Domino Diagnostics and Crash detection + Fault Recovery
Daniel Nashed – 5 May 2021 09:35:16
Based on an AHA idea that I don't really agree on, I want to explain the background why Domino is implemented in regard to fault recovery.
I had discussions about why Domino is restarting servers if one task fails very long time ago at Lotusphere in the developer labs with the responsible developers for NSD, fault recovery & co.
They spent a lot of effort in making Domino reliable, available and make it easy to run diagnostic.
Domino has many features and earlier called it RAS (Reliability, Availability and Serviceability).
Let me explain some of the aspects and I will link this blog post to the AHA idea.
https://domino-ideas.hcltechsw.com/ideas/DOMINO-I-1682
-- Daniel
NSD/Fault Recovery and Diagnostic Collection
Domino Fault recovery, NSD, Memcheck, trapleak debugging and other features as been developed long time ago and is still a really outstanding in the industry to collect diagnostic data to help HCL support and development to pin-point issues.
Some of it is really geeky and not documented in detail. I did a two days customer workshops including hands on for this topic long time ago.
But clearly the information is intended for developers to look at. Still I got many questions for the LND tool, which was able to get some details from NSDs.
This tool has been developed by someone in support and that tool wasn't transferred from IBM to HCL. But there are many out of the box features that already help you.
Domino itself has a fault analyzer, which is already a great way to correlate call-stacks and crash information.
Why does Domino crash if one servertask fails?
A kill -9 is a hard kill (SIGKILL), which always is the last resort. Other applications use SIGHUP to notify a task to reload configuration.
Many configuration changes are applied without restart. And for example for Domino V12 certificate changes don't require and restart when certstore.nsf is used -- other applications even on Linux still need a trigger to reload.
Any servertask using the C-API initializes the Domino run-time environment.
So it becomes part of the Domino environment leveraging all kind of resources.
Those resources use shared memory, semaphores and other resources, used among the processes.
If one process crashes, there are resources which are not cleaned up.
And also the process could have overwritten memory in shared memory for other processes.
In addition the process could have locked a semaphore, that might never get released if the process is gone unexpected.
So this will lead to more damage and also lead to server hangs later on.
That's why Domino has an internal monitor to check if Domino processes are cleanly shutdown.
On Linux/Unix also the SIGCHLD is checked for process terminations.
On Windows there isn't a signal so the process monitor panics the server if a process terminates unexpected.
This is all designed to protect the remaining processes and data in memory.
Fault Recovery / Transaction Logging & Co
There are a couple of features playing together hand in hand.
Here is a very brief overview of the most important components -- but there is much more that those main functions.
Each of them is something I could fill pages of blog posts to explain them in detail.
No panic -- You don't need to know all those technical details.
It is important to use those functions and HCL support or a HCL business partner can use the data to help you.
1. Fault Recovery (server doc)
Detects a crash and restarted the server
2. Diagnostic Collection/NSD (server doc)
Collects a NSD and memcheck at crash time and plays hand in hand with Fault recovery
3. Transaction Log (server doc)
Ensures databases are always consistent. Improves run-time performance by writing changes into translog first to let the process continue it's work.
Writes changed data asynchronously into the databases.
And most important: Ensures that at restart of the server transaction log applies changed information to make the database consistent without the need for a fixup.
So the server will be up and running with all databases quickly.
4. Automatic Data collection (config doc)
Collects NSDs + other diagnostic information and sends them into a central database.
5. Fault Analyzer (config doc)
Runs on the central mail-in database to annotate and correlated NSDs.
This works for servers and clients and you have all information in one place.
There is no need to request a NSD from a Notes client or log into a Domino server.
There is much more. But those are the most important parts to configure to leverage Domino diagnostics.
Of course there are many Notes.ini settings for debug information, which is all written into the console.log.
And there is also specific tracing for HTTP, SMTP and other tasks.
- Comments [4]