Domino on Linux/Unix, Troubleshooting, Best Practices, Tips and more ...

 
alt

Daniel Nashed

 

    The Art of Troubleshooting

    Daniel Nashed  15 January 2024 08:26:50

    In all the years I am involved in troubleshooting, I still see the same patterns. I am planning to start a new initiative this year.

    To start with, I wrote a short abstract end of the year when looking into this.
    See this as a beginning of a change from my side to better help on community level and also provide better services as a HCL business partner.


    It will also include troubleshooting steps for different kind of problem types like crashes, hangs, memory leaks, performance problems.

    Not all of it can be described in howto material. But raising the awareness on all parts of the support process, can significantly help to solve problems faster.

    I am in the troubleshooting business for over 25 years, read NSD before breakfast and wrote my own troubleshooting tools.


    Notes System Diagnostics (NSD)


    There is an AHA idea We need a tool to analyze NSD files  (
    https://domino-ideas.hcltechsw.com/ideas/DOMINO-I-1451)

    Lotus Notes Diagnostics (LND) was a great tool written by IBM support. It was lost in the transition to HCL, because it was a support tool.

    But NSD was never really a documented customer tool and LND was just a helper tool to speed up NSD, semapore and memory dump information.

    Very few admins and developers can really read NSDs to the extend that is needed to pin point problems. LND didn't change that.

    NSD is mainly intended to collect information about crashes, hangs and performance issues and provide all the information to support and development.


    Notes/Domino has outstanding design for Reliability, Availability and Serviceability (RAD).
    This does not only include NSD crash analysis, memory dumps, memory leak detection, semaphore and LockManager diagnostics.

    My Lotusphere presentation from 2010 contains a good starting point to understand what types of functionality is available out of the box.
    Most of the presentation is still relevant and the features are part of the DNA of Notes/Domino.



    BP204 “CSI Domino” --Diagnostic Collection and Analysis


    https://www.nashcom.de/nsh/web.nsf/ff5ce882e73ab026c1256942003bdf10/6084a81e9f2c2b00c1256cc30030c6c1/$FILE/BP204.pdf

    I am planning to revisit this topic and create new material. But this is mainly to understand it and to provide better information to a subject matter expert to help solving problems.

    Even today I don't think this can be all done by a tool like LND. It was written for support from support mainly.


    The following abstract doesn't apply to Notes/Domino only. It might help you to better understand why some of your problem cases are not going forward as quickly as you want it.

    IMHO admins and developers have to help support and the product developers to pinpoint problems.


    I can't write this up in a day. It will be probably a GitHub repository where I add more material over time.

    Maybe even when I run into problems and write up information what to collect for this particular case.


    It's a new initiative I am starting in 2024. Stay tuned and I hope you like the idea.


    -- Daniel




    The Art of Troubleshooting


    Troubleshooting can be quite challenging for everyone involved in the process. The challenges are not always only on the technical side.
    The success also highly depends on soft skills on the administrative, developer, and support sides.


    Reproducing and narrowing down the problem In many cases, the problem is complex, especially when the problem situation isn't clear.
    Trying to reproduce the problem is very helpful to find a solution and to resolve the issue faster.
    Once the problem is reproduced in the production environment, attempting to reproduce it in another environment can help a support team to replicate it as well.

    It can also help narrow down the problem and extract it into an easier-to-troubleshoot scenario.
    This often turns the problem to a different angle with a different root cause.



    You Don't Know What You Don't Know


    If you can't narrow down and extract the problem, it is essential to provide as many details about the problem as possible.

    Someone with a different skill set might make sense of the data with a different look into it.

    Filtering out information is a good intention but might reduce the chance of solving the problem.



    Collecting All Relevant Information


    Someone looking into the problem outside the actual environment needs to understand the full picture and rule out any assumptions.

    The sequence of events, including changes to the environment, is similarly helpful to a troubleshooter than the anamnesis to a medical doctor.


    A troubleshooter doesn't know your environment and what you know about it, even when talking to the developer of the application itself. All details can be important.
    This doesn't only include logs and crash stacks but also details about the environment.

    Therefore support organizations have checklists with information to collect for the next support level to let a more senior support engineer get a clear picture as well.


    Assumptions on both ends of the support process are usually one of the most common complications slowing down the resolution of the problem.



    The International Support Language is English


    Even though most support organizations provide local language support, a best practice is to provide all information in English if possible.

    This includes error messages and problem descriptions. In most support organizations, internal support is mainly in English.

    Most internal knowledge bases and escalation support are in English. Internal communication would require translation for incoming information and responses.
    It can not only have an impact on response time but could also cause information to "get lost in translation."


    Links

      Archives


      • [HCL Domino]
      • [Domino on Linux]
      • [Nash!Com]
      • [Daniel Nashed]