Domino on Linux/Unix, Troubleshooting, Best Practices, Tips and more ...

RunFaster=1 for Domino on Linux

Daniel Nashed  30 April 2012 10:39:06

We have been tracing this problem together with IBM for half a year and finally got a solution for a performance problem that blocked us from migrating 8500 users from Groupwise to Domino.
When we first started to run large scale imports to Domino (load with creating many documents with attachments in parallel with multiple workstations) I discovered a slow down for some transactions by 100 ms or even a multiple of 100 ms.
Part of the problem was based on issues introduced with a newer kernel that SuSE is using in SLES11. The new process scheduler CFS which I have blogged earlier needs some "tuning" to work nicely with Domino.
SLES 11 SP2 even ships with a 3.0 kernel which has an updated CFS implementation which needs a different tuning setting than SP1 (you need to set echo NO_FAIR_SLEEPERS > /sys/kernel/debug/sched_features after reboot).
In addition another disk I/O issue is fixed in the shipping 3.0 kernel which in addition reduces the I/O.

Still those settings did not help to completely solve the problem and we continued tracing. I have build test tools to do all sort of load tests locally and remotely to figure out what could cause the delay.
And it turned out that there was an issue with the IOCP handling (sys-epoll) which caused delays in the thread pool coordination.

Last week we got a hotfix (853FP1HF85) containing a changed behaviour in the way the worker threads are coordinated (SPR PHEY8RJHXR).
This SPR is already submitted for 8.5.4 and should hopefully make it into 8.5.3 FP2.

In testing the response time for a normal work-load was five times lower than without the fix!
In my testing with an attachment workload I had even better results.

Load Test Environment

- 30 threads
- 100 documents 
- attachment size 2 MB 
- create 8.5.3 mailfiles from template 
- search for all documents after writing them 

nshload -t30 -w100 -a /largefile.txt -s servername

Infrastructure 

- SLES 11 SP2 64bit Domino 8.5.3 FP1 
- fast quad-core CPU, fast, local RAID10 disks, 32 GB RAM 
- Intel(R) Xeon(R) CPU E5620 @ 2.40GHz 

Special Configuration 
- amgr disabled 
- transaction logging enabled 
- NO_FAIR_SLEEPERS for most tests beside one test 
- noop scheduler instead of CFQ for disk devices 

I have tested with out network interface creating the documents locally as a reference. 
Than I tested without the fix but NO_FAIR_SLEEPERS set 
After that I did a test with the hotfix.

And I have got the following results
no network without fix fix + fair sleeper fix + no fair sleeper
elapsed time (sec)56 828 78 68
response time client
(ms)
2 40-60 5 3


Response time improvment: 12 times faster (reduced to 8%)
Elapsed time improvement: 16 times faster (reduced to 6%)

We have a dramatical performance improvement with the hotfix.
You can also see that the NO_FAIR_SLEEPERS is helpful but the real performance boost we get is because of the hotfix :-)

Our current nightly migration runs for 24 hours. With this fix I would assume that at least it would be done in 1/5 of the time (really careful estimation if the migration tool does not cause other delays) .
So we would be done in 5 hours instead of 24 hours.
This solves our performance issues during migration and also improves response time during normal operations dramatically.

I have also done another tests creating documents with attachments in parallel with 100 threads.
There was no response time slow down with the nshopen tests. I got the same 3 ms average response time.
The test creating 100 documents with 100 threads and a 1,5 MB attachment in parallel took 187 seconds!!!

If you are not running into critical performance issues you should wait for the next fixpack to ship.
In case you are having issues you should first looking into the other performance optimizations I have posted (CFS changed, switching from CFQ to noop disk I/O scheduler).
If that does not help you could open a PMR to get the hotfix :-)

I am looking forward to the Domino performance team to ship new performance benchmark comparing Linux with Windows in future :-)

-- Daniel



Linux I/O Performance Tweek

Daniel Nashed  23 April 2012 06:51:09


During a performance troubleshooting session at a customer we found a very interesting tuning option that can boost your storage performance dramatically.
This setting is really helpful for SAN environments, SSD discs, virtual servers and also physical servers using RAID storage.

Current Linux versions do use the an I/O scheduler called CFQ (complete fair queuing).
This scheduler is helpful when you have simple local disk where Linux can optimize the I/O requests by re-ordering them.

But this can introduce latency with modern I/O devices.
If you have a caching RAID controller or a SAN, SSD disk the hardware backend is far more effective optimizing the I/O.
Most disk sub-systems also include a larger amount of cache to optimize the I/O.

Changing the I/O scheduler from "CFQ" to "noop" can dramatically reduce I/O latency.
The "await" in the iostat output shows the time needed for an I/O request from the application to the queue, to the disk and back to the application.
Even if the service time of the device is good (svctm) the await could be a lot higher because of CFQ that tries to re-order the I/O requests.

If you switch from "CFQ" to "noop" all I/O requests are send to the I/O device directly without any additional delay.
You will notice that the await is dramatically reduced.

We have seen good results with SAN environments and also physical disks.

Here is an example for two tests that we did with a larger physical machine with RAID 10 disks.

First Test

I have used 80 thread to read 32000 docs each (80 separate local databases on the server with small documents) 

The result was as follows: 
51 sec with CFQ scheduler 
28 sec with noop scheduler 
19 sec all data in cache. 

Second Test

80 threads creating 2000 docs each 
with noop 42 sec 
with cfq 132 sec 

You can see from the results that we have dramatic difference in performance between CFQ and noop. 
The server had a lot of RAM and repeating the test did not show and read I/O. 
So all data has been served from cache in this case! That means the 19 seconds is without physical disk I/O. Just the processing time in back-end. 

Side note for File-System Cache

Even for larger environment caching can effect the results. So for all tests beside the last test we have flushed the file-system cache. 

Here is the cache flush command on Linux 
echo 3 > /proc/sys/vm/drop_caches 

So if you do your own testing you should


Here is a short extract from an older IBM Linux redbook that is a great summary of the background:

Select the right I/O elevator in kernel 2.6 
For most server workloads, the complete fair queuing (CFQ) elevator is an adequate choice 
as it is optimized for the multiuser, multiprocess environment a typical server operates in. 
However, certain environments can benefit from a different I/O elevator.

Intelligent disk subsystems 
Benchmarks have shown that the NOOP elevator is an interesting alternative in high-end 
server environments. When using IBM ServeRAID or TotalStorage DS class disk 
subsystems, the lack of ordering capability of the NOOP elevator becomes its strength. 
Intelligent disk subsystems such as IBM ServeRAID and TotalStorage DS class disk 
subsystems feature their own I/O ordering capabilities. Enterprise class disk subsystems 
may contain multiple SCSI or FibreChannel disks that each have individual disk heads and 
data striped across the disks. It would be very difficult for an operating system to anticipate 
the I/O characteristics of such complex subsystems correctly, so you might often observe 
at least equal performance at less overhead when using the NOOP I/O elevator.

Virtual machines 
Virtual machines, regardless of whether in VMware or VM for zSeriesĀ®, may only 
communicate through the virtualization layer with the underlying hardware. Hence a virtual 
machine is not aware of the fact if the assigned disk device consists of a single SCSI 
device or an array of FibreChannel disks on a TotalStorage DS8000. The virtualization 
layer takes care of necessary I/O reordering and the communication with the physical 
block devices. Therefore, we recommend using the NOOP elevator for virtual machines to 
ensure minimal processor overhead.


How do you change the settings?

You can change the setting dynamically for testing with this manual command (example for hda device)

#
echo noop > /sys/block/hda/queue/scheduler

If you want to permanently change the scheduler for all block devices you can set it as a boot option.

Edit /boot/grub/grub.conf and enter in kernel line elevator=noop.


Conclusion

You should really look into this setting and try it in your environment.
Changing the settings dynamically in an production environment does not cause any issues and you can see with "iostat -x 1" how the "await" will change.

I am interested in your feedback from your environments.

-- Daniel

Process iCalendar attachment while in other mailfiles

Daniel Nashed  19 April 2012 14:17:31

Normally you can only import iCal data into your own mail-file. This limitation sounds like working as designed.
But there is a notes.ini setting that I just found that allows import into other mail-files as well :-)
Also other databases with calendars should work according to the TN.

Just found it in a technote (#1469271) checking for something else ...

notes.ini ICAL_IMPORT_MANAGED_USER=1

-- Daniel

New Traveler Companion APP 2.0.7

Daniel Nashed  18 April 2012 16:46:54

Description of the update just says "Bug Fixes"

Hmm.. still a good idea to install ..

-- Daniel

ComputeWithForm Performance

Daniel Nashed  16 April 2012 10:43:28


I have done some analysis for a customer which lead to a PMR and a SPR which is marked as an enhancement request.
We got feedback that this is not going to be addressed in the current code stream but I think this is still a quite relevant improvement specially for customer and business partner applications.

In our analysis it turned out that with NSFNoteComputeWithForm (and also when you use the derived functions in LS and Java)

a.) subforms and shared fields are not cached from cache.ndk

b.) and also when searching subforms and shared fields (potential more we did only find those in client_clock data) the design collection is searched instead of using FINDDESIGN transactions.

So not only the cache.ndk is not used but also those transactions are not optimized.
It means that if you have multiple subforms and shared fields for each of them the design collection is read -- every time.

For a database with many design elements this means a quite lot overhead.
So instead of zero bytes going over the wire if cache.ndk would be leveraged (after the design element has been loaded for the first time) instead of a cheap transaction
(FINDDESIGN -- transfering only a couple of bytes) we have many transactions reading the design collection.
For a database with large design each of those lookups could be multiple 64k read transactions until the design element is found.

Those transactions look usually like this:

(442-79 [618]) READ_ENTRIES(REPC125757C:00308E76-NTFFFF0020): 34 ms. [76+64958=65034]

The internal number for the design collection is 0xFFFF0020. And you can see that a 64K block is read.
The number of reads depend on the size of the database design and where in the design collection the design element is located.
So you could end up having 3-15 additional transactions with

This sounds like code that has not been changed when the design cache was introduced long time ago.
We got a SPR # BHUY8SELK9 which marked as enhancement request and is not planned to be addressed in this code-stream.
If you are interested to have this enhanced you could open a PMR referencing this SPR.

For now you should be very carefully with subforms and shared fields -- and big main forms if you use "ComputeWithForm" in an application that you access on a server over WAN connections.

-- Daniel

Unix/Linux Start Script Update for high CPU issue

Daniel Nashed  10 April 2012 12:19:09

There was an issue that in some cases when the monitor command was running and the shell was closed, the script could not catch the dead text file caused by the shell.
I have changed the trap events and now I hopefully trap all signals that could occur.
We got this problem only on some platforms. And I cannot test on all different versions of Linux.
I hope this change resolves all cases. If not send me your feedback.

You can request the new version on the start script homepage.

http://www.nashcom.de/nshweb/pages/startscript.htm

-- Daniel



V2.4 10.04.2012

Problems Solved
---------------
Solved an issue when closing a terminal window while the monitor was running.
With some OS releases and some shells this caused that the script did not terminate due to issues in the shell.
This could lead to high CPU usage (100% for one core) for the script because the loop did not terminate.
The change to catch more events from the shell should resolve this issue.
If you still run into problems in this area, please send feedback.

FTIndex Crash with C-API based tools caused by a change in D8.5.3

Daniel Nashed  10 April 2012 08:43:21

We ran into this problem quite badly and the root cause was hard to track.
One of my applications (nshrun -- a tool to do multiple tasks in parallel on multiple databases) started to crash without a meaningful call-stack.
I started to figure out the root cause and identified that the C-API call FTIndex causes a crash of the calling function because the stack is damaged.

It turned out that IBM changed the structure of the statistic buffer for FTIndex by adding two new variables.
This change caused incompatibility with all existing applications using this structure with FTIndex.
When using a previous version of the C-API toolkit the memory buffer passed to the function was to small and the function did overwrite memory which caused the stack to be corrupt.

The problem exists with 8.5.3 and 8.5.3 FP1 and will be corrected in 8.5.3 FP2 and 8.5.4. The exposed structure will be reverted back to the old format.
In the meantime to get your application working you have multiple options

a.) wait for 8.5.3 FP2 or 8.5.4
b.) don't use the statistics returned and pass NULL as the parameter
c.) recompile just for 8.5.3 with the current 8.5.3 C-API
d.) redefine the structure in your earlier toolkit version and compile -- for older versions the bigger buffer does not cause any issues

There is a upcoming technote (TN #1590244) which is not yet released and the SPR we got for the problem is APAR#LO68258/SPR #VDES8SMFCJ.

I am going to compile my applications with a changed header structure to ensure it will continue to work with all releases of Domino.

-- Daniel



STATUS LNPUBLIC FTIndex(DBHANDLE hDB,WORD Options,char far *StopFile, FT_INDEX_STATS far *retStats);

In version 853, the structure FT_INDEX_STATS was updated as shown below in both product code and the C API toolkit.
typedef struct
{
DWORD DocsAdded; /* # of new documents */
DWORD DocsUpdated; /* # of revised documents */
DWORD DocsDeleted; /* # of deleted documents */
DWORD BytesIndexed; /* # of bytes indexed */
DWORD Merges; /* # of index merges */
DWORD MergeMsec; /* Msec spent merging */
}
FT_INDEX_STATS;

In previous versions on the product, this structure was defined as shown below.
typedef struct
{
DWORD DocsAdded; /* # of new documents */
DWORD DocsUpdated; /* # of revised documents */
DWORD DocsDeleted; /* # of deleted documents */
DWORD BytesIndexed; /* # of bytes indexed */
}
FT_INDEX_STATS;

Domino on Linux Plattform stats might use 100% CPU

Daniel Nashed  22 March 2012 20:52:47
We ran into this issue on a couple of customer sites. It is not completely sure in which release starts (in a fixpack of 8.5.2 and in 8.5.3 for sure).
The problem is that the server thread responsible for platform stats might use 100% of one CPU thread.
Basically there is an issue when reading a system file in a loop.

There are two SPRs involved in this

SPR #YPHG8ET496 will be fixed in 8.5.3 FP1
SPR #PHEY8KRJJ7 will be fixed in 8.5.2 FP2 and 8.5.4

If you run into this problem in your environment the only work-around is to disable the platform stats via PLATFORM_STATISTICS_DISABLED=1.

What also happens in this case is that the server might not terminate cleanly because the platform stat server thread is in a infinite loop which does not terminate.

-- Daniel

Companion App 2.0.6 released -- fixes session based authentication issues

Daniel Nashed  20 March 2012 09:28:27

A new update for the companion 2.0.6 app has been releases which fixes the session based authentication issue we discussed.
According to the description this is the only fix in the app. Again you should still use basic authenticaton also to ensure that the device detects password changes.
But it is still good that IBM fixed this issue :-) Huge thanks for the very fast response!

-- Daniel


New Traveler Db Usage command in 8.5.3.2

Daniel Nashed  9 March 2012 10:41:39


There is a new tell command to show how much data is synced to devices.
This command is interesting to figure out how much data users are syncing.

The tell traveler dbusage command is new in 8.5.3.2 and the results look like the example below.

With iOS devices you cannot prevent via policies that users sync all their mail.
But there is a general switch for the whole Traveler server in NTSConfig.xml to reduce the maximum limit that the server will sync -- independent from the user settings

<PROPERTY NAME="USER_EMAIL_LIMIT" VALUE="365">
Sets maximum mail filter window.

<PROPERTY NAME="USER_EVENTS_LIMIT" VALUE="365"/>
Sets maximum calendar filter window.

<PROPERTY NAME="USER_NOTES_LIMIT" VALUE="365"/>
Sets maximum journal filter window.

-- Example Output from dbusage command (my server has just two users) --


tell traveler dbusage
Lotus Traveler Database Statistics
Accounts        : 2
Devices         : 5
Device documents: 8814
Domino documents: 3492
Highest Total Usage                 Documents    Percentage
-----------------------------------------------------------------------
Daniel Nashed/NashCom/C=DE          2904         83,16    
mobile/NashCom-Net                  588          16,84    
Mail documents: 2131
Highest Mail usage                  Documents    Percentage EMail filter        
-----------------------------------------------------------------------
Daniel Nashed/NashCom/C=DE          2120         99,48      30 days            
mobile/NashCom-Net                  11           0,52       5 days              
Calendar documents: 63
Highest Calendar usage              Documents    Percentage Event filter        
-----------------------------------------------------------------------
Daniel Nashed/NashCom/C=DE          61           96,83      30 days            
mobile/NashCom-Net                  2            3,17       14 days            
Contacts documents: 1262
Highest Contacts usage              Documents    Percentage Contact filter      
-----------------------------------------------------------------------
Daniel Nashed/NashCom/C=DE          699          55,39      unlimited          
mobile/NashCom-Net                  563          44,61      unlimited          
To Do documents: 3
Highest To Do usage                 Documents    Percentage Task filter        
-----------------------------------------------------------------------
mobile/NashCom-Net                  2            66,67      incomplete only    
Daniel Nashed/NashCom/C=DE          1            33,33      incomplete only    
Notebook documents: 0
Folder documents: 26
Highest Folder usage                Documents    Percentage Folder filter      
-----------------------------------------------------------------------
Daniel Nashed/NashCom/C=DE          19           73,08      unlimited          
mobile/NashCom-Net                  7            26,92      unlimited          
Command DbUsage complete.