Domino on Linux/Unix, Troubleshooting, Best Practices, Tips and more ...

RunFaster=1 for Domino on Linux

Daniel Nashed  30 April 2012 10:39:06

We have been tracing this problem together with IBM for half a year and finally got a solution for a performance problem that blocked us from migrating 8500 users from Groupwise to Domino.
When we first started to run large scale imports to Domino (load with creating many documents with attachments in parallel with multiple workstations) I discovered a slow down for some transactions by 100 ms or even a multiple of 100 ms.
Part of the problem was based on issues introduced with a newer kernel that SuSE is using in SLES11. The new process scheduler CFS which I have blogged earlier needs some "tuning" to work nicely with Domino.
SLES 11 SP2 even ships with a 3.0 kernel which has an updated CFS implementation which needs a different tuning setting than SP1 (you need to set echo NO_FAIR_SLEEPERS > /sys/kernel/debug/sched_features after reboot).
In addition another disk I/O issue is fixed in the shipping 3.0 kernel which in addition reduces the I/O.

Still those settings did not help to completely solve the problem and we continued tracing. I have build test tools to do all sort of load tests locally and remotely to figure out what could cause the delay.
And it turned out that there was an issue with the IOCP handling (sys-epoll) which caused delays in the thread pool coordination.

Last week we got a hotfix (853FP1HF85) containing a changed behaviour in the way the worker threads are coordinated (SPR PHEY8RJHXR).
This SPR is already submitted for 8.5.4 and should hopefully make it into 8.5.3 FP2.

In testing the response time for a normal work-load was five times lower than without the fix!
In my testing with an attachment workload I had even better results.

Load Test Environment

- 30 threads
- 100 documents 
- attachment size 2 MB 
- create 8.5.3 mailfiles from template 
- search for all documents after writing them 

nshload -t30 -w100 -a /largefile.txt -s servername

Infrastructure 

- SLES 11 SP2 64bit Domino 8.5.3 FP1 
- fast quad-core CPU, fast, local RAID10 disks, 32 GB RAM 
- Intel(R) Xeon(R) CPU E5620 @ 2.40GHz 

Special Configuration 
- amgr disabled 
- transaction logging enabled 
- NO_FAIR_SLEEPERS for most tests beside one test 
- noop scheduler instead of CFQ for disk devices 

I have tested with out network interface creating the documents locally as a reference. 
Than I tested without the fix but NO_FAIR_SLEEPERS set 
After that I did a test with the hotfix.

And I have got the following results
no network without fix fix + fair sleeper fix + no fair sleeper
elapsed time (sec)56 828 78 68
response time client
(ms)
2 40-60 5 3


Response time improvment: 12 times faster (reduced to 8%)
Elapsed time improvement: 16 times faster (reduced to 6%)

We have a dramatical performance improvement with the hotfix.
You can also see that the NO_FAIR_SLEEPERS is helpful but the real performance boost we get is because of the hotfix :-)

Our current nightly migration runs for 24 hours. With this fix I would assume that at least it would be done in 1/5 of the time (really careful estimation if the migration tool does not cause other delays) .
So we would be done in 5 hours instead of 24 hours.
This solves our performance issues during migration and also improves response time during normal operations dramatically.

I have also done another tests creating documents with attachments in parallel with 100 threads.
There was no response time slow down with the nshopen tests. I got the same 3 ms average response time.
The test creating 100 documents with 100 threads and a 1,5 MB attachment in parallel took 187 seconds!!!

If you are not running into critical performance issues you should wait for the next fixpack to ship.
In case you are having issues you should first looking into the other performance optimizations I have posted (CFS changed, switching from CFQ to noop disk I/O scheduler).
If that does not help you could open a PMR to get the hotfix :-)

I am looking forward to the Domino performance team to ship new performance benchmark comparing Linux with Windows in future :-)

-- Daniel



Comments

1Ulrich Krause  30.04.2012 15:16:21  RunFaster=1 for Domino on Linux

Wow. Great finding. I adore you for your deep knowledge in all this stuff.

2Daniel Nashed  30.04.2012 15:58:28  RunFaster=1 for Domino on Linux

@Jesper, you either wait for the fixpack or you can open a PMR to ask for the hotfix.

Since this is a hotfix that is already there -- if you ask for a fix for 8.5.3 FP1 you should be able to get it from support asap.

It is not allowed to share a hotfix. Every customer has to get the hotfix directly from IBM support to have it trackable.

-- Daniel

3Jesper Kiaer  30.04.2012 16:09:23  RunFaster=1 for Domino on Linux

Great work! I have similar problems on my Linux servers, but where to get hold of the Hot Fix ?

4Daniele Vistalli  30.04.2012 16:14:20  RunFaster=1 for Domino on Linux

Great stuff Daniel. Thank you for sharing

5Jens Bruntt  01.05.2012 10:04:15  Thanks

It is important to tell people when they do good.

So, here is a pat on the back. The stuff you post here is really appreciated.

6Stephan H. Wissel  01.05.2012 14:29:18  RunFaster=1 for Domino on Linux

Wow - nice one!

Do some of your findings have an impact on Notes clients running on Linux? I can observe big performance degradation when background replication is running over slow and high latency networks

7Daniel Nashed  01.05.2012 21:33:28  RunFaster=1 for Domino on Linux

@Stephan, this can be also an issue on Linux clients and as far I understood also for Mac clients.

But I don't expect a lot performance imrpovement on the client side.

Your latency issues might be something completely different.

We should discuss and trace offline :-)

-- Daniel

8Gustavo Frizzera  05.06.2012 22:53:08  RunFaster=1 for Domino on Linux

Hey Daniel,

Great job!

I tested this issues in my sles 11.1 and I see great performance in nrpc communication..

9Amin  27.01.2014 6:55:04  Migrating from Suse LINUX 10 -11

Hey Daniel

can u please explain us as to how we may proceed to regarded one of our cluster server which is ruining on susse 10 .

we are now planing to upgrade it to suse 11.

please advice?

K regards