Domino on Linux/Unix, Troubleshooting, Best Practices, Tips and more ...

 
alt

Daniel Nashed

 

Nagle Algorithm causes Performance issues

Daniel Nashed  2 November 2009 10:38:40

I should have documented the following performance issue already a while ago but we ran into this first before I started my blog.
The issue described started in Domino 7.x and in the current Domino 8.5 release the Nagle Algorithm has been disabled by default for all Domino TCP/IP communication.
The parameter to solve the issue has also been included meanwhile into the DCT tuning settings.

The problem does not occur in all Domino on Linux/Unix environments and depends on multiple factors in the customer TCP/IP infrastructure.
In cases the problem occurs disabling Nagle can lead to a much better Client/Server performance.
Everyone using earlier versions of Domino on Linux/Unix should disable the Nagle Algorithm manually.
The notes.ini parameter debug_pd_nagle_off=1 has been documented a while ago but there is still no detailed description how the performance was affected. Here is a short abstract with the background.


The Nagle Algorithm named after the engineer John Nagle is designed to combine multiple smaller packets into larger packets to optimize the ratio between the packet headers and the content of the packets.
This was very important for applications like telnet using very small packets holding just one byte of data (that you typed into your console window).
The algorithm tries to combine packets based on the destination address to optimize the network bandwidth.

The algorithm works great if the client side of the application continues it's work even the last packet from the server did not receive yet.
But in case of Notes the client waits for the last transaction to complete before starting the next transaction.

Depending on the implementation the delay can be up to 200 ms before the incomplete (not full) packet is send back to the client.
The client waits for the last transaction to complete (waiting for the packet to arrive from the server) before starting the next transaction.
If your application uses a lot of short transactions like opening a document or design note this can impact the end user experience dramatically.

Here is an example:

OPEN_NOTE(REPC125725F:002B6DD3-NT00003A66,08400000): 202 ms. [50+1476=1526] 

- Client tries to open a document and sends a OPEN_NOTE transaction to the server via TCP/IP (actually the client uses the NTI layer which finally uses the TCP/IP stack of the client's OS).
- The server processes the open request and sends back some data to the client which is smaller than the threshold
- The TCP/IP stack of the server waits up to 200 ms until sending the data packages back to the client
- Because there is no other data that needs to be send to the client because the client waits for the last transaction to finish it usually happens in 30-40% cases that the Notes client needs to wait for more than 200 ms instead of below 20 ms until the transaction is complete

Disabling the Nagle Algorithm for Domino TCP/IP sockets eliminates the delay completely and all replies from the Server as send directly without any wait time.

While troubleshooting this issue I wrote a little C-API based application to trace the performance of Note Open operations (because those transactions turned out the to be the most affected transactions at the first customer we identified this issue) but there are also other operations running into the same issue.

Here is an example with and without disabling the Nagle algorithm.

The test program opens 1000 documents from the names.nsf on the server and analyses the response time.
You can see that in more than half of the cases the response time is below 1 ms.
And that in total 70% of the Note Open transactions are below 16 ms. Those are the normal expected response times for a client in the local LAN.
But you see that there are also many transactions between 200 and 300 ms.
Most of those slow responses are caused by delays based on the Nagle Algorithm.
This leads to an average response time of 60 ms for the 1000 note open operations.


29.10.2009 14:44:05 nshopen: Local Notes/Domino Release 8.5 QMR:1 QMU:0 Hotfix: 0 Fixpack: 0 (0) 
29.10.2009 14:44:05 nshopen: Remote Notes/Domino Release 7.0 QMR:3 QMU:1 Hotfix: 0 Fixpack: 0 (0)
 
29.10.2009 14:44:05 nshopen: Database: 'customer_server!!names.nsf'
 
29.10.2009 14:44:05 nshopen: Count: 1000
 
29.10.2009 14:44:05 nshopen: Total: 60171
 
29.10.2009 14:44:05 nshopen: Minimum: 0
 
29.10.2009 14:44:05 nshopen: Maximum: 405
 
29.10.2009 14:44:05 nshopen: Average: 60 
29.10.2009 14:44:05 nshopen: =0 : 520
 
29.10.2009 14:44:05 nshopen: =1 : 0
 
29.10.2009 14:44:05 nshopen: <=10 : 0
 
29.10.2009 14:44:05 nshopen: <=100 : 243
 
29.10.2009 14:44:05 nshopen: <=200 : 7
 
29.10.2009 14:44:05 nshopen: Slow : 230
 
29.10.2009 14:44:05 nshopen: [ 0] -> 520 
29.10.2009 14:44:05 nshopen: [ 15] -> 75
 
29.10.2009 14:44:05 nshopen: [ 16] -> 115
 
29.10.2009 14:44:05 nshopen: [ 31] -> 20 
29.10.2009 14:44:05 nshopen: [ 32] -> 6
 
29.10.2009 14:44:05 nshopen: [ 46] -> 4
 
29.10.2009 14:44:05 nshopen: [ 47] -> 12
 
29.10.2009 14:44:05 nshopen: [ 62] -> 3
 
29.10.2009 14:44:05 nshopen: [ 63] -> 3
 
29.10.2009 14:44:05 nshopen: [ 78] -> 3
 
29.10.2009 14:44:05 nshopen: [ 93] -> 1
 
29.10.2009 14:44:05 nshopen: [ 94] -> 1
 
29.10.2009 14:44:05 nshopen: [ 109] -> 1
 
29.10.2009 14:44:05 nshopen: [ 140] -> 1
 
29.10.2009 14:44:05 nshopen: [ 156] -> 2
 
29.10.2009 14:44:05 nshopen: [ 187] -> 2
 
29.10.2009 14:44:05 nshopen: [ 188] -> 1
 
29.10.2009 14:44:05 nshopen: [ 202] -> 4 
29.10.2009 14:44:05 nshopen: [ 203] -> 16
 
29.10.2009 14:44:05 nshopen: [ 218] -> 74
 
29.10.2009 14:44:05 nshopen: [ 219] -> 43
 
29.10.2009 14:44:05 nshopen: [ 234] -> 39
 
29.10.2009 14:44:05 nshopen: [ 249] -> 7
 
29.10.2009 14:44:05 nshopen: [ 250] -> 18
 
29.10.2009 14:44:05 nshopen: [ 265] -> 6 
29.10.2009 14:44:05 nshopen: [ 266] -> 4
 
29.10.2009 14:44:05 nshopen: [ 280] -> 1
 
29.10.2009 14:44:05 nshopen: [ 281] -> 2
 
29.10.2009 14:44:05 nshopen: [ 296] -> 4
 
29.10.2009 14:44:05 nshopen: [ 297] -> 2
 
29.10.2009 14:44:05 nshopen: [ 312] -> 3
 
29.10.2009 14:44:05 nshopen: [ 328] -> 2
 
29.10.2009 14:44:05 nshopen: [ 343] -> 1
 
29.10.2009 14:44:05 nshopen: [ 374] -> 1
 
29.10.2009 14:44:05 nshopen: [ 375] -> 1
 
29.10.2009 14:44:05 nshopen: [ 390] -> 1
 
29.10.2009 14:44:05 nshopen: [ 405] -> 1
 
29.10.2009 14:44:05 nshopen: Terminated



After disabling the Nagle Algorithm for Domino TCP/IP the results look like this.

29.10.2009 14:53:37 nshopen: Local Notes/Domino Release 8.5 QMR:1 QMU:0 Hotfix: 0 Fixpack: 0 (0) 
29.10.2009 14:53:37 nshopen: Remote Notes/Domino Release 7.0 QMR:3 QMU:1 Hotfix: 0 Fixpack: 0 (0)
 
29.10.2009 14:53:37 nshopen: Database: 'customer_server!!names.nsf'
 
29.10.2009 14:53:37 nshopen: Count: 1000
 
29.10.2009 14:53:37 nshopen: Total: 5853
 
29.10.2009 14:53:37 nshopen: Minimum: 0
 
29.10.2009 14:53:37 nshopen: Maximum: 78
 
29.10.2009 14:53:37 nshopen: Average: 5 
29.10.2009 14:53:37 nshopen: =0 : 658
 
29.10.2009 14:53:37 nshopen: =1 : 0
 
29.10.2009 14:53:37 nshopen: <=10 : 0
 
29.10.2009 14:53:37 nshopen: <=100 : 342
 
29.10.2009 14:53:37 nshopen: <=200 : 0
 
29.10.2009 14:53:37 nshopen: Slow : 0
 
29.10.2009 14:53:37 nshopen: [ 0] -> 658 
29.10.2009 14:53:37 nshopen: [ 15] -> 125
 
29.10.2009 14:53:37 nshopen: [ 16] -> 192
 
29.10.2009 14:53:37 nshopen: [ 31] -> 15 
29.10.2009 14:53:37 nshopen: [ 32] -> 5
 
29.10.2009 14:53:37 nshopen: [ 47] -> 3
 
29.10.2009 14:53:37 nshopen: [ 62] -> 1
 
29.10.2009 14:53:37 nshopen: [ 78] -> 1
 
29.10.2009 14:53:37 nshopen: Terminated
 


So you see that the transactions around 200 ms are completely gone and that there are just a very few transactions above 16 ms listed.

The average dropped from 60 ms to 5 ms!!!

Depending on the complexity of your application (even the mailfile opens multiple design elements etc) this could lead to a quite interesting overall response time improvement.
In our tests specially the Note Open operation was analysed but also other transactions would benefit from this change.
As I said in the introduction this might not happen in all customer environments and Nagle is disabled by default at least in 8.5 and above.
In case you are running earlier releases you should set the parameter. We have only seen this issue on Linux, AIX and Solaris and it is hard to tell if this also affects Windows in some special cases.
Setting the parameter does not harm and it is better to set it just in case.
We have also seen in our tests that enabling Notes port network compression also caused the problem to disappear. But it would be still more save to set the parameter.

Still waiting for the over all performance improvement traces from customer (checked via client_clock data annotation). But the first test results with the nshopen tool look promising also in this case.


-- Daniel

Links

    Archives


    • [HCL Domino]
    • [Domino on Linux]
    • [Nash!Com]
    • [Daniel Nashed]