Daniel Nashed 23 April 2012 06:51:09
During a performance troubleshooting session at a customer we found a very interesting tuning option that can boost your storage performance dramatically.
This setting is really helpful for SAN environments, SSD discs, virtual servers and also physical servers using RAID storage.
Current Linux versions do use the an I/O scheduler called CFQ (complete fair queuing).
This scheduler is helpful when you have simple local disk where Linux can optimize the I/O requests by re-ordering them.
But this can introduce latency with modern I/O devices.
If you have a caching RAID controller or a SAN, SSD disk the hardware backend is far more effective optimizing the I/O.
Most disk sub-systems also include a larger amount of cache to optimize the I/O.
Changing the I/O scheduler from "CFQ" to "noop" can dramatically reduce I/O latency.
The "await" in the iostat output shows the time needed for an I/O request from the application to the queue, to the disk and back to the application.
Even if the service time of the device is good (svctm) the await could be a lot higher because of CFQ that tries to re-order the I/O requests.
If you switch from "CFQ" to "noop" all I/O requests are send to the I/O device directly without any additional delay.
You will notice that the await is dramatically reduced.
We have seen good results with SAN environments and also physical disks.
Here is an example for two tests that we did with a larger physical machine with RAID 10 disks.
I have used 80 thread to read 32000 docs each (80 separate local databases on the server with small documents)
The result was as follows:
51 sec with CFQ scheduler
28 sec with noop scheduler
19 sec all data in cache.
80 threads creating 2000 docs each
with noop 42 sec
with cfq 132 sec
You can see from the results that we have dramatic difference in performance between CFQ and noop.
The server had a lot of RAM and repeating the test did not show and read I/O.
So all data has been served from cache in this case! That means the 19 seconds is without physical disk I/O. Just the processing time in back-end.
Side note for File-System Cache
Even for larger environment caching can effect the results. So for all tests beside the last test we have flushed the file-system cache.
Here is the cache flush command on Linux
echo 3 > /proc/sys/vm/drop_caches
So if you do your own testing you should
Here is a short extract from an older IBM Linux redbook that is a great summary of the background:
Select the right I/O elevator in kernel 2.6
For most server workloads, the complete fair queuing (CFQ) elevator is an adequate choice
as it is optimized for the multiuser, multiprocess environment a typical server operates in.
However, certain environments can benefit from a different I/O elevator.
Intelligent disk subsystems
Benchmarks have shown that the NOOP elevator is an interesting alternative in high-end
server environments. When using IBM ServeRAID or TotalStorage DS class disk
subsystems, the lack of ordering capability of the NOOP elevator becomes its strength.
Intelligent disk subsystems such as IBM ServeRAID and TotalStorage DS class disk
subsystems feature their own I/O ordering capabilities. Enterprise class disk subsystems
may contain multiple SCSI or FibreChannel disks that each have individual disk heads and
data striped across the disks. It would be very difficult for an operating system to anticipate
the I/O characteristics of such complex subsystems correctly, so you might often observe
at least equal performance at less overhead when using the NOOP I/O elevator.
Virtual machines, regardless of whether in VMware or VM for zSeries®, may only
communicate through the virtualization layer with the underlying hardware. Hence a virtual
machine is not aware of the fact if the assigned disk device consists of a single SCSI
device or an array of FibreChannel disks on a TotalStorage DS8000. The virtualization
layer takes care of necessary I/O reordering and the communication with the physical
block devices. Therefore, we recommend using the NOOP elevator for virtual machines to
ensure minimal processor overhead.
How do you change the settings?
You can change the setting dynamically for testing with this manual command (example for hda device)
echo noop > /sys/block/hda/queue/scheduler
If you want to permanently change the scheduler for all block devices you can set it as a boot option.
Edit /boot/grub/grub.conf and enter in kernel line elevator=noop.
You should really look into this setting and try it in your environment.
Changing the settings dynamically in an production environment does not cause any issues and you can see with "iostat -x 1" how the "await" will change.
I am interested in your feedback from your environments.
- Comments