Many servers, especially databases like MySQL, are dealing with hard drive IO on every data insert, so in order to get much performance out of such databases with extensive amount of data inserts, it is critical to tune the IO writes.
Tuning IO is a tedious task which requires many iterations until you eventually reach your goals or see any results.
While tuning IO, I think that tuning for read performance is a different task from tuning for write performance. Combing them both can sometimes be one of the hardest tasks a SysAdmin can face.
I decided to focus on write performance in the first article.
Wearing both hats of a SysAdmin and developer, I like to implement development methodologies into system administration. This time it is the ‘top-down‘ development methodology – analyzing all steps from the top – the application behavior, operating system, filesystem and then eventually – the bottom – the hardware.
When analyzing IO performance and designing for proper performance, always imagine IOPS as water running down a pipe – from your application to the hardware.
If any part of the pipe is narrower – it wouldn’t run properly. And our task here is to examine this pipe and widen it where needed.
Speaking of write performance, you would usually find your IO either very sequential – such as writing video blocks one after the other. Or rather fairly random – such as user driven DB updates in places you can’t really expect them. Tuning the latter is a rather harder task as the input would be rather random.
Developers are always afraid of using too much memory. Why? – I don’t know…
Memory today is cheap, and by far too many times I have found developers investing countless hours developing something more optimized and save a “huge” amount of 4kb of memory. Memory is by far cheaper than labor.
Use as much memory as needed! I highly encourage developers to malloc() big chunks of memory to avoid any disk access and recude CPU time.
Especially when tuning for IO, just use more memory. If, for instance, you need to write data blocks to the disk, perhaps you can buffer them as much as you can – and engage in disk writes only when really necessary or when it is more convenient to do so.
Same goes for random write IO – buffer your requests and serve them to your DB in large chunks – let the layer underneath handle the multiple IOs more efficiently. E.g. you have many IO write requests, instead of serializing them to the disk in the order they have arrived, cache many of them, and let the operating system queue them in the most efficient way.
If your application is actually a database – there are numerous parameters you can configure to get the database to work much better. Work them all with your loyal DBA for maximum performance.
One very important concept I always follow while tuning an operating system is ‘Don’t play god’. Operating systems usually know damn right what’s good for you. That is usually. Fortunately with today’s modern operating systems, such as Linux, you get to choose between a few IO schedulers.
An IO scheduler is the component in the operating system which queues and sorts IO requests, trying to optimize the order in which they will be served.
So which one is best? – I can’t answer this question for you. If you don’t have an IO tester for your system (a unit test that will mimic the IO character of your application), please do build one and experiment with the different IO schedulers Linux has to offer. Your answer is just behind the corner.
Some kernel parameters (tuneable via sysctl) are also here to help us. From my past experience, playing with these will yield very little of a performance boost. The smoking gun is usually the IO scheduler you choose.
Luckily with Linux, we have a vast choice of filesystems which can provide enormous boosts. Ext* (be it 2, 3 or 4) are definitely nice, but they are not performance-oriented filesystems.
You’re probably wondering just as well if you want any journaling in your system – as it can harm performance. Yes you do. If a hard reboot suddenly takes place on a massive 2TB filesystem, can you be bothered with an endless e2fsck check over all of it? – You can’t, trust me.
The performance boost you might gain by choosing the right filesystem is enormous. In my previous workplace we had severe IO write performance problems. It was not until we’ve migrated from the default ext3 to JFS that we got an approximate 50% of a performance boost, just from changing the filesystem!
Speaking of filesystems, sometimes you might not require a filesystem. A filesystem eventually slows things down. Do you need dates on your files? Permissions? Hard links? A hierarchy?
If you’re feeling hardcore – providing the application with a block device can yield huge performance gains with the drawback of increased management.
Have a look for instance at Oracle RDBMS. Oracle RDBMS can work with raw devices – accessing a character/block device instead of files on a filesystem would be much faster. It’s a headache for the DBA though. But if squeezing out all the performance you can is your major priority, the increased management work will fade into insignificance.
Perhaps the most important component of your system, if your system is IO bound.
It’ll be a huge difference for your system whether you plan to access a single, off-the-shelf, commodity HD, or, on the other hand, a set of 128 disks on an enterprise storage machine.
Before assessing this situation, you should first ask yourself if you’d like to tackle the disk configuration on your own, or use an off-the-shelf solution.
Off-the-shelf solution will usually be excellent in terms of performance, reliability and management. However, they’ll be very expensive. Sometimes even 10 times more expensive, for the same performance.
Designing the right disk configuration for your application still exist even if you got the off-the-shelf super-expensive storage system from one of the big companies.
You should be highly familiar with RAID levels and what would work best for your application.
You should then ask yourself if the software RAID found in Linux is comprehensive enough for your system. I am the proud user of Linux software RAID on my home server, but for a proper enterprise system I would try to avoid it.
Speaking of RAID levels, RAID 4/5 will never give you good performance, that is comparing to RAID0 or RAID10.
If write performance is what you are after – then for the same price – a set of cheap, commodity HDs in RAID10 might do much better than a set of high-end 15K RPM HDs in RAID5. I am aware that for RAID10 you need double the disks number comparing to RAID5, but as I said – use commodity ones!
If Google and Facebook are using the approach of commodity hardware (and probably commodity HDs) – then it can’t be wrong!
Another word about RAID controllers – they come in all shapes and colors. For proper performance go for the branded ones, such as Adaptec or 3ware. Generally speaking – prefer the ones with open source drivers or drivers in the vanilla kernel – so you are never bound as to which kernel version to use.
Test, test and test some more!
Always profile your IO performance, using tools such as Monitis, Munin, Cacti and last but not least – iostat.
I really like iostat for that task, it gives you excellent parameters such as:
And some more other counters that can all be measured in a tight resolution.
Constantly monitoring your system will let you know if tedious IO tuning should take any place at all. If there are no IO performance problems, then chill, relax, have a beer – it’s much more fun.