Much has been written about how to set up different monitoring tools to look after the health of your Linux servers. This article attempts to present a concise overview of the most important metrics available on Linux and the associated tools.
CPU usage is usually the first place we look when a server shows signs of slowing down (although more often than not, the problem is elsewhere). The top command is arguably the most common performance-related utility in Linux when it comes to processes and CPU. By default, top displays summary percentages for all CPUs on the system. These days, most CPUs are dual-core or even quad-core – essentially two or four CPUs in one chip, so to view the statistics broken down by CPU (or core), use the “1” command in top. To sort processes by CPU usage type “O” followed by a “k”.
top - 16:16:16 up 8 days, 5:30, 3 users, load average: 0.11, 0.12, 0.14 Tasks: 228 total, 1 running, 226 sleeping, 0 stopped, 1 zombie Cpu0 : 2.7%us, 1.6%sy, 0.0%ni, 89.5%id, 5.9%wa, 0.1%hi, 0.1%si, 0.0%st Cpu1 : 3.1%us, 1.7%sy, 0.0%ni, 95.0%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 3337576k total, 2920168k used, 417408k free, 301852k buffers Swap: 5439480k total, 22520k used, 5416960k free, 1313284k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 17788 dkamenov 20 0 664m 82m 27m S 5.4 2.5 25:54.36 chrome 2597 root 20 0 198m 26m 10m S 1.4 0.8 50:22.32 Xorg 9028 dkamenov 20 0 162m 28m 14m S 0.4 0.9 0:32.57 ld-linux.so.2 9667 dkamenov 20 0 957m 62m 19m S 0.3 1.9 0:17.93 chrome 17660 dkamenov 20 0 744m 129m 37m S 0.3 4.0 14:41.33 chrome 17765 dkamenov 20 0 1173m 332m 20m S 0.3 10.2 5:44.19 chrome
It is important to distinguish between two types of CPU metrics: load averages and percentages.
All UNIX-like systems traditionally display the CPU load as 1-minute, 5-minute and 15-minute load averages. Essentially, the load average represents the fraction of time that the CPU is busy. Remember that a CPU can be over-utilized – processes can be waiting for a CPU to become available, so you could see utilization rates over 1.00. The “perfect” utilization point of 1.00 per CPU means that CPU is executing 100% of the time and no processes are waiting for a CPU to become available. (On a machine with a single dual-core CPU that point would be 2.00, on a dual quad-core CPU – 8.00 and so on). Of course a utilization of 1.00 per CPU would mean that there is no spare capacity to take an increased load, so most administrators are worried when they see utilization numbers consistently over 0.70.
Percentages break down processes executing on each CPU by process state
- %us – percentage of time processes execute in user mode
- %sy – time processes spent in system (kernel) mode
- %ni – time spent executing in user mode under nice priority
- %id – idle time
- %wa – time spent waiting on I/O or timer
- %hi – time spent servicing hardware interrupts
- %si – servicing software interrupts
Another command which displays CPU percentage statistics is mpstat:
$ mpstat -P ALL Linux 2.6.32-220.el6.x86_64 (tramp) 04/16/2012 _x86_64_ (2 CPU) 05:02:16 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 05:02:16 PM all 6.86 0.00 1.98 4.34 0.02 0.06 0.00 0.00 86.74 05:02:16 PM 0 6.78 0.00 2.01 7.60 0.04 0.11 0.00 0.00 83.46 05:02:16 PM 1 6.94 0.00 1.95 1.14 0.00 0.00 0.00 0.00 89.97
When a process requests the kernel to allocate memory and the system has run out of physical memory, the kernel will start paging out the least-used memory blocks to disk to free up some space, until the process that allocated them needs them back, at which point the kernel will have to find another least-used block, page it out and page in the original block in physical memory. This mechanism means that more memory is available to applications than the physical memory installed on the server – this memory is known as virtual memory. The good thing is that your application doesn’t even know it is using virtual memory. But that doesn’t mean you should not keep track of memory usage because nothing is free. Since disk access is slower than RAM access, if your system starts paging excessively, virtual memory access will become a performance bottleneck. (A quick note: although the terms paging and swapping are often used interchangeably, strictly speaking paging refers to individual memory pages being loaded or saved to disk, and swapping – to the entire memory space of a Linux process being moved from memory to disk or vice versa)
To examine the virtual memory usage, use the vmstat command. When run without parameters, it displays a snapshot of the current state of virtual memory:
$ vmstat procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 24036 325876 314876 1244732 0 0 4 16 11 57 6 2 87 4 0
The most important entries here are as follows:
- free – amount of free memory available
- si – paged (swapped) in. This is the amount of memory read back from disk into physical memory
- so – paged (swapped) out – the amount of memory written to disk to make more room
The example above shows a system without any paging activity – the machine has enough physical memory to serve the needs of all running processes. While this is the situation you want to be in, most servers will exhibit some level of paging activity. To examine memory usage by individual processes, start top again, type the “O” interactive command to change the sort order and select “N” for memory.
top - 19:35:51 up 26 days, 21:29, 3 users, load average: 0.00, 0.00, 0.00 Tasks: 173 total, 1 running, 172 sleeping, 0 stopped, 0 zombie Cpu(s): 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 3107288k total, 2979052k used, 128236k free, 173988k buffers Swap: 2818040k total, 144k used, 2817896k free, 1989964k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11906 root 19 0 1244m 464m 15m S 0.0 15.3 3:58.59 java 20832 dkamenov 15 0 280m 57m 24m S 0.0 1.9 13:24.33 firefox 20694 dkamenov 15 0 185m 50m 17m S 0.0 1.7 0:04.57 esc 20660 dkamenov 18 0 91568 28m 19m S 0.0 0.9 1:05.51 nautilus 2916 root 15 0 107m 23m 9616 S 0.0 0.8 1:03.17 Xorg 20722 dkamenov 15 0 40304 21m 10m S 0.0 0.7 1:33.03 puplet 2985 dkamenov 18 0 85172 15m 11m S 0.0 0.5 0:01.56 nautilus 3098 root 34 19 28828 13m 2244 S 0.0 0.4 5:24.63 yum-updatesd 2556 root 18 0 28780 12m 5900 S 0.0 0.4 0:05.54 asterisk 4445 root 18 0 24504 11m 7216 S 0.0 0.4 0:00.30 httpd
If you discover that certain applications need too much memory and your system is paging more than it should, consider installing more memory or moving them to another machine. If you do install more memory, don’t forget to increase the amount of swap space – as a rule of thumb it should be at least equal to the amount of physical memory available.
Whenever you suspect that disk I/O activity is the bottleneck, use the iostat command with the -x switch to examine disk activity
$ iostat -x /dev/sda Linux 2.6.32-220.el6.x86_64 (tramp) 04/17/2012 _x86_64_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 6.44 0.00 1.99 4.35 0.00 87.22 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.05 1.92 0.13 0.92 4.98 21.44 25.25 0.07 67.53 30.49 3.19
The important values here are:
- r/s – read requests per second
- w/s – write requests per second
- await – average time to service a request (in milliseconds). This includes wait time as well as actual time spent servicing the request
- Use faster disks. Higher RPM means faster seek time
- Use logical volumes with striping – this way a single request can be serviced by several disks in parallel
- Use a hardware RAID controller – avoid software RAID for data-intensive applications
- Add more memory to allow for larger buffers
In this installment we explored CPU utilization, memory usage and the disk subsystem – the most likely suspects for decreased performance. We also looked at the associated ‘tools of the trade’. When investigating performance issues, it is often beneficial to put things into perspective – which is why we should collect performance-related data to use as benchmarks. In a future article we will look at sar and it’s associated commands to do just that.