Key Linux Performance Metrics

Much has been written about how to set up different monitoring tools to look after the health of your Linux servers. This article attempts to present a concise overview of the most important metrics available on Linux and the associated tools.

CPU utilization

CPU usage is usually the first place we look when a server shows signs of slowing down (although more often than not, the problem is elsewhere). The top command is arguably the most common performance-related utility in Linux when it comes to processes and CPU. By default, top displays summary percentages for all CPUs on the system. These days, most CPUs are dual-core or even quad-core – essentially two or four CPUs in one chip, so to view the statistics broken down by CPU (or core), use the “1” command in top. To sort processes by CPU usage type “O” followed by a “k”.

top - 16:16:16 up 8 days,  5:30,  3 users,  load average: 0.11, 0.12, 0.14
Tasks: 228 total,   1 running, 226 sleeping,   0 stopped,   1 zombie
Cpu0  :  2.7%us,  1.6%sy,  0.0%ni, 89.5%id,  5.9%wa,  0.1%hi,  0.1%si,  0.0%st
Cpu1  :  3.1%us,  1.7%sy,  0.0%ni, 95.0%id,  0.2%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3337576k total,  2920168k used,   417408k free,   301852k buffers
Swap:  5439480k total,    22520k used,  5416960k free,  1313284k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
17788 dkamenov  20   0  664m  82m  27m S  5.4  2.5  25:54.36 chrome
 2597 root      20   0  198m  26m  10m S  1.4  0.8  50:22.32 Xorg
 9028 dkamenov  20   0  162m  28m  14m S  0.4  0.9   0:32.57 ld-linux.so.2
 9667 dkamenov  20   0  957m  62m  19m S  0.3  1.9   0:17.93 chrome
17660 dkamenov  20   0  744m 129m  37m S  0.3  4.0  14:41.33 chrome
17765 dkamenov  20   0 1173m 332m  20m S  0.3 10.2   5:44.19 chrome

It is important to distinguish between two types of CPU metrics: load averages and percentages.

Load Averages

All UNIX-like systems traditionally display the CPU load as 1-minute, 5-minute and 15-minute load averages. Essentially, the load average represents the fraction of time that the CPU is busy. Remember that a CPU can be over-utilized – processes can be waiting for a CPU to become available, so you could see utilization rates over 1.00. The “perfect” utilization point of 1.00 per CPU means that CPU is executing 100% of the time and no processes are waiting for a CPU to become available.  (On a machine with a single dual-core CPU that point would be 2.00, on a dual quad-core CPU – 8.00 and so on). Of course a utilization of 1.00 per CPU would mean that there is no spare capacity to take an increased load, so most administrators are worried when they see utilization numbers consistently over 0.70.

Percentages

Percentages break down processes executing on each CPU by process state

  • %us – percentage of time processes execute in user mode
  • %sy – time processes spent in system (kernel) mode
  • %ni – time spent executing in user mode under nice priority
  • %id – idle time
  • %wa – time spent waiting on I/O or timer
  • %hi – time spent servicing hardware interrupts
  • %si – servicing software interrupts

Another command which displays CPU percentage statistics is mpstat:

$ mpstat -P ALL
Linux 2.6.32-220.el6.x86_64 (tramp)     04/16/2012      _x86_64_        (2 CPU)

05:02:16 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
05:02:16 PM  all    6.86    0.00    1.98    4.34    0.02    0.06    0.00    0.00   86.74
05:02:16 PM    0    6.78    0.00    2.01    7.60    0.04    0.11    0.00    0.00   83.46
05:02:16 PM    1    6.94    0.00    1.95    1.14    0.00    0.00    0.00    0.00   89.97

Memory Usage

When a process requests the kernel to allocate memory and the system has run out of physical memory, the kernel will start paging out the least-used memory blocks to disk to free up some space, until the process that allocated them needs them back, at which point the kernel will have to find another least-used block, page it out and page in the original block in physical memory. This mechanism means that more memory is available to applications than the physical memory installed on the server – this memory is known as virtual memory. The good thing is that your application doesn’t even know it is using virtual memory. But that doesn’t mean you should not keep track of memory usage because nothing is free. Since disk access is slower than RAM access, if  your system starts paging excessively, virtual memory access will become a performance bottleneck. (A quick note: although the terms paging and swapping are often used interchangeably, strictly speaking paging refers to individual memory pages being loaded or saved to disk, and swapping – to the entire memory space of a Linux process being moved from memory to disk or vice versa)

To examine the virtual memory usage, use the vmstat command. When run without parameters, it displays a snapshot of the current state of virtual memory:

$ vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0  24036 325876 314876 1244732    0    0     4    16   11   57  6  2 87  4  0

The most important entries here are as follows:

  • free – amount of free memory available
  • si – paged (swapped) in. This is the amount of memory read back from disk into physical memory
  • so – paged (swapped) out – the amount of memory written to disk to make more room

The example above shows a system without any paging activity – the machine has enough physical memory to serve the needs of all running processes. While this is the situation you want to be in, most servers will exhibit some level of paging activity. To examine memory usage by individual processes, start top again, type the “O” interactive command to change the sort order and select “N” for memory.

top - 19:35:51 up 26 days, 21:29, 3 users, load average: 0.00, 0.00, 0.00
Tasks: 173 total, 1 running, 172 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.3%us, 0.0%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 3107288k total, 2979052k used, 128236k free, 173988k buffers
Swap: 2818040k total, 144k used, 2817896k free, 1989964k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11906 root 19 0 1244m 464m 15m S 0.0 15.3 3:58.59 java
20832 dkamenov 15 0 280m 57m 24m S 0.0 1.9 13:24.33 firefox
20694 dkamenov 15 0 185m 50m 17m S 0.0 1.7 0:04.57 esc
20660 dkamenov 18 0 91568 28m 19m S 0.0 0.9 1:05.51 nautilus
 2916 root 15 0 107m 23m 9616 S 0.0 0.8 1:03.17 Xorg
20722 dkamenov 15 0 40304 21m 10m S 0.0 0.7 1:33.03 puplet
 2985 dkamenov 18 0 85172 15m 11m S 0.0 0.5 0:01.56 nautilus
 3098 root 34 19 28828 13m 2244 S 0.0 0.4 5:24.63 yum-updatesd
 2556 root 18 0 28780 12m 5900 S 0.0 0.4 0:05.54 asterisk
 4445 root 18 0 24504 11m 7216 S 0.0 0.4 0:00.30 httpd

If you discover that certain applications need too much memory and your system is paging more than it should, consider installing more memory or moving them to another machine. If you do install more memory, don’t forget to increase the amount of swap space – as a rule of thumb it should be at least equal to the amount of physical memory available.

Disk Subsystem

Whenever you suspect that disk I/O activity is the bottleneck, use the iostat command with the -x switch to examine disk activity

$ iostat -x /dev/sda
Linux 2.6.32-220.el6.x86_64 (tramp) 04/17/2012 _x86_64_ (2 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
6.44 0.00 1.99 4.35 0.00 87.22

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.05 1.92 0.13 0.92 4.98 21.44 25.25 0.07 67.53 30.49 3.19

The important values here are:

  • r/s – read requests per second
  • w/s – write requests per second
  • await – average time to service a request (in milliseconds). This includes wait time as well as actual time spent servicing the request
For obvious reasons, the I/O subsystem is the most common bottleneck on database servers and file servers. Some options to consider are:
  • Use faster disks. Higher RPM means faster seek time
  • Use logical volumes with striping – this way a single request can be serviced by several disks in parallel
  • Use a hardware RAID controller – avoid software RAID for data-intensive applications
  • Add more memory to allow for larger buffers

Conclusion

In this installment we explored CPU utilization, memory usage and the disk subsystem – the most likely suspects for decreased performance. We also looked at the associated ‘tools of the trade’. When investigating performance issues, it is often beneficial to put things into perspective – which is why we should collect performance-related data to use as benchmarks. In a future article we will look at sar and it’s associated commands to do just that.

References:

  1. Linux Performance and Tuning Guidelines, IBM Redbooks: https://lenovopress.com/redp4285.pdf
  2. Examining Load Average, Linux Journal: http://www.linuxjournal.com/article/9001
  3. Monitoring Virual Memory with vmstat, Linux Journal: http://www.linuxjournal.com/article/8178

You might also like