Performance Analysis


In 2015, Brendan Gregg posted two great articles on Netflix blog: Linux Performance Analysis in 60 seconds, and Linux Perfomrance Tools. They have great value when I was in a urgency to spot performance issues. The articles cover the essential tools for performance troubleshooting, including:

  • Check out load averages: w or uptime
  • Print kernel ring buffer: dmesg -T
  • Virtual memory status: vmstat 1
  • Multiple processor staticstics: mpstat -P ALL 1
  • Task status: pidstat 1
  • CPU and I/O status: iostat -xz 1
  • Free memory check: free -m
  • Network Activity record: sar -n DEV 1
  • TCP activity record: sar -n TCP,ETCP 1
  • Display processes: top

We will dive into each of them in the next section.

Basic Troubleshooting

The command w is equivalent of uptime (which shows uptime since boot) and who (which shows logged-in users). It also displays load average for the last 1 minute, 5 minutes and 15 minutes. The number of load average reflects the overall system load (CPU + disks), and it is further discussed in this post with a simple take away:

  • If the averages are 0.0, then your system is idle.
  • If the 1 minute average is higher than the 5 or 15 minute averages, then load is increasing.
  • If the 1 minute average is lower than the 5 or 15 minute averages, then load is decreasing.
  • If they are higher than your CPU count, then you might have a performance problem (it depends).

When Linux load averages increase, you know you have higher demand for resources (CPUs, disks, and some locks), but you aren’t sure which. You will need to switch to other metrics. Brendan recommend don’t spend more than 5 seconds on these numbers.

[dhunch@c7v-ghintapp01 ~]$ w
 12:14:10 up 46 days, 16:41,  3 users,  load average: 2.69, 2.44, 2.29
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
dhunch   pts/0    w6v-ghas01 24Jun20  3days  0.36s  0.30s ssh c7v-bastion
dhunch   pts/1    12:02    2.00s  0.00s  0.00s w
dhunch   pts/4    w6v-ghas01 17Jun20 15days  0.15s  0.07s view readme.txt

Before moving to more insightful metrics, it is also worth a quick look into the kernel ring buffer with dmesg command (dmesg -T | less +G). This will allow us to capture obvious issues such as oom-killer or TCP request dropping.

The vmstat tool reports the statistics of virtual memory, which we covered in previous posting. Servers have a fixed amount of physical memory, but they can run a set of applications that use a much larger amount of virtual memory. Application tend to reserve more memory than they need, and they usually operate on only a subset of their memory. In both cases, the operating system can keep the unused parts of memory on disk, and page it into physical memory only if it is needed. For the most part, this kind of memory management works well. But it doesn’t always with Java applications due to Java heap. Once a system start swapping – moving pages of data from main memory to disk, and vice versa, the performance tend to be bad. Systems must be configured so that swapping never occurs.

[dhunch@c7v-ghintapp01 ~]$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0 239360 385928      0 36734692    0    0     1   200    0    0  6  1 93  0  0
 3  0 239360 387732      0 36734704    0    0     0    43 8017 8524 14  0 85  0  0
 3  0 239360 387608      0 36734904    0    0     0    57 6768 7680 14  1 86  0  0
 2  0 239360 389008      0 36734904    0    0     0    44 6366 7300 14  0 86  0  0
 3  0 239360 421728      0 36700144    0    0     0     0 8141 7957 13  0 86  0  0
 2  0 239360 421984      0 36702048    0    0     0   467 8994 8362 14  0 85  0  0

The tool prints key server statistics each line, with the first line showing the average since boot. Here lists the explanation of some columns:

  • r: number of processes running on CPU and waiting for a turn. This provides a better signal than load averages for determining CPU saturation, as it does not include I/O. To interpret: an โ€œrโ€ value greater than the CPU count is saturation.
  • swpd: the amount of virtual memory used. This number should align with the used column for Swap row from free command.
  • buff, cache: buffer and cache. They should align with the buff/cache column form Mem row from free command.
  • free: free memory in kilobytes. This number should align with the free column for Mem row from free command.
  • si, so: swap-ins and swap-outs. As mentioned, if these are non-zero, you’re out of memory.
  • bi, bo: blocks received from and sent to a blcok device (# of block per second)
  • in, cs: number of interrupt, and context switches per second.
  • us, sy, id, wa, st: user, system, idle, wait I/O and stolen times. These are breakdowns of CPU time, on average across all CPUs. They should add up to 100% (or close). stolen time is amount of CPU time needed by a guest virtual machine that is not provided by the host. IO wait time is the CPU time waiting for I/O activity. Idle time could be several things: the process may be waiting for something (e.g. a response from database); the process may be blocked by a thread lock; or the process simply has nothing to do. user and system times are CPU times spent on user tasks and kernel tasks, respectively.

Out of these columns, watch for columns r, free, buff, cache, us, sy, id and wa at minimum. The combination of us and sy confirms if CPUs are busy. A constant degree of wa points to a disk bottleneck with too much time spent on pending disk I/O. The sy (kernel time) is necessary for I/O processing but sy stays high (e.g. constantly over 20%), it becomes interesting. Perhaps the kernel is processing I/O inefficiently.

For further per-CPU stats, use mpstat command (-P ALL), to prind CPU time breakdowns per CPU and check for imbalance. A single host CPU can be evidence of a single-threaded application. Here is an example output from a system of 16 CPU cores.

[dhunch@c7v-ghintapp01 ~]$ mpstat -P ALL 1
Linux 3.10.0-1062.12.1.el7.x86_64 (   08/01/20        _x86_64_        (16 CPU)

13:47:23     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
13:47:24     all   13.77    0.00    0.19    0.00    0.00    0.00    0.00    0.00    0.00   86.05
13:47:24       0    2.02    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   97.98
13:47:24       1    2.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   98.00
13:47:24       2    2.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   98.00
13:47:24       3   98.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    2.00
13:47:24       4    2.02    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   97.98
13:47:24       5    2.02    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   97.98
13:47:24       6    2.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   98.00
13:47:24       7    2.94    0.00    0.98    0.00    0.00    0.00    0.00    0.00    0.00   96.08
13:47:24       8    2.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   98.00
13:47:24       9    2.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   98.00
13:47:24      10   97.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    3.00
13:47:24      11    0.99    0.00    0.99    0.00    0.00    0.00    0.00    0.00    0.00   98.02
13:47:24      12    2.02    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   97.98
13:47:24      13    1.98    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   98.02
13:47:24      14    2.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   98.00
13:47:24      15    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

For a per process summary of CPU consumption, use pidstat command. It can be thought of a periodical snapshot of top command, allowing you to watch for patterns. The %CPU column is the total across all CPUs so 5 CPUs have a maximum value of 500.

If vmstate indicates some I/O issue, iostat tool can help us understand block devices, on both the workload applied and the resulting performance. Key columns are:

  • r/s, w/s, rkB/s, wkB/s: delivered reads, writes, read Kbytes, and write Kbytes per second to the device. Use these for workload characterization. A performance problem may simply be due to an excessive load applied.
  • await: the average wait time for I/O in milliseconds. This is the time that the application suffers, as it includes both time queued and time being serviced. Larger than expected average times can be an indicator of device saturation, or malfunction.
  • avgqu-sz: the average number of requests issued to device. Values greater than 1 can be evidence of saturation (although devices can typically operate on requests in parallel, especially virtual devices which front multiple back-end disks.)
  • %util: device utilization. This is really a busy percent, showing the time each second that the device was doing work. Values greater than 60% typically lead to poor performance (which should be seen in await), although it depends on the device. Values close to 100% usually indicate saturation.

I/O problem may either be inefficiencies in application that issues I/O request, or slowing disk unable to keep up with I/O requests. We review two examples here to illustrate each situation. The first output is as follows:

% iostat -xm 5
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          23.45    0.00   37.89    0.10    0.00   38.56

          Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
          sda               0.00    11.60    0.60   24.20     0.02    0.14    13.35     0.15    6.06    5.33    6.08   0.42   1.04

In the first example, the disk stat loosk up at first glance. The w_await (time to service I/O write) is fairly low at 6.08ms. However, the system is spending 37.89% of its time in the kernel. If all that system time is from the application, it suggest something inefficient is happening. The fact that the system is doing 24.2 writes per second is another clue: that is alot when writing only 0.14 MB per second (MBps). I/O has become a bottleneck, and the next step would be to look into how the application is performing its writes.

The second example output is as follows:

% iostat -xm 5
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          35.05    0.00    7.85   47.89    0.00    9.20

          Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
          sda               0.00     0.20    1.00  163.40     0.00    81.09  1010.19   142.74  866.47   97.60  871.17   6.08 100.00

In this example, it tells us that processes are spending 47.89% of their time in iowait, and the data to complete the I/O (w_await) is 871ms, the queue size is large, and the disk is writing at 81MB per second. This all points to disk I/O as a problem and that the amount of I/O in the application (or elsewhere in the system) must be reduced.

Bear in mind that poor performing disk I/O isn’t necessarily an application issue. Many techniques are typically used to perform I/O asynchronously, so that the application doesn’t block and suffer the latency directly (e.g. read-ahead for reads, and buffering for writes, also refer to “tail latency“).

Note that the acceptable utilization metric depends on the configuration of block device. If the storage is a logical disk device fronting many back-end disks (e.g. RAID 0), then 100% utilization may just mean that some I/O is being processed 100% of the time, however, the back-end disks may be far from being saturated, and may even be able to handle more work.

The free command gives the breakdown of memory usage. The right two columns are:

  • buffers: for the buffer cache, used for block device I/O.
  • cached: for the page cache, used by file systems.

We just want to check that these aren’t near-zero in size, which can lead to higher disk I/O (confirm using iostat), and worse performance. Linux uses free memory for the caches, but can reclaim it quickly if applications need it. So in a way the cached memory should be included in the free memory column. In this case, it’s included in the available column. This website has further details.

To check interface stat, nicstat is a great tool but it isn’t available by default in Linux. Instead we can run sar (-n DEV) to retrieve stats.

[dhunch@c7v-ghintapp01 ~]$ sar -n DEV 1
Linux 3.10.0-1062.12.1.el7.x86_64 (   08/01/20        _x86_64_        (16 CPU)

16:51:25        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
16:51:26         eth0   3089.00    934.00   3815.33    834.61      0.00      0.00      0.00
16:51:26           lo    464.00    464.00   2289.07   2289.07      0.00      0.00      0.00

16:51:26        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
16:51:27         eth0    956.00    586.00    826.66    211.34      0.00      0.00      0.00
16:51:27           lo    213.00    213.00    196.00    196.00      0.00      0.00      0.00

16:51:27        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
16:51:28         eth0    349.00    181.00     52.32    147.19      0.00      0.00      0.00
16:51:28           lo    244.00    244.00     81.13     81.13      0.00      0.00      0.00

Here rxkB/s/s and txkB/s represents receive and transmission throughput, as a measure of workload. If they reach the limit then the interface is the bottleneck.

On top of interface is the TCP layer, which can be monitored with sar again (-n ECP, ETCP). The key metrics are:

  • active/s: number of locally-initiated (e.g. via connect()) TCP connections per second
  • passive/s: number of remotely-initiated (e.g. via accept()) TCP connections per second
  • retrans/s: number of TCP retransmits per second

The active and passive counts are often useful as a rough measure of server load. It might help to think of active as outbound, and passive as inbound, but this isn’t strictly true (e.g. consider a localhost to localhost connection). Retransmits are a sign of network or server issue; it may be an unreliable network (e.g. public Internet), or it may be due to a server being overloaded and dropping packets.

[dhunch@c7v-ghintapp01 ~]$ sar -n TCP,ETCP 1
Linux 3.10.0-1062.12.1.el7.x86_64 (   08/01/20        _x86_64_        (16 CPU)

16:52:29     active/s passive/s    iseg/s    oseg/s
16:52:30         0.00      1.00     28.00     35.00

16:52:29     atmptf/s  estres/s retrans/s isegerr/s   orsts/s
16:52:30         0.00      0.00      0.00      0.00      0.00

16:52:30     active/s passive/s    iseg/s    oseg/s
16:52:31         8.00      8.00    200.00    317.00

16:52:30     atmptf/s  estres/s retrans/s isegerr/s   orsts/s
16:52:31         0.00      1.00      1.00      0.00      3.00

Last but not least is our favourite command top, which includes many of the metrics covered in previous tools. The downside to top is it is harder to see patterns over time, which may be more clear in tools like vmstat and pidstat, both of which produce rolling output.

Several tools introduced here involves sar, which is a great monitoring tool on its own that we need to be familiar with.

System Activity Report (SAR)

Further to the basic tools, the sar command is very helpful as it stores historical stat every 10 minutes. Sar keeps 18 types of reports, all stored in /var/log/sa/. When viewing the report file, you may pipe the result to less command so it only prints header once. For example, if you would like to print CPU report for the 2nd of the month:

# sar -u -f /var/log/sa/sar02 | less

If you check NFS client statistics for the 31st

 # sar -n NFS -f /var/log/sa/sar31

If you need to check network server statistics for today

# sar -n NFS -f /var/log/sa/sar31

Below are all types of reports:

  • -u CPU utilization
  • -w task creation and system switching activity
  • -W swapping statistics
  • -B report paging
  • -b report I/O and transfer rate statistics
  • -R report memory statistics
  • -r memory utilization
  • -S swap space utilization
  • -H huge pages utilization statistics
  • -v inode
  • -q queue length and load average
  • -y TTY device activity
  • -d activity for each block device
  • -n network statistics, DEV (per interface)
  • -n network statistics, EDEV (error per interface)
  • -n network statistics, NFS (NFS client)
  • -n network statistics, NFSD (NFS server)
  • -n network statistics, SOCK (socket usage)

Berkeley Packet Filter (BPF) Compiler Collection (bcc) tools

For advanced, low-level performance troubleshooting, the BCC tools provide a suite of tools. Here we only cover the installation of it.

In CentOS, install bcc-tools package with yum. When you try to run a command, such as cachestat, if you come across this error:

-bash: cachestat: command not found

Then you will need to add its path to default:

export PATH=$PATH:/usr/share/bcc/tools

Now if you run into this error:

chdir(/lib/modules/3.10.0-1062.12.1.el7.x86_64/build): No such file or directory
Traceback (most recent call last):

The file listed is a symbolic link, and if it is missing, you just need to install kernel-headers that matches the kernel version:

yum install kernel-headers

Then you may use tools in /usr/share/bcc/tools. For example, cachestat help you display page cache hit ratio; gethostlatency shows DNS resolution latency; tcpconnect prints out active tcp connections (made via connect system call):

[root@dhunch ~]# /usr/share/bcc/tools/tcpconnect -t -P 8080 | gawk '{ print strftime("%F %T  "), $0 }'
2020-06-13 00:16:57   TIME(s)  PID    COMM         IP SADDR            DADDR            DPORT
2020-06-13 02:16:57   0.000    15241  QNetworkAcce 4    8080
2020-06-13 02:16:57   0.064    15241  QNetworkAcce 4    8080
2020-06-13 02:16:57   0.438    15241  QNetworkAcce 4    8080

The command above outputs a time and pid stamped log line every time a TCP connection is made to port 8080; tcpaccept traces passive tcp connections (via accept system call). These tools are not as intrusive as tcpdump.

It is beyond the purpose of this article to get into details of each tool in the BFP suite. The tools are covered in detail in books “BPF Performance Tools” and “Linux Observability with BPF”.