Last modified: June 06, 2026

This article is written in: 🇺🇸

Performance Monitoring

Performance monitoring is the process of observing how a system uses its resources.

The goal is to understand whether the system is healthy, overloaded, or waiting on a specific bottleneck.

A bottleneck is the resource that limits performance.

Common bottlenecks include:

A system may feel slow for many reasons. Performance monitoring helps avoid guessing.

Instead of saying:

The server is slow.

we want to answer:

Basic Performance Model

A Linux system runs many processes. Those processes compete for CPU, memory, disk, and network resources.

+-------------------+
| Applications      |
| nginx, database,  |
| browser, scripts  |
+---------+---------+
          |
          v
+-------------------+
| Linux Kernel      |
| scheduler, memory |
| filesystem, I/O   |
+---------+---------+
          |
          v
+-------------------+
| Hardware          |
| CPU, RAM, disk,   |
| network card      |
+-------------------+

Monitoring tools observe these layers and show how busy they are.

Important Usage Statistics

The most common system usage statistics are:

Each statistic tells a different part of the story.

CPU Usage

CPU usage shows how much processing work the system is doing.

High CPU usage can mean:

CPU usage is not automatically bad. A busy CPU may be normal if the system is doing useful work.

The important question is:

Is the CPU busy because of expected work,
or is one process consuming CPU unexpectedly?

RAM Usage

RAM is fast working memory.

Linux uses RAM for:

Linux often uses available RAM for cache. This is usually good.

A system can show little “free” memory and still be healthy because cached memory can be reclaimed when applications need it.

The better field to watch is usually:

available memory

not just:

free memory

Swap Usage

Swap is disk space used as overflow memory.

Swap helps prevent immediate crashes when RAM is full, but it is much slower than RAM.

Heavy swap usage can make a system feel extremely slow.

RAM is fast.
Swap is much slower because it uses disk.

Some swap usage is not always a problem. Continuous swap-in and swap-out activity is a problem.

Disk Usage vs Disk I/O

Disk usage and disk I/O are different.

Disk usage means how much storage space is filled.

Example:

The filesystem is 95% full.

Disk I/O means how actively the disk is reading and writing.

Example:

The disk is writing 300 MB/s and is 100% busy.

A disk can be almost full but not busy.

A disk can have plenty of free space but still be overloaded with reads and writes.

Load Average

Load average shows how many processes are running or waiting to run.

It is shown over three time periods:

Example:

load average: 0.42, 0.35, 0.30

On a single-core system, a load of 1.00 roughly means the CPU is fully occupied.

On a four-core system, a load of 4.00 may be normal under full CPU use.

However, load average can also increase when processes are waiting on disk I/O, not just CPU.

So high load means:

There is work waiting.

It does not always mean:

The CPU is the bottleneck.

Monitoring Workflow

A good performance investigation follows a structured path.

  1. Check load average
  2. Check CPU usage
  3. Check memory and swap
  4. Check disk I/O
  5. Check disk space
  6. Identify the responsible process
  7. Check logs for errors
  8. Decide whether the load is expected or abnormal

Useful starting commands:

uptime
top
free -h
vmstat 1
iostat -xz 1
df -h
ps aux --sort=-%cpu | head
ps aux --sort=-%mem | head

top

The top command provides a live view of system activity.

Run:

top

It shows two main sections:

The system summary shows CPU, memory, swap, load average, task count, and uptime.

The process list shows running processes, usually sorted by CPU usage.

Example top Output

top - 15:00:02 up 1 day,  4:03,  2 users,  load average: 0.42, 0.35, 0.30
Tasks: 180 total,   2 running, 178 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.1 us,  2.2 sy,  0.0 ni, 92.1 id,  0.4 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem :  8026792 total,  123456 free,  2345678 used,  5460658 buff/cache
KiB Swap:  2048000 total,  1755000 free,   293000 used,  1234567 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
1234 user1     20   0  162956   2212   1124 R  25.0  0.3   0:15.03 my_process
5678 user2     20   0  161256   2024   1028 S  12.5  0.2   1:20.03 another_process

Understanding the Top Summary

Understanding CPU Fields in top

Example:

%Cpu(s): 5.1 us, 2.2 sy, 92.1 id, 0.4 wa

Interpretation:

Understanding Process Columns in top

Important process states:

Useful top Keys

To monitor one process:

top -p 1234

htop

htop is an interactive and more user-friendly alternative to top.

It shows CPU bars, memory bars, process lists, searching, filtering, tree view, and easier process management.

Install it on Debian or Ubuntu:

sudo apt install htop

On Red Hat or CentOS:

sudo yum install htop

On Fedora:

sudo dnf install htop

Run:

htop

Example htop View

1  [||||||||||| 34.5%]   Tasks: 65, 132 thr; 2 running
2  [||||||||||  28.7%]   Load average: 1.23 0.97 0.88
Mem[|||||||||||||||1.45G/3.84G]
Swp[|             0K/512M]

  PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
1287 root       20   0  256M  4980  3192 R 28.6  0.1  0:03.41 /usr/bin/Xorg
2905 user1      20   0  517M  3720  2012 S 14.0  0.1  1:13.69 gnome-terminal

Interpretation:

htop is useful when you want to interactively inspect and manage processes.

free

The free command shows memory and swap usage.

Run:

free -h

The -h option shows human-readable units.

Example output:

total        used        free      shared  buff/cache   available
Mem:            8G         3.2G        2.1G       101M      2.7G        4.4G
Swap:           2G         1.2G        800M

Understanding free -h

Important memory fields:

Important swap fields:

Interpretation of the example:

The most important field for practical memory pressure is usually:

available

If available is low and swap activity is high, the system may be under memory pressure.

RSS and VSZ

Linux process memory can be confusing because there are multiple memory measurements.

Two important fields are:

RSS

RSS means Resident Set Size.

It is the amount of physical RAM currently used by the process.

RSS is usually more useful than VSZ when asking:

How much real RAM is this process using right now?

However, RSS includes shared memory pages, so adding RSS values for many processes can overcount total RAM.

VSZ

VSZ means Virtual Set Size, or virtual memory size.

It includes memory that may be:

VSZ can look large even when actual RAM use is modest.

A common mistake is to treat VSZ as real RAM usage. For physical RAM pressure, check RSS and %MEM.

Example RSS and VSZ Calculation

Suppose a process currently uses:

RSS is:

450K + 800K + 120K = 1370K

Suppose the process has virtually allocated:

VSZ is:

600K + 2200K + 150K = 2950K

The process has a larger virtual memory footprint than physical resident memory.

Finding Top Memory Processes

To show processes sorted by real physical memory percentage:

ps aux --sort=-%mem | head -n 10

Example output:

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mysql     5678 12.0 18.5 2540000 1500000 ?     Sl   10:00   3:20 mysqld
java      1234 25.0 15.0 4096000 1210000 ?     Sl   09:50   8:10 java
postgres  1213  5.0  8.0 1500000  650000 ?     Sl   09:55   2:30 postgres

Interpretation:

To sort by VSZ instead:

ps -e -o pid,vsz,rss,comm --sort=-vsz | head -n 10

Important note:

Checking Memory for a Specific Process

Example for nginx:

ps -o %mem,rss,vsz,cmd -C nginx

Example output:

%MEM   RSS     VSZ     CMD
 2.3   12000   250000  nginx: master process /usr/sbin/nginx
 1.2    6000   150000  nginx: worker process
 1.2    6000   150000  nginx: worker process

Interpretation:

vmstat

vmstat shows process, memory, swap, disk I/O, system, and CPU statistics.

Run a single snapshot:

vmstat

Run updates every second:

vmstat 1

Run three samples five seconds apart:

vmstat 5 3

Example vmstat Output

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 2723288 844288 5670316    0    0    14    42   49   39  7  5 88  0  0
 2  0      0 2729716 844296 5670332    0    0     0   387 8888 12065  3  6 90  0  0
 1  0      0 2735688 844304 5670364    0    0     0   436 9379 13069  4  6 90  0  0

Important fields:

Interpretation of this example:

uptime

uptime is a quick way to check how long the system has been running and what the load average is.

Run:

uptime

Example output:

15:00:02 up 1 day, 4:03, 2 users, load average: 0.42, 0.35, 0.30

Interpretation:

iostat

iostat reports CPU and disk I/O statistics.

Install it through sysstat if needed:

sudo apt install sysstat

Run:

iostat -xz 1

Important disk fields:

Example output:

Device            r/s     w/s     rkB/s     wkB/s   await  aqu-sz  %util
sda              1.00    2.00     50.00    100.00    2.20    0.01   0.15

Interpretation:

iotop

iotop shows disk I/O by process.

Install:

sudo apt install iotop

Run:

sudo iotop -o

The -o option shows only processes currently doing I/O.

Example output:

Total DISK READ: 100.00 K/s | Total DISK WRITE: 50.00 K/s
PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
7890 be/4  user      50.00 K/s   25.00 K/s  0.00 %  10.00 %  process_a
5678 be/4  user      50.00 K/s   25.00 K/s  0.00 %   5.00 %  process_b

Interpretation:

iotop is useful when you know the disk is busy and want to know which process is responsible.

df and du

df shows filesystem space usage.

Run:

df -h

Example:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       100G   92G  8.0G  92% /

Interpretation:

du shows directory usage.

Example:

sudo du -h --max-depth=1 /var | sort -h

Example output:

100M    /var/tmp
2.0G    /var/log
12G     /var/lib
15G     /var

Interpretation:

Scenario 1: Simulate a CPU Bottleneck

Create high CPU usage and verify it with top, htop, and vmstat.

Simulate the Bottleneck

Install stress-ng if needed:

sudo apt install stress-ng

Run a CPU stress test:

stress-ng --cpu 4 --timeout 60s

This starts four CPU workers for 60 seconds.

Check with top

top

Example output:

%Cpu(s): 96.0 us,  3.0 sy,  0.0 ni,  1.0 id,  0.0 wa

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM COMMAND
4321 user      20   0   50000   8000   2000 R 399.0  0.1 stress-ng-cpu

Interpretation: - CPU user time is very high. - Idle time is almost zero. - stress-ng is using about four CPU cores. - I/O wait is zero, so this is not a disk bottleneck.

Check with vmstat

vmstat 1

Example output:

r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa st
5  0      0 800000  20000 500000    0    0     0     1 3000  6000 95  4  1  0  0

Interpretation:

Possible Fixes

Example:

nice -n 10 command

Scenario 2: Simulate Memory Pressure

Create memory pressure and observe it with free, top, and vmstat.

Simulate the Bottleneck

Run:

stress-ng --vm 2 --vm-bytes 70% --timeout 60s

This starts memory workers that allocate memory.

Check with free

free -h

Example output:

total        used        free      shared  buff/cache   available
Mem:            8.0G        6.9G        250M       120M        850M        600M
Swap:           2.0G        100M        1.9G

Interpretation: - Used memory is high. - Available memory is low. - Swap has started to be used. - The system is under memory pressure.

Check with vmstat

vmstat 1

Example output:

r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
2  1 200000 100000  12000 200000  100  300  800  1500 2500 7000 30 15 45 10  0

Interpretation:

Possible Fixes

Scenario 3: Simulate Swap Thrashing

Show how heavy swap activity can slow a system.

Simulate Carefully

Use a stronger memory test only on a lab system:

stress-ng --vm 4 --vm-bytes 90% --timeout 60s

Check with vmstat

vmstat 1

Example output:

r  b   swpd    free   buff  cache    si    so     bi     bo   in    cs us sy id wa st
3  6 1500000  50000  8000  90000  5000  7000  12000  18000 5000 15000 15 20 20 45  0

Interpretation: - swpd is high. - si and so are very high. - b is high, meaning blocked processes. - wa is high, meaning the CPU waits on disk. - This is swap thrashing.

The system may feel frozen because it is constantly moving memory pages between RAM and disk.

Possible Fixes

Scenario 4: Simulate Disk I/O Bottleneck

Create heavy disk writes and verify them with iostat, iotop, and vmstat.

Simulate the Bottleneck

Install tools:

sudo apt install fio sysstat iotop

Run a safe file-based write test:

mkdir -p ~/perf-lab

fio --name=write-test \
    --directory=~/perf-lab \
    --size=1G \
    --rw=write \
    --bs=1M \
    --direct=1 \
    --runtime=60 \
    --time_based

Check with iostat

iostat -xz 1

Example output:

Device            r/s     w/s     rkB/s     wkB/s   await  aqu-sz  %util
sda              0.00  350.00      0.00  350000.0   32.50   10.20  99.60

Interpretation: - w/s and wkB/s are high. - await is elevated. - aqu-sz shows queueing. - %util is close to 100%. - The disk is saturated by writes.

Check with iotop

sudo iotop -o

Example output:

Total DISK WRITE: 340.00 M/s
TID  PRIO USER DISK READ DISK WRITE IO> COMMAND
5221 be/4 user 0.00 B/s  338.00 M/s 92% fio --name=write-test

Interpretation:

Possible Fixes

Example:

ionice -c3 backup-command

Scenario 5: Simulate High Disk Space Usage

Create a nearly full filesystem in a safe test directory and diagnose it.

Simulate the Problem

Create a large test file:

mkdir -p ~/perf-lab
fallocate -l 1G ~/perf-lab/bigfile.img

Check disk usage:

du -sh ~/perf-lab

Example output:

1.1G    /home/user/perf-lab

Check filesystem space:

df -h ~

Example output:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3       20G   18G  2.0G  90% /

Interpretation: - The filesystem is 90% full. - The test directory contributes about 1.1 GB. - If this were production, the system could soon fail writes or logs.

Find Large Directories

du -h --max-depth=1 ~ | sort -h

Example output:

100M    /home/user/Documents
500M    /home/user/Downloads
1.1G    /home/user/perf-lab
2.0G    /home/user

Interpretation:

perf-lab is one of the largest directories under the home directory.

Clean Up

rm -rf ~/perf-lab

Scenario 6: Simulate High Load Average from CPU

Understand load average when CPU is the bottleneck.

Simulate

stress-ng --cpu 4 --timeout 120s

Check Load

uptime

Example output:

15:30:00 up 2 days,  1 user,  load average: 4.20, 2.10, 1.00

Check CPU count:

nproc

Example output:

4

Interpretation: - The 1-minute load is about 4.20. - The system has 4 CPUs. - This indicates the CPU is near full utilization.

Confirm with top:

High us, low id, low wa = CPU-bound load.

Scenario 7: Simulate High Load Average from Disk Wait

Show that high load can come from I/O wait, not just CPU work.

Simulate

Run a disk-heavy workload:

fio --name=randwrite-test \
    --directory=~/perf-lab \
    --size=1G \
    --rw=randwrite \
    --bs=4k \
    --numjobs=4 \
    --iodepth=32 \
    --direct=1 \
    --runtime=60 \
    --time_based

Check

uptime
vmstat 1
iostat -xz 1

Example vmstat output:

r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa st
1  8      0 500000  20000 700000    0    0     0 75000 3000 9000  5  8 15 72  0

Example iostat output:

Device            r/s     w/s    rkB/s    wkB/s   await  aqu-sz  %util
sda              0.00  5200.00   0.00  20800.0   48.30   25.60  99.90

Interpretation: - b is high, meaning blocked processes. - wa is high, meaning CPU is waiting for I/O. - Disk %util is near 100%. - This high load is caused by disk I/O wait, not CPU computation.

Scenario 8: Identify a Memory-Heavy Process

Find which process is consuming RAM.

Simulate

Start a memory workload:

stress-ng --vm 1 --vm-bytes 1G --timeout 120s

Check with ps

ps aux --sort=-%mem | head -n 10

Example output:

USER       PID %CPU %MEM    VSZ     RSS COMMAND
user      7001 80.0 12.5 1200000 1024000 stress-ng-vm
mysql     5678 10.0  8.0 2500000  650000 mysqld

Interpretation: - stress-ng-vm is using the most physical RAM. - RSS is about 1 GB. - This process is responsible for memory pressure.

Check Specific Process

ps -o pid,%mem,rss,vsz,cmd -p 7001

Example:

PID  %MEM     RSS     VSZ CMD
7001 12.5 1024000 1200000 stress-ng-vm

Scenario 9: Simulate a Zombie Process

Understand zombie processes and how to identify them.

A zombie process has finished running but still has an entry in the process table because its parent has not collected its exit status.

Simulate with a Small Script

Create a file:

cat > /tmp/make-zombie.py <<'EOF'
import os
import time

pid = os.fork()
if pid == 0:
    os._exit(0)
else:
    time.sleep(60)
EOF

Run it:

python3 /tmp/make-zombie.py

In another terminal:

ps -eo pid,ppid,state,cmd | grep ' Z '

Example output:

8123  8122 Z [python3] <defunct>

Interpretation: - State Z means zombie. - The child process exited. - The parent process has not collected it yet. - A few short-lived zombies are usually harmless. - Many zombies may indicate a broken parent process.

Fix

Usually fix or restart the parent process.

In this simulation, wait 60 seconds or stop the parent script.

Scenario 10: Script an Alert for Disk Usage Above 80%

Create a simple script that warns when a filesystem is too full and lists the largest directories.

Script

cat > ~/check-disk-usage.sh <<'EOF'
#!/bin/bash

THRESHOLD=80
TARGET="/"

USAGE=$(df -P "$TARGET" | awk 'NR==2 {gsub("%","",$5); print $5}')

if [ "$USAGE" -ge "$THRESHOLD" ]; then
    echo "WARNING: $TARGET is ${USAGE}% full"
    echo
    echo "Top directories under /:"
    sudo du -xhd1 / 2>/dev/null | sort -h | tail -n 5
else
    echo "OK: $TARGET is ${USAGE}% full"
fi
EOF

chmod +x ~/check-disk-usage.sh

Run:

~/check-disk-usage.sh

Example output:

WARNING: / is 87% full

Top directories under /:
1.2G    /opt
2.5G    /home
4.0G    /var
8.0G    /usr
18G     /

Interpretation: - The root filesystem is above the threshold. - The largest top-level directories are listed. - Investigate /var, /usr, or /home depending on what is unexpectedly large.

Scenario 11: Gather Hourly Performance Logs

Collect simple performance statistics over time.

Create a Script

cat > ~/perf-snapshot.sh <<'EOF'
#!/bin/bash

LOG="$HOME/perf-history.log"

{
    echo "===== $(date) ====="
    echo "--- uptime ---"
    uptime
    echo "--- memory ---"
    free -h
    echo "--- disk space ---"
    df -h /
    echo "--- top CPU processes ---"
    ps aux --sort=-%cpu | head -n 6
    echo "--- top memory processes ---"
    ps aux --sort=-%mem | head -n 6
    echo
} >> "$LOG"
EOF

chmod +x ~/perf-snapshot.sh

Run manually:

~/perf-snapshot.sh

Add to cron:

crontab -e

Add:

0 * * * * /home/user/perf-snapshot.sh

Interpretation: - The script records a basic hourly snapshot. - After several days, compare timestamps to identify peak usage times.

Performance Troubleshooting Decision Guide

Use this guide to interpret common patterns.

Pattern: High CPU, Low I/O Wait

Example:

top: us = 95%, id = 1%, wa = 0%

Likely cause:

CPU-bound workload

Check:

ps aux --sort=-%cpu | head

Pattern: High I/O Wait

Example:

top: wa = 70%
vmstat: b is high
iostat: %util is 99%

Likely cause:

disk I/O bottleneck

Check:

iostat -xz 1
sudo iotop -o

Pattern: Low Available Memory and Swap Activity

Example:

free: available memory is low
vmstat: si and so are high

Likely cause:

memory pressure or memory leak

Check:

ps aux --sort=-%mem | head

Pattern: High Load but CPU Idle

Example:

uptime: load average high
top: CPU mostly idle
vmstat: b high, wa high

Likely cause:

processes blocked on I/O

Check:

vmstat 1
iostat -xz 1
ps -eo pid,stat,cmd | awk '$2 ~ /D/ {print}'

Pattern: Disk Almost Full

Example:

df -h: Use% above 90%

Likely cause:

logs, cache, backups, database files, or user data consuming space

Check:

sudo du -xhd1 / | sort -h

Useful Command Summary

General:

uptime
top
htop
vmstat 1
free -h

CPU:

ps aux --sort=-%cpu | head
top -p PID

Memory:

free -h
ps aux --sort=-%mem | head
ps -o pid,%mem,rss,vsz,cmd -p PID

Disk space:

df -h
du -h --max-depth=1 DIRECTORY | sort -h

Disk I/O:

iostat -xz 1
sudo iotop -o
vmstat 1

Process states:

ps -eo pid,ppid,state,cmd
ps -eo pid,stat,cmd | awk '$2 ~ /D/ {print}'

Stress testing in labs:

stress-ng --cpu 4 --timeout 60s
stress-ng --vm 2 --vm-bytes 70% --timeout 60s
fio --name=write-test --directory=~/perf-lab --size=1G --rw=write --bs=1M --direct=1 --runtime=60 --time_based

Safe Lab Rules

Before simulating bottlenecks:

Clean up test data:

rm -rf ~/perf-lab

Practical Challenges

  1. Run top during normal system use. Identify the top CPU-consuming process and explain whether its usage is expected.
  2. Run htop and sort by memory usage. Compare the top memory process with the output of ps aux --sort=-%mem.
  3. Use free -h to record total, used, free, buff/cache, available, and swap usage. Explain why available is more useful than free.
  4. Run vmstat 1 during normal use and during a CPU stress test. Compare r, us, sy, id, and wa.
  5. Simulate memory pressure with stress-ng and observe free -h and vmstat 1.
  6. Simulate disk I/O pressure with fio and observe iostat -xz 1 and iotop.
  7. Use df -h and du to identify the largest directories on a test filesystem.
  8. Create a disk usage alert script that warns when / is above 80% usage.
  9. Create a cron job that records uptime, memory, disk space, and top processes every hour.
  10. Write a short performance report for one simulated bottleneck. Include the command used to simulate it, tool output, interpretation, and recommended fix.