Why Does diamond blastp Stall on HDD? A Deep Dive into Random I/O

A conversation that started with “why is parallel never using all my CPU cores” and ended up teaching me more about hard disks than I expected.

The Setup

I have about 1,000 Bakta jobs for annotating MAGs (metagenome-assembled genomes). Each job runs on 8 threads, and I launch them like this:

parallel --progress --jobs 4 :::: bakta.sh

The machine has 32 threads. So 4 jobs × 8 threads = 32 threads, perfectly occupied — at least in theory.

In practice? parallel almost never saturates the CPU. The jobs drag on, and when I look at what’s happening, almost every task is stuck at the diamond blastp step. CPU usage near zero. Nothing moving.

So what’s going on?

The First Clue: `iowait`

Running top or iostat -x 1 tells the story immediately. Look at the %wa (iowait) column — it’s high. The CPU isn’t busy computing. It’s waiting. Waiting for the disk.

The diamond database (psc.dmnd, 43 GB) lives on a mechanical hard disk. Every blastp search has to read from it. And that’s where things fall apart.

The figure below shows the root cause: 4 concurrent diamond tasks all competing for a single HDD, which can only serve ~207 random reads per second total. CPU utilization collapses to ~15% because the processes spend almost all their time blocked on I/O.

Figure 1. Four diamond tasks share a single mechanical HDD. The disk is saturated at ~95% utilisation while the CPU idles at ~15%, almost entirely blocked on iowait.

Sequential vs Random I/O: The Core of the Problem

Here’s the thing that confused me at first. When I run rsync to copy data from the same HDD, I get ~240 MB/s. So the disk isn’t slow, right?

Wrong. Or rather — it depends on what “slow” means.

There are two completely different ways to read data from a hard disk, and they have wildly different performance characteristics.

Sequential read (what rsync does): data is stored contiguously on disk. The read head positions itself at the start of the file and sweeps forward in one continuous arc, like a needle playing a vinyl record. It barely needs to move between reads. Result: 150–250 MB/s sustained.

Random read (what diamond does): the query jumps to scattered offsets across all 43 GB of psc.dmnd. Every jump requires the read head to physically move — a seek. On a mechanical HDD, seeking takes 5–15 ms. That’s not a software limit. That’s a spinning platter and a physical arm moving through space.

The figure below shows the difference visually:

Figure 2. Sequential reads let the head sweep continuously along a track (~200 MB/s). Random reads force the head to physically jump between scattered disk locations, capped at ~207 seeks per second regardless of how fast the disk spins.

Think about what that means arithmetically: 1000 ms ÷ 10 ms per seek = 100 seeks per second, maximum. Even if you could perfectly pipeline them, that’s your ceiling. With 4 concurrent jobs competing for the same head, you get even worse.

Benchmarking It: `fio` Tells the Truth

I used fio to measure both access patterns on my actual HDD (a LUKS-encrypted 3.6 TB drive mounted at /mnt/store):

# Random read — simulates how diamond reads the database
fio --name=rand-read \
    --filename=/mnt/store/testfile \
    --rw=randread \
    --bs=4k \
    --size=4G \
    --numjobs=4 \
    --runtime=30 \
    --group_reporting \
    --direct=1

HDD results:

IOPS=207, BW=828 KiB/s
Average latency: 19.3 ms

Then I tested my NVMe SSD (the system drive, mounted at /):

fio --name=rand-read \
    --filename=/tmp/testfile \
    --rw=randread \
    --bs=4k \
    --size=4G \
    --numjobs=4 \
    --runtime=30 \
    --group_reporting \
    --direct=1

NVMe results:

IOPS=79,700, BW=311 MiB/s
Average latency: 0.05 ms

The difference: 385× in IOPS, 386× in latency.

Figure 3. fio benchmark results on the actual hardware. HDD delivers 207 IOPS at 19 ms average latency; NVMe delivers 79,700 IOPS at 0.05 ms. The gap is entirely explained by physical seek time.

With 4 jobs sharing 207 total IOPS, each job gets about 52 IOPS. That’s nowhere near enough for diamond to make forward progress — it just sits there, waiting for each tiny read to come back before it can continue.

Why Does diamond Need Random Access At All?

This is the really interesting question. Why can’t diamond just read psc.dmnd sequentially like rsync does?

The answer lies in how diamond’s database is constructed — and what it’s optimized for.

What’s Inside `psc.dmnd`

A diamond database is not a flat list of protein sequences. It’s a preprocessed index structure built for fast similarity search. When you run diamond makedb, it does several things:

Chunks sequences into k-mer seeds — each protein sequence is broken into overlapping short fragments (seeds) of length ~5 amino acids.
Builds a seed index — a hash table mapping each seed to the list of database sequences that contain it.
Stores sequences in a compressed column format — optimized for SIMD vector comparison, not for human-readable sequential access.

The resulting .dmnd file looks roughly like this internally:

[Header / metadata]
[Seed hash table index]   ← sparse, randomly addressed
[Sequence data blocks]    ← addressed via offsets in the index
[Score matrix data]

What Happens During a Query

When you blast a query protein against the database:

Diamond extracts seeds from your query protein.
For each seed, it looks up the hash table to find candidate matching database sequences — a random access into the index region.
For each candidate, it fetches the sequence block from the sequence data region — another random access, to a different part of the file.
It runs a vectorized alignment (Smith-Waterman or BLOSUM-based scoring) on the fetched sequences.
Repeat for thousands of seeds per query protein.

Steps 2 and 3 are the killers. The index offsets for different seeds point to completely different parts of the 43 GB file. There’s no spatial locality — a seed from position 1 in your query protein might point to database offset 2 GB, while the next seed points to offset 38 GB. The disk head has to travel the full distance.

Why Not Just Cache It?

The OS does try to cache recently-read blocks in RAM. But psc.dmnd is 43 GB. Unless you have 43+ GB of free RAM, the cache will constantly be evicted as new blocks come in. With 1,000 different MAG jobs each querying different proteins, the cache hit rate is near zero — every read is a cold miss going to physical disk.

A Computer Science Perspective: The Fundamental Mismatch

What we’re seeing is a classic storage hierarchy mismatch. Computer scientists think about memory and storage in terms of a hierarchy:

Registers         ~0.3 ns      (tiny, ultrafast)
L1 Cache          ~1 ns
L2 Cache          ~4 ns
L3 Cache          ~10 ns
DRAM              ~100 ns
NVMe SSD          ~50,000 ns   (0.05 ms)
SATA SSD          ~100,000 ns  (0.1 ms)
HDD (sequential)  ~5,000,000 ns (5 ms — mostly rotational latency)
HDD (random)      ~10,000,000 ns (10 ms — seek + rotational latency)

Each level is roughly 10–100× slower than the one above it. Random HDD access sits at the very bottom.

diamond is built assuming its database will live on a device with reasonable random I/O — originally, when psc.dmnd was smaller (maybe a few GB), even an HDD could keep up. But as protein databases have grown into the tens of gigabytes, the working set has outgrown what HDDs can serve with their seek-limited random I/O.

The Algorithm Is Right — The Storage Is Wrong

It’s worth emphasizing: diamond’s access pattern isn’t a bug. Hash-indexed seed lookup is the correct algorithm for fast sequence similarity search. The alternative — scanning the entire 43 GB database sequentially for every query — would be far worse. The algorithm assumes you have fast random access. SSDs and NVMe drives deliver that. HDDs don’t.

This is the same reason databases (PostgreSQL, MySQL, etc.) run much better on SSDs: they also use B-tree indexes that require random reads. It’s the same reason Redis keeps everything in RAM. The data structure is designed for a certain access latency, and putting it on slower storage breaks the design assumptions.

Does LUKS Encryption Make It Worse?

I wondered about this since my HDD is LUKS-encrypted (cryptstore). The answer is: barely, if at all.

LUKS decryption happens in the CPU, using AES-NI hardware acceleration. On modern CPUs, AES-NI can decrypt at ~5–10 GB/s — far faster than the HDD can supply data. The bottleneck is the seek time, not the decryption. LUKS adds maybe 1–2% overhead at these I/O rates.

You can verify this by testing the raw block device directly:

sudo fio --name=raw-rand \
    --filename=/dev/sda \
    --rw=randread \
    --bs=4k \
    --size=4G \
    --numjobs=4 \
    --runtime=30 \
    --group_reporting \
    --direct=1

If the IOPS on /dev/sda matches what you get through /mnt/store, encryption is not your problem.

The Fix: Move `psc.dmnd` to NVMe

My system has a 3.7 TB NVMe SSD as the root drive, with /tmp mounted on it. The fix is simple in principle:

# Check available space first
df -h /tmp /

# Copy the two big files (82 GB total)
cp /mnt/store/omics/SDB/bakta/bakta_databases/psc.dmnd /tmp/
cp /mnt/store/omics/SDB/bakta/bakta_databases/bakta.db /tmp/

# Symlink back so bakta config doesn't need to change
ln -sf /tmp/psc.dmnd \
    /mnt/store/omics/SDB/bakta/bakta_databases/psc.dmnd
ln -sf /tmp/bakta.db \
    /mnt/store/omics/SDB/bakta/bakta_databases/bakta.db

With 79,700 IOPS available on NVMe versus 207 on HDD, each of the 4 parallel jobs now gets ~20,000 IOPS instead of ~52 IOPS — a ~385× improvement in the step that was the bottleneck.

Once diamond is no longer stalling, the --jobs 4 setting can potentially be raised (since each job will actually use CPU now), and the 1,000-job queue should complete dramatically faster.

Key Takeaways

1. “Fast disk” means different things for sequential vs random workloads. Your HDD can copy files at 200+ MB/s. It can only do ~200 random 4K reads per second. These are not contradictory — they reflect the physical reality of spinning platters.

2. Bioinformatics databases are random-access data structures. Tools like diamond, HMMER, and BLAST are built on indexed data structures that jump around their database files. They assume fast random I/O. When that assumption breaks, you get near-zero CPU utilization despite the disk “working hard.”

3. iowait is the diagnostic signal. When your CPU shows low utilization but your tasks aren’t finishing, check %wa in top or iostat. High iowait = your program is spending most of its time waiting for the disk, not computing.

4. The storage hierarchy matters. NVMe → SATA SSD → HDD is not a small step — each tier is an order of magnitude different in random I/O performance. For any workload that uses indexed data structures (databases, search indexes, genomics tools), this difference is often the dominant factor in runtime.

5. Encryption is not the culprit here. LUKS/dm-crypt encryption overhead is negligible compared to HDD seek latency. Don’t let it distract you from the real bottleneck.

Diagnostic Cheatsheet

# Check iowait (look for high %wa)
top

# Detailed I/O stats by device
iostat -x 2

# Find which process is doing I/O
iotop -o

# Test random read IOPS on a path
fio --name=test --filename=/your/path/testfile \
    --rw=randread --bs=4k --size=4G \
    --numjobs=4 --runtime=30 \
    --group_reporting --direct=1 && rm /your/path/testfile

# Check where a file lives
df -h /path/to/file

# Check disk type (ROTA=1 is HDD, ROTA=0 is SSD/NVMe)
lsblk -o NAME,ROTA,SIZE,TYPE,MOUNTPOINTS

Written after debugging a bakta pipeline on a 32-thread workstation with mixed HDD + NVMe storage. The diagnosis took longer than the fix.