blockio-uring: Perform batches of asynchronous disk IO operations.

[ bsd3, library, system ] [ Propose Tags ] [ Report a vulnerability ]

This library supports disk I/O operations using the Linux io_uring API. The library supports submitting large batches of I/O operations in one go. It also supports submitting batches from multiple Haskell threads concurrently. The I/O only blocks the calling thread, not all other Haskell threads. In this style, using a combination of batching and concurrency, it is possible to saturate modern SSDs, thus achieving maximum I/O throughput. This is particularly helpful for performing lots of random reads.

The library only supports recent versions of Linux, because it uses the io_uring kernel API. It only supports disk operations, not socket operations. The library is tested only with Ubuntu (versions 22.04 and 24.04), but other Linux distributions should probably also work out of the box. Let us know if you run into any problems!


[Skip to Readme]

Modules

[Index] [Quick Jump]

Downloads

Maintainer's Corner

Package maintainers

For package maintainers and hackage trustees

Candidates

  • No Candidates
Versions [RSS] 0.1.0.0
Change log CHANGELOG.md
Dependencies base (>=4.16 && <4.22), primitive (>=0.8 && <0.10), vector (>=0.13 && <0.14) [details]
Tested with ghc ==9.2 || ==9.4 || ==9.6 || ==9.8 || ==9.10 || ==9.12
License BSD-3-Clause
Copyright (c) Well-Typed LLP 2022 - 2025
Author Duncan Coutts
Maintainer duncan@well-typed.com, joris@well-typed.com
Category System
Source repo head: git clone https://github.com/well-typed/blockio-uring
this: git clone https://github.com/well-typed/blockio-uring(tag blockio-uring-0.1.0.0)
Uploaded by jdral at 2025-06-09T10:11:34Z
Distributions
Downloads 4 total (4 in the last 30 days)
Rating (no votes yet) [estimated by Bayesian average]
Your Rating
  • λ
  • λ
  • λ
Status Docs uploaded by user [build log]
All reported builds failed as of 2025-06-09 [all 2 reports]

Readme for blockio-uring-0.1.0.0

[back to package description]

blockio-uring

This library supports disk I/O operations using the Linux io_uring API. The library supports submitting large batches of I/O operations in one go. It also supports submitting batches from multiple Haskell threads concurrently. The I/O only blocks the calling thread, not all other Haskell threads. In this style, using a combination of batching and concurrency, it is possible to saturate modern SSDs, thus achieving maximum I/O throughput. This is particularly helpful for performing lots of random reads.

The library only supports recent versions of Linux, because it uses the io_uring kernel API. It only supports disk operations, not socket operations. The library is tested only with Ubuntu (versions 22.04 and 24.04), but other Linux distributions should probably also work out of the box. Let us know if you run into any problems!

Installation

  1. Install liburing version 2.1 or higher. Use your package manager of choice, or clone a recent version of https://github.com/axboe/liburing and install using make install.
  2. Invoke cabal build or cabal run.

Benchmarks

We can compare the I/O performance that can be achieved using library against the best case of what the system can do. The most interesting comparison is for performing random 4k reads, and measuring the IOPS -- the I/O operations per second.

Baseline using fio

We can use the fio (flexible I/O tester) tool to give us a baseline for the best that the system can manage. While manufacturers often claim that SSDs can hit certain speeds or IOPs for specific workloads (like random reads at queue depth 32), fio gives a more realistic number that includes all the overheads that are present in practice, such as overheads induced by file systems, encrypted block devices, etc. This makes fio a sensible baseline.

The repo contains an fio configuration file for a random read benchmark using io_uring, which we can use like so:

❯ fio ./benchmark/randread.fio

This will produce a page full of output, like so:

  read: IOPS=234k, BW=915MiB/s (960MB/s)(1024MiB/1119msec)
    slat (usec): min=41, max=398, avg=70.98, stdev=15.76
    clat (usec): min=59, max=4611, avg=342.75, stdev=199.02
     lat (usec): min=121, max=4863, avg=413.75, stdev=200.25
    clat percentiles (usec):
     |  1.00th=[   86],  5.00th=[  102], 10.00th=[  120], 20.00th=[  174],
     | 30.00th=[  262], 40.00th=[  289], 50.00th=[  318], 60.00th=[  343],
     | 70.00th=[  371], 80.00th=[  490], 90.00th=[  586], 95.00th=[  652],
     | 99.00th=[  922], 99.50th=[ 1090], 99.90th=[ 1598], 99.95th=[ 2180],
     | 99.99th=[ 4015]
   bw (  KiB/s): min=938280, max=946296, per=100.00%, avg=942288.00, stdev=5668.17, samples=2
   iops        : min=234570, max=236574, avg=235572.00, stdev=1417.04, samples=2
  lat (usec)   : 100=4.32%, 250=21.96%, 500=54.23%, 750=16.10%, 1000=2.74%
  lat (msec)   : 2=0.60%, 4=0.05%, 10=0.01%
  cpu          : usr=10.20%, sys=64.67%, ctx=66125, majf=0, minf=138
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=100.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

The headline number to focus on is this bit:

  read: IOPS=234k

Caching

By default, the fio configuration file declares that the random read benchmark should use direct I/O, which bypasses any caching of file contents by the operating system. This means that fio measures the raw performance of the disk hardware. If you change the configuration file to use direct I/O, then the reported statistics will be different because of caching. It is possible to manually drop caches before running the benchmark to reduce the effects of caching by invoking:

❯ sudo sysctl -w vm.drop_caches=1

Haskell benchmarks

The bench executable expects three arguments: (1) a benchmark name, either low or high, (2) a caching mode, either Cache or NoCache, and (3) a filepath to a data file.

  • The low benchmark targets low-level, internal code of the library. The high benchmark uses the public API that is exposed from blockio-uring.

  • NoCache enables direct I/O, while Cache disables it.

  • The filepath to a data file will be used to read bytes from during the benchmark.

For a fair comparison with fio we use the same file that fio generated for its own benchmark, ./benchmark/benchfile.0.0. For example, we can invoke the high-level benchmark with direct I/O as follows:

❯ cabal run bench -- high NoCache ./benchmark/benchfile.0.0

This will report some statistics:

High-level API benchmark
File caching:    False
Capabilities:    1
Threads     :    4
Total I/O ops:   262016
Elapsed time:    1.216050853s
IOPS:            215465
Allocated total: 24883952
Allocated per:   95

Though the output prints a number of useful metrics, the headline number to focus on here is this bit:

IOPS:            215465

Comparing results

For comparisons, we are primarily interested in IOPS.

On a maintainer's laptop (@jdral) with direct I/O enabled and using a single OS thread the numbers in question are:

  • fio: 234k IOPS
  • High-level Haskell benchmark: 215k IOPS
  • Low-level Haskell benchmark: 208k IOPS

So as a rough conclusion, we can get about 92% of the maximum IOPS using the high-level Haskell API, or about 88% when using the low-level internals. On some other machines, it was observed that the high-level benchmark could even match the IOPs reported by fio. It might be unintuitive that the high-level benchmark gets higher numbers than the low-level benchmark. However, the high-level benchmarks uses multiple lightweight Haskell threads, which allows the benchmark to have multiple I/O batches in flight at once even when using one OS thread, while the low-level benchmark only submits I/O batches serially.

Multi-threading

Note that much IOPs can be achieved by running the benchmarks with more than one OS thread. This allows (more) I/O operations to be in flight at once, which better utilises the bandwidth of modern SSDs. For example, if we want to use 4 OS threads for our benchmarks, then:

  • In the fio configuration file we can add numjobs=4 to fork 4 identical jobs that perform I/O concurrently.

  • In the Haskell benchmarks we can configure the GHC run-time system to use 4 OS threads (capabilities) by passing +RTS -N4 -RTS to the benchmark executable.