Running with iodepth reduces the performance by 10x or more

Question

Running with iodepth reduces the performance by 10x or more

grandsuri opened this issue a year ago · comments

Running elbencho with iodepth=1 on our stack gives good performance as expected. Running iodepth=2 or more (in aio mode) reduces the read/write performance by 10x or more. We ran fio tool with iodepth and got almost the same performance as without aio.
Wondering if this is a known issue with elbencho? Or should we run with any additional flags to raise the performance.

Sven Breuner · Answer 1 · Sat Jan 14 2023 06:35:10 GMT+0800 (China Standard Time)

Hi @grandsuri,
can you please share an example of a fio and elbencho command line where you see this difference, so I can check that they are equivalent or whether I see the same difference?

In general, the iodepth implementation for async IO of elbencho is based on libaio (similar to fio --ioengine=libaio). The exact effect depends on the filesystem, but typically libaio is only effective with direct IO, so elbencho --direct and fio --direct=1.

For spinning disks, depending on the IO scheduler settings, the problem with async IO is that it can turn sequential IO into non-sequential IO, so the spinning disks start seeking and get slow. With an SSD, this problem wouldn't exist.

As an example, here are 4 tests, which I ran on the internal SSD of a Linux host:

1) Sequential write of a single 1GB file in 4KB blocks using a single thread.

(Elbencho generates incompressible data by default, which means a bit of extra work for the CPU. Here I'm using the extra parameter --blockvarpct 0 to disable this.)

$ elbencho -w -b 4k -s 1g --direct -t 1 --blockvarpct 0 /tmp/testfile

OPERATION RESULT TYPE        FIRST DONE  LAST DONE
========= ================   ==========  =========
WRITE     Elapsed ms       :      10691      10691
          IOPS             :      24518      24518
          Throughput MiB/s :         95         95
          Total MiB        :       1024       1024

...about 24K IOPS as baseline result.

2) Same test as above, but with `--iodepth 4`:

$ elbencho -w -b 4k -s 1g --direct -t 1 --iodepth 4 --blockvarpct 0 /tmp/testfile

OPERATION RESULT TYPE        FIRST DONE  LAST DONE
========= ================   ==========  =========
WRITE     Elapsed ms       :       5137       5137
          IOPS             :      51030      51030
          Throughput MiB/s :        199        199
          Total MiB        :       1024       1024

...about 50K IOPS, so it's faster on the SSD, as expected.

3) For comparison, instead of `iodepth 4` now with 4 threads:

$ elbencho -w -b 4k -s 1g --direct -t 4 --blockvarpct 0 /tmp/testfile

OPERATION RESULT TYPE        FIRST DONE  LAST DONE
========= ================   ==========  =========
WRITE     Elapsed ms       :       5086       5087
          IOPS             :      51529      51527
          Throughput MiB/s :        201        201
          Total MiB        :       1023       1024

...again about 50K IOPS, so this result is roughly equivalent to the single thread with iodepth 4, as expected.

4) And to confirm that things also don't get worse for iodepth with multiple threads, now with 4 threads and iodepth 4:

$ elbencho -w -b 4k -s 1g --direct -t 4 --iodepth 4 --blockvarpct 0 /tmp/testfile

OPERATION RESULT TYPE        FIRST DONE  LAST DONE
========= ================   ==========  =========
WRITE     Elapsed ms       :       3412       3416
          IOPS             :      76825      76735
          Throughput MiB/s :        300        299
          Total MiB        :       1024       1024

...76K IOPS, so the result again gets higher with more parallelism.

Sven Breuner · Answer 2 · Sun Nov 12 2023 03:24:31 GMT+0800 (China Standard Time)

Hi @grandsuri , I'm closing this due to no reply for several months. Please feel free to re-open if you have anything new to add.

Sven Breuner · Answer 3 · Sun Nov 12 2023 03:25:06 GMT+0800 (China Standard Time)

Hi @grandsuri , I'm closing this due to no reply for several months. Please feel free to re-open if you have anything new to add.

Running with iodepth reduces the performance by 10x or more

1) Sequential write of a single 1GB file in 4KB blocks using a single thread.

2) Same test as above, but with --iodepth 4:

3) For comparison, instead of iodepth 4 now with 4 threads:

4) And to confirm that things also don't get worse for iodepth with multiple threads, now with 4 threads and iodepth 4:

2) Same test as above, but with `--iodepth 4`:

3) For comparison, instead of `iodepth 4` now with 4 threads: