Running with iodepth reduces the performance by 10x or more
grandsuri opened this issue · comments
Running elbencho with iodepth=1 on our stack gives good performance as expected. Running iodepth=2 or more (in aio mode) reduces the read/write performance by 10x or more. We ran fio tool with iodepth and got almost the same performance as without aio.
Wondering if this is a known issue with elbencho? Or should we run with any additional flags to raise the performance.
Hi @grandsuri,
can you please share an example of a fio and elbencho command line where you see this difference, so I can check that they are equivalent or whether I see the same difference?
In general, the iodepth implementation for async IO of elbencho is based on libaio (similar to fio --ioengine=libaio
). The exact effect depends on the filesystem, but typically libaio is only effective with direct IO, so elbencho --direct
and fio --direct=1
.
For spinning disks, depending on the IO scheduler settings, the problem with async IO is that it can turn sequential IO into non-sequential IO, so the spinning disks start seeking and get slow. With an SSD, this problem wouldn't exist.
As an example, here are 4 tests, which I ran on the internal SSD of a Linux host:
1) Sequential write of a single 1GB file in 4KB blocks using a single thread.
(Elbencho generates incompressible data by default, which means a bit of extra work for the CPU. Here I'm using the extra parameter --blockvarpct 0
to disable this.)
$ elbencho -w -b 4k -s 1g --direct -t 1 --blockvarpct 0 /tmp/testfile
OPERATION RESULT TYPE FIRST DONE LAST DONE
========= ================ ========== =========
WRITE Elapsed ms : 10691 10691
IOPS : 24518 24518
Throughput MiB/s : 95 95
Total MiB : 1024 1024
...about 24K IOPS as baseline result.
2) Same test as above, but with --iodepth 4
:
$ elbencho -w -b 4k -s 1g --direct -t 1 --iodepth 4 --blockvarpct 0 /tmp/testfile
OPERATION RESULT TYPE FIRST DONE LAST DONE
========= ================ ========== =========
WRITE Elapsed ms : 5137 5137
IOPS : 51030 51030
Throughput MiB/s : 199 199
Total MiB : 1024 1024
...about 50K IOPS, so it's faster on the SSD, as expected.
3) For comparison, instead of iodepth 4
now with 4 threads:
$ elbencho -w -b 4k -s 1g --direct -t 4 --blockvarpct 0 /tmp/testfile
OPERATION RESULT TYPE FIRST DONE LAST DONE
========= ================ ========== =========
WRITE Elapsed ms : 5086 5087
IOPS : 51529 51527
Throughput MiB/s : 201 201
Total MiB : 1023 1024
...again about 50K IOPS, so this result is roughly equivalent to the single thread with iodepth 4, as expected.
4) And to confirm that things also don't get worse for iodepth with multiple threads, now with 4 threads and iodepth 4:
$ elbencho -w -b 4k -s 1g --direct -t 4 --iodepth 4 --blockvarpct 0 /tmp/testfile
OPERATION RESULT TYPE FIRST DONE LAST DONE
========= ================ ========== =========
WRITE Elapsed ms : 3412 3416
IOPS : 76825 76735
Throughput MiB/s : 300 299
Total MiB : 1024 1024
...76K IOPS, so the result again gets higher with more parallelism.
Hi @grandsuri , I'm closing this due to no reply for several months. Please feel free to re-open if you have anything new to add.
Hi @grandsuri , I'm closing this due to no reply for several months. Please feel free to re-open if you have anything new to add.