Operations timeout while inserting data into ScyllaDB cluster at very low throughput

Question

Operations timeout while inserting data into ScyllaDB cluster at very low throughput

amitesh88 opened this issue 24 days ago · comments

I have a 3 node scyllaDB cluster
32 CPU ,64GB RAM , scylla version: 5.4.3
io_properties.yaml :
read_iops: 36764
read_bandwidth: 769690880
write_iops: 42064
write_bandwidth: 767818944
When application has increased writes operations from 1200 to 10000 tps , which is far less than claimed write_iops, it was getting error below:
Error inserting Data : Operation timed out for xxx_xxx.xxx_xxxxx_240512 - received only 1 responses from 2 CL=QUORUM.
On ScyllaDB node the only log can be seen is:
[shard 8:comp] large_data - Writing large partition xxx_xxx.xxx_xxxxx_240512: xxx (37041816 bytes) to me-3gg2_13mq_3jyhc2r2wxx7hvxxw4-big-Data.db
CPU utilisation on each node is hardly 15% , but application failed to write
Note: RF of system_auth and other keyspaces is already equal to number of nodes

Need insights on this
Thanks in advance

Yaniv Kaul · Answer 1 · Mon May 13 2024 14:15:34 GMT+0800 (China Standard Time)

Hi @amitesh88 - you can't compare the io_properties IOPS in any way to the CQL OPs - ScyllaDB does a whole lot more 'raw' IOPS per every CQL transaction. For example, commit log I/O, or compaction.
However, I do encourage you to test with fio the disk - it may be that iotune is configuring vastly less IOPs than the disks can sustain and you may be able to raise the numbers somewhat. Unsure if that will solve your issues, but worth a try.

amitesh88 · Answer 2 · Tue May 14 2024 01:09:57 GMT+0800 (China Standard Time)

we are using ssd disk on gcp vm which has good io throughput , refer image above

Can writing large partition be the issue related to partition key not properly distributing load??

Yaniv Kaul · Answer 3 · Tue May 14 2024 16:22:11 GMT+0800 (China Standard Time)

@amitesh88 - as you can see, the numbers quoted above and iotune are vastly different. I'd also compare with fio. If fio is substantially better, I'd change the number manually to higher values and try again. See scylladb/seastar#1297 for reference

amitesh88 · Answer 4 · Wed May 15 2024 19:01:53 GMT+0800 (China Standard Time)

Using FIO , I am getting below result
scylla_io_2: (groupid=0, jobs=16): err= 0: pid=16100: Wed May 15 16:28:36 2024
write: IOPS=37.8k, BW=185MiB/s (193MB/s)(10.8GiB/60008msec); 0 zone resets

Yaniv Kaul · Answer 5 · Wed May 15 2024 19:13:18 GMT+0800 (China Standard Time)

Using FIO , I am getting below result scylla_io_2: (groupid=0, jobs=16): err= 0: pid=16100: Wed May 15 16:28:36 2024 write: IOPS=37.8k, BW=185MiB/s (193MB/s)(10.8GiB/60008msec); 0 zone resets

That's a bit low - I expected more. Can you share the full fio command line and results?

amitesh88 · Answer 6 · Wed May 15 2024 19:38:10 GMT+0800 (China Standard Time)

Below is the command with output:

fio --filename=/var/lib/scylla/a --direct=1 --rw=randrw --refill_buffers --size=1G --norandommap --randrepeat=0 --ioengine=libaio --bs=5kb --rwmixread=0 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=scylla_io_2
scylla_io_2: (g=0): rw=randrw, bs=(R) 5120B-5120B, (W) 5120B-5120B, (T) 5120B-5120B, ioengine=libaio, iodepth=16
...
fio-3.16
Starting 16 processes
scylla_io_2: Laying out IO file (1 file / 1024MiB)
Jobs: 16 (f=16): [w(16)][100.0%][w=179MiB/s][w=36.8k IOPS][eta 00m:00s]
scylla_io_2: (groupid=0, jobs=16): err= 0: pid=16100: Wed May 15 16:28:36 2024
write: IOPS=37.8k, BW=185MiB/s (193MB/s)(10.8GiB/60008msec); 0 zone resets
slat (usec): min=3, max=1636, avg=10.87, stdev=13.92
clat (usec): min=391, max=26669, avg=6759.47, stdev=1146.82
lat (usec): min=549, max=26699, avg=6770.58, stdev=1146.98
clat percentiles (usec):
| 1.00th=[ 2245], 5.00th=[ 4621], 10.00th=[ 5932], 20.00th=[ 6390],
| 30.00th=[ 6587], 40.00th=[ 6783], 50.00th=[ 6915], 60.00th=[ 7046],
| 70.00th=[ 7177], 80.00th=[ 7373], 90.00th=[ 7635], 95.00th=[ 7963],
| 99.00th=[ 9110], 99.50th=[10421], 99.90th=[13566], 99.95th=[15008],
| 99.99th=[20579]
bw ( KiB/s): min=179810, max=321463, per=99.99%, avg=188943.08, stdev=1561.74, samples=1920
iops : min=35962, max=64289, avg=37788.25, stdev=312.34, samples=1920
lat (usec) : 500=0.01%, 750=0.01%, 1000=0.02%
lat (msec) : 2=0.62%, 4=3.40%, 10=95.36%, 20=0.58%, 50=0.01%
cpu : usr=1.39%, sys=3.04%, ctx=1606741, majf=0, minf=194
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,2267856,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
WRITE: bw=185MiB/s (193MB/s), 185MiB/s-185MiB/s (193MB/s-193MB/s), io=10.8GiB (11.6GB), run=60008-60008msec

Disk stats (read/write):
sdb: ios=0/2266088, merge=0/0, ticks=0/15152759, in_queue=15152760, util=99.87%

Yaniv Kaul · Answer 7 · Wed May 15 2024 20:32:03 GMT+0800 (China Standard Time)

Very strange. This is what I'm getting on my laptop :
Run status group 0 (all jobs):
WRITE: bw=3684MiB/s (3863MB/s), 3684MiB/s-3684MiB/s (3863MB/s-3863MB/s), io=16.0GiB (17.2GB), run=4447-4447msec

And of course, if I switch to 4KB bs, it's slightly better.
Run status group 0 (all jobs):
WRITE: bw=4011MiB/s (4206MB/s), 4011MiB/s-4011MiB/s (4206MB/s-4206MB/s), io=16.0GiB (17.2GB), run=4085-4085msec

Avi Kivity · Answer 8 · Wed May 15 2024 22:15:31 GMT+0800 (China Standard Time)

Please check the advanced dashboard in per-shard view mode to see if some shard is the bottleneck.

amitesh88 · Answer 9 · Wed May 15 2024 22:16:44 GMT+0800 (China Standard Time)

Thanks a lot
Can we check this on opensource Scylla??

Yaniv Kaul · Answer 10 · Wed May 15 2024 22:28:31 GMT+0800 (China Standard Time)

Thanks a lot Can we check this on opensource Scylla??

Yes, you can use the monitor with open source Scylla.

amitesh88 · Answer 11 · Thu May 16 2024 15:11:19 GMT+0800 (China Standard Time)

I got the issue, It was due to partition key which was not letting data to be equally divided on the nodes , thats why getting
large_data - Writing large partition
We have corrected it to uuid and now data is equally distributed on both DC1 and DC2
Thanks for your time.