scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra

Home Page:

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Operations timeout while inserting data into ScyllaDB cluster at very low throughput

amitesh88 opened this issue · comments

I have a 3 node scyllaDB cluster
32 CPU ,64GB RAM , scylla version: 5.4.3
io_properties.yaml :
read_iops: 36764
read_bandwidth: 769690880
write_iops: 42064
write_bandwidth: 767818944
When application has increased writes operations from 1200 to 10000 tps , which is far less than claimed write_iops, it was getting error below:
Error inserting Data : Operation timed out for xxx_xxx.xxx_xxxxx_240512 - received only 1 responses from 2 CL=QUORUM.
On ScyllaDB node the only log can be seen is:
[shard 8:comp] large_data - Writing large partition xxx_xxx.xxx_xxxxx_240512: xxx (37041816 bytes) to me-3gg2_13mq_3jyhc2r2wxx7hvxxw4-big-Data.db
CPU utilisation on each node is hardly 15% , but application failed to write
Note: RF of system_auth and other keyspaces is already equal to number of nodes

Need insights on this
Thanks in advance

Hi @amitesh88 - you can't compare the io_properties IOPS in any way to the CQL OPs - ScyllaDB does a whole lot more 'raw' IOPS per every CQL transaction. For example, commit log I/O, or compaction.
However, I do encourage you to test with fio the disk - it may be that iotune is configuring vastly less IOPs than the disks can sustain and you may be able to raise the numbers somewhat. Unsure if that will solve your issues, but worth a try.


we are using ssd disk on gcp vm which has good io throughput , refer image above

Can writing large partition be the issue related to partition key not properly distributing load??

@amitesh88 - as you can see, the numbers quoted above and iotune are vastly different. I'd also compare with fio. If fio is substantially better, I'd change the number manually to higher values and try again. See scylladb/seastar#1297 for reference

Using FIO , I am getting below result
scylla_io_2: (groupid=0, jobs=16): err= 0: pid=16100: Wed May 15 16:28:36 2024
write: IOPS=37.8k, BW=185MiB/s (193MB/s)(10.8GiB/60008msec); 0 zone resets

Using FIO , I am getting below result scylla_io_2: (groupid=0, jobs=16): err= 0: pid=16100: Wed May 15 16:28:36 2024 write: IOPS=37.8k, BW=185MiB/s (193MB/s)(10.8GiB/60008msec); 0 zone resets

That's a bit low - I expected more. Can you share the full fio command line and results?

Below is the command with output:

fio --filename=/var/lib/scylla/a --direct=1 --rw=randrw --refill_buffers --size=1G --norandommap --randrepeat=0 --ioengine=libaio --bs=5kb --rwmixread=0 --iodepth=16 --numjobs=16 --runtime=60 --group_reporting --name=scylla_io_2
scylla_io_2: (g=0): rw=randrw, bs=(R) 5120B-5120B, (W) 5120B-5120B, (T) 5120B-5120B, ioengine=libaio, iodepth=16
Starting 16 processes
scylla_io_2: Laying out IO file (1 file / 1024MiB)
Jobs: 16 (f=16): [w(16)][100.0%][w=179MiB/s][w=36.8k IOPS][eta 00m:00s]
scylla_io_2: (groupid=0, jobs=16): err= 0: pid=16100: Wed May 15 16:28:36 2024
write: IOPS=37.8k, BW=185MiB/s (193MB/s)(10.8GiB/60008msec); 0 zone resets
slat (usec): min=3, max=1636, avg=10.87, stdev=13.92
clat (usec): min=391, max=26669, avg=6759.47, stdev=1146.82
lat (usec): min=549, max=26699, avg=6770.58, stdev=1146.98
clat percentiles (usec):
| 1.00th=[ 2245], 5.00th=[ 4621], 10.00th=[ 5932], 20.00th=[ 6390],
| 30.00th=[ 6587], 40.00th=[ 6783], 50.00th=[ 6915], 60.00th=[ 7046],
| 70.00th=[ 7177], 80.00th=[ 7373], 90.00th=[ 7635], 95.00th=[ 7963],
| 99.00th=[ 9110], 99.50th=[10421], 99.90th=[13566], 99.95th=[15008],
| 99.99th=[20579]
bw ( KiB/s): min=179810, max=321463, per=99.99%, avg=188943.08, stdev=1561.74, samples=1920
iops : min=35962, max=64289, avg=37788.25, stdev=312.34, samples=1920
lat (usec) : 500=0.01%, 750=0.01%, 1000=0.02%
lat (msec) : 2=0.62%, 4=3.40%, 10=95.36%, 20=0.58%, 50=0.01%
cpu : usr=1.39%, sys=3.04%, ctx=1606741, majf=0, minf=194
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,2267856,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
WRITE: bw=185MiB/s (193MB/s), 185MiB/s-185MiB/s (193MB/s-193MB/s), io=10.8GiB (11.6GB), run=60008-60008msec

Disk stats (read/write):
sdb: ios=0/2266088, merge=0/0, ticks=0/15152759, in_queue=15152760, util=99.87%

Very strange. This is what I'm getting on my laptop :
Run status group 0 (all jobs):
WRITE: bw=3684MiB/s (3863MB/s), 3684MiB/s-3684MiB/s (3863MB/s-3863MB/s), io=16.0GiB (17.2GB), run=4447-4447msec

And of course, if I switch to 4KB bs, it's slightly better.
Run status group 0 (all jobs):
WRITE: bw=4011MiB/s (4206MB/s), 4011MiB/s-4011MiB/s (4206MB/s-4206MB/s), io=16.0GiB (17.2GB), run=4085-4085msec

Please check the advanced dashboard in per-shard view mode to see if some shard is the bottleneck.

Thanks a lot
Can we check this on opensource Scylla??

Thanks a lot Can we check this on opensource Scylla??

Yes, you can use the monitor with open source Scylla.

I got the issue, It was due to partition key which was not letting data to be equally divided on the nodes , thats why getting
large_data - Writing large partition
We have corrected it to uuid and now data is equally distributed on both DC1 and DC2
Thanks for your time.