gluesql / gluesql

GlueSQL is quite sticky. It attaches to anywhere.

Home Page:https://gluesql.org/docs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance with persistent tables (Sled)

kanekoshoyu opened this issue · comments

Hi,

lately I've been doing quite a lot of stuff with GlueSQL, and one thing is that I am storing some fininancial data. However, I find that the performance of GlueSQL is not sufficient to fulfill requirements. Namely, the persistent tables are really slow, so I was wondering if there is any optimization available to both SharedMemoryStorage and SledStorage that you would recommend.

Currently I have about 100 (symbols) * 24 * 60 * 60 seconds of data that I have to query quite a lot of times. Each data is a row in the table. I was expecting like multiple queries in a second, apparently 1 single selection data is already taking lots of time, as below.

1 single selection:
in-memory DB 1hr data: 700ms
persistent DB 1hr data: 2700ms
in-memory DB 24hr data: 23000ms
persistent DB 24hr data: 82500ms

I changed the Sled Config to do 2GB data cache, which is the exact same amount as the data stored, and set Mode to HighThroughput. It did improve a bit of performance.

persistent DB 24hr data (cache set to 2GB): 65000ms

I wonder if any proper use of primary key within GlueSQL would help with the performance. Any optimizable options would be great.

Did you mean selecting 360,000 (100 * 60 * 60) rows from sled storage took 2.7 seconds or 1 row from 360,000 rows??
Did it include client latency to get the all rows or just exeuction time of query?

And i wonder what was the full query text.

https://gist.github.com/JakkuSakura/4bb9678501dbabf56c1b6d95269740aa
This is the source code used for benchmark
Not the exact one but still shows something

persistent database insertion for 1s data: 7ms
volatile database insertion for 1s data: 1ms
volatile shared database selection with 1 symbols 24hr data: 1581ms, 1hr=false
volatile shared database selection with 1 symbols 24hr data: 1587ms, 1hr=true
persistent database selection with 1 symbols 1hr data: 93ms, 1hr=false
volatile shared database selection with 1 symbols 1hr data: 75ms, 1hr=false
persistent database selection with 1 symbols 24hr data: 2067ms, 1hr=false
persistent database selection with 1 symbols 24hr data: 2071ms, 1hr=true
It's completely useless
There're no simple way we can improve
1hr=true means that we select 1hour's data from whole dataset

Did you mean selecting 360,000 (100 * 60 * 60) rows from sled storage took 2.7 seconds or 1 row from 360,000 rows?? Did it include client latency to get the all rows or just exeuction time of query?

And i wonder what was the full query text.

It is 3,600 rows from 360,000, as in getting one symbol out of 100 symbols for 1 hr data.

Jakku has set up a test with slight modification to above, changing symbol string to symbol ID. It did help quite a lot, so my next optimization guess is to change from Decimal to f64.

persistent database insertion for 1s data: 2ms
volatile database insertion for 1s data: 1ms
volatile shared database selection with 1 symbols 24hr data: 380ms, 1hr=false
volatile shared database selection with 1 symbols 24hr data: 385ms, 1hr=true
persistent database selection with 1 symbols 1hr data: 36ms, 1hr=false
volatile shared database selection with 1 symbols 1hr data: 17ms, 1hr=false
persistent database selection with 1 symbols 24hr data: 826ms, 1hr=true
persistent database selection with 1 symbols 24hr data: 838ms, 1hr=false

But again, query for selecting 86,400 rows out of 8,640,000 in persistent table taking 800+ms is quite slow, we are aiming at somewhere below 50ms.

I'm wondering what are the limiting factors to the performances here. I get that changing symbol string to id improves performance, because it reduces the time for symbol comparison. But say for size of each row, do they make big difference in terms of performance in GlueSQL selection?

Roughly speaking, Sled is structured in BTreeMap, so the best possible for query should be O(logN). However, currently 1hr vs 24hr data is taking 36ms vs 838ms, at roughly 24 times O(N) instead of 4.5 times O(logN).

Same applies to SharedMemoryStorage, 17ms vs 385ms is around the 24 times.

Operation per row that's taking too much time that the O(logN) issue is becoming a O(N) issue.