Bug: Paginating data (START / OFFSET) gets exponential slower with bigger datasets (from 380ms to 468970ms)

Question

Bug: Paginating data (START / OFFSET) gets exponential slower with bigger datasets (from 380ms to 468970ms)

kerkmann opened this issue a month ago · comments

Describe the bug

I am using SurrealDB on production on a bigger dataset (162k+ entries) in a relative simple table, just a id which is a string. Now I need to fetch the entire dataset from that database, instead of fetching the entire database at once, I've used pagination (paginating through the entire dataset in a loop with a page size of 100). Select the entire dataset without pagination took 380ms. But selecting the entire dataset took around 468970ms (yes ... ms, this query took 7 minutes, 48 seconds and 970 milliseconds in total).

The code I wrote for fetching the entire dataset (without pagination):

   let mut things = HashSet::new();

   let response = surreal
        .query(
            r#"
            SELECT id
            FROM my_table
            "#,
        )
        .await?
        .take::<Vec<Things>>(0)?;

        things = response.into_iter().collect::<HashSet<_>>();

        Ok(things)

This took around 380ms.

The code I wrote for fetching the entire dataset (with pagination):

    let mut things = HashSet::new();

    let mut page = 0;
    loop {
        let response = surreal
            .query(
                r#"
                SELECT id
                FROM my_table
                LIMIT $page_size
                START ($page_size * $page)
                "#,
            )
            .bind(("page_size", 100))
            .bind(("page", page))
            .await?
            .take::<Vec<Thing>>(0)?;

        if response.is_empty() {
            break;
        }

        things.extend(response);

        page += 1;
    }

    Ok(things)

This took around 468970ms (yes ... ms, this query took 7 minutes, 48 seconds and 970 milliseconds in total).

I've created an example with the demo dataset. In that one the dataset is just 1000 and the delta between these two variants are 372ms (from 8ms to 380ms). What I can tell is, having a bigger dataset is much worse, it's scaling ridiculously fast.. ^^"

Steps to reproduce

I've loaded the Demo Surreal Demo dataset in surrealist to create an easy entry point for that bug. In that there are too queries, one is selecting the entire dataset (without pagination) and the other one is selector the entire dataset with pagination.

The query without pagination:

-- This query takes ~8ms
select id from review;

This query is for that example a hard coded array (because the review table contains 1000 entries). On a application it's often just a loop to iterate through the dataset.

-- This query takes ~380ms
for $page in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99] {
    select id from review limit 10 start $page * 10;
};

Expected behaviour

Using the START/LIMIT should NOT be exponentially slower over time (or at least not that much). Working with a bigger dataset is sadly quite impossible right now. :(

SurrealDB version

surreal 1.4.2 on linux x86_64

Contact Details

daniel@kerkmann.dev

Is there an existing issue for this?

I have searched the existing issues

Code of Conduct

I agree to follow this project's Code of Conduct

Tobie Morgan Hitchcock · Answer 1 · Sat May 11 2024 15:27:53 GMT+0800 (China Standard Time)

Hi @kerkmann thanks for submitting this issue.

When you use START in your statement, it doesn't actually start at that number. We still actually have to scan from the beginning of the table, and skip the results that are processed, until we have reached the record number that should be at the 'start'. As a result, what you are effectively doing is running a limited full table scan, many, many times - and one that gets larger over time as the number of records is skipped.

The good news: there are a couple of things that can be done to make this faster! This is where record ranges, and complex record ids come in to use...

Record ranges
You can use record ranges to start a query from a specific record.

-- Create 1000 new records
CREATE |person:1..1000|;
-- Select the first 5 records - the last id from this result is person:5
SELECT * FROM person LIMIT 5;
-- Use an unbounded record range to start selecting immediately from the person:5 record. We still need to use START 1, because we want to skip the starting record.
SELECT * FROM person:5.. LIMIT 5 START 1;
-- We can improve on this by starting from the next record after the starting record in the range, and by doing it this way, we can drop the START 1 clause.
SELECT * FROM person:5>.. LIMIT 5;

Complex record ids
If you know in advance what your record ids might look like, then you could use complex record ids in combination with the approach above.

-- Create records with a specific ID structure
CREATE person:['UK', 'marcus@gmail.com'];
CREATE person:['UK', 'simon@gmail.com'];
CREATE person:['UK', 'charles@gmail.com'];
CREATE person:['US', 'mike@gmail.com'];
CREATE person:['US', 'trevor@gmail.com'];
CREATE person:['US', 'adam@gmail.com'];
CREATE person:['US', 'george@gmail.com'];
CREATE person:['US', 'obama@gmail.com'];
-- Select all records in the table
SELECT * FROM person LIMIT 5;
-- Select only from a specific set of records, or a specific 'shard' of the table
SELECT * FROM person:['UK', NONE]..['US', NONE];
-- SELECT a specific set of records starting at a specific point and continuing to the end of the table
SELECT * FROM person:['UK', NONE]..;