penberg / limbo

Limbo is a work-in-progress, in-process OLTP database management system, compatible with SQLite.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

External sorting for `ORDER BY`

penberg opened this issue · comments

All the sorting now happens in-memory, which won't work well for large data sets. Let's add support for external sorting too.

I would like to take this up, though would be also my first hand to a rust project, I would have a greater learning curve. Is there some specific document I should look out that would be helpful to implement this feature?

It's good to study what SQLite does and check alternatives like https://duckdb.org/2021/08/27/external-sorting.html.

Going through some resources on how external sorting is implemented, I could understand quite a difference in the way it's implemented in OLTP vs OLAP db's and it is mostly due to the fact on how the tuples are stored on pages. But the main essence is:

  1. split the rows into smaller chunks and store in temporary files
  2. sort each file in memory
  3. merge them using some algorithm (k-merge is quite famous, duckdb uses a radix sort approach) with 1 thread or many threads in parallel based on the host machine specs.

ref:

  1. Paper on External Sorting in Duck DB

@penberg is there a way I could force SQLite to use external sort, wished to see the EXPLAIN output on which opcodes SQLite executes to perform the external sort? I tried adding the following PRAGMA's

PRAGMA cache_size = -2000;
PRAGMA temp_store_directory = '/tmp/sqlite_sort';
PRAGMA temp_store = 2;
PRAGMA sort_mem = 1000000;

but could not find anything in my tmp/sqlite_sort directory

My understanding is sqlite sorts by inserting into a temporary b+ tree index, and reading it out in order.

My understanding is sqlite sorts by inserting into a temporary b+ tree index, and reading it out in order.

Yes, this is what the opcodes even showed. SorterInit would open up a sorter and a OpenPseudo would create a Temp B-Tree Index and use it for sorting.

No, that's not how it works exactly. There is a SorterOpen instruction that opens a transient index that a VDBE program uses for sorting. The OpenPseudo instruction, on the other hand, opens a cursor to a fake row that only ever has one row. That one row is determined by the contents of register specified in operand P2.

If you look at the bytecode of a simple ORDER BY query:

sqlite> EXPLAIN SELECT id FROM users ORDER BY zipcode;
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     16    0                    0   Start at 16
1     SorterOpen     1     3     0     k(1,B)         0
2     OpenRead       0     2     0     9              0   root=2 iDb=0; users
3     Rewind         0     9     0                    0
4       Rowid          0     2     0                    0   r[2]=users.rowid
5       Column         0     8     1                    0   r[1]= cursor 0 column 8
6       MakeRecord     1     2     3                    0   r[3]=mkrec(r[1..2])
7       SorterInsert   1     3     1     2              0   key=r[3]
8     Next           0     4     0                    1
9     OpenPseudo     2     4     3                    0   3 columns in r[4]
10    SorterSort     1     15    0                    0
11      SorterData     1     4     2                    0   r[4]=data
12      Column         2     1     2                    0   r[2]=id
13      ResultRow      2     1     0                    0   output=r[2]
14    SorterNext     1     11    0                    0
15    Halt           0     0     0                    0
16    Transaction    0     0     2     0              1   usesStmtJournal=0
17    Goto           0     1     0                    0

You will see how there are two loops: the first loop starting at address 4 walks the "users" table b-tree and uses SorterInsert to insert to the transient index. When the walking is done, you can see that the second loop starting at address 11 loads a row from the transient index with SorterData and stores it in register r[4]. And if you go up a bit to address 9, you see how OpenPseudo opens a single-row fake table layered on top of register r[4].

In SQLite the transient index lives in src/sorter.c and in Limbo it's in core/vdbe/sorter.rs.

Did you mean src/vdbesort.c in case of SQLite as I could not find any src/sorter.c in the source repo.

@rajivharlalka Yes, correct, that's the one.

@penberg I was trying to replicate the case of using external sort but could not find a way to do so. created a 10 gig db (my laptop has a 8gig ram) to simulate if there is any difference in the bytecode of sqlite if needed to sort the table, but could not find any difference that could help me understand the difference between an external and in-mem sort of data.