Sopel97 / chess_pos_db

Database software for chess position statistics. Designed to provide high performance and handle billions of games.

Multithreaded import

Sopel97 opened this issue 4 years ago · comments

Tomasz Sobczyk commented 4 years ago

General thoughts follow:

We can't do multithreading for databases that support first/last game lookup without major structural changes
- Defer this to a separate database format type. To be designed (gonna be fun :D)
- We actually can. But it requires a different approach
  - Divide files to different threads.
    - if order of games matters this has to be done apriori, otherwise we couldn't
      keep the order of files the same as the order of games (or we would need one file per pgn which is potentially too fragmented)
    - if order of games DOESN'T matter (for example formats without first/last game or formats that can order games not only by their location) then we can and should do a task queue (but we have to order files from largest to smallest!, otherwise we risk long single threaded task at the end)
  - Each thread creates files 1000000*thread_id + i
  - Store all entries from a single game temporarily
  - After the whole game is processed add it to the game headers
  - Set game index in all entries
  - Copy the entries to the output buffer
  - Sorting has to be stable_sort if order of games matters - and the only way to order the games by time is by the entry location.
    - and we have to go back to combining entries where first/last games are not min max but from location

Some old points that may apply:

We CAN easly do multithreading for databases that don't store header data
- fit into current database format code
- constexpr bool gameOrderMatters = hasAnyHeader
  - this also affects mergemode
- constexpr bool parallelizable = !gameOrderMatters
- if parallelizable && config.max_num_import_threads > normal_num_import_threads (we have a pipeline so there is more than 1 thread by default): parallelImport()
- here we completely disregard game header code
- we have a singe producer multiple consumer concurrent queue of input files
- list of database files has to be atomic again (inherit from Lockable? if constexpr then lock())
- the number of buffers in flight will grow by a factor of parallelization, that's fine

The general directions should be readding parallelisation to the current database class, don't separate sequential from parallel.