shabados / database

A digital representation of Sikh Bani and other Panthic texts with a public logbook of sangat-sourced corrections.

Home Page:https://docs.shabados.com/database

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RFC: Offline DB/API revamp

Harjot1Singh opened this issue · comments

Problem

The repository is built into a relational database, in the form of an SQLite output, and Docker containers for Postgres + MariaDB.

This is great for a huge number of use-cases, however, the data (imo) is intrinsically document-based. That is not to say that relations do not exist, but that they're only useful for gleaning insights if you were querying the database blind and did not understand what it contains, which is often not the case.

In practice, this means that we end up transforming the data into a relational format, but to consume the data in a useful manner, we have to join it back together during runtime. This seems a little pointless and is the source of slowness.

Proposition

I can see the flexibility of the relational database being used in applications to do with data exploration and visualisation, so there is definitely a valid use-case to keep supporting this. But the majority of the time, the database is leveraged in a search-oriented fashion.

So, I propose that we provide a NoSQL/document-based format, which will facilitate high-speed querying offline, whilst maintaining our relational formats.

We'd want to expose this functionality through @shabados/database and expose any of our search features as wrappers, as we currently do. I would be personally happy with dropping the SQLite dependency from the JS library entirely, if need be.

Other options

  • Denormalisation: I did consider producing a denormalised SQLite output. But I suspect it'd be far too large for offline storage on consumer devices.

Design Considerations

  • Doesn't require too much more space than our current DB (~150MB normalised)
  • Fixed, local dataset, so write speed/concurrency/scale issues do not matter
  • Full-text/fuzzy is a bonus (but a full API should really deal with this)
  • Prioritising ease of use and speed regarding querying shabads, banis, and lines, and anyone who'd like to use it for other things can use the SQLite build to do so
  • Differential updates/sync in the future, so there could be an API server or master instance sitting somewhere, and streams DB updates to everyone
  • Useable on desktop and mobile
  • Local JS API publishable in a similar manner to the existing @shabados/database
  • Partial sync

Please post any additional considerations.

Solutions

Comments for suitable technologies and products would be appreciated.

I'm a big fan of this. Have you also considered a supplmentary lmdb for enhanced performance? Something specific to desktop/mobile, and perhaps not for other database users.

Check this out https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database it's a key value but with improvements to narrow down the key faster. If we were to use this for all potential search queries it would be gross, we'd be talking in the gigabytes of storage, however if we were to use it smartly (perhaps using serialized ids/numbers for values and incomplete lists), we may be able to squeeze out even further performance than a document based db would.

Differential updates/sync in the future, so there could be an API server or master instance sitting somewhere, and streams DB updates to everyone

Will we control which sources are included in this? Can the developer only update/sync a few sources/publications for a variable number of compositions? Can the user make offline only a few sources ? Etc.

Notes so far

  • Search indexes use a LOT of storage, if you pregenerate the possible results for every query (about 86MB just for SGGS, with only line IDs attached).
    • 2.5ms for fetching 50 results from the search table in SQLite. About 33ms with a LIKE query instead.
    • Compression would make repeating/denormalisation much more palatable.
  • A PouchDB is a single index. You need multiple to store different "views".
    • It's not amazingly fast (40ms) to fetch the list of results as ids from the index and then request the full docs from another PouchDB.
    • It is very fast to fetch everything in one shot (5ms?)
  • Denormalising SQLite entirely to reduce joins speeds up fetches (6-15ms) at the expense of storage (600mb including transliterations + translations, 158mb with translations, 20Mb for just Gurbani). Joining ONLY translations and transliterations 158mb to denormalised lines.

Perhaps we only need to solve one issue, aside from dropping the ORM. The smart sorting:

select * from `line_content` inner join `lines` on `lines`.`id` = `line_content`.`line_id` inner join `shabads` on `shabads`.`id` = `lines`.`shabad_id` inner join `writers` on `writers`.`id` = `shabads`.`writer_id` inner join `translations` on `translations`.`line_id` = `line_content`.`line_id` inner join `transliterations` on `transliterations`.`line_id` = `line_content`.`line_id` where `line_content`.`first_letters` like '%a%' limit 50
op: 12.902ms

vs

select * from `line_content` inner join `lines` on `lines`.`id` = `line_content`.`line_id` inner join `shabads` on `shabads`.`id` = `lines`.`shabad_id` inner join `writers` on `writers`.`id` = `shabads`.`writer_id` inner join `translations` on `translations`.`line_id` = `line_content`.`line_id` inner join `transliterations` on `transliterations`.`line_id` = `line_content`.`line_id` where `line_content`.`first_letters` like '%a%' order by CASE  WHEN first_letters LIKE 'a' THEN 0 ELSE 1 END asc limit 50
op: 297.052ms

the CASE WHEN sorting is pretty slow... Maybe giving 100-150mb to a search index isn't such a bad idea...

with generated serach index:

select * from `search_only` inner join `line_content` on `line_content`.`line_id` = `search_only`.`line_id` 
inner join `lines` on `lines`.`id` = `line_content`.`line_id` inner join `shabads` on `shabads`.`id` = `lines`.`shabad_id` inner join `writers` on `writers`.`id` = `shabads`.`writer_id` inner join `translations` on `translations`.`line_id` = `line_content`.`line_id` inner join `transliterations` on `transliterations`.`line_id` = `line_content`.`line_id` where `search_only`.`query` = 'hggm' limit 50
op: 11.707ms

Maybe giving 100-150mb to a search index isn't such a bad idea

For offline support if we plan to bundle this @shabados/mobile wouldn't that just make app size even bigger? Is there a way we can split into small chunks to keep it smaller?

Indeed! Which is why Partial Sync support is planned for the next release!