wz1000 / HieDb

Generates a references DB from .hie files

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Indexing `typerefs` produces duplicates

josephsumabat opened this issue · comments

Indexing type refs seems to produce complete duplicate rows in the database.

Reproducible on https://github.com/josephsumabat/static-ls by running hiedb -D .hiedb index .hiefiles after building.

sqlite> select count(*) from typerefs;
3413
sqlite> select count(*) from (select distinct * from typerefs);
2090

On larger projects this can be an over 10x difference which greatly degrades the performance of indexing since so many unnecessary inserts are performed.

This bug report really piqued my interest.
I tried indexing our modest codebase (~40k LOC of Haskell)
After indexing we have
1 million rows in typerefs, but only 50k actually distinct rows.

Where is all of that duplication coming from?

This screenshot illustrates the issue for the most often repeated typeref within our codebase.
It represents 2384 references to the type ghc-prim:GHC.Types:Type, that are all hiding behind this small "deriving JSON" src span.

Screenshot from 2024-02-07 16-14-05

Do you think it would be possible to tweak the indexing process to avoid all this duplication (because it only bloats the db, makes the inserts and lookups unnecessarily slower)?

I can look into that if you consider that as potentially promissing direction.

It would certainly be worthwhile to look into this.

We noticed something similar on Mercury's code base as well (similar with something like 17 million rows but similar order of magnitude of thousands of unique rows). I did take some time to look into it a year ago but the --skip-typerefs feature (https://github.com/wz1000/HieDb/pulls?q=is%3Apr+is%3Aclosed) was originally motivated by this being a big bottleneck. Notably most ide features can be preserved without this table though.