Indexing `typerefs` produces duplicates

Question

Indexing `typerefs` produces duplicates

josephsumabat opened this issue 8 months ago · comments

Indexing type refs seems to produce complete duplicate rows in the database.

Reproducible on https://github.com/josephsumabat/static-ls by running hiedb -D .hiedb index .hiefiles after building.

sqlite> select count(*) from typerefs;
3413
sqlite> select count(*) from (select distinct * from typerefs);
2090

On larger projects this can be an over 10x difference which greatly degrades the performance of indexing since so many unnecessary inserts are performed.

Jan Hrcek · Answer 1 · Wed Feb 07 2024 23:25:17 GMT+0800 (China Standard Time)

This bug report really piqued my interest.
I tried indexing our modest codebase (~40k LOC of Haskell)
After indexing we have
1 million rows in typerefs, but only 50k actually distinct rows.

Where is all of that duplication coming from?

This screenshot illustrates the issue for the most often repeated typeref within our codebase.
It represents 2384 references to the type ghc-prim:GHC.Types:Type, that are all hiding behind this small "deriving JSON" src span.

Do you think it would be possible to tweak the indexing process to avoid all this duplication (because it only bloats the db, makes the inserts and lookups unnecessarily slower)?

I can look into that if you consider that as potentially promissing direction.

wz1000 · Answer 2 · Thu Feb 08 2024 00:00:35 GMT+0800 (China Standard Time)

It would certainly be worthwhile to look into this.

josephsumabat · Answer 3 · Thu Feb 08 2024 03:18:36 GMT+0800 (China Standard Time)

We noticed something similar on Mercury's code base as well (similar with something like 17 million rows but similar order of magnitude of thousands of unique rows). I did take some time to look into it a year ago but the --skip-typerefs feature (https://github.com/wz1000/HieDb/pulls?q=is%3Apr+is%3Aclosed) was originally motivated by this being a big bottleneck. Notably most ide features can be preserved without this table though.