Performance ideas
eatonphil opened this issue · comments
Catchall for now for potential improvements to datastation/dsq.
- SQL pre-processing
- Import only used fields (see #71)
- Do pre-filtering of data in SQLiteWriter, only insert things that match the WHERE clause
- Support more input types using SQLiteWriter, basically requires supporting expanded nested objects in (see notes in #67 )
- Maybe Handle jsonl in parallel since newlines must not be within individual JSON lines
- Get rid of map[string]any inside datastation
- At the very least put WriteRecord into the ResultWriter interface so SQLiteWriter can avoid map[string]any which it converts from anyway
- CSV parser improvements
- Find a simdcsv Go implementation (https://github.com/minio/simdcsv is abandoned) or write a wrapper to https://github.com/geofflangdale/simdcsv
- Maybe easier first step: write a parser that handles CSVs when there are no quotes and fall back to encoding/csv otherwise
- Or actually investigate why encoding/csv is slow
- Add benchmarks for every file format, not just CSV. Basically every file format needs to be worked on individually