dsq --schema missing array in 11GB file

Question

dsq --schema missing array in 11GB file

mccorkle opened this issue 2 years ago · comments

Describe the bug and expected behavior

In my testing with large datasets, there is at least one array of objects that is not being reported with --schema when the array begins on line 1,326,612,715 out of 1,495,055,188 lines in the 11GB file.

Is it possible that schema only reviews the first X lines or bytes of a file? If so, is there any way that I can override that?

Reproduction steps
With a 11GB (or larger) file:
dsq --schema --pretty LARGE_FILE.json

Versions

OS: Ubuntu 22.04 LTS, AMD EPYC 7R32
Shell: bash
dsq version: dsq 0.20.2 from apt

Phil Eaton · Answer 1 · Fri Jul 22 2022 22:08:01 GMT+0800 (China Standard Time)

Hey! Thanks for the report. Yeah datastation/dsq does sampling to get reasonable performance. Maybe it makes sense to sample a larger file but then performance is going to get much worse. Overall I don't yet have a great strategy for dealing with very large files.

Mark McCorkle · Answer 2 · Tue Jul 26 2022 02:28:49 GMT+0800 (China Standard Time)

Before I discovered Datastation, the way I had imagined building my own was to stream-read the file and when I see an array -- to read only the first 3 of the array's children into memory, counting but discarding all other objects in the array until I capture the last 3.

The flaw with my plan was that if there is an array child that didn't conform to the structure of the first and last 3 in the array, my report would not include them in the schema -- but it would have found this schema element that datastation/dsq is missing.

Perhaps a hybrid of your approach and mine which can be activated by an --array_depth=3 argument?