multiprocessio / dsq

Commandline tool for running SQL queries against JSON, CSV, Excel, Parquet, and more.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Regression between old and new version for loading parquet

Shogan opened this issue · comments

Previously with an older release of dsq I could do a basic SQL select on a parquet file.

With the latest release (0.2.0), I get this error:

panic: Missing type equality condition for unknown merge.

Command:

./dsq ~/Downloads/part-00030.snappy.parquet "SELECT * FROM {}"

If you can't reproduce this easily I can see if I can get a sample parquet file together and attached to this.

Stacktrace:

goroutine 1 [running]:
github.com/multiprocessio/datastation/runner.shapeMerge({{0x4c4111c, 0x7}, 0x0, 0x0, 0x0, 0x0}, {{0x4c4111c, 0x7}, 0x0, 0x0, ...})
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/runner@v0.0.0-20220121201025-e665cd7ac0fc/shape.go:248 +0x648
github.com/multiprocessio/datastation/runner.objectMerge({0x4bd1da0}, {0x4bd1da0})
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/runner@v0.0.0-20220121201025-e665cd7ac0fc/shape.go:209 +0x1b9
github.com/multiprocessio/datastation/runner.shapeMerge({{0x4c3f476, 0x6}, 0x0, 0xc000010168, 0x0, 0x0}, {{0x4c3f476, 0x6}, 0x0, 0xc000010198, ...})
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/runner@v0.0.0-20220121201025-e665cd7ac0fc/shape.go:232 +0x115
github.com/multiprocessio/datastation/runner.getArrayShape({0x7ffeefbff919, 0x30}, {0xc0005183c0, 0x3, 0x6}, 0x16)
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/runner@v0.0.0-20220121201025-e665cd7ac0fc/shape.go:268 +0x3e5
github.com/multiprocessio/datastation/runner.GetShape({0x7ffeefbff919, 0x4ac43e0}, {0x4ad1700, 0xc00051c4f8}, 0x0)
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/runner@v0.0.0-20220121201025-e665cd7ac0fc/shape.go:277 +0x245
github.com/multiprocessio/datastation/runner.ShapeFromFile({0xc0005bb570, 0x4adfe60}, {0x7ffeefbff919, 0x30}, 0x2710, 0x7ffeefbff919)
	/Users/runner/go/pkg/mod/github.com/multiprocessio/datastation/runner@v0.0.0-20220121201025-e665cd7ac0fc/shape.go:328 +0x16c
main.getShape({0xc0005bb570, 0x30}, {0x7ffeefbff919, 0x30})
	/Users/runner/work/dsq/dsq/main.go:46 +0x4e
main.main()
	/Users/runner/work/dsq/dsq/main.go:202 +0xac7

Aha! Yes please do send me a sample. I thought that code path wasn't possible.

Hey @Shogan ping on a sample to help me reproduce this :/

What I'm going to do in the meantime is drop this panic. It's demonstrating a real bug but maybe you don't care about this particular column.

Instead it will just log some info about the column and you still won't be able to query that column until I fix the bug.

multiprocessio/datastation#162 this pr is where the main fix happens.

Just started updating 0.6.0 for the AUR, and I'm getting a potential regression that may be related to this. For reference, the version I am updating from (0.5.0) passes all tests successfully.

Here's test output from ./scripts/test.py

STARTING: SQL count for csv pipe
  SUCCESS

STARTING: SQL count for csv file
  SUCCESS

STARTING: SQL count for tsv pipe
  SUCCESS

STARTING: SQL count for tsv file
  SUCCESS

STARTING: SQL count for parquet pipe
  FAILURE
1c1,31
< 1000
\ No newline at end of file
---
> panic: runtime error: index out of range [576457816924784844] with length 115816
>
> goroutine 1 [running]:
> github.com/goccy/go-json/internal/encoder.CompileToGetCodeSet(0xc000f30ee0, 0x562e4598cb2c)
>       github.com/goccy/go-json@v0.9.4/internal/encoder/compiler_norace.go:11 +0x1df
> github.com/goccy/go-json.encode(0xc0018ec000, {0xc000caf9e0, 0xc001898b60})
>       github.com/goccy/go-json@v0.9.4/encode.go:224 +0xd0
> github.com/goccy/go-json.marshal({0xc000caf9e0, 0xc001898b60}, {0x0, 0x0, 0x1})
>       github.com/goccy/go-json@v0.9.4/encode.go:148 +0xba
> github.com/goccy/go-json.MarshalWithOption(...)
>       github.com/goccy/go-json@v0.9.4/json.go:186
> github.com/goccy/go-json.Marshal({0xc000caf9e0, 0xc001898b60})
>       github.com/goccy/go-json@v0.9.4/json.go:171 +0x2a
> github.com/multiprocessio/go-json.(*StreamEncoder).EncodeRow(0xc00047b5c0, {0xc000caf9e0, 0xc001898b60})
>       github.com/multiprocessio/go-json@v0.0.0-20220308002443-61d497dd7b9e/encoder.go:57 +0x1dd
> github.com/multiprocessio/datastation/runner.transformParquet.func1(0x0)
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/file.go:120 +0xc6
> github.com/multiprocessio/datastation/runner.withJSONArrayOutWriter({0x562e47751fe0, 0xc000131040}, 0xc000f311d8)
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/json.go:36 +0xf6
> github.com/multiprocessio/datastation/runner.withJSONArrayOutWriterFile(...)
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/json.go:51
> github.com/multiprocessio/datastation/runner.transformParquet({0x562e47787be8, 0xc0006cc000}, {0x562e47751fe0, 0xc000131040})
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/file.go:105 +0xd8
> github.com/multiprocessio/datastation/runner.transformParquetFile({0xc000044140, 0x562e4774cee0}, {0x562e47751fe0, 0xc000131040})
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/file.go:142 +0xec
> github.com/multiprocessio/datastation/runner.TransformReader({0x562e4774cee0, 0xc00047a000}, {0x0, 0x0}, {{0x562e46b7d36e, 0x562e46b79c73}, {0x0, 0x0}}, {0x562e47751fe0, 0xc000131040})
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/http.go:262 +0x325
> main._main()
>       github.com/multiprocessio/dsq/main.go:211 +0x968
> main.main()
>       github.com/multiprocessio/dsq/main.go:376 +0x19
\ No newline at end of file


STARTING: SQL count for parquet file
  FAILURE
1c1,33
< 1000
\ No newline at end of file
---
> panic: runtime error: index out of range [576457833567912774] with length 115816
>
> goroutine 1 [running]:
> github.com/goccy/go-json/internal/encoder.CompileToGetCodeSet(0xc000ebef68, 0x55b244f0fb2c)
>       github.com/goccy/go-json@v0.9.4/internal/encoder/compiler_norace.go:11 +0x1df
> github.com/goccy/go-json.encode(0xc001546a90, {0xc000627920, 0xc0015095f0})
>       github.com/goccy/go-json@v0.9.4/encode.go:224 +0xd0
> github.com/goccy/go-json.marshal({0xc000627920, 0xc0015095f0}, {0x0, 0x0, 0x1})
>       github.com/goccy/go-json@v0.9.4/encode.go:148 +0xba
> github.com/goccy/go-json.MarshalWithOption(...)
>       github.com/goccy/go-json@v0.9.4/json.go:186
> github.com/goccy/go-json.Marshal({0xc000627920, 0xc0015095f0})
>       github.com/goccy/go-json@v0.9.4/json.go:171 +0x2a
> github.com/multiprocessio/go-json.(*StreamEncoder).EncodeRow(0xc000592600, {0xc000627920, 0xc0015095f0})
>       github.com/multiprocessio/go-json@v0.0.0-20220308002443-61d497dd7b9e/encoder.go:57 +0x1dd
> github.com/multiprocessio/datastation/runner.transformParquet.func1(0x0)
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/file.go:120 +0xc6
> github.com/multiprocessio/datastation/runner.withJSONArrayOutWriter({0x55b246cd4fe0, 0xc00015a740}, 0xc000ebf260)
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/json.go:36 +0xf6
> github.com/multiprocessio/datastation/runner.withJSONArrayOutWriterFile(...)
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/json.go:51
> github.com/multiprocessio/datastation/runner.transformParquet({0x55b246d0abe8, 0xc000152ab0}, {0x55b246cd4fe0, 0xc00015a740})
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/file.go:105 +0xd8
> github.com/multiprocessio/datastation/runner.transformParquetFile({0x7ffc520f99c6, 0x1b}, {0x55b246cd4fe0, 0xc00015a740})
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/file.go:142 +0xec
> github.com/multiprocessio/datastation/runner.TransformFile({0x7ffc520f99c6, 0x1b}, {{0x0, 0x1ff}, {0x0, 0xc000b7f428}}, {0x55b246cd4fe0, 0xc00015a740})
>       github.com/multiprocessio/datastation/runner@v0.0.0-20220308165251-fb5006e70c36/file.go:554 +0x1ab
> main.evalFileInto({0x7ffc520f99c6, 0x1b}, 0x0)
>       github.com/multiprocessio/dsq/main.go:47 +0xc5
> main._main()
>       github.com/multiprocessio/dsq/main.go:236 +0xb29
> main.main()
>       github.com/multiprocessio/dsq/main.go:376 +0x19
\ No newline at end of file

Rats! No I don't think it's related. I redid the way JSON encoding/decoding works so it's not surprising there's a bug. But it is surprising it's in of the files that are tested in automated testing!

I'm having trouble reproducing this though. In Github Actions this test passes, as does it on my MBP and Fedora Linux dev machine.

I also tried building dsq and running the tests in an archlinux container and the tests passed.

Can you tell me any more about your machine/environment? I'm surprised it worked for you before and now breaks.

Hey @Shogan ping on a sample to help me reproduce this :/

Hi @eatonphil , sorry it took me so long - I was on vacation for a while and not checking notifications. I've had a look at the parquet I was querying and unfortunately I can't provide it here easily as it has sensitive data.

However in trying to load it up and edit it to remove said data, using viewer tool I got an exception about INT96 being unsupported. So this parquet data I'm working with uses INT96. Could that be an issue? I believe it is incompatible with avro.

I used the same version of dsq that I used when I started this thread to load some other parquet file examples I found here and it worked fine for these.

However in trying to load it up and edit it to remove said data, using viewer tool I got an exception about INT96 being unsupported. So this parquet data I'm working with uses INT96. Could that be an issue? I believe it is incompatible with avro.

Ok I'll try making a dataset with INT96 in it and see if that causes an issue.

But also if you don't need that column right now newer versions of dsq won't crash when this happens. They'll just not be able to load that column for querying.

However in trying to load it up and edit it to remove said data, using viewer tool I got an exception about INT96 being unsupported. So this parquet data I'm working with uses INT96. Could that be an issue? I believe it is incompatible with avro.

Ok I'll try making a dataset with INT96 in it and see if that causes an issue.

But also if you don't need that column right now newer versions of dsq won't crash when this happens. They'll just not be able to load that column for querying.

Nice! Confirmed version 0.6.0 works. Thanks @eatonphil 🎉

I'm having trouble reproducing this though. In Github Actions this test passes, as does it on my MBP and Fedora Linux dev machine.

I also tried building dsq and running the tests in an archlinux container and the tests passed.

Can you tell me any more about your machine/environment? I'm surprised it worked for you before and now breaks.

I can reproduce it on two different machines both running Arch Linux, both Intel & AMD CPUs with go 1.18.

Basically just running in a clean chroot (based on systemd-nspawn) following our Go packaging guidelines, so the flags could be a thing (but I doubt it).

I've currently skipped the failing parquet tests.

Gotcha! It's the -buildmode=pie flag that is exposing this crash (I don't know whether or not to say it's causing the crash).

I'm going to make a separate issue about -buildmode=pie. I don't know whether I'll be able to fix it though.