jcrobak / parquet-python

python implementation of the parquet columnar file format.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Infinite loop for Impala-generated file

spaztic1215 opened this issue · comments

Hi there,

Was wondering what condition would cause an infinite loop in this while-loop block: https://github.com/jcrobak/parquet-python/blob/master/parquet/__init__.py#L354-L360

Using the following file which we generated from Impala: https://www.dropbox.com/s/kah986gqjt7mrnr/movies.0.parquet at some point where it reads Bytes 65278 -> 112466 it gets stuck in an endless loop b/c the values stop updating. However, we've been able to read smaller Impala-generated files, so not sure if this is a limitation with file size (the file is 100MB+ but there are only 5 columns of data).

Any insight would be hugely appreciated, thanks!

Jenny

Hi Jenny—thanks for the report. The problem seems to be that I haven't implemented support for null values (via definition_levels) for the encoding used by the rating column on that file.

I should have a fix shortly—I'd like to add some regression tests to ensure this bug doesn't pop up again.