jcrobak / parquet-python

python implementation of the parquet columnar file format.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Structures cause error

SergeNov opened this issue · comments

Hi Joe and others
I am trying to use your module to read a parquet file, and i ran into a problem here:
schema.py, line 21:
assert len(self.schema_elements) == len(self.schema_elements_by_name)
Apparently the init method assumes that my structure has multiple fields with the same name. Module works correctly if you comment out this line though
Originally these files were used by Hive, and here is the list of fields in the table:

fileid bigint,
version bigint,
ip_geocode structcountrycode:string,regionname:string,city:string,postalcode:string,metrocode:string,dmacode:string,
timestamp bigint,
region bigint,
pixel bigint,
uuid bigint,
uuid_exists boolean,
referingurl string,
useragent string,
ip string,
querystring string,
campaignsinfo array<struct<campaign_id:bigint,media_types:array,advertiser_id:bigint,funnel_step_id:bigint,funnel_step_value:bigint,track_conversion:boolean>>,
opted_out boolean,
event_id string

Here is how the list of fields that the module sees:

name=u'hive_schema', field_id=None, repetition_type=None, type_length=None, precision=None, num_children=17, converted_type=None, type=None
name=u'fileid', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'version', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'ip_geocode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=6, converted_type=None, type=None
name=u'countrycode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'regionname', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'city', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'postalcode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'metrocode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'dmacode', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'timestamp', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'region', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'pixel', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'uuid', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'uuid_exists', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0
name=u'referingurl', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'useragent', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'ip', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'querystring', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'campaignsinfo', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=1, converted_type=3, type=None
name=u'bag', field_id=None, repetition_type=2, type_length=None, precision=None, num_children=1, converted_type=None, type=None
name=u'array_element', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=6, converted_type=None, type=None
name=u'campaign_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'media_types', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=1, converted_type=3, type=None
name=u'bag', field_id=None, repetition_type=2, type_length=None, precision=None, num_children=1, converted_type=None, type=None
name=u'array_element', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'advertiser_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'funnel_step_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'funnel_step_value', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=2
name=u'track_conversion', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0
name=u'opted_out', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=0
name=u'event_id', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=6
name=u'dt', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=1
name=u'hr', field_id=None, repetition_type=1, type_length=None, precision=None, num_children=None, converted_type=None, type=1

Apparently there are 2 elements named 'array_element' and 'bag' - i assume these fields just come with structures

@SergeNov thanks for the report. I'll attempt to reproduce and fix the issue.

@SergeNov I've started to work on support for schemas like these. The first step is in #45, if you want to give it a try. Unfortunately, I don't think your schema is fully supported yet because it includes an array.

Still experiencing this issue in version 1.3.1.