bxparks / bigquery-schema-generator

Generates the BigQuery schema from newline-delimited JSON or CSV data records.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

schema inference involving nulls and arrays produces inconsistent results

dannynoar opened this issue · comments

The following call:
generator.deduce_schema([ {'1':None}, {'1':['a','b']}, {'1':None}, {'1':['c','d','e']} ])
Produces OrderedDict([('1', None)])

Other calls of a similar nature produce inconsistent results:
generator.deduce_schema([ {'1':None}, {'1':['a','b']}, {'1':['c','d','e']} ])
Produces OrderedDict([('1', OrderedDict([('status', 'hard'), ('filled', True), ('info', OrderedDict([('mode', 'REPEATED'), ('name', '1'), ('type', 'STRING')]))]))])

And
generator.deduce_schema([ {'1':None}, {'1':['a','b']}, {'1':None} ])
Produces OrderedDict([('1', OrderedDict([('status', 'soft'), ('filled', False), ('info', OrderedDict([('mode', 'NULLABLE'), ('name', '1'), ('type', 'STRING')]))]))])

The specific issue I have involves a column 90% composed of nulls and 10% string arrays. It results in version 3 of the above instances, when I'd have hoped that it'd result in something with a mode of 'REPEATED'

Very interesting.
You are getting different results because the script is choosing to ignore different rows due to what it considers to be incompatible types between various rows. If you print out the errors array that is returned by deduce_schema(), it should tell you which lines are being ignored.

But the real question is, should null values be allowed in fields which are supposed to be REPEATED? My memory is hazy, but I could have sworn that when I created bigquery-schema-generator back in 2017/2018, BigQuery did not allow null values for a repeated field (because everything is eventually stored as protocol buffers inside Google, and protobufs don't allow nullable repeated fields).

But I tried uploading your data samples to BigQuery using bq load --autodetect, and what do you know, it actually supports nulls in repeated JSON fields. For example, I tried loading:

{"array":null}
{"array":["a","b"]}
{"array":null}
{"array":["c","d","e"]}

with

$ bq load --source_format NEWLINE_DELIMITED_JSON --replace --autodetect tmp.nulls nulls.json

and bq load deduces the schema to be:

[
  {
    "mode": "REPEATED",
    "name": "array",
    "type": "STRING"
  }
]

As far as I can tell, BigQuery seems to consider null to be identical to the empty array [], so the following JSON data produces the identical schema and stores the data records in the database. The [] are displayed as null in the BigQuery Cloud Console:

{"array":[]}
{"array":["a","b"]}
{"array":[]}
{"array":["c","d","e"]}

So I'm going to upload some code which I think fixes this. More details in the next post.

Can you sync to the latest develop branch and try out this code? I have to be honest and say that I don't work with BigQuery anymore, and I'm finding it quite difficult to understand and maintain my own code, especially when we get into these edge cases. But my tests seem to pass, and this seems to handle your situation.

BTW, I don't think BigQuery column names can be a digit like "1". I'm pretty sure that it needs to start with a letter.

Will test it out soon when I get some free time. Thanks for working on it.

Fixed with v1.6.0.