schema inference involving nulls and arrays produces inconsistent results

Question

schema inference involving nulls and arrays produces inconsistent results

dannynoar opened this issue a year ago · comments

The following call:
generator.deduce_schema([ {'1':None}, {'1':['a','b']}, {'1':None}, {'1':['c','d','e']} ])
Produces OrderedDict([('1', None)])

Other calls of a similar nature produce inconsistent results:
generator.deduce_schema([ {'1':None}, {'1':['a','b']}, {'1':['c','d','e']} ])
Produces OrderedDict([('1', OrderedDict([('status', 'hard'), ('filled', True), ('info', OrderedDict([('mode', 'REPEATED'), ('name', '1'), ('type', 'STRING')]))]))])

And
generator.deduce_schema([ {'1':None}, {'1':['a','b']}, {'1':None} ])
Produces OrderedDict([('1', OrderedDict([('status', 'soft'), ('filled', False), ('info', OrderedDict([('mode', 'NULLABLE'), ('name', '1'), ('type', 'STRING')]))]))])

The specific issue I have involves a column 90% composed of nulls and 10% string arrays. It results in version 3 of the above instances, when I'd have hoped that it'd result in something with a mode of 'REPEATED'

Brian Park · Answer 1 · Wed Mar 22 2023 11:59:48 GMT+0800 (China Standard Time)

Very interesting.
You are getting different results because the script is choosing to ignore different rows due to what it considers to be incompatible types between various rows. If you print out the errors array that is returned by deduce_schema(), it should tell you which lines are being ignored.

But the real question is, should null values be allowed in fields which are supposed to be REPEATED? My memory is hazy, but I could have sworn that when I created bigquery-schema-generator back in 2017/2018, BigQuery did not allow null values for a repeated field (because everything is eventually stored as protocol buffers inside Google, and protobufs don't allow nullable repeated fields).

But I tried uploading your data samples to BigQuery using bq load --autodetect, and what do you know, it actually supports nulls in repeated JSON fields. For example, I tried loading:

{"array":null}
{"array":["a","b"]}
{"array":null}
{"array":["c","d","e"]}

with

$ bq load --source_format NEWLINE_DELIMITED_JSON --replace --autodetect tmp.nulls nulls.json

and bq load deduces the schema to be:

[
  {
    "mode": "REPEATED",
    "name": "array",
    "type": "STRING"
  }
]

As far as I can tell, BigQuery seems to consider null to be identical to the empty array [], so the following JSON data produces the identical schema and stores the data records in the database. The [] are displayed as null in the BigQuery Cloud Console:

{"array":[]}
{"array":["a","b"]}
{"array":[]}
{"array":["c","d","e"]}

So I'm going to upload some code which I think fixes this. More details in the next post.

Brian Park · Answer 2 · Wed Mar 22 2023 12:08:41 GMT+0800 (China Standard Time)

Can you sync to the latest develop branch and try out this code? I have to be honest and say that I don't work with BigQuery anymore, and I'm finding it quite difficult to understand and maintain my own code, especially when we get into these edge cases. But my tests seem to pass, and this seems to handle your situation.

BTW, I don't think BigQuery column names can be a digit like "1". I'm pretty sure that it needs to start with a letter.

Bounceback · Answer 3 · Fri Mar 24 2023 05:18:24 GMT+0800 (China Standard Time)

Will test it out soon when I get some free time. Thanks for working on it.

Brian Park · Answer 4 · Sun Apr 02 2023 00:47:05 GMT+0800 (China Standard Time)

Fixed with v1.6.0.