bxparks / bigquery-schema-generator

Generates the BigQuery schema from newline-delimited JSON or CSV data records.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FEATURE] When a nested field has mismatched type print the full path to that nested field

abroglesc opened this issue · comments

Summary

In a complex structure like the following:

{
  "source_machine": {
    "port": 80
  },
  "dest_machine": {
    "port": "http-port"
  }
}

If there was an error with another log where dest_machine.port was an integer this would error and simply state something like:
Ignoring field with mismatched type: old=(hard,port,NULLABLE,STRING); new=(hard,port,NULLABLE,INTEGER)

At this point you are left to figure out which structure this port column actually exists in. This is a more simple example but as the schema grows and is more complex, this problem is harder to manually resolve.

Ideally, we can track the path to this using a JSON path or dpath expression. Something like dest_machine.port. This will likely take adding an additional argument to the recursive function merge_schema_entry. Something like a base_path=None and continually build up that base_path string in each recursive iteration so that it can be used in the errors like "{}.{}".format(base_path, new_name) and "{}.{}".format(base_path, old_name)

I can't remember, does the script print out the line number of the record with the error? Does that help?

The JSON path to the error is a reasonable idea. I'm happy to review a PR if you have something in mind. Otherwise, it might take me a while to get to this, since it won't rise high on my priority list...

Oh, I understand your problem, you have 2 port fields, so the line number does not help.

@bxparks created #53 to address this.

Fixed

Pushed v1.2 to PyPI.