bxparks / bigquery-schema-generator

Generates the BigQuery schema from newline-delimited JSON or CSV data records.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can the schema generation tool suppress case insensitive duplicates that are not accepted by bigquery?

deepakmalpote opened this issue · comments

Hi

I have been trying to export asset metadata to GCS. The idea is to export the asset metadata generated into bigquery and then visualize in Data Studio.

However whenever I use the cloud asset API (either using curl or 'gcloud asset export' command), the generated raw json data file contains two duplicate fields, 'IPProtocol' and 'ipProtocol'.

Due to this when I try to export this data into bigquery (either by bq mk or bq load command) it gives me follwing error.

$ bq mk inventory_dataset.2019_09_20_11_00_00 schema.json
BigQuery error in mk operation: Field resource.data.allowed.ipProtocol already exists in schema

Is this a bug or I am doing anything wrong?

I am using a bigquery-schema-generator tool for generating schema.(https://pypi.org/project/bigquery-schema-generator/)

Please help.

I have only limited call phone internet access right now. As you discovered, bigtable is caae-insensitive with regards to column names. Maybe you can try renaming the duplicate column name with a 'sed' script? Can't help any further for about 1-2 weeks.

My suggestion remains the same, use a pre-processor to resolve the conflicting name. In some cases, the two field names may represent the same thing, so should be collapsed together. In other cases, the two fields names represent 2 different things, so the should be separated by renaming one of the fields. The bigquery-schema-generator script cannot determine which of the different cases, so the user needs to figure this out with a separate pre-filter step. Closing.