bxparks / bigquery-schema-generator

Generates the BigQuery schema from newline-delimited JSON or CSV data records.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CSV Delimiter Option

XenoLight opened this issue · comments

I haven't seen an option for it, and the schema is not generating for a pipe-delimited csv. I was wondering if this is something that could be added.

I might be able to do the code edit myself and push it, but this would be the first project I would contributing to, so would want to take the time to look at all the code first.

Hi, Currently, it does not support custom delimiters for CSV. But I think it would be useful to add it. If you want to take a crack at it, go for it and send me a PR. You probably want to start at the DictReader on line 174 of generate_schema.py. You want to add a parser.add_argument() in the main(). Call the flag something like --csv_delimiter. Pass the flag value into the SchemaGenerator constructor, then use pass that value into the DictReader.

Just a heads-up, I'm a bit particular about coding style and unit testing. Please be sure to run the flake8 validator and the unit tests. You can provide the most value by testing this code with actual CSV input files and verifying the expected schema output. I don't use this project these days, and I've actually never used the CSV input format (it was a contribution from someone else). We can add your test CSV samples to the tests/testdata.txt file.

So, I have done the code change locally and just will need to find the time to finish implementation, tests, and push.

The real reason I want to use this is to build an Apache-Beam pipeline that if a file is placed in a specific folder structure in GCS

EX:
Landingzone/DatasetName/TableName/File.csv

The pipeline would run, check the landingZone, then check if the dataset/table exists, if not create them and then load the data.

Currently utilizing this package, I was able to get the pipeline to generate the schema_map for each row. I am now looking at your code to complete the next portion which would be merging the schema of each row into one final schema to be used. Would the function merge_schema_entry be a good start, as I would have to reimplement that logic via apache-beam patterns.

Not sure I understand exactly what you want to do. Are you saying that that you have multiple CSV files that are ingested into the given TableName over time, and you want to incrementally update its schema as new files come in, then import those new files into the existing TableName? If so, then you want to look at the --existing_schema_path flag, which allows you to specify the schema of the existing table. You can extract the existing schema using a bq command (that I cannot remember, but I think I mention it in the README.md), then you run generate_schema.py on the new file, passing the --existing_schema_path flag.

Oh no, I want to make a pipeline that no matter where we get the csv, as long as it is properly formatted, utf-8, qouted etc. I can place it in this folder structure and it will create the dataset and tables. So the csvs can come from anywhere and not specifically for one table. To help speed up the process of getting the data loaded. So less work on our end to actually review the file and so on.

Then I'm confused, when you write, "I was able to get the pipeline to generate the schema_map for each row". Because you shouldn't be generating a schema_map for just one row of a CSV. You are generating the schema for the entire CSV file. You shouldn't have deal with calling merge_schema_entry() manually, because it's called automatically for every row in your CSV file.

Closing due to lack of activity.