AlexJReid / properties-dataflow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

properties-dataflow

Tiny example for querying a BigQuery table and sending enrichment requests to Google Places API. External data is landed in raw JSON form to the specified output table in BigQuery for further processing later on. The URL from the source table was used as an identifier in the output table so that it can be joined.

Running it

In Cloud Shell:

cd properties-dataflow
virtualenv -p python3.7 venv
source venv/bin/activate

pip install -r requirements.txt

To run locally (replacing the values in <>, i.e. --output test_project:mydata:output)

python process.py \
  --output <your project>:<your dataset>.<output_table_name> \
  --project <your project> \
  --gmaps_key <your API key>

Note that the local Apache Beam runner is only suitable for testing a very small number of records.

When ready to execute it on Cloud Dataflow, follow this guide.

python process.py \
  --output <your project>:<your dataset>.<output_table_name> \
  --project <your project> \
  --gmaps_key <your API key> \
  --runner DataflowRunner \
  --project $PROJECT \
  --temp_location gs://<your temp bucket>/dataflow/

This will provision several VMs to perform this batch job. You can monitor process by following the URL that appears in the terminal after issuing the above command.

But first

Read the small Beam program and amend the input query, at least replacing the source table name. Remember to add a LIMIT to your query when testing.

Improvements

  • Raw data is landed as a blob of JSON. This can be queried with JSON_EXTRACT and so on, but it is a little awkward to work with. You could extract relevant parts of it in this program and populate an easier to use table with a more specialised schema.

About


Languages

Language:Python 100.0%