Dask-BigQuery

Read data from Google BigQuery with Dask.

This package uses the BigQuery Storage API. Please refer to the data extraction pricing table for associated costs while using Dask-BigQuery.

Installation

dask-bigquery can be installed with pip:

pip install dask-bigquery

or with conda:

conda install -c conda-forge dask-bigquery

Authentication

Default credentials can be provided by setting the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file name:

$ export GOOGLE_APPLICATION_CREDENTIALS=/home/<username>/google.json

For information on obtaining the credentials, use Google API documentation.

Example: read from BigQuery

dask-bigquery assumes that you are already authenticated.

import dask_bigquery

ddf = dask_bigquery.read_gbq(
    project_id="your_project_id",
    dataset_id="your_dataset",
    table_id="your_table",
)

ddf.head()

Example: write to BigQuery

With default credentials:

import dask
import dask_bigquery

ddf = dask.datasets.timeseries(freq="1min")

res = dask_bigquery.to_gbq(
    ddf,
    project_id="my_project_id",
    dataset_id="my_dataset_id",
    table_id="my_table_name",
)

With explicit credentials:

from google.oauth2.service_account import Credentials

# credentials
creds_dict = {"type": ..., "project_id": ..., "private_key_id": ...}
credentials = Credentials.from_service_account_info(info=creds_dict)

res = to_gbq(
    ddf,
    project_id="my_project_id",
    dataset_id="my_dataset_id",
    table_id="my_table_name",
    credentials=credentials,
)

Before loading data into BigQuery, to_gbq writes intermediary Parquet to a Google Storage bucket. Default bucket name is dask-bigquery-tmp. You can provide a diferent bucket name by setting the parameter: bucket="my-gs-bucket". After the job is done, the intermediary data is deleted.

If you're using a persistent bucket, we recommend configuring a retention policy that ensures the data is cleaned up even in case of job failures.

Run tests locally

To run the tests locally you need to be authenticated and have a project created on that account. If you're using a service account, when created you need to select the role of "BigQuery Admin" in the section "Grant this service account access to project".

You can run the tests with

$ pytest dask_bigquery

if your default gcloud project is set, or manually specify the project ID with

DASK_BIGQUERY_PROJECT_ID pytest dask_bigquery

History

This project stems from the discussion in this Dask issue and this initial implementation developed by Brett Naul, Jacob Hayes, and Steven Soojin Kim.

License

BSD-3

bnaul / dask-bigquery