Read data from Google BigQuery with Dask.
This package uses the BigQuery Storage API. Please refer to the data extraction pricing table for associated costs while using Dask-BigQuery.
dask-bigquery
can be installed with pip
:
pip install dask-bigquery
or with conda
:
conda install -c conda-forge dask-bigquery
Default credentials can be provided by setting the environment variable GOOGLE_APPLICATION_CREDENTIALS
to the file name:
$ export GOOGLE_APPLICATION_CREDENTIALS=/home/<username>/google.json
For information on obtaining the credentials, use Google API documentation.
dask-bigquery
assumes that you are already authenticated.
import dask_bigquery
ddf = dask_bigquery.read_gbq(
project_id="your_project_id",
dataset_id="your_dataset",
table_id="your_table",
)
ddf.head()
With default credentials:
import dask
import dask_bigquery
ddf = dask.datasets.timeseries(freq="1min")
res = dask_bigquery.to_gbq(
ddf,
project_id="my_project_id",
dataset_id="my_dataset_id",
table_id="my_table_name",
)
With explicit credentials:
from google.oauth2.service_account import Credentials
# credentials
creds_dict = {"type": ..., "project_id": ..., "private_key_id": ...}
credentials = Credentials.from_service_account_info(info=creds_dict)
res = to_gbq(
ddf,
project_id="my_project_id",
dataset_id="my_dataset_id",
table_id="my_table_name",
credentials=credentials,
)
Before loading data into BigQuery, to_gbq
writes intermediary Parquet to a Google Storage bucket. Default bucket name is dask-bigquery-tmp
. You can provide a diferent bucket name by setting the parameter: bucket="my-gs-bucket"
. After the job is done, the intermediary data is deleted.
If you're using a persistent bucket, we recommend configuring a retention policy that ensures the data is cleaned up even in case of job failures.
To run the tests locally you need to be authenticated and have a project created on that account. If you're using a service account, when created you need to select the role of "BigQuery Admin" in the section "Grant this service account access to project".
You can run the tests with
$ pytest dask_bigquery
if your default gcloud
project is set, or manually specify the project ID with
DASK_BIGQUERY_PROJECT_ID pytest dask_bigquery
This project stems from the discussion in this Dask issue and this initial implementation developed by Brett Naul, Jacob Hayes, and Steven Soojin Kim.