This repo contains the code for the demo presented on Snowpark Day 2024-01-31.
The demo shows how to run an MLOps pipeline with Astronomer and Snowpark, which analyses information about trial customers of an up and coming online cookie shop. The pipeline trains several models to predict how many cookies of each type a customer will order, and then uses the best model to make predictions on new customers.
The pipeline consists of 3 Airflow DAGs:
snowpark_etl_train_set
: extracts data from S3, transform it with Snowpark, and load it into Snowflake. This DAG runs via an external trigger, for example a call to the Airflow REST API.snowpark_ml_train
: trains a model with Snowpark ML and store it in the Snowflake Model Registry. This DAG runs as soon as the previous one finishes, using a Dataset.snowpark_predict
: retrieves the model from the Snowflake Model Registry and uses it to make predictions on new data. This DAG is scheduled to run once per day.
- Snowpark Airflow Provider
- Snowflake ML
- Snowpark API
- Snowflake Airflow Provider
- Amazon Airflow Provider
- Matplotlib
Warning
The Snowpark Airflow Provider is currently in beta, and is not yet recommended for production use. Please report any issues you encounter.
- Install the Astro CLI. The Astro CLI is an open source tool and the easiest way to run Airflow locally.
- Clone this repository.
- Create a
.env
file in the root of the repository and copy the contents of.env.example
into it. Fill in the values for your Snowflake and AWS accounts. - Copy the contents of
include/data/
into an S3 bucket. You can generate more customer data by runninginclude/generate_customer_info.py
. - Makes sure to provide your values for the global variables defined at the start of the DAG files. These are:
SNOWFLAKE_CONN_ID
: name of your Snowflake connection in Airflow. This needs to be the same connection in all DAGs.AWS_CONN_ID
: name of your AWS connection in AirflowMY_SNOWFLAKE_DATABASE
: name of the database in Snowflake where you want to store the data and models. This needs to be an existing database and the same one in all DAGs.MY_SNOWFLAKE_SCHEMA
: name of the schema in Snowflake where you want to store the data and models. This needs to be an existing schema and the same one in all DAGs.TRAIN_DATA_TABLE_RAW
: name of the table in Snowflake where you want to store the raw training data. This table will be created by the DAG.TRAIN_DATA_TABLE_PROCESSED
: name of the table in Snowflake where you want to store the processed training data. This table will be created by the DAG. -TEST_DATA_TABLE_PROCESSED
: name of the table in Snowflake where you want to store the processed test data. This table will be created by the DAG.TABLE_PREDICT
: name of the table in Snowflake where you want to store the predictions. This table will be created by the DAG.USE_SNOWPARK_WAREHOUSE
: toggle to true if you want to use a Snowpark warehouse to run the Snowpark jobs. If you do this you will need to provider your values forMY_SNOWPARK_WAREHOUSE
andMY_SNOWFLAKE_REGULAR_WAREHOUSE
.
- Run
astro dev start
to start Airflow locally. You can log into the Airflow UI atlocalhost:8080
with the usernameadmin
and passwordadmin
. - Unpause the
snowpark_etl_train_set
andsnowpark_ml_train
DAGs by clicking the toggle to the left of the DAG name in the Airflow UI. - Trigger the
snowpark_etl_train_set
DAG by clicking the play button to the right of the DAG name in the Airflow UI this will run the DAG once and thesnowpark_ml_train
DAG will start automatically once the ETL DAG finishes. - Unpause the
snowpark_predict
DAG by clicking the toggle to the left of the DAG name in the Airflow UI. This DAG will run once per day and make predictions on new data.
At include/streamlit_app.py
you can find a simple Streamlit app that displays the predictions. Use Streamlit in Snowflake and copy the contents of include/streamlit_app.py
into the Streamlit app.
By default, Airflow stores data passed between tasks in an XCom table in the metadata database. The Snowpark Airflow provider includes the functionality to store this data in Snowflake instead.
To use this functionality, you need to create a table in Snowflake to store the XCom data. You can do this by running the following SQL query in Snowflake:
create or replace TABLE AIRFLOW_XCOM_DB.AIRFLOW_XCOM_SCHEMA.XCOM_TABLE (
DAG_ID VARCHAR(16777216) NOT NULL,
TASK_ID VARCHAR(16777216) NOT NULL,
RUN_ID VARCHAR(16777216) NOT NULL,
MULTI_INDEX NUMBER(38,0) NOT NULL,
KEY VARCHAR(16777216) NOT NULL,
VALUE_TYPE VARCHAR(16777216) NOT NULL,
VALUE VARCHAR(16777216) NOT NULL
);
You will also need to create a stage in Snowflake in the same schema.
Afterwards, uncommend the XCOM related environment variables you copied from the .env_example
file in the .env
file.
AIRFLOW__CORE__XCOM_BACKEND=snowpark_provider.xcom_backends.snowflake.SnowflakeXComBackend
AIRFLOW__CORE__XCOM_SNOWFLAKE_TABLE='AIRFLOW_XCOM_DB.AIRFLOW_XCOM_SCHEMA.XCOM_TABLE'
AIRFLOW__CORE__XCOM_SNOWFLAKE_STAGE='AIRFLOW_XCOM_DB.AIRFLOW_XCOM_SCHEMA.XCOM_STAGE'
AIRFLOW__CORE__XCOM_SNOWFLAKE_CONN_NAME='snowflake_default'