RAPIDS Notebooks-Contrib

Intro
Installation
Exploring the Repo

Notebooks:

Getting Started
Intermideate
Advanced
BLOGS
Conference

Introduction

Welcome to the community contributed notebooks repo! (formerly known as Notebooks-Extended)

The purpose of this collection of notebooks is to help users understand what RAPIDS has to offer, learn why, how, and when including RAPIDS in a data science pipeline makes sense, and contain community contributions of RAPIDS knowledge. The difference between this repo and the Notebooks Repo are:

These are vetted, community-contributed notebooks (includes RAPIDS team member contributions).
These notebooks won't run on air gapped systems, which is one of our container requirements. Many RAPIDS notebooks use additional PyData ecosystem packages, and include code for downloading datasets, thus they require network connectivity. If running on a system with no network access, please download all the data that you plan to use ahead of time or simply use the core notebooks repo.

Installation

Please use the BUILD.md to check the pre-requisite packages and installation steps.

Contributing

Please see our guide for contributing to notebooks-contrib.

Once you've followed our guide, please don't forget to test your notebooks! before making a PR.

Exploring the Repo

Folders

getting_started_notebooks - “how to start using RAPIDS”. Contains notebooks showing "hello worlds", getting started with RAPIDS libraries, and tutorials around RAPIDS concepts.
intermediate_notebooks - “how to accomplish your workflows with RAPIDS”. Contains notebooks showing algorithm and workflow examples, benchmarking tools, and some complete end-to-end (E2E) workflows.
advanced_notebooks - "how to master RAPIDS". Contains notebooks showing kernel customization and advanced end-to-end workflows.
blog notebooks - contains shared notebooks mentioned and used in blogs that showcase RAPIDS workflows and capabilities
conference notebooks - contains notebooks used in conferences, such as GTC
data - contains small data samples used for purely functional demonstrations. Some notebooks include cells that download larger datasets from external websites.

Lists

multimedia_links.md is a list of videos by RAPIDS or our community talking about or showing how to use RAPIDS. Feel free to contribute your videos and RAPIDS themed playlists as well!
competition_notebooks.md - contains archived notebooks that were used in competitions, such as Kaggle. Some of these notebooks were blogged about and can also be found in our blog notebooks folder.

Our Notebooks

Below is a listing of the notebooks in this repository. Each row will tell you the notebook's

Location in Folder
Notebook Title and Direct Link in Notebook Title
Description in Description
Design is for a Single GPU(SG) or Multiple GPUs(MG) in GPU (don't worry, you can still run the multi-GPU notebooks with a single GPU)
Data can be found in Dataset Used

Getting Started Notebooks:

Folder	Notebook Title	Description	GPU	Dataset Used
basics	Getting_Started_with_cuDF	This notebook shows how to get started with GPU DataFrames (single GPU only) using cuDF in RAPIDS.	SG	Self Generated
basics	Dask_Hello_World	This notebook shows how to quickly setup Dask and run a "Hello World" example.	MG	Self Generated
basics	Getting_Started_with_Dask	This notebook shows how to get started with multi-GPU DataFrames using Dask and cuDF in RAPIDS.	MG	Self Generated
basics	hello_streamz	This notebook demonstrates use of cuDF to perform streaming word-count using a small portion of the Streamz API.	SG	Self Generated
basics -> blazingsql	Getting Started with BlazingSQL	How to set up and get started with BlazingSQL and the RAPIDS AI suite.	SG	Music Dataset
basics -> blazingsql	Federated Query Demo	In a single query, join an Apache Parquet file, a CSV file, and a GPU DataFrame (GDF) in GPU memory.	SG	Breast Cancer Diagnostic
intro_tutorials	01_Introduction_to_RAPIDS	This notebook shows at a high level what each of the packages in RAPIDS are as well as what they do.	MG	Self Generated
intro_tutorials	02_Introduction_to_cuDF	This notebook shows how to work with cuDF DataFrames in RAPIDS.	SG	Self Generated
intro_tutorials	03_Introduction_to_Dask	This notebook shows how to work with Dask using basic Python primitives like integers and strings.	MG	Self Generated
intro_tutorials	04_Introduction_to_Dask_using_cuDF_DataFrames	This notebook shows how to work with cuDF DataFrames using Dask.	MG	Self Generated
intro_tutorials	06_Introduction_to_Supervised_Learning	This notebook shows how to do GPU accelerated Supervised Learning in RAPIDS.	SG	Self Generated
intro_tutorials	07_Introduction_to_XGBoost	This notebook shows how to work with GPU accelerated XGBoost in RAPIDS.	SG	Self Generated
intro_tutorials	08_Introduction_to_Dask_XGBoost	This notebook shows how to work with Dask XGBoost in RAPIDS.	MG	Self Generated
intro_tutorials	09_Introduction_to_Dimensionality_Reduction	This notebook shows how to do GPU accelerated Dimensionality Reduction in RAPIDS.	SG	Self Generated
intro_tutorials	10_Introduction_to_Clustering	This notebook shows how to do GPU accelerated Clustering in RAPIDS.	SG	Self Generated

Intermediate Notebooks:

Folder	Notebook Title	Description	GPU	Dataset Used
examples	linear_regression_demo.ipynb	This notebook demos how to implement simple and multiple linear regression with cuML to predict median housing price on sklearn's Boston Housing dataset. With corresponding Medium Story.	SG	SKLearn Boston Housing
examples	umap_demo_full	In this notebook we will show how to use UMAP and its GPU accelerated implementation present in RAPIDS.	SG	Fashion MNIST
examples	rf_demo	Demonstration of using both cuml and sklearn to train a RandomForestClassifier on the Higgs dataset.	SG	Higgs Boson
examples	weather	Demonstration of using Dask and cuDF to process and analyze weather history	MG	NOAA Annual Weather Data
examples -> blazingsql	BlazingSQL vs Spark	Analyze 73 million rows of net flow data. Compare BlazingSQL and Apache Spark timings for the same workload.	SG	University of New South Wales LanL Dataset
examples -> blazingsql	Taxi Fare Prediction	Build & test a cuML Linear Regression model to predict the cost of a ride from 20 million rows of NYC Taxi data.	SG	NYC Taxi Dataset
examples -> custreamz	parsing_haproxy_logs	This notebook builds upon the weblogs streaming notebook and demonstrates more advanced features for parsing HAProxy logs.	SG	Self Generated
examples -> cugraph	MG Pagerank	Analyze a Twitter dataset (26GB on disk) with 41.7 million users with 1.47 billion social relations (edges) to find out the most influential profiles.	MG	Twitter
E2E -> taxi	NYCTaxi	Demonstrates multi-node ETL for cleanup of raw data into cleaned train and test dataframes. Shows how to run multi-node XGBoost training with dask-xgboost. Please Note: requires Google Dataproc to run! Blog	MG	Google Dataproc Hosted NYC Taxi Data
E2E -> synthetic_3D	rapids_ml_workflow_demo	A 3D visual showcase of a machine learning workflow with RAPIDS (load data, transform/normalize, train XGBoost model, evaluate accuracy, use model for inference). Along the way we compare the performance gains of RAPIDS [GPU] vs sklearn/pandas methods [CPU].	SG	SciKit-Learn's demo datasets
E2E -> census	census_education2income_demo	In this notebook we use 50 years of census data to see how education affects income.	SG	Custom IPUMS Data pull
E2E -> mortgage	mortgage_e2e	This notebook demonstrates multi-GPU ETL and XGBoost for data preprocessing and training on 17 years of Fannie Mae’s Single-Family Loan Performance Data.	MG	Mortgage Loan Data
benchmarks	cuml_benchmarks	The purpose of this notebook is to extensively benchmark all of the single GPU cuML algorithms against their skLearn counterparts, while also providing the ability to find and verify upper bounds. Note: Best on large memory GPUs	SG	Self Generated
benchmarks	rapids_decomposition	This notebook benchmarks and visualize RAPIDS decomposition methods against each other. You have the opportunity to self-compare it to CPU speeds and methods	SG	SciKit-Learn's demo datasets
benchmarks -> cugraph_benchmarks	louvain_benchmark	This notebook benchmarks performance improvement of running the Louvain clustering algorithm within cuGraph against NetworkX.	SG	Sparse collection
benchmarks -> cugraph_benchmarks	pagerank_benchmark	This notebook benchmarks performance improvement of running PageRank within cuGraph against NetworkX.	SG	Sparse collection
benchmarks -> cugraph_benchmarks	BFS benchmark	This notebook benchmarks performance improvement of running BFS within cuGraph against NetworkX.	SG	Sparse collection
benchmarks -> cugraph_benchmarks	SSSP_benchmark	This notebook benchmarks performance improvement of running SSSP within cuGraph against NetworkX.	SG	Sparse collection
benchmarks -> cugraph_mg_hibench	MG pagerank_benchmark	This notebook runs cuGraph's multi-GPU PageRank on a dataset of 300GB. It designed for DGX-2 machines.	MG	HiBench

Advanced Notebooks:

Folder	Notebook Title	Description	GPU	Dataset Used
tutorials	rapids_customized_kernels	Archive Only. This notebook shows how create customized kernels using CUDA to make your workflow in RAPIDS even faster.	SG	Self Generated

Blog Notebooks:

Folder	Notebook Title	Description	GPU	Dataset Used
cyber	flow_classification_rapids	Archive Only. The `cyber` folder contains the associated companion files for the blog GPU Accelerated Cyber Log Parsing with RAPIDS, by Bianca Rhodes US, Bhargav Suryadevara, and Nick Becker. This notebook demonstrates how to load netflow data into cuDF and create a multiclass classification model using XGBoost. Uses run_raw_data_generator	SG	University of New South Wales LanL Dataset
cyber	lanl_network_mapping_using_rapids	Archive Only. The `cyber` folder contains the associated companion files for the blog GPU Accelerated Cyber Log Parsing with RAPIDS, by Bianca Rhodes US, Bhargav Suryadevara, and Nick Becker. This notebook demonstrates how to parse raw windows event logs using cudf and uses cuGraph's pagerank model to build a network graph. Uses run_raw_data_generator	SG	University of New South Wales LanL Dataset
databricks	RAPIDS_PCA_demo_avro_read	The `databricks` folder is the companion file repository to the blog RAPIDS can now be accessed on Databricks Unified Analytics Platform by Ikroop Dhillon, Karthikeyan Rajendran, and Taurean Dyer. This notebooks purpose is to showcase RAPIDS on Databricks use their sample datasets and show the CPU vs GPU comparison for the PCA algorithm. There is also an accompanying HTML file for easy Databricks import. This notebook is for illustrative purposes only! Do not expect this notebook to successfully run on its own- this notebook's code is replicates a workflow meant to run on a specific platform, `Databricks`	SG	RAPIDS Toy Data
plasticc	rapids_lsst_full_demo	Archive Only. This notebook demos the full CPU and GPU implementation of the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. Updated notebooks found here	MG	Kaggle PLAsTiCC-2018 dataset
plasticc	rapids_lsst_gpu_only_demo	Archive Only. This GPU only based notebook shows the RAPIDS speedup of the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. Updated notebooks found here	MG	Kaggle PLAsTiCC-2018 dataset
santander	cudf_tf_demo	Archive Only. This financial industry facing notebook is the cudf-tensorflow approach from the RAPIDS.ai team for Santander Customer Transaction Prediction. Placed 17/8808. Blog	SG	Kaggle Santander Customer Transaction Prediction Dataset
santander	E2E_santander_pandas	Archive Only. This This financial data modelling notebook is the Pandas based version the RAPIDS.ai team's best single model for Santander Customer Transaction Prediction competition. Placed 17/8808. Blog	SG	Kaggle Santander Customer Transaction Prediction Dataset
santander	E2E_santander	Archive Only. This financial data modelling notebook is the cuDF based version of the RAPIDS.ai team's best single model for Santander Customer Transaction Prediction competition. It allows you to compare cuDF performance to the Pandas version. Placed 17/8808. Blog.	SG	Kaggle Santander Customer Transaction Prediction Dataset
regression	regression_blog_notebook	This is the companion notebook for the blog Essential Machine Learning with Linear Models in RAPIDS: part 1 of a series by Paul Mahler. It showcases an end to end notebook using the Bike Share dataset and cuML's implementation of ridge regression.	SG	Bike Share Dataset
regression	regression_2_blog	This is the companion notebook for the blog Regression Blog 2: We’re Practically Giving These Regressions Away by Paul Mahler. It showcases an end to end notebook using the Black Friday dataset and cuML's implementations of L1 and L2 regularizations using Ridge, Lasso, and ElasticNet regression techniques.	SG	Analytics Vidhya Black Friday Hackathon Dataset
NLP	show_me_the_word_count_gutenberg	This is the notebook for blog Show Me The Word Count by Vibhu Jawa, Nick Becker, David Wendt, and Randy Gelhausen. This notebook showcases NLP pre-processing capabilties of nvstrings+cudf on the Gutenberg dataset.	SG	Gutenburg Dataset
cuspatial	accelerate_geospatial_processing	This is the notebook for blog cuSpatial Accelerates Geospatial and Spatiotemporal Processing by Milind Naphade, Jianting Zhang, Shuo Wang, Thomson Comer, Josh Paterson, Keith Kraus, Mark Harris, and Sujit Biswas. This notebook showcases cuSpatial benchmarking of directed Hausdorff distance for computing trajectory clustering on a large dataset.	SG	Trajectories Data and target_intersection.png
randomforest	fruits_rf_notebook	This is the notebook for blog GPU-accelerated Random Forest by Vishal Mehta, Myrto Papadopoulou, Thejaswi Rao. This notebook showcases how to use GPU accelerated Random Forest Classification in cuML. The fruit dataset used is Self generated and used as an example in the Blog	SG	Self Generated
mortgage deep learning	mortgage_e2e_deep_learning	Archive Only. This end to end notebook for the blog, Using RAPIDS with PyTorch, by Even Oldridge, combines the RAPIDS GPU data processing with a PyTorch deep learning neural network to predict mortgage loan delinquency.	MG	Fannie Mae Mortgage Dataset
svm	svc_covertype	This notebook provides supplementary information for the Benchmark section of the RAPIDS cuML SVC blog post.	SG	UCI Forest covertype dataset

Conference Notebooks:

Folder	Notebook Title	Description	GPU	Dataset Used
GTC_SJ_2019	GTC_tutorial_instructor	This is the instructor notebook for the hands on RAPIDS tutorial presented at San Jose's GTC 2019. It contains all the demonstrated solutions.	SG	Analytics Vidhya Black Friday Hackathon Dataset
GTC_SJ_2019	GTC_tutorial_student	This is the exercise-filled student notebook for the hands on RAPIDS tutorial presented at San Jose's GTC 2019	SG	Analytics Vidhya Black Friday Hackathon Dataset

KDD_2019	Cybersecurity_KDD	Using RAPIDS on network traffic and metadata, we demonstrate how to: 1. Triage and perform data exploration, 2. Model network data as a graph, 3. Perform graph analytics on the graph representation of the cyber network data, and 4. Prepare the results in a way that is suitable for visualization.	SG	IDS 2018 dataset
KDD_2019	MiningFrequentPatternsFromGraphs	This notebook uses PC failure metadata, turns it into a coordinate list, and uses cugraph to find frequent patterns about the population that has failed	SG	Microsoft PC Failure Metadata Graph
KDD_2019	Part 1.1 RNN Feature Engineering	Part 1.1 of this GPU only based notebook shows the RAPIDS speedup of the the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. - Introduction found here. - Exercise Answers found here - Original submission found here	MG	Kaggle PLAsTiCC-2018 dataset
KDD_2019	Part 1.2 RNN Extract Bottleneck	Part 1.2 of this GPU only based notebook shows the RAPIDS speedup of the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. - Introduction found here. - Exercise Answers found here - Original submission found here	MG	Kaggle PLAsTiCC-2018 dataset
KDD_2019	Part 2.1 Feature Engineering	Part 2.1 of this GPU only based notebook shows the RAPIDS speedup of the the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. - Introduction found here. - Exercise Answers found here - Original submission found here	MG	Kaggle PLAsTiCC-2018 dataset
KDD_2019	Part 2.2 Train XGBoost & MLP	Part 2.2 of this GPU only based notebook shows the RAPIDS speedup of the the RAPIDS.ai team's model that placed 8/1094 in the PLAsTiCC Astronomical Classification competition. Blog. - Introduction found here. - Exercise Answers found here - Original submission found here	MG	Kaggle PLAsTiCC-2018 dataset

SCIPY_2019	SCIPY_2019 Tutorial Index	This index outlines the "getting started" style tutorials within the folder. The tutorials cover cudf, cuml, and cugraph. These tutorials were presented at SCIPY 2019	SG	Various Self Generated datasets and Zachary Karate Club Data Set

ASONAM 2019	Cyber	Example notebook using RAPIDS to let an organization's security and forensics experts collect vast amounts of network traffic and network metadata and perform fast triage, processing, modeling, and visualization capabilities.	MG	IDS 2018 dataset from the Canadian Institute for Cybersecurity
ASONAM 2019	Spotify Playlist	Shows how you can quickly use RAPIDS to explore the Spotify Million Playlist Dataset, which was created for the RecSys 2018 competition, and build a playlist recommender Note: this dataset requires an independent user download and cannot be pulled from the notebook	MG	RecSys 2018 competition
ASONAM 2019	Weighted Link Prediction	This notebook uses cuGraph for Weighted Link Prediction to mitigate uncertainty on the Epinions Trust Network Dataset to predict the likelihood of trust or distrust between vertices. Note: this dataset requires an independent user download and cannot be pulled from the notebook	SG	Epinions Trust Network Dataset

KDD 2020	KDD 2020	Conference material for the KDD 2020 hands-on tutorial	SG
KDD 2020	Taxi	Analysis of the New York City Taxi dataset. Introductory notebook showing ETL, Statistical Analysis, Machine Learning, Graph, and Visualization	SG	2016 New York Taxi Data
KDD 2020	Tabular	Perform store sales prediction using tabular deep learning	SG	Kaggle Rossmann Store Sales competition
KDD 2020	Cell RNA	Single-Cell RNA Sequencing Analysis	SG	human lung cells from Travaglini et al. 2020
KDD 2020	Parking	Analyzing Seattle Parking data and determining the best parking spot within a walkable distance from Space Needle	SG
KDD 2020	CyBERT	Cyber Log Parsing using Neural Networks and Language Based Model	SG

Additional Information

The data folder also includes the full image set from the Fashion MNIST dataset.
utils: contains a set of useful scripts for interacting with RAPIDS Notebooks-Contrib
For our notebook examples and tutorials found in our standard containers, please see the Notebooks Repo

ayushdg / notebooks-extended