spark-rapids-examples

A repo for Spark related utilities and examples using the Rapids Accelerator,including ETL, ML/DL, etc.

Enterprise AI is built on ETL pipelines and relies on AI infrastructure to effectively integrate and process large amounts of data. One of the fundamental purposes of RAPIDS Accelerator is to effectively integrate large ETL and ML/DL pipelines. Rapids Accelerator for Apache Spark offers seamless integration with Machine learning frameworks such XGBoost, PCA. Users can leverage the Apache Spark cluster with NVIDIA GPUs to accelerate the ETL pipelines and then use the same infrastructure to load the data frame into single or multiple GPUs across multiple nodes to train with GPU accelerated XGBoost or a PCA. In addition, if you are using a Deep learning framework to train your tabular data with the same Apache Spark cluster, we have leveraged NVIDIA’s NVTabular library to load and train the data across multiple nodes with GPUs. NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems. We also add MIG support to YARN to allow CSPs to split an A100/A30 into multiple MIG devices and have them appear like a normal GPU.

Please see the Rapids Accelerator for Spark documentation for supported Spark versions and requirements. It is recommended to set up Spark Cluster with JDK8.

Getting Started Guides

1. Microbenchmark guide

The microbenchmark on RAPIDS Accelerator For Apache Spark is to identify, test and analyze the best queries which can be accelerated on the GPU. For detail information please refer to this guide.

2. Xgboost examples guide

We provide three similar Xgboost benchmarks, Mortgage, Taxi and Agaricus. Try one of the "Getting Started Guides". Please note that they target the Mortgage dataset as written with a few changes to EXAMPLE_CLASS and dataPath, they can be easily adapted with each other with different datasets.

3. TensorFlow training on Horovod Spark example guide

We provide a Criteo Benchmark to demo ETL and deep learning training on Horovod Spark, please refer to this guide.

4. PCA example guide

This is an example of the GPU accelerated PCA algorithm running on Spark. For detail information please refer to this guide.

5. MIG support

We provide some guides about the Multi-Instance GPU (MIG) feature based on the NVIDIA Ampere architecture (such as NVIDIA A100 and A30) GPU.

6. Spark Rapids UDF examples

This is examples of the GPU accelerated UDF. refer to this guide.

API

1. Xgboost examples API

These guides focus on GPU related Scala and python API interfaces.

Troubleshooting

You can trouble-shooting issues according to following guides.

Trouble Shooting XGBoost

Contributing

See the Contributing guide.

Contact Us

Please see the RAPIDS website for contact information.

License

This content is licensed under the Apache License 2.0

rongou / spark-rapids-examples