DEPRECATED. This repository is no longer maintained. The materials have been ported over to spark-bootcamp.
This repository contains code samples for Apache Spark/Spark MLlib Workshop @ SDSC 2018.
- Scala Version: 2.11.12 (You can downgrade, but I recommend 2.11.8 or higher)
- Recommended IDE: IntelliJ
Dataset
Join Examples- Machine Learning: A simple
NaiveBayes
-based spam detector- Note that the dataset for training is in the resources folder.
- Spark Streaming Example: A simple streaming job that counts the number of occurrences each word in a stream.
- Task Not Serializable Example
- More examples in Databricks Notebooks:
- Scala Examples: Contains some Scala basics.
RDD
ExamplesDataFrame
ExamplesDataset
Examples- Machine Learning: Same as the machine learning example in this repository.