Spark’s library for machine learning is called MLlib (Machine Learning library). It’s heavily based on Scikit-learn’s ideas on pipelines.
For this example, we will use a very basic dataset. The Titanic dataset, hopefully, you are all familiar with the case and the data. To start we have to download the data, for that we are using Kaggle: Titanic: Machine Learning from Disaster Start here! Predict survival on the Titanic and get familiar with ML basics www.kaggle.com
Just download the “train.csv” file it is also available in this repo.