joanvelro / SparkResources

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Spark: The Definitive Guide

This is the central repository for all materials related to Spark: The Definitive Guide by Bill Chambers and Matei Zaharia.

This repository is currently a work in progress and new material will be added over time.

Spark: The Definitive Guide

Code from the book

You can find the code from the book in the code subfolder where it is broken down by language and chapter.

How to run the code

Run on your local machine

To run the example on your local machine, either pull all data in the data subfolder to /data on your computer or specify the path to that particular dataset on your local machine.

Run on Databricks

Databricks is a zero-management cloud platform that provides:

  • Fully managed Spark clusters
  • An interactive workspace for exploration and visualization
  • A production pipeline scheduler
  • A platform for powering your favorite Spark-based applications

All the examples run on Databricks Runtime 3.1 and above. To get a free Databricks account, go to Try Databricks.

On a Databricks cluster, the data required to run the examples is available in /databricks-datasets/definitive-guide/data.

You can run a notebook in the code folder in Databricks after you import it, update the path to the data file to /databricks-datasets/definitive-guide/data, and attach it to a cluster. For details, see Notebooks.

About

License:Other


Languages

Language:Scala 60.5%Language:Python 36.8%Language:R 2.4%Language:Java 0.2%