wa5znu / linux-jupyter-pyspark

run pyspark with jupyter on linux, using containers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Jupyter PySpark Notebook

Introduction

  • Following this
  • Target local Linux
  • Future: hopefully easier to run Jupyter or similar notebook with pyspark in AWS/GC
  • Non-target: Mac laptop

Requirements

  • Recent Linux server with four or more cores.
  • Podman, aliased to docker
  • A directory to save state into

Installation

  • mkdir $WORKBOOK
  • ed conf.sh # add $WORKBOOK
  • mkdir $WORKBOOK
  • sudo ./install-podman.sh

Running

  • ./doit.sh
  • Visit the 127.0.0.1 link it prints out
  • Add these cells
# Setup
from pyspark.sql import SparkSession

# local = this host only, *=use all cores
spark = SparkSession.builder.master("local[*]").getOrCreate()
# add your imports here

from pyspark.sql.functions import *
# Read your data
df = spark.read.option("header", True).csv("10000.csv")
# Analyze your data
df.count
# df.limit(10).toPandas()

References

About

run pyspark with jupyter on linux, using containers

License:MIT License


Languages

Language:Shell 100.0%