CorrelAid / spark_workshop

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Spark Workshop

This workshop is scheduled for July 9th, 2024.

Table of Contents
  1. About The Workshop
  2. Getting Started
  3. License
  4. Data
  5. Contact
  6. Acknowledgments

About The Workshop

Join us for an engaging Spark workshop tailored for beginners, where we'll dive into both theoretical concepts and practical application. Delve into the fundamental principles of Spark, learning about its architecture, data processing capabilities, and use cases. Then, roll up your sleeves for hands-on practice, where you'll apply what you've learned in real-world scenarios. By the end of the workshop, you'll have a basic understanding of Spark and the confidence to start building your own data processing pipelines.

(back to top)

Built With

This section enumerates the primary frameworks/libraries employed to initiate the workshop.

  • Docker
  • GitHub

(back to top)

Getting Started

Begin by following these detailed guidelines to set up your local environment seamlessly. These instructions will walk you through the process, ensuring a smooth start to your local development journey.

Prerequisites

Please Pre-install the following software:

Windows

Two methods are available for installing Docker Desktop on Windows: using WSL or Hyper-V. Both approaches are outlined below in Step 1.

  1. Choose one of the described options and install all necessary functions.

    (Recommended) Running Docker Desktop on Windows with WSL

    Open Windows Powershell as Administrator and run the following commands:

    wsl --install

    The first time you launch a newly installed Linux distribution, a console window will open and you'll be asked to wait for files to de-compress and be stored on your machine. Once you have installed WSL, you will need to create a user account and password for your newly installed Linux distribution.

    ⚠️ Occasionally, Ubuntu installation may terminate unexpectedly, displaying an Error Code (e.g., Error code: Wsl/InstallDistro/E_UNEXPECTED). In such cases, you may need to enable Virtual Machine within your CPU Configurations in the BIOS or activate Virtual Environment in Windows Features!


    Running Docker Desktop on Windows with Hyper-V backend

    Open Windows Powershell as Administrator and run the following commands:

    Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V -All
    Enable-WindowsOptionalFeature -Online -FeatureName Containers -All
    Enable-WindowsOptionalFeature -Online -FeatureName VirtualMachinePlatform -All

  2. Donwload the Docker Desktop file from the Docker Website and follow instructions.

  3. After installation open Docker Desktop.

Mac/Linux

  1. Donwload the Docker Desktop file from the Docker Website and follow instructions.

  2. After installation open Docker Desktop.

Installation

Please follow the instructions to setup your local programming environment. These guidelines will ensure that you have all the necessary tools and configurations in place for a smooth and efficient development experience during our workshop.

  1. Obtain a free Personal Access Token (PAT) from https://docs.github.com/PAT and use it to authenticate yourself locally.

  2. Clone the repository of our spark workshop.

    git clone https://github.com/CorrelAid/spark_workshop.git
  3. Open a bash terminal and navigate directly to the root directory of your locally cloned repository. From there, install and activate the PySpark Environment using the following command:

    sh run_setup.sh
  4. Access the port via a web browser:

    http://localhost:10001/?token=<token>
  5. Enter the token as the password. You can find the token displayed in the terminal, as illustrated in the image below.:

    If you are not sure what the token is you can open another bash terminal and execute:

    docker logs pyspark_workshop | grep -o 'token=[^ ]*'

    This command will display all logs from your recently created Docker container containing the token. Simply look for the section where "token=" is mentioned.

bash

Sample Output: Screenshot of Terminal Display Post-Execution of run_setup.sh - Featuring Token and Port Information

(back to top)

Workshop - Tipps and Tricks

This section features helpful links aimed at assisting with workshop tasks.

PySpark Documentation: DataFrame Functions
PySpark by Examples
PySpark: The 9 most useful functions to know
PySpark YouTube Tutorial: DataFrame Functions

When all else fails, just ask: either directly to us or to ChatGPT. 😉

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Data

The data originates from a freely available dataset on Kaggle: https://kaggle/grocery-dataset

(back to top)

Contact

Pia Baronetzky

Daniel Manny

Jie Bao - @jbao

Luisa-Sophie Gloger - @LAG1819

Project Link: https://github.com/CorrelAid/spark_workshop

(back to top)

Acknowledgments

(back to top)

About

License:MIT License


Languages

Language:Jupyter Notebook 76.1%Language:Shell 23.9%