danicat / oxford-ai-gcp

This is the supporting material for my talk "Data Engineering on GCP"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

oxford-ai-gcp

This is the supporting material for my talk "Data Engineering on GCP" presented at Oxford University as part of the course Artificial Inteligence: Cloud and Edge Implementations.

Presentation Slides

Available here.

Getting Started

This repo has been tested on Linux Ubuntu and Mac OS X.

If you are using Windows 10 you may use Ubuntu through the Windows Subsystem for Linux.

The following steps assume you have python3 installed.

Setup

Google Cloud SDK

Follow the instructions at https://cloud.google.com/sdk/docs/downloads-interactive.

Linux (Ubuntu)

Install Java 8 (required for running PySpark locally). On Linux Ubuntu:

sudo apt install openjdk-8-jdk

Create a python3 virtual env before running any of the sample code:

python3 -m venv venv

If the module python3-venv is not available, you may need to install it:

sudo apt-get install python3-venv

Mac OS X

TBD

Running the Examples

To activate the environment, use:

source venv/bin/activate

With the virtual env activated, install the requirements file:

pip install -r requirements.txt

To deactivate the environment, after finishing your work, use the command deactivate.

References

  1. Mining of Massive Datasets
  2. Why performance matters
  3. Spark BigQuery connector

About

This is the supporting material for my talk "Data Engineering on GCP"

License:MIT License


Languages

Language:Shell 55.4%Language:Python 44.6%