xuwenyihust / DataPulse

DataPulse is a platform for developers to build, schedule and monitor data pipelines.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DataPulse: Platform For Big Data & AI

GitHub Workflow Status (with event) GitHub Actions Workflow Status GitHub Release Static Badge GitHub License

Documentation

Features

  • Spark Application Deployment
    • Jar Application Submission
    • PySpark Application Submission
    • Jupyter Notebook
      • Customized Integration with PySpark
  • Monitoring
    • Spark UI
    • History Server

Supported Versions

  • Apache Spark: 3.5.0
  • Scala: 2.12
  • Python: 3.11
  • GCS Connector: hadoop3-2.2.0

Prerequisites

  • GCP account
    • Kubernetes Engine
    • Cloud Storage
  • gcloud SDK
  • kubectl
  • helm
  • docker
  • python3

Quickstart

Notebook

Step1: Setup Configuration

cp bin/env_template.yaml bin/env.yaml

Fill in the env.yaml file with your own configurations.

Step2: Create a Kubernetes cluster on GCP

source bin/setup.sh

Step3: Create a Jupyter Notebook

A service notebook will be created on the Kubernetes cluster.

Step4: Check Spark Integration

Alt text

Check Spark information by running the following code in a notebook cell:

start()

Step5: Check Spark UI

Alt text

Check Spark UI by clicking the link in the notebook cell output.

License

This project is licensed under the terms of the MIT license.

Reference

About

DataPulse is a platform for developers to build, schedule and monitor data pipelines.

License:MIT License


Languages

Language:Shell 74.0%Language:Python 23.6%Language:Dockerfile 2.5%