salesforce / pyplyn

ETL tool that allows you to visualize the health of historical time-series data in real-time

Home Page:https://salesforce.github.io/pyplyn/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Introduction Build Status Static Analysis StackShare Code coverage

Pyplyn: a scalable time-series data collector

Pyplyn (meaning pipeline in Afrikaans) is an open source tool that extracts data from various sources, transforms and gives it meaning, and sends it to other systems for consumption.

Pyplyn was written to allow teams in Salesforce to power real-time health dashboards that go beyond one-to-one mapping of time-series data to visual artifacts.

One example of such an use-case is ingesting time-series data stored in Argus and Refocus, processing multiple metrics together (context), and displaying the result in Refocus (as red, yellow, or green lights).

Pyplyn System Diagram

Features

  • Simple and reliable data pipeline with support for various transformations
  • No code required, JSON-based syntax
  • Flexible multi-stage source/transformation/destination logic
  • Developed with support for extension via easy-to-grasph Java code
  • Highly available and scalable (the pipeline can be partitioned across multiple node)
  • Configurations can be added/updated/removed without restarting the process
  • Publishes operational metrics (errors, p95, etc.) for monitoring service health

Improvements from release 9.x

  • Faster processing speed with the use of RxJava (4.3x faster, tested on our reference dataset)
  • Cleaner code, mainly after converting models Immutables-annotated abstract classes
  • Support mutual TLS authentication for endpoints, by specifying a Java keystore and password
  • Connect, read, and write timeouts can now be specified for each connector
  • All Jackson-based models can now be serialized (with the type specifier field)
  • AppConfig.Global.minRepeatIntervalMillis was deprecated (replaced with AppConfig.Global.runOnce)
  • Bash script for managing the service's lifecycle (start, stop, restart, logs, etc.)
  • Since 10.0.0, Pyplyn releases follow Semantic versioning guidelines.

Roadmap

We welcome ideas for improvement and bugs and as such we encourage you to submit them by opening new issues on GitHub!

Running pyplyn

Pyplyn uses Maven for its build lifecycle. At least you will need to have Maven and Java 8 installed on your host OS.

Consult the full prerequisites section to find out more.

# Clone the Pyplyn repository
git clone https://github.com/salesforce/pyplyn /tmp/pyplyn

# Build the project with Maven
cd /tmp/pyplyn
mvn clean package

# Navigate to Pyplyn's build location
cd target/

# Create a new directory for your configurations (leave empty for now)
mkdir configurations

# Rename app-config.example.json and make the required changes
mv config/app-config.example.json config/pyplyn-config.json

# Rename connectors.example.json and make the required changes (see below)
mv config/connectors.example.json config/connectors.json

# Update the _connectors.json_ file and configure your endpoints
#

# Edit bin/pyplyn.sh and set _LOCATION_ to the absolute path of the build directory
#   LOCATION=/tmp/pyplyn/target

# Start pyplyn and check logs
bash bin/pyplyn.sh start

# Check that the program started without throwing any exceptions
bash ~/pyplyn/bin/pyplyn.sh logs

A full step-by-step explanation (including how to write configurations) can be found in the Pyplyn documentation.

Next steps?

Consult the Pyplyn Documentation for an in-depth explanation of Pyplyn's features.

Generate Javadocs by running the following Maven target: mvn package.

If you would like to contribute to Pyplyn, please read the contributor guide!

Thank you!

About

ETL tool that allows you to visualize the health of historical time-series data in real-time

https://salesforce.github.io/pyplyn/

License:BSD 3-Clause "New" or "Revised" License


Languages

Language:Java 99.6%Language:Shell 0.4%