ETL9 makes data pipelining fun.
Simple all-in-one container for FaaS-based ETL pipelines.
docker-compose up
- Simple Declarative Interface
- Automatically generated web-based documentation
- Input/Output Type Checking
- GUI for configuring pipelines & types
- API for type-checking and running pipeline jobs
- Hot reloading YAML or JSON configuration
- One docker run and you're ready to start building
- Easy visual debugging, inspection and "pipeline breakpoints"
- Progressive data input (partially completed stages)
- Built to scale for large workloads
- endpoint: A service endpoint, usually a URL
- stage: A type of data transformation
- stage function: The function executed to progress a stage. Every stage function takes 2 arguments, the input and the state of the stage, and return 2 objects, one indicating the new state, and one indicating the output.
- pipeline: Configuration of stage(s) and connections to produce a desired output
- instance: A running pipeline. Many instances can run from a single pipeline definition.
- Minimize side effects Each stage function should not maintain state beyond what it returns and is input. State in other services should be avoided (but is often necessary, e.g. uploading to an s3 bucket).
- Keep payloads small Don't store a lot of information e.g. files within the state of each stage.
- Return quickly Stage functions should complete some progress then immediately return, rather than doing all the work available then returning. In some cases, this may create a lot of network overhead. If network overhead is an issue, complete as much work as possible in 10 seconds then return.
Often data is incrementally completed as a stage progresses. etl9 will move progressive data onto subsequent stages that support progressive data, resulting in faster pipeline completion.
The config directory will automatically be generated with some boilerplate code. In this doc, we'll use YAML but you can specify any configuration file in JSON as well.
A configuration file can be placed anywhere in the config directory (e.g. within subdirectories)
kind: PipelineTemplate
name: S3TestPipeline
stages:
pull_s3:
type: PullFromS3
inputs:
s3_source:
value: s3://example-bucket
output_logger:
type: LogOutput
inputs:
input:
node: pull_s3
output: files
---
kind: Stage
name: PullFromS3
inputs:
s3_source:
type: string
regex_validator: ^s3://.*
s3_creds:
type: S3Credentials
optional: yes
outputs:
files:
type: FileList
progressive: yes
---
kind: Type
name: S3Credentials
superstruct: |
{
secretAccessKey: 'string',
accessKeyId: 'string'
}
---
kind: Type
name: FileList
superstruct: |
[{
url: 'string'
}]
In production, you'll want to use a persistent database external to the container. Use the PG_*
environment variables to configure your custom database.
Environment Variable | Purpose |
---|---|
PG_HOST | Postgres Database host e.g. localhost |
PG_USER | Database user |
PG_PASS | Database password |
PG_PORT | Database port |
This repository is made up of several services managed by lerna. Here are the main services and their descriptions...
Service | Port/Endpoint | Description |
---|---|---|
gui |
:9100 , /* |
NextJS user interface for managing pipelines |
instance-controller |
:9101 , /api/instance-controller |
Controls instance lifecycle |
database |
Database with state of all active pipelines/stages | |
database-rest-api |
:9102 , /api/db |
A REST API for the database. Does not perform type-checking. |
stage-api |
:9103 , /api/stage |
Evoke stage function or create stages. |
typecheck-api |
:9104 , /api/typecheck |
Typecheck API |
builtin-stages |
:9105 , /api/builtin-stages |
Use ETL9 builtin stages. |
config-sync |
Monitor filesystem and load configuration files | |
reverse-proxy |
:9123 , * |
Reverse proxy, coordinates to correct services |
An instance is created from a pipeline definition. The instance will perform the following steps until it's completed.
- Iterate over all it's stages.
- If input is available for a stage, execute the stage function.
- Store the resulting state and output of each stage function.
- If all stages are complete, done. If not, repeat from step 1.
- LogOutput Logs output to ETL9 log database for viewing in the ETL9 GUI.
- RunContainer in progress Run a (possibly long-running) docker container process. Requires container to have access to docker daemon.
A stage function receives a POST
request with a JSON body containing the following contents...
{
// ID of instance being handled
instance_id: string,
// Input stages
inputs: {
[InputKey: string]: {
value: any
}
},
// State returning from the last successful request from this instance
state: any
}
The stage function should return an object of the following form...
{
// Updated state
state?: any,
// Updated Output
output?: {
[OutputKey: string]: {
value: any
}
},
// Number between 0 and 1 indicating closeness to completion
progress?: number,
// Boolean indicating whether or not this stage is complete
complete?: boolean
}