data-engineering data-structures orchestration pipeline python activities data-factory datasets linked-services

Data Factory

Data Factory is my attempy at building my own Python library for building data pipelines with ease for my personal projects, inspired on azure data factory.

Features

Easy-to-use Python API for defining and executing data pipelines.
Support for defining activities and orchestrating them in a pipeline.

Installation

You can install Data Factory using poetry:

poetry install

To validate the install works you can run unit tests

pytest

Examples

Activities and Pipelines

from data_factory.pipeline import Pipeline, Activity

# Define your activities
def activity1():
    print("Executing Activity 1")

def activity2():
    print("Executing Activity 2")

# Create activities
activity_1 = Activity("Activity 1", activity1)
activity_2 = Activity("Activity 2", activity2)

# Create a pipeline and add activities
my_pipeline = Pipeline("My Pipeline", activities=[activity_1, activity_2])

# Run the pipeline
my_pipeline.run()

Orchestrating Pipelines

from data_factory.orchestrator import PipelineOrchestrator

# Create pipelines
pipeline_1 = Pipeline("Pipeline 1", activities=[activity1, activity2])
pipeline_2 = Pipeline("Pipeline 2", activities=[activity3])

# Create an orchestrator and add pipelines
orchestrator = PipelineOrchestrator([pipeline_1, pipeline_2])

# Run all pipelines sequentially
orchestrator.run_pipelines(verbose=True)

# Get the run statuses of all pipelines
pipeline_statuses = orchestrator.get_run_statuses()
print(pipeline_statuses)

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Data Engineering Module Inspired on Azure Data Factory for Pipeline orchestration of Data

data-engineering data-structures orchestration pipeline python activities data-factory datasets linked-services

MIT License

Languages

Language:Python 100.0%

jperod / data_factory