davitacols / dataDisk

dataDisk is a Python package designed to simplify the creation and execution of data processing pipelines. It provides a flexible framework for defining sequential tasks, applying transformations, and validating data. Additionally, it includes a ParallelProcessor for efficient parallel execution.

Home Page:https://pypi.org/project/dataDisk/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dataDisk

dataDisk is a Python package designed to simplify the creation and execution of data processing pipelines. It provides a flexible framework for defining sequential tasks, applying transformations, and validating data. Additionally, it includes features for efficient parallel execution.

Key Features

  • DataPipeline: Define a sequence of data processing tasks in a straightforward manner.
  • Transformation: Apply custom transformations to your data easily.
  • Validator: Ensure your data meets specific conditions.
  • ParallelProcessor: Execute pipeline tasks in parallel for improved performance.
  • Data Sinks: Save processed data to various formats like CSV, Excel, and SQLite.

Installation

Install the package using pip:

pip install dataDisk

Transformations

Transformations allow you to apply various operations to your data. Here's a brief overview of available transformations:

  • Standardize: Scale features to have zero mean and unit variance.
  • Normalize: Scale features to have zero mean and unit variance.
  • Label Encode: Convert categorical labels to numeric values.
  • OneHot Encode: Convert categorical labels to one-hot encoded vectors.
  • Data Cleaning: Perform data cleaning operations like filling missing values and encoding categories.

Example of a custom transformation:

from dataDisk.transformation import Transformation

def double(x):
    return x * 2

transformation = Transformation(double)

Data Sinks

Data sinks allow you to save processed data to various formats:

  • CSVDataSink: Save data to a CSV file.
  • ExcelDataSink: Save data to an Excel file.
  • SQLiteDataSink: Save data to an SQLite database.

Example of using a data sink:

from dataDisk.data_sinks import CSVDataSink

csv_data_sink = CSVDataSink('output.csv')
csv_data_sink.save(data)

About

dataDisk is a Python package designed to simplify the creation and execution of data processing pipelines. It provides a flexible framework for defining sequential tasks, applying transformations, and validating data. Additionally, it includes a ParallelProcessor for efficient parallel execution.

https://pypi.org/project/dataDisk/

License:Other


Languages

Language:Python 85.5%Language:C++ 9.5%Language:Cython 3.3%Language:C 1.1%Language:HTML 0.3%Language:CSS 0.1%Language:JavaScript 0.1%Language:Fortran 0.1%Language:PowerShell 0.0%Language:Smarty 0.0%Language:Forth 0.0%Language:Assembly 0.0%Language:Meson 0.0%Language:Batchfile 0.0%Language:CMake 0.0%Language:Makefile 0.0%