thoppe / dspipe

Easy to use data science pipes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

easy to use data science pipes

PyVersion PyPI

pip install dspipe

Shuffling your data from one directory into another directory is easier than before!

On it's own, a Pipe calls a function in parallel using joblib. By default it uses all possible cores to compute:

from dspipe import Pipe

def add2(x):
    return x + 2

result = Pipe(range(3))(add2)

# result = [2, 3, 4]

But it can do so much more! Let's consider a more involved example, let's say you have a bunch of XML files and you'd like to parse them into clean CSVs.

from dspipe import Pipe

def compute(f0, f1):
    # Read the XML file f0
    # Turn it into a dataframe
    df.to_csv(f1)

P = Pipe("data/xml", "data/parsed_csv", input_suffix='.xml', output_suffix='.csv')

P(compute, -1)

This will run the pipe for every XML file in data/xml, create the directory data/parsed_csv, and give the compute function a new filename to save it to f1 with the .xml replaced with .csv. Running the pipe with 1 as the second argument wil run single-threaded execution.

Other options and defaults for Pipe

  • shuffle = True shuffle the data in the input before compute
  • progressbar = True turns on/off the progress bar
  • prefilter = True runs through the input filenames ahead of time
  • autorename = True turns on/off the renaming

About

Easy to use data science pipes


Languages

Language:Python 97.6%Language:Makefile 2.4%