jeremyjordan / megatron

Lightweight, customizable feature pipeline represented as a directed acyclic graph.

Home Page:https://megatron.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Megatron: Machine Learning Pipelines

Megatron is a Python module for building data pipelines that encapsulate the entire machine learning process, from raw data to predictions.

The advantages of using Megatron:

  • A wide array of data transformations can be applied, including:
    • Built-in preprocessing transformations such as one-hot encoding, whitening, time-series windowing, etc.
    • Any custom transformations you want, provided they take in Numpy arrays and output Numpy arrays.
    • Sklearn preprocessors, unsupervised models (e.g. PCA), and supervised models. Basically, anything from sklearn.
    • Keras models.
  • To any Keras users, the API will be familiar: Megatron's API is heavily inspired by the Keras Functional API, where each data transformation (whether a simple one-hot encoding or an entire neural network) is applied as a Layer.
  • Since all datasets should be versioned, Megatron allows you to name and version your pipelines and associated output data.
  • Pipeline outputs can be cached and looked up easily for each pipeline and version.
  • The pipeline can be elegantly visualized as a graph, showing connections between layers similar to a Keras visualization.
  • Data and input layer shapes can be loaded from structured data sources including:
    • Pandas dataframes.
    • CSVs.
    • SQL database connections and queries.
  • Pipelines can either take in and produce full datasets, or take in and produce batch generators, for maximum flexibility.
  • Pipelines support eager execution for immediate examination of data and simpler debugging.

Installation

To install megatron, just grab it from pip:

pip install megatron

There's also a Docker image available with all dependencies and optional dependencies installed:

docker pull ntaylor22/megatron

Optional Dependencies

  • Scikit-Learn
    • If you'd like to use Sklearn transformations as Layers.
  • Keras
    • If you'd like to use Keras models as Layers.
  • Pydot
    • If you'd like to be able to visualize pipelines.
    • Note: requires GraphViz to run.

Tutorial

See the project documentation For an in-depth tutorial where you can build this simple example:

https://raw.githubusercontent.com/ntaylorwss/megatron/master/img/keras.png

Also check out the example notebooks, which you can conveniently launch from Binder by clicking on the badge below.

Custom Layers

If you have a function that takes in Numpy arrays and produces Numpy arrays, you have two possible paths to adding it as a Layer in a Pipeline:

  1. The function has no parameters to learn, and will always return the same output for a given input. We refer to this as a "stateless" Layer.
  2. The function learns parameters (i.e. needs to be "fit"). We refer to this as a "stateful" Layer.

Custom Stateful Layers

To create a custom stateful layer, you will inherit the StatefulLayer base class, and write two methods: fit (or partial_fit), and transform. Here's an example with a Whitening Layer:

class Whiten(megatron.layers.StatefulLayer):
    def fit(self, X):
        self.metadata['mean'] = X.mean(axis=0)
        self.metadata['std'] = X.std(axis=0)

    def transform(self, X):
        return (X - self.metadata['mean']) / self.metadata['std']

There's a couple things to know here:

  • When you calculate parameters during the fit, you store them in the provided dictionary self.metadata. You then retrieve them from this dictionary in your transform method.
  • If your Layer is one that can be fit iteratively, you can override partial_fit rather than fit. If your transformation cannot be fit iteratively, you override fit; note that Layers without a partial_fit cannot be used with data generators, and will throw an error in that situation.

Custom Stateless Layers

To create a custom stateless Layer, you can simply define your function and wrap it in megatron.layers.Lambda. For example:

def dot_product(X, Y):
    return np.dot(X, Y)

dot_xy = megatron.layers.Lambda(dot_product)([X_node, Y_node], 'dot_product_result')

That's it, a simple wrapper.

Why is it called Megatron?

Because the layers are data transformers!

That's... that's about it.

License

MIT.

ezoic increase your site revenue

About

Lightweight, customizable feature pipeline represented as a directed acyclic graph.

https://megatron.readthedocs.io

License:Other


Languages

Language:Python 60.6%Language:Jupyter Notebook 37.7%Language:Shell 1.3%Language:Dockerfile 0.5%