Huang9495 / dataflow

Efficient Data Loading Pipeline in Pure Python

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Tensorpack DataFlow

Tensorpack DataFlow is an efficient and flexible data loading pipeline for deep learning, written in pure Python.

Its main features are:

  1. Highly-optimized for speed. Parallelization in Python is hard and most libraries do it wrong. DataFlow implements highly-optimized parallel building blocks which gives you an easy interface to parallelize your workload.

  2. Written in pure Python. This allows it to be used together with any other Python-based library.

DataFlow is originally part of the tensorpack library and has been through many years of polishing. Given its independence of the rest of the tensorpack library, it is now a separate library whose source code is synced with tensorpack. Please use tensorpack issues for support.

Why would you want to use DataFlow instead of a platform-specific data loading solutions? We recommend you to read Why DataFlow?.

Install:

pip install --upgrade git+https://github.com/tensorpack/dataflow.git
# or add `--user` to install to user's local directories

You may also need to install opencv, which is used by many builtin DataFlows.

Examples:

import dataflow as D
d = D.ILSVRC12('/path/to/imagenet')  # produce [img, label]
d = D.MapDataComponent(d, lambda img: some_transform(img), index=0)
d = D.MultiProcessMapData(d, num_proc=10, lambda img, label: other_transform(img, label))
d = D.BatchData(d, 64)
d.reset_state()
for img, label in d:
  # ...

Documentation:

Tutorials:

  1. Basics
  2. Why DataFlow?
  3. Write a DataFlow
  4. Parallel DataFlow
  5. Efficient DataFlow

APIs:

  1. Built-in DataFlows
  2. Built-in Datasets

Support & Contributing

Please send issues and pull requests (for the dataflow/ directory) to the tensorpack project where the source code is developed.

About

Efficient Data Loading Pipeline in Pure Python

License:Apache License 2.0


Languages

Language:Python 100.0%