omartarek206/towhee

x2vec, Towhee is all you need!

What is Towhee?

Towhee is a flexible, application-oriented framework for generating embedding vectors via a pipeline of ML models and other operations. It aims to make democratize x2vec, allowing everyone - from beginner developers to large organizations - to generate dense embeddings with just a few lines of code.

To accomplish this, we provide pre-built pipelines for a variety of tasks, including audio/music embeddings, image embeddings, celebrity recognition, and more. For a full list of pipelines, feel free to visit our Towhee hub. Although our initial focus is on generating embeddings, Towhee can also be used to create and share generic machine learning pipelines such as this image animation pipeline.

Key features

Easy embedding for everyone: Run a local embedding pipeline with as little as three lines of code.
Rich operators and pipelines: No more reinventing the wheel! Collaborate and share pipelines with the open source community.
Automatic versioning: Our versioning mechanism for pipelines and operators ensures that you never run into dependency hell.
Support for fine-tuning models: Feed your dataset into our Trainer and get a new model in just a few easy steps.
Deploy to cloud*: Ready-made pipelines can be deployed to the cloud with minimal effort.

Features marked with a star (*) are on our roadmap and have not yet been implemented. Help is always appreciated, so come join our Slack or check out our docs for more information.

Getting started

Towhee requires Python 3.6+. Towhee can be installed via pip:

% pip install -U pip  # if you run into installation issues, try updating pip
% pip install towhee

Towhee provides a variety of pre-built pipelines. For example, generating an image embedding can be done in as little as five lines of code:

>>> from towhee import pipeline

# Our built-in image embedding pipeline takes
>>> embedding_pipeline = pipeline('image-embedding')
>>> embedding = embedding_pipeline('https://docs.towhee.io/img/logo.png')

Your image embedding is now stored in embedding. It's that simple.

For image datasets, the users can also build their own pipeline with the DataCollection API:

import towhee

towhee.glob('./*.jpg') \
      .image_decode() \
      .image_embedding.timm(model_name='resnet50') \
      .to_list()

where image_decode and image_embedding.timm are operators from towhee hub. The method-chaining style programming interface also support parallel execution and exception handling.

Dive deeper

If you find that one of our default embedding pipelines does not suit you, you can also specify a custom pipeline from the hub as follows:

>>> embedding_pipeline = pipeline('towhee/image-embedding-convnext-base')

For a full list of supported pipelines, visit our docs page.

Custom machine learning pipelines can be defined in a YAML file or via the DataCollection API. The first time you instantiate and use a pipeline, all Python functions, configuration files, and model weights are automatically downloaded from the Towhee hub. To ease the development process, pipelines which already exist in the local Towhee cache (/$HOME/.towhee/pipelines) will be automatically loaded:

# This will load the pipeline defined at $HOME/.towhee/pipelines/fzliu/my-embedding-pipeline.yaml
>>> embedding_pipeline = pipeline('fzliu/my-embedding-pipeline')

Architecture overview

Towhee is composed of three main building blocks - Pipelines, Operators, and a singleton Engine.

Pipeline: A Pipeline is a single embedding generation task that is composed of several operators. Operators are connected together within the pipeline via a directed acyclic graph.
Operator: An Operator is a single node within a pipeline. An operator can be a machine learning model, a complex algorithm, or a Python function. All files needed to run the operator are contained within a directory (e.g. code, configs, models, etc...).
DataCollection: A pythonic and method-chaining style API that for building custom unstructured data processing pipelines. DataCollection is designed to behave as a python list or iterator, DataCollection is easy to understand for python users and is compatible with most popular data science toolkits. Function/Operator invocations can be chained one after another, making your code clean and fluent.
Engine: The Engine sits at Towhee's core. Given a Pipeline, the Engine will drive dataflow between individual operators, schedule tasks, and monitor compute resource (CPU/GPU/etc) usage. We provide a basic Engine within Towhee to run pipelines on a single-instance machine - K8s and other more complex Engine implementations are coming soon.

For a deeper dive into Towhee and its architecture, check out the Towhee docs.

Contributing

Remember that writing code is not the only way to contribute! Submitting issues, answering questions, and improving documentation are some of the many ways you can join our growing community. Check out our contributing page for more information.

Special thanks goes to these folks for contributing to Towhee, either on Github, our Towhee Hub, or elsewhere: