PixelBytes+

PixelBytes+ is an Python project that generates and processes multimodal sequences, including pixels/video, audio, action-states, and text in a unified representation.

Installation

Requires Python 3.8+. Install via PyPI:

pip install git+https://github.com/fabienfrfr/PixelBytes.git@main

Overview

PixelBytes+ builds on theoretical foundations including Image Transformers, PixelRNN/PixelCNN, Bi-Mamba+, and MambaByte to create a unified representation for coherent multimodal generation and processing. It handles:

Pixel/video sequences
Audio data
Action-state control
Text

The model seamlessly manages transitions between modalities and maintains dimensional consistency.

Usage

Basic commands :

tokenizer = ActionPixelBytesTokenizer(data_slicing=DATA_REDUCTION)
config = ModelConfig(vocab_size=VOCAB_SIZE, embed_size=EMBED_SIZE, hidden_size=HIDDEN_SIZE,
                      num_layers=NUM_LAYERS, pxby_dim=PXBY_DIM, auto_regressive=AR, model_type=MODEL_TYPE)
model = aPxBySequenceModel(config).to(DEVICE)
dataset = TokenPxByDataset(ds, tokenizer, SEQ_LENGTH, STRIDE)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn, shuffle=True)
model.train_model(train_dataloader, val_dataloader, optimizer, criterion, DEVICE, scaler, EPOCHS, ACCUMULATION_STEPS)

For detailed documentation, see the docs folder.

Dataset

Use the PixelBytes-Pokemon dataset from Hugging Face: ffurfaro/PixelBytes-Pokemon

Cloud Deployment

Build and push Docker image:

docker build -t $USER/img_name .
docker push $USER/img_name

docker-compose up --build

Deploy to your preferred cloud provider (OVH, Azure, AWS, Google Cloud).

Contributing

Contributions welcome. Fork, create a feature branch, and submit a pull request.

License

MIT License

Contact

fabien.furfaro_at_gmail.com

Citation

@article{furfaro:hal-04683349,
  TITLE = {{PixelBytes: Catching Unified Representation for Multimodal Generation}},
  AUTHOR = {Furfaro, Fabien},
  URL = {https://hal.science/hal-04683349},
  NOTE = {working paper or preprint},
  YEAR = {2024},
  KEYWORDS = {Embedding ; Multimodal representation learning ; Sequence generation},
  HAL_ID = {hal-04683349},
}

@misc{furfaro2024pixelbytes_project,
     author = {Furfaro, Fabien},
     title = {PixelBytes: A Unified Multimodal Representation Learning Project},
     year = {2024},
     howpublished = {
         GitHub: \url{https://github.com/fabienfrfr/PixelBytes},
         Models: \url{https://huggingface.co/ffurfaro/PixelBytes-Pokemon} and \url{https://huggingface.co/ffurfaro/aPixelBytes-Pokemon},
         Datasets: \url{https://huggingface.co/datasets/ffurfaro/PixelBytes-Pokemon} and \url{https://huggingface.co/datasets/ffurfaro/PixelBytes-PokemonAll}
     },
     note = {GitHub repository, Hugging Face Model Hub, and Datasets Hub}
     }

fabienfrfr / PixelBytes