PixelBytes+ is an Python project that generates and processes multimodal sequences, including pixels/video, audio, action-states, and text in a unified representation.
Requires Python 3.8+. Install via PyPI:
pip install git+https://github.com/fabienfrfr/PixelBytes.git@main
PixelBytes+ builds on theoretical foundations including Image Transformers, PixelRNN/PixelCNN, Bi-Mamba+, and MambaByte to create a unified representation for coherent multimodal generation and processing. It handles:
- Pixel/video sequences
- Audio data
- Action-state control
- Text
The model seamlessly manages transitions between modalities and maintains dimensional consistency.
Basic commands :
tokenizer = ActionPixelBytesTokenizer(data_slicing=DATA_REDUCTION)
config = ModelConfig(vocab_size=VOCAB_SIZE, embed_size=EMBED_SIZE, hidden_size=HIDDEN_SIZE,
num_layers=NUM_LAYERS, pxby_dim=PXBY_DIM, auto_regressive=AR, model_type=MODEL_TYPE)
model = aPxBySequenceModel(config).to(DEVICE)
dataset = TokenPxByDataset(ds, tokenizer, SEQ_LENGTH, STRIDE)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn, shuffle=True)
model.train_model(train_dataloader, val_dataloader, optimizer, criterion, DEVICE, scaler, EPOCHS, ACCUMULATION_STEPS)
For detailed documentation, see the docs folder.
Use the PixelBytes-Pokemon dataset from Hugging Face: ffurfaro/PixelBytes-Pokemon
Build and push Docker image:
docker build -t $USER/img_name .
docker push $USER/img_name
docker-compose up --build
Deploy to your preferred cloud provider (OVH, Azure, AWS, Google Cloud).
Contributions welcome. Fork, create a feature branch, and submit a pull request.
MIT License
fabien.furfaro_at_gmail.com
@article{furfaro:hal-04683349,
TITLE = {{PixelBytes: Catching Unified Representation for Multimodal Generation}},
AUTHOR = {Furfaro, Fabien},
URL = {https://hal.science/hal-04683349},
NOTE = {working paper or preprint},
YEAR = {2024},
KEYWORDS = {Embedding ; Multimodal representation learning ; Sequence generation},
HAL_ID = {hal-04683349},
}
@misc{furfaro2024pixelbytes_project,
author = {Furfaro, Fabien},
title = {PixelBytes: A Unified Multimodal Representation Learning Project},
year = {2024},
howpublished = {
GitHub: \url{https://github.com/fabienfrfr/PixelBytes},
Models: \url{https://huggingface.co/ffurfaro/PixelBytes-Pokemon} and \url{https://huggingface.co/ffurfaro/aPixelBytes-Pokemon},
Datasets: \url{https://huggingface.co/datasets/ffurfaro/PixelBytes-Pokemon} and \url{https://huggingface.co/datasets/ffurfaro/PixelBytes-PokemonAll}
},
note = {GitHub repository, Hugging Face Model Hub, and Datasets Hub}
}