pgosar/mamba.cpp

To run:

python3 scripts/download_models.py -m 370m --bits 32 -md models/370m_32bit.bin
make fast
./build/mamba models/model.bin -n 20 -i "Customer Support should" -t 0.0

Command line arguments will be used to control inference, for example, quantization level, debugging verbosity, input prompt.

You can use the download models shell script to download the useful configurations for testing, including tokenizers.

TODO

Model configuration will be done through model_config.yaml, for example, temperature (text diversity), generated text amount, batch size. There may be multiple selectable configurations, these are selected through the command line arguments.

TODO

Helpful references:

Models

Jamba

Mamba Variants

Model Configuration

https://ivibudh.medium.com/a-guide-to-controlling-llm-model-output-exploring-top-k-top-p-and-temperature-parameters-ed6a31313910

Implementations:

Implementation of some optimization techniques

https://github.com/MDK8888/GPTFast/tree/master

Mamba LLM

https://github.com/redotvideo/mamba-chat

Quantization:

state-spaces/mamba#133 (only quantize nn.linear)

https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/quantization

https://leimao.github.io/article/Neural-Networks-Quantization/

Fast matrix mult:

https://coffeebeforearch.github.io/2020/06/23/mmul.html

https://justine.lol/matmul/

About

Languages

Language:C++ 91.3%Language:Python 6.8%Language:Makefile 1.1%Language:Shell 0.7%

TODO

TODO

Helpful references:

Models

Model Configuration

Implementations:

Using ReLu instead of SiLu (mamba's default):

Flash memory:

Speculative Streaming:

Speculative Decoding:

1 bit model variant:

Quantization:

Fast matrix mult:

About

Languages