My notes

Add compute capability

When using my Titan-X, I see runtime error:

CUDA error: no kernel image is available for execution on the device

It turns out the causal-conv1d package does not have nvcc -gencode to cover my Titan-X. As a result, I forked the causal-conv1d package as submodule in this repo.

The submodule is patched with

cc_flag.append("-gencode")
cc_flag.append("arch=compute_61,code=sm_61") # support my Titan-X!

Testing

python test/test_causal_conv1d.py
python test/test.py

What tokenizer is used?

Short answer: stick with original mamba's EleutherAI/gpt-neox-20b with extended padding tokens to run more efficiently for the underlying GPU GEMM kernels.

See: Zyphra/BlackMamba#6

BlackMamba

BlackMamba: Mixture of Experts for State-space models
Quentin Anthony*, Yury Tokpanov*, Paolo Glorioso*, Beren Millidge*
Paper: https://arxiv.org/abs/2402.01771

About

In this repository we provide inference code for our BlackMamba model.

BlackMamba is an novel architecture which combines state-space models (SSMs) with mixture of experts (MoE). It uses Mamba as its SSM block and switch transformer as its MoE block base. BlackMamba is extremely low latency for generation and inference, providing significant speedups over all of classical transformers, MoEs, and Mamba SSM models. Additionally, due to its SSM sequence mixer, BlackMamba retains linear computational complexity in the sequence length.

Requirements

pip install causal-conv1d>=1.1.0: required for Mamba. The rest of the kernels should be built locally.

Other requirements:

Linux NVIDIA GPU PyTorch 1.12+ CUDA 11.6+

Quick installation in a fresh Python environment

pip install torch packaging
pip install . to install from source from this repository

Pretrained Models

Our pretrained models are uploaded to our HuggingFace:

*Since models are MoE, they're named according to (Forward Pass Parameters) / (Total Parameters) for clarity.

Usage

from mamba_model import MambaModel
import torch

model = MambaModel.from_pretrained(pretrained_model_name="Zyphra/BlackMamba-2.8B")
model = model.cuda().half()
inputs = torch.tensor([1, 2]).cuda().long().unsqueeze(0)
out = model(inputs)

w32zhong / blackmamba-fork

My notes

Add compute capability

Testing

What tokenizer is used?

BlackMamba

About

Requirements

Quick installation in a fresh Python environment

Pretrained Models

Usage

About

Languages