This is a ggml implementation of the BERT embedding architecture. It supports inference on both CPU and CUDA in floating point and a wide variety of quantization schemes. Includes Python bindings for batched inference.
This repo is a fork of original bert.cpp as well as embeddings.cpp. Thanks to both of you!
Fetch this respository then download submodules and install packages with
git submodule update --init --recursive
pip install -r requirements.txt
To fetch models from huggingface
and convert them to gguf
format run the following
cd models
python download-repo.py BAAI/bge-base-en-v1.5 # or any other model
python convert-to-ggml.py BAAI/bge-base-en-v1.5 f16
python convert-to-ggml.py BAAI/bge-base-en-v1.5 f32
To build the dynamic library for usage from Python
cmake -B build .
make -C build -j
If you're compiling for GPU, you should run
cmake -DGGML_CUBLAS=ON -B build .
make -C build -j
On some distros, you also need to specifiy the host C++ compiler. To do this, I suggest setting the CUDAHOSTCXX
environment variable to your C++ bindir.
And for Apple Metal, you should run
cmake -DGGML_METAL=ON -B build .
make -C build -j
All executables are placed in build/bin
. To run inference on a given text, run
build/bin/main -m models/bge-base-en-v1.5/ggml-model-f16.gguf -p "Hello world"
To force CPU usage, add the flag -c
.
You can also run everything through Python, which is particularly useful for batch inference. For instance,
import bert
mod = bert.BertModel('models/bge-base-en-v1.5/ggml-model-f16.gguf')
emb = mod.embed(batch)
where batch
is a list of strings and emb
is a numpy
array of embedding vectors.
You can quantize models with the command
build/bin/quantize models/bge-base-en-v1.5/ggml-model-f32.gguf models/bge-base-en-v1.5/ggml-model-q8_0.gguf q8_0
or whatever your desired quantization level is. Currently supported values are: q8_0
, q5_0
, q5_1
, q4_0
, and q4_1
. You can then pass these model files directly to main
as above.