ggerganov / bert.cpp

GGML implementation of BERT model with Python bindings and quantization.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bert.cpp

This is a ggml implementation of the BERT embedding architecture. It supports inference on both CPU and CUDA in floating point and a wide variety of quantization schemes. Includes Python bindings for batched inference.

This repo is a fork of original bert.cpp as well as embeddings.cpp. Thanks to both of you!

Install

Fetch this respository then download submodules and install packages with

git submodule update --init --recursive
pip install -r requirements.txt

To fetch models from huggingface and convert them to gguf format run the following

cd models
python download-repo.py BAAI/bge-base-en-v1.5 # or any other model
python convert-to-ggml.py BAAI/bge-base-en-v1.5 f16
python convert-to-ggml.py BAAI/bge-base-en-v1.5 f32

Build

To build the dynamic library for usage from Python

cmake -B build .
make -C build -j

If you're compiling for GPU, you should run

cmake -DGGML_CUBLAS=ON -B build .
make -C build -j

On some distros, you also need to specifiy the host C++ compiler. To do this, I suggest setting the CUDAHOSTCXX environment variable to your C++ bindir.

And for Apple Metal, you should run

cmake -DGGML_METAL=ON -B build .
make -C build -j

Excecute

All executables are placed in build/bin. To run inference on a given text, run

build/bin/main -m models/bge-base-en-v1.5/ggml-model-f16.gguf -p "Hello world"

To force CPU usage, add the flag -c.

Python

You can also run everything through Python, which is particularly useful for batch inference. For instance,

import bert
mod = bert.BertModel('models/bge-base-en-v1.5/ggml-model-f16.gguf')
emb = mod.embed(batch)

where batch is a list of strings and emb is a numpy array of embedding vectors.

Quantize

You can quantize models with the command

build/bin/quantize models/bge-base-en-v1.5/ggml-model-f32.gguf models/bge-base-en-v1.5/ggml-model-q8_0.gguf q8_0

or whatever your desired quantization level is. Currently supported values are: q8_0, q5_0, q5_1, q4_0, and q4_1. You can then pass these model files directly to main as above.

About

GGML implementation of BERT model with Python bindings and quantization.

License:MIT License


Languages

Language:C++ 77.0%Language:Python 18.2%Language:CMake 4.8%