LLaMA.go

The Goal

We dream of a world where ML hackers are able to grok with REALLY BIG GPT models without having GPU clusters consuming a shit tons of $$$ - using only machines in their own homelabs.

The code of the project is based on the legendary ggml.cpp framework of Georgi Gerganov written in C++

We hope using our beloved Golang instead of soo-powerful but too-low-level language will allow much greater adoption of the NoGPU ideas.

The V1 supports only FP32 math, so you'll need at least 32GB RAM to work even with the smallest LLaMA-7B model. As a preliminary step you should have binary files converted from original LLaMA model locally.

V0 Roadmap

Run tensor math in pure Golang based on C++ source
Implement LLaMA neural net architecture and model loading
Run smaller LLaMA-7B model
Be sure Go inference works EXACT SAME way as C++
Let Go shine! Enable multi-threading and boost performance

V1 Roadmap

V2 Roadmap

Allow plugins and external APIs for complex projects
AVX512 support - yet another performance boost for AMD Epyc
FP16 and BF16 support when hardware support there
Support INT4 and GPTQ quantization

How to Run

go run main.go \
    --model ~/models/7B/ggml-model-f32.bin \
    --temp 0.80 \
    --context 128 \
    --predict 128 \
    --prompt "Why Golang is so popular?"

Or build it with Makefile and then run binary.

Useful CLI parameters:

--prompt   Text prompt from user to feed the model input
--model    Path and file name of converted .bin LLaMA model
--threads  Adjust to the number of CPU cores you want to use [ all cores by default ]
--predict  Number of tokens to predict [ 64 by default ]
--context  Context size in tokens [ 64 by default ]
--temp     Model temperature hyper parameter [ 0.8 by default ]
--silent   Hide welcome logo and other output [ show by default ]
--chat     Chat with user in interactive mode instead of compute over static prompt
--profile  Profe CPU performance while running and store results to [cpu.pprof] file
--avx      Enable x64 AVX2 optimizations for Intel and AMD machines
--neon     Enable ARM NEON optimizations for Apple Macs and ARM server

FAQ

1] Where might I get original LLaMA model files?

Contact Meta directly or look around for some torrent alternatives

2] How to convert original LLaMA files into supported format?

Youl'll need original FP16 files placed into models directory, then convert with command:

python3 ./scripts/convert.py ~/models/LLaMA/7B/ 0

About

llama.go is like llama.cpp in pure Golang!

Other

Languages

Language:Go 64.8%Language:Assembly 24.1%Language:Python 5.3%Language:C 3.7%Language:Makefile 2.1%