My ollama notes

Ollama is one of the simplest ways to run Large Language Models (LLMs) on your hardware.

Follow the installation guide on the official website or, if you are on Linux, simply download the single binary and make it executable:

curl -L https://ollama.com/download/ollama-linux-amd64 -o ollama
chmod +x ollama

Then start the Ollama server (and keep it running in the background):

./ollama serve

Now you can interact with the Ollama server in various ways:

Command line interface: In a new terminal, run ./ollama run --help
REST API: Perform HTTP requests to localhost:11434
ollama-python: Ollama's Python library API, a wrapper around the REST API
Open WebUI: A ChatGPT-like web interface for Ollama

Notes

In order to have fast inference, the models must fit into the GPU memory. (The underlying inference engine is llama.cpp, which is able to run entire/partial models on CPU as well, but it is orders of magnitude slower.)

Size: The most limiting factor in the choice of model is the amount of VRAM available on the GPU. For practical purposes, the models can be categorized into three groups based on the number of parameters. Small models can be run on a single consumer GPU (possibly locally on a laptop). Medium models usually require dedicated hardware with a decent amount of VRAM (e.g. > 40 GB). Large models require high-end GPU/s (80 GB or more combined VRAM). We assume the models are quantized to 4-bit precision (see later).

Small (< 13B)
- llama3:8b
- gemma:7b
- mistral:7b
- wizardlm2:7b
- ...
Medium (13B - 70B)
- llama3:70b
- command-r:35b
- mixtral:8x7b
- ...
Large (>70B)
- command-r-plus:104b
- mixtral:8x22b
- wizardlm2:8x22b
- ...

MoE vs Dense: Some of the models express the number of parameters as a multiplication (e.g. mixtral:8x7b). These models are referred to as Mixture of Experts (MoE) and at inference time, a routing network is used to select a subset of the experts to run (e.g. 2 out of 8) for each token. This effectively reduces the number of parameters used in each forward pass, making token generation faster with minimal loss in performance compared to dense models.

Quantization: To run reasonably large models on a single GPU, the models are quantized with various precisions and methods. When pulling models with a simple tag (e.g. ollama pull mistral), it defaults to pulling the model with 4-bit quantization. See the Ollama website for all the available tags for each model.

Model Format: Under the hood, Ollama makes use of llama.cpp which requires its custom format for model weights and specifications: .gguf. The tags follow the gguf naming scheme.

License: Various models are released under different licenses. For example, models from Mistral are released under the "Apache 2.0" license, a very permissive license. The models from the Command-R series are released under the "CC-BY-NC" license, which is more restrictive, limiting the use of the models for commercial purposes (unless you purchase a commercial license).

Capabilities: Different models excel at different tasks: code generation, multi-language understanding, reasoning capabilities, RAG performance, tool usage, and context length, just to name a few. Some respond in a more casual and engaging manner, while others are more formal and informative.

Benchmarks

Benchmarking model performance and capabilities is quite challenging. In addition to the usual metrics on popular datasets (MMLU, GPQA, HumanEval, GSM-8K, etc.), the chat bot arena provides an ELO-based ranking based on human evaluations of the generated text. Another resource for LLM public perception is the LocalLLaMa subreddit.

Below are some speed benchmarks for several base models available on Ollama. The plots were produced by running benchmarks.py on a machine equipped with:

APPLE M1 MAX

CPU: APPLE M1 MAX
RAM: 32 GB

Eval Speed

How many tokens per second can the model generate in an autoregressive setting?

Prompt Eval Speed

How fast (in tokens/s) can the model process a given prompt?

NVIDIA 1080 TI

CPU: Intel Xeon CPU E5-2620 v3 @ 2.40GH (24 cores)
RAM: 126 GB

Eval Speed

How many tokens per second can the model generate in an autoregressive setting?

Prompt Eval Speed

How fast (in tokens/s) can the model process a given prompt?

NVIDIA A6000

CPU: AMD EPYC-Rome (14 cores)
GPU: NVIDIA A6000 (48 GB)
RAM: 92 GB

Eval Speed

How many tokens per second can the model generate in an autoregressive setting?

Prompt Eval Speed

How fast (in tokens/s) can the model process a given prompt?

Model Choice

Even though the model zoo can be overwhelming with new models released regularly, finding a good model for your use case can be broken down into a few steps:

Technical Limitations: It must run at the required speed (size, MoE vs Dense).
License Limitations: It can be used for the intended purpose (license).
Capabilities: It should be good at the intended task (capabilities).

Once you have identified a model or models that meet your requirements, you can further "optimize" your choice by considering the following:

Quantization: Move to a less quantized model if you have spare VRAM (while keeping an eye on the speed).
Fine-tuned version: The LLM community often releases fine-tuned versions of the base models for specific tasks (e.g., uncensored models, improved prompt following capabilities, etc.).
Inference engine: Ollama is easy to start with, but there exist more performant inference engines than llama.cpp (exllamav2, vllm, etc.). See here and here.

S1M0N38 / my-ollama-notes

My ollama notes

Notes

Benchmarks

Eval Speed

Prompt Eval Speed

Eval Speed

Prompt Eval Speed

Eval Speed

Prompt Eval Speed

Model Choice

About

Languages