[BUG] ollama models context size not properly imported/reflected

Question

[BUG] ollama models context size not properly imported/reflected

XReyRobert opened this issue 6 months ago · comments

Describe the bug
ollama models context size not properly imported/reflected

Where is it happening?
To Reproduce
import 128K ollama model (ex Yarn-mistral 7b-128k) show model details / max model tokens in UI

Expected behavior

Screenshots / context

If applicable, please add screenshots or additional context

Enrico Ros · Answer 1 · Thu Dec 28 2023 07:18:37 GMT+0800 (China Standard Time)

Thanks @XReyRobert . Unfortunately Ollama does not usually provide the context size, so it's assumed to be 4k across the board.

The /models API does not provide it, and the models list did not.

In your particular case, the name of the model has the context size, but that's a rarity.

What's the best way to deal with this, or to get context sizes for all models?

XRR · Answer 2 · Thu Dec 28 2023 07:30:46 GMT+0800 (China Standard Time)

Hi @enricoros,

There's a "show" endpoint that gives additional parameters when available:
for example mistrallite:latest and yarn-mistral:7b-128k will display this "num_ctx" parameter.

curl http://localhost:11434/api/show -d '{
  "name": "mistrallite:latest"
}' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   992  100   958  100    34   656k  23876 --:--:-- --:--:-- --:--:--  968k
{
  "modelfile": "# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM mistrallite:latest\n\nFROM /usr/share/ollama/.ollama/models/blobs/sha256:fcfc737faf6b2bb5050752602ca341e92ec4d8208f2b5762bd656d447be9910e\nTEMPLATE \"\"\"<|prompter|>{{ .System }} {{ .Prompt }}</s><|assistant|>\n\"\"\"\nPARAMETER num_ctx 32768\nPARAMETER stop \"<|prompter|>\"\nPARAMETER stop \"<|assistant|>\"\nPARAMETER stop \"</s>\"",
  "parameters": "num_ctx                        32768\nstop                           <|prompter|>\nstop                           <|assistant|>\nstop                           </s>",
  "template": "<|prompter|>{{ .System }} {{ .Prompt }}</s><|assistant|>\n",
  "details": {
    "format": "gguf",
    "family": "llama",
    "families": null,
    "parameter_size": "7B",
    "quantization_level": "Q4_0"
  }
}

curl http://localhost:11434/api/show -d '{
  "name": "yarn-mistral:7b-128k"
}' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   568  100   532  100    36   423k  29315 --:--:-- --:--:-- --:--:--  554k
{
  "modelfile": "# Modelfile generated by \"ollama show\"\n# To build a new Modelfile based on this one, replace the FROM line with:\n# FROM yarn-mistral:7b-128k\n\nFROM /usr/share/ollama/.ollama/models/blobs/sha256:14f2e225961b80d791d14c88def05fca31abc44ab1a7a12ba8e8f2365442e6e6\nTEMPLATE \"\"\"{{ .Prompt }}\"\"\"\nPARAMETER num_ctx 131072",
  "parameters": "num_ctx                        131072",
  "template": "{{ .Prompt }}",
  "details": {
    "format": "gguf",
    "family": "llama",
    "families": null,
    "parameter_size": "7B",
    "quantization_level": "Q4_0"
  }
}

Giulio De Pasquale · Answer 3 · Wed Jan 10 2024 21:03:04 GMT+0800 (China Standard Time)

I confirm the bug. Also, for what it's worth, this Ollama release changelog specifies how to pass a 32k context window to Mixtral (and I suppose other models as well). https://github.com/jmorganca/ollama/releases/tag/v0.1.19

Enrico Ros · Answer 4 · Thu Jan 11 2024 03:02:14 GMT+0800 (China Standard Time)

I confirm the bug. Also, for what it's worth, this Ollama release changelog specifies how to pass a 32k context window to Mixtral (and I suppose other models as well). https://github.com/jmorganca/ollama/releases/tag/v0.1.19

Thanks! I'll prioritize this issue. I can quickly fix it as far as knowing the context size.

For the "32k Mixtral" the weird part is that it should not be the developer to tell the API what the context window is, but the other way around. Commonly, APIs usually pass a "max_tokens" parameter as a hard limit to the response length - I'm sure the Ollama folks will make the API more standard. Their recent /chat endpoint shows that they're on a good path.

Prioritized.

Enrico Ros · Answer 5 · Fri Jan 26 2024 17:50:05 GMT+0800 (China Standard Time)

@XReyRobert implemented, releasing in 3 hours in 1.12.0. Context size is inferred from num_ctx where available and set correctly. Please refer to Ollama / Jeffrey's post (https://github.com/jmorganca/ollama/releases/tag/v0.1.19) to alter that on your Ollama files.