nomic-ai / gpt4all

gpt4all: run open-source LLMs anywhere

Home Page:https://gpt4all.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Python Bindings: Model no longer kept in cache

woheller69 opened this issue · comments

Bug Report

Just compiled the updated Python bindings V2.7.0
When terminating my GUI now the whole model needs to be loaded again which may take a long time.
In previous versions only the first start took long, subsequent starts with the same model were fast.

Steps to Reproduce

Use CLI:

python3 app.py repl --model dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf

/exit

python3 app.py repl --model dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf

-> model loads again

Expected Behavior

At CLI restart the model should already be in cache

Your Environment

  • Bindings version (compiled V2.7.0):
  • Operating System: Ubuntu
  • Chat model used (tried with Dolphin 2.7 Mixtral 8x7b):

I uninstalled V2.7.0 and downgraded to V2.6.0 and cache works again

This does not happen with smaller models, such as Llama 3 8B Instruct Q8 which is 8.5GB in size.
Dolphin 2.7 Mixtral 8x7b Q4_K_M is 26 GB.

I have 36 GB of RAM so this should not be a problem and worked perfectly in 2.6.0

On the resources monitor the behaviour is also strange. It first fills cache and then moves data from cache to memory

Loading to cache:
Screenshot from 2024-05-16 21-30-05
Moving from cache to memory:
Screenshot from 2024-05-16 21-30-18

For the smaller model just cache increases (fully loaded):
Screenshot from 2024-05-16 21-33-32

With v2.6.0 Dolphin 2.7 is held in cache and reloads quickly:
Screenshot from 2024-05-17 07-50-40
I notice the same with llama-cpp-python. Has there been a degradation in llama.cpp ?