Python Bindings: Model no longer kept in cache
woheller69 opened this issue · comments
Bug Report
Just compiled the updated Python bindings V2.7.0
When terminating my GUI now the whole model needs to be loaded again which may take a long time.
In previous versions only the first start took long, subsequent starts with the same model were fast.
Steps to Reproduce
Use CLI:
python3 app.py repl --model dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf
/exit
python3 app.py repl --model dolphin-2.7-mixtral-8x7b.Q4_K_M.gguf
-> model loads again
Expected Behavior
At CLI restart the model should already be in cache
Your Environment
- Bindings version (compiled V2.7.0):
- Operating System: Ubuntu
- Chat model used (tried with Dolphin 2.7 Mixtral 8x7b):
I uninstalled V2.7.0 and downgraded to V2.6.0 and cache works again
This does not happen with smaller models, such as Llama 3 8B Instruct Q8 which is 8.5GB in size.
Dolphin 2.7 Mixtral 8x7b Q4_K_M is 26 GB.
I have 36 GB of RAM so this should not be a problem and worked perfectly in 2.6.0
On the resources monitor the behaviour is also strange. It first fills cache and then moves data from cache to memory