[NECESSARY CHANGE NEEDED] Regarding how model is loaded and used by pageassist for conversations

Question

[NECESSARY CHANGE NEEDED] Regarding how model is loaded and used by pageassist for conversations

HakaishinShwet opened this issue 2 months ago · comments

Whenever we try to ask question from pageassist it first load the model and then after answering it unloads the model which is not something many want. Reason for this is because people might want to do further conversation and if for each message it will load and unload the model which it is doing right now then it is creating a very delayed conversation experience to be honest.
Even if i try to load the model permanently by changing one load parameter using curl command :

curl http://localhost:11434/api/generate -d '{"model": "llama3:8b-instruct-q5_K_M", "keep_alive": -1}'

This command i use generally to load my any particular model like in this case its llama 8b instruct one in gpu vram so that i can get faster response while doing conversations in any docker web app or different things which i use with ollama.
But as i explained above in case of pageassist EVEN IF I PRELOAD THE MODEL IN VRAM WITH ABOVE COMMAND, PAGEASSIST WILL STILL REMOVE MY PRELOADED MODEL FROM VRAM AND WILL LOAD AGAIN FOR WRITING MESSAGE AND AFTER WRITING THE MESSAGE WITH THE MODEL IT WILL AGAIN UNLOAD THE MODEL FROM VRAM WHICH IS SERIOUS ISSUE FOR ME AND MANY OTHER USERS WHO USE OLLAMA LOCAL MODELS CONSISTENTLY FOR MANY USECASES.
I Believe you might have thought while creating that vram+unecessary power consumption shouldnt be there so u might have designed it to unload after writing message but still people wont get satisfactory response from local models easily so they will further conversate and even if they do get satisfactory response they might want to ask more about some things so in both of these cases they will want to do further conversations so if model is loaded and unloaded again and again after every message generation then you know how much unnecessary delayed conversation will piss off many users and i faced same thing so finally thought to share this with you so that you can change this logic and give user the option of if they wanna perma load the model and do conversations comfortably without delays caused by loading and unloading or if they wanna use for single message and unload like it is right now.
So according to users need they can switch between both options. So this will resolve the issue as far as i can think.Rest if you wanna ask more regarding this then we can do further conversations but i hope u understood what i meant.