COST TO RUN & DEPENDENCE on OpenAI
jayfalls opened this issue · comments
When watching Dave's demo of the project, a big standout were his remarks of timing out the API when just running the demo briefly, and seeing the amount of inferences that will need to be generated.
I don't think this limitation is necessary, and depending on a third party is not ideal. The limitation should rather be the amount of compute available, and getting this to run on consumer hardware would be the best.
As such, I suggest using the dolphin-2.1-mistral-7b model.
Specifically a quantised version that can run with a maximum ram requirement of only 7.63 GB and a download size of only 5.13gb.
Using the llama-cpp-python bindings, which meets the project requirements of only being in python.
There are benefits to doing it this way:
- No dependence on a third party for the LLM (THE MOST ESSENTIAL COMPONENT)
- No cost besides the electricity bill, and obviously upfront hardware cost
And benefits to this model specifically
- Higher benchmark performance than LLama 70B
- Apache 2.0, meaning commercially viable
- Completely uncensored, which gives it higher performance and higher compliance to the system and user prompts
- Small model, which means higher performance and lower memory requirements
- Quantised model, which means it can run with a maximum ram requirement of 7.63 GB
- GGUF format, which has massive support for many different bindings, with CPU/GPU or CPU&GPU support
This is just a suggestion, and this model will become outdated within the week.
But I think that this is truly the right way to go.
This does not belong here. Please move it to the Discussions tab.