Provide pruned version for weaker hardware
CommanderTvis opened this issue · comments
It would be really useful to have a pruned version of the model (like Balaboba) to launch on weaker video card setups.
Also, quantization even to 4 bits may be possible, like it is successfully done for LLaMa. https://github.com/ggerganov/llama.cpp
+1 also this distribution technique might be very much applicable here: https://petals.ml