2bit llama70b with quip-sharp

Question

the-crypt-keeper opened this issue 8 months ago · comments

the-crypt-keeper · Answer 1 · Mon Jan 01 2024 05:28:04 GMT+0800 (China Standard Time)

Requires CUDA12.
Attempting to use nvcr.io/nvidia/pytorch:23.06-py3 as the base, but something is wrong with transformers-engine inside that image and it crashes on load
pip installing git+https://github.com/NVIDIA/TransformerEngine.git@main fixes the crash
the model loads, but generate call never returns

the-crypt-keeper · Answer 2 · Mon Jul 01 2024 00:59:15 GMT+0800 (China Standard Time)

Closing out all old 2-bit quants.