2bit llama70b with quip-sharp
the-crypt-keeper opened this issue · comments
- Requires CUDA12.
- Attempting to use
nvcr.io/nvidia/pytorch:23.06-py3
as the base, but something is wrong withtransformers-engine
inside that image and it crashes on load - pip installing
git+https://github.com/NVIDIA/TransformerEngine.git@main
fixes the crash - the model loads, but generate call never returns
Closing out all old 2-bit quants.