AnswerDotAI / fsdp_qlora

Training LLMs with QLoRA + FSDP

AnswerDotAI/fsdp_qlora Issues

Dual GPU training instantly powers off my desktop
Closed a month ago6
train.py
Updated a month ago
Request for Scripts to Merge QDoRA Adapters with Base Model for vLLM Inference
Updated a month ago2
ValueError report
Updated a month ago
Question about GPU memory usage.
Updated a month ago
DeepSeek VL support
Updated a month ago
train.py script crashes when using HQQ
Updated a month ago3
How does one load and do inference on fine-tuned LLama 3 using bnb_dora train script?
Updated a month ago
BOFT support?
Updated a month ago
Can i use this script to pre-train models?
Updated a month ago
Fine tuning only runs on CPU
Updated a month ago4
Issues with LLaMA-3-70B
Closed a month ago1
ProcessExitedException: process 0 (2x 4090)
Updated a month ago39
llama3?
Updated 2 months ago
What if I have three graphics cards?
Updated 2 months ago1
Results after running
Updated 2 months ago
How to load the saved model?
Updated 2 months ago
process 0 terminated with signal SIGKILL
Updated 2 months ago4
nan when the input length is large
Updated 2 months ago5
Question about adding / training Mixtral
Updated 2 months ago1
how to inference using 70b? or we need to implement it with the same way to train it by ourself?
Updated 2 months ago1
Why is o_proj not targetted?
Updated 2 months ago
Q on comparison with SFTTrainer
Updated 2 months ago
/opt/conda/conda-bld/pytorch_1708025847130/work/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [14,0,0] Assertion `t >= 0 && t < n_classes` failed.
Updated 2 months ago
Bigger context size?
Updated 2 months ago
Torch Compile?
Updated 3 months ago
Example with AMD ROCm/HIP
Closed 3 months ago4
RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase
Closed 3 months ago3
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU
Updated 3 months ago
Running into CUDA out of memory with hqq_lora
Closed 3 months ago3
bugs for fine-tune fsdp multinode
Updated 3 months ago1
NCCL issue training with two GPUs
Updated 3 months ago2
Training from e
Closed 3 months ago1
License
Closed 3 months ago