Requirements.txt and Trained weights
abdur75648 opened this issue · comments
Thanks for the good quality work.
It would be great if you could kindly upload requirements.txt (or specify important library versions).
Also, can the trained weights be also released?
Thanks for your attention. We will upload requirements.txt and release the pre-trained weithts.
Thanks a lot @sunsmarterjie
Somehow, I was able to set up the environment and run the training script.
However, after loading the dataset and the model, when initializing deepspeed distributed training with backend nccl, I'm getting the following error:
[2024-02-15 22:41:11,208] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.11.1, git-hash=unknown, git-branch=unknown
[2024-02-15 22:41:11,209] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-02-15 22:41:11,209] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-02-15 22:41:21,484] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 4989
[2024-02-15 22:41:38,785] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 4990
[2024-02-15 22:41:38,785] [ERROR] [launch.py:321:sigkill_handler] ['/home/chemical/dual/ch7190150/.conda/envs/chatterbox/bin/python3.1', '-u', 'train_custom1.py', '--local_rank=1', '--version', 'llava-llama-2-13b-chat-lightning-preview'] exits with return code = -11
Found a similar issue here but still couldn't solve, seeing this seems like an issue with NCCL (The user says
NCCL backend in DeepSpeed not yet implemented, deepspeed uses torch.distribute, and uses torch.TorchBackend
but in your code, I see NCCL is being used.
Kindly help if you've any idea on this, I'll be thankful.
We do not encounter this issue. You can try our downloaded llava model at: https://huggingface.co/sunsmarterjieleaf/ChatterBox/tree/main/llava-llama-2-13b-chat-lightning-preview
Thank you very much