unslothai / unsloth

I saw error message when I am trying to do supervised fine tuning with 4xA100 GPUs. So the free version cannot be used on multiple GPUs?

RuntimeError: Error: More than 1 GPUs have a lot of VRAM usage. Please obtain a commercial license.

Oh currently Unsloth does not support multi GPU sorry - our enterprise plans have them for now - we're currently concentrated on adding Ollama, Llama-3 bug fixes, all model support and more in the OSS

@danielhanchen Is there a way to run Unsloth on only 1 GPU when I have a 2 GPU Node? I get the same error and I want to use only 1 GPU as the model easily fits on it?
I tried

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

But it did not work

Export it via shell before running the python script

Yep you have to set the env variable before running Unsloth

Setting the env variable before running Unsloth still does not resolve the problem.

Used: export CUDA_VISIBLE_DEVICES=0 but it still comes up with the error:
RuntimeError: Error: More than 1 GPUs have a lot of VRAM usage. Please obtain a commercial license.

Also used: export CUDA_VISIBLE_DEVICES=1 but same problem.

@danielhanchen I am confused
Kindly help.
The error is asking to get a commercial licence.

@miary what GPUs are you using and are they already running another job?

@danielhanchen I have 2 GPUs, both RTX 3090. This runtime error about more than 1 GPU is a brand new issue that came from Unsloth 2024.6.

I have a project that is using Unsloth 2024.5 and it works just fine.

It is completely fine is Unsloth wants to charge for environment with more than one GPU. However, the option should be given to use only one GPU, which is what setting the CUDA_VISIBLE_DEVICES env is supposed to do, but it's apparently broken. Looks like a really bad bug because it breaks the entire project.

Hmm I shall investigate this hmmm.

How do you all call Unsloth? Via the terminal as a python script? Via Jupyter?

I am using python script and had the same issue while trying to run on GPU 1 (if i set the code to have visibility only on GPU 0 it works fine).

I am using this as the first lines in my main code:

import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
os.environ["GRADIO_SHARE"]="1"
os.environ["WORLD_SIZE"] = "1"

@Chirobocea So do you use python train.py or like torchrun?

Usually I use python train.py. However, I just tried to lunch it with torchrun and it has same issue.
Also I checked with the debugger that torch indeed sees only one gpu, which is renamed for the running code to id 0, while in the loading process of the model it takes VRAM only from GPU 1 as expected (from nvidia-smi).

Ok thanks for the info! Running in runpod to see what I can do! :)

@miary @Chirobocea @aflah02 Just fixed it! Hopefully it now can work! Apologies on the issues! Please update Unsloth via

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

RuntimeError: Error: More than 1 GPUs have a lot of VRAM usage. Please obtain a commercial license.

Thanks for all your work, btw! Killer project!

==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1
\ /| Num examples = 1,029 | Num Epochs = 1
O^O/ _/ \ Batch size per device = 2 | Gradient Accumulation steps = 4
\ / Total batch size = 8 | Total steps = 30
"-____-" Number of trainable parameters = 41,943,040
Traceback (most recent call last):
File "/home/matto/projects/baby-code/workspace/unsloth-orpo.py", line 128, in
orpo_trainer.train()
File "/home/matto/miniconda3/envs/unsloth_env/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File "", line 226, in _fast_inner_training_loop
RuntimeError: Error: More than 1 GPUs have a lot of VRAM usage. Please obtain a commercial license.

Can confirm it does not occur with unsloth-2024.5 but does at unsloth-2024.6
If necessary, one can downgrade via:
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git@f9689b1

@molander Do you know if my latest fix fixes stuff?

@danielhanchen no, as soon as I uninstall and pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git, it comes back :(

@miary @Chirobocea @aflah02 Just fixed it! Hopefully it now can work! Apologies on the issues! Please update Unsloth via
pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

This patch did not solve the problem. Same error:
RuntimeError: Error: More than 1 GPUs have a lot of VRAM usage. Please obtain a commercial license.

Hmmm weird I tried it in runpod with 4x GPUs and it worked - I shall re try fixing this! Sorry everyone on the issue!

@miary @molander I updated the package again! Apologies on the issues!

I found the below to work (change 1 to any device id)

export CUDA_VISIBLE_DEVICES=1 && python train_file.py

Likewise torchrun also works with that approach.

Hope this works! Thank you for your patience!

@miary @Chirobocea @aflah02 Just fixed it! Hopefully it now can work! Apologies on the issues! Please update Unsloth via
pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

I tried this one hour ago and checked.
It seams that the main problem is when I have something running on other GPU as well.
For example if i have another code with another env on GPU 0, I can't run unsloth on GPU 1.
The error is the same as before.

Confirmed not working as intended. Nothing going on GPU = 1, will not run, even though Num GPUs shows as 1 in Unsloth banner below.

But on GPU = 0, after I closed everything but Xorg, it worked.

So, it would appear at first glance, it must use GPU0 in which case, you have a legitimate workaround, ticket closed, back to the real work ;)

Thank you for open-sourcing. I know that it takes big balls, and I assure you, it's worth it all the way around ;)

max_steps is given, it will override any value given in num_train_epochs
==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1
\ /| Num examples = 1,029 | Num Epochs = 1
O^O/ _/ \ Batch size per device = 2 | Gradient Accumulation steps = 4
\ / Total batch size = 8 | Total steps = 30
"-____-" Number of trainable parameters = 41,943,040

@miary @molander I updated the package again! Apologies on the issues!

I found the below to work (change 1 to any device id)
export CUDA_VISIBLE_DEVICES=1 && python train_file.py
Likewise torchrun also works with that approach.

Hope this works! Thank you for your patience!

@danielhanchen Just wanted to confirm that your patch by including export CUDA_VISIBLE_DEVICES=1 works!!! Thanks for all the good work, greatly appreciated!

@miary Great it worked!

@molander Thanks glad it's a workaround - I'll see what I can do. So export CUDA_VISIBLE_DEVICES=1 && python train_file.py still does not work? Do you use torchrun or python or accelerate?

@danielhanchen Good to go here! I made a new conda env and conda installed pytorch, transformers, etc and it's working like a mule at the grand canyon! Thanks!

Thanks @danielhanchen!!

Hi @danielhanchen,

I'm experiencing a similar issue. I want to load Llama3 on two separate A6000 GPUs. When I set CUDA_VISIBLE_DEVICES to "0,1", it works fine on CUDA0. However, when I try to load the model on CUDA1 and generate responses, it fails. I want to have the full model on each GPU, not distribute it across them. Any suggestions on how to resolve this?

This still isn't working for me. @danielhanchen can you please remove the exception if more than one GPU has 4GB of memory usage?

@molander Did you do anything custom on your end? Looking at the main branch, the code is still there.

I'm on a node with multiple GPUs, but I only have one in CUDA_VISIBLE_DEVICES.

The issue I'm having is with these lines in the patch_sft_trainer_tokenizer() function of tokenizer_utils.py:

unsloth/unsloth/tokenizer_utils.py

Lines 961 to 970 in 933d9fe

    
           "import subprocess, re\n"\ 
        
           "output = subprocess.check_output(\n"\ 
        
           "    'nvidia-smi --query-gpu=memory.used --format=csv', shell = True)\n"\ 
        
           "output = re.findall(rb'([\\d]{1,})[\\s]{1,}M', output)\n"\ 
        
           "output = sum(int(x.decode('utf-8'))/1024 > 4 for x in output)\n"\ 
        
           "if output > 1: raise RuntimeError(\n"\ 
        
           "    'Unsloth currently does not work on multi GPU setups - sadly we are a 2 brother team so '\\\n"\ 
        
           "    'enabling it will require much more work, so we have to prioritize. Please understand!\\n'\\\n"\ 
        
           "    'We do have a separate beta version, which you can contact us about!\\n'\\\n"\ 
        
           "    'Thank you for your understanding and we appreciate it immensely!')\n"\

The check for multiple GPUs here is really a count of how many GPUs on the node are using > 4gb of memory. This is going to fail for anyone on a busy shared node.

I removed that check, and a similar check in llama.py:

unsloth/unsloth/models/llama.py

Lines 1198 to 1207 in 933d9fe

    
                   import subprocess, re, gc 
        
                   output = subprocess.check_output( 
        
                       'nvidia-smi --query-gpu=memory.used --format=csv', shell = True) 
        
                   output = re.findall(rb'([\\d]{1,})[\\s]{1,}M', output) 
        
                   output = sum(int(x.decode('utf-8'))/1024 > 4 for x in output) 
        
                   if output > 1: raise RuntimeError( 
        
                       'Unsloth currently does not work on multi GPU setups - sadly we are a 2 brother team so '\\ 
        
                       'enabling it will require much more work, so we have to prioritize. Please understand!\\n'\\ 
        
                       'We do have a separate beta version, which you can contact us about!\\n'\\ 
        
                       'Thank you for your understanding and we appreciate it immensely!')

Then I was able to run unsloth on my node.

Much much apologies on the delay! My brother and I just relocated to SF, so just got back to Github issues!

As per the discussion here, I will instead convert it to a warning for people to say Unsloth is not yet functional for multi GPUs, and will still allow the finetuning process to go through (esp for shared servers)

As requested, I made it into a warning instead and not an error :) Please update Unsloth and try it out! Hope it works now!

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

Same error in

Name: unsloth
Version: 2024.8

script:
export CUDA_VISIBLE_DEVICES=0 && python naive_train.py --model gemma_2b --dataset alpaca_gpt4 --lora

As requested, I made it into a warning instead and not an error :) Please update Unsloth and try it out! Hope it works now!根据要求，我将其设置为警告，而不是错误:)请更新 Unsloth 并尝试一下！希望它现在有效！
pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

Also tried this :(

Oh no it still doesn't work? I'll look into it sorry

	"import subprocess, re\n"\
	"output = subprocess.check_output(\n"\
	" 'nvidia-smi --query-gpu=memory.used --format=csv', shell = True)\n"\
	"output = re.findall(rb'([\\d]{1,})[\\s]{1,}M', output)\n"\
	"output = sum(int(x.decode('utf-8'))/1024 > 4 for x in output)\n"\
	"if output > 1: raise RuntimeError(\n"\
	" 'Unsloth currently does not work on multi GPU setups - sadly we are a 2 brother team so '\\\n"\
	" 'enabling it will require much more work, so we have to prioritize. Please understand!\\n'\\\n"\
	" 'We do have a separate beta version, which you can contact us about!\\n'\\\n"\
	" 'Thank you for your understanding and we appreciate it immensely!')\n"\

	import subprocess, re, gc
	output = subprocess.check_output(
	'nvidia-smi --query-gpu=memory.used --format=csv', shell = True)
	output = re.findall(rb'([\\d]{1,})[\\s]{1,}M', output)
	output = sum(int(x.decode('utf-8'))/1024 > 4 for x in output)
	if output > 1: raise RuntimeError(
	'Unsloth currently does not work on multi GPU setups - sadly we are a 2 brother team so '\\
	'enabling it will require much more work, so we have to prioritize. Please understand!\\n'\\
	'We do have a separate beta version, which you can contact us about!\\n'\\
	'Thank you for your understanding and we appreciate it immensely!')

CUDA_VISIBILE_DEVICES not functioning