"Raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) " happened when starting studing

Question

"Raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) " happened when starting studing

Teriss opened this issue a year ago · comments

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

CUDA SETUP: Loading binary D:\kohya\kohya_ss\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
use 8-bit AdamW optimizer | {}
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 1500
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 1500
num epochs / epoch数: 1
batch size per device / バッチサイズ: 1
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 1500
steps: 0%| | 0/1500 [00:00<?, ?it/s]epoch 1/1
Traceback (most recent call last):
File "D:\kohya\kohya_ss\python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\kohya\kohya_ss\python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\kohya\kohya_ss\venv\Scripts\accelerate.exe_main.py", line 7, in
File "D:\kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
args.func(args)
File "D:\kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1104, in launch_command
simple_launcher(args)
File "D:\kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 567, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\kohya\kohya_ss\venv\Scripts\python.exe', 'train_network.py', '--enable_bucket', '--pretrained_model_name_or_path=C:/Users/PC/Desktop/test/v1-5-pruned-emaonly.safetensors', '--train_data_dir=C:/Users/PC/Desktop/test/input', '--resolution=512,512', '--output_dir=C:/Users/PC/Desktop/test/output', '--logging_dir=C:/Users/PC/Desktop/test/log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-5', '--unet_lr=0.0001', '--network_dim=8', '--output_name=last', '--lr_scheduler_num_cycles=1', '--learning_rate=0.0001', '--lr_scheduler=cosine', '--lr_warmup_steps=150', '--train_batch_size=1', '--max_train_steps=1500', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--cache_latents', '--optimizer_type=AdamW8bit', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 3221225477.

Stu_Teriss · Answer 1 · Thu Mar 23 2023 18:41:35 GMT+0800 (China Standard Time)

when i restart and train again,a new error happen.
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link

CUDA SETUP: Loading binary D:\kohya\kohya_ss\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
use 8-bit AdamW optimizer | {}
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 1500
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 1500
num epochs / epoch数: 1
batch size per device / バッチサイズ: 1
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 1500
steps: 0%| | 0/1500 [00:00<?, ?it/s]epoch 1/1
Traceback (most recent call last):
Traceback (most recent call last):
File "D:\kohya\kohya_ss\train_network.py", line 659, in
File "", line 1, in
File "D:\kohya\kohya_ss\Python310\lib\multiprocessing\spawn.py", line 116, in spawn_main
train(args)
File "D:\kohya\kohya_ss\train_network.py", line 488, in train
exitcode = _main(fd, parent_sentinel)
File "D:\kohya\kohya_ss\Python310\lib\multiprocessing\spawn.py", line 126, in _main
for step, batch in enumerate(train_dataloader):
File "D:\kohya\kohya_ss\venv\lib\site-packages\accelerate\data_loader.py", line 372, in iter
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
dataloader_iter = super().iter()
File "D:\kohya\kohya_ss\venv\lib\site-packages\torch\utils\data\dataloader.py", line 444, in iter
return self._get_iterator()
File "D:\kohya\kohya_ss\venv\lib\site-packages\torch\utils\data\dataloader.py", line 390, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "D:\kohya\kohya_ss\venv\lib\site-packages\torch\utils\data\dataloader.py", line 1077, in init
w.start()
File "D:\kohya\kohya_ss\Python310\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "D:\kohya\kohya_ss\Python310\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "D:\kohya\kohya_ss\Python310\lib\multiprocessing\context.py", line 336, in _Popen
return Popen(process_obj)
File "D:\kohya\kohya_ss\Python310\lib\multiprocessing\popen_spawn_win32.py", line 93, in init
reduction.dump(process_obj, to_child)
File "D:\kohya\kohya_ss\Python310\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
MemoryError
steps: 0%| | 0/1500 [00:45<?, ?it/s]
Traceback (most recent call last):
File "D:\kohya\kohya_ss\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "D:\kohya\kohya_ss\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\kohya\kohya_ss\venv\scripts\accelerate.exe_main.py", line 7, in
File "D:\kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
args.func(args)
File "D:\kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1104, in launch_command
simple_launcher(args)
File "D:\kohya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 567, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\kohya\kohya_ss\venv\scripts\python.exe', 'train_network.py', '--enable_bucket', '--pretrained_model_name_or_path=C:/Users/PC/Desktop/test/v1-5-pruned-emaonly.safetensors', '--train_data_dir=D:/kohya/test/input', '--resolution=512,512', '--output_dir=D:/kohya/test/output', '--logging_dir=D:/kohya/test/log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-5', '--unet_lr=0.0001', '--network_dim=8', '--output_name=last', '--lr_scheduler_num_cycles=1', '--learning_rate=0.0001', '--lr_scheduler=cosine', '--lr_warmup_steps=150', '--train_batch_size=1', '--max_train_steps=1500', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--cache_latents', '--optimizer_type=AdamW8bit', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale']' returned non-zero exit status 1.

bmaltais · Answer 2 · Thu Mar 23 2023 21:39:32 GMT+0800 (China Standard Time)

Try training with AdamW instead of AdamW8bit. I think your car'd can use the bitsandbytes module required for AdamW8bit.

Stu_Teriss · Answer 3 · Fri Mar 24 2023 11:27:06 GMT+0800 (China Standard Time)

Try training with AdamW instead of AdamW8bit. I think your car'd can use the bitsandbytes module required for AdamW8bit.

Thanks for the suggestion, but it didn't work.
I tried restarting the computer and running it again and it worked. But after the computer was on standby overnight, it got this error again, sometimes a new one : "Memory allocation failure" or "Out Of Momery".My card is RTX 3080 with 12GB momery.It seems to use only 6GB, yet it says it is OOM.

bmaltais · Answer 4 · Fri Mar 24 2023 16:45:35 GMT+0800 (China Standard Time)

So something related to windows and possibly windows drivers... Hard to fix those

Stu_Teriss · Answer 5 · Fri Mar 24 2023 17:36:27 GMT+0800 (China Standard Time)

So something related to windows and possibly windows drivers... Hard to fix those

Hi, I found that the error occurred when loading the data. So I try to changed the parameter "num_workers" in the function "torch.utils.data.DataLoader" to turn it down, it work. And when I set it to 0, the training is the fastest…I think python's multiprocessing may not be very efficient in windows OS.

bmaltais · Answer 6 · Fri Mar 24 2023 20:02:04 GMT+0800 (China Standard Time)

So something related to windows and possibly windows drivers... Hard to fix those

Hi, I found that the error occurred when loading the data. So I try to changed the parameter "num_workers" in the function "torch.utils.data.DataLoader" to turn it down, it work. And when I set it to 0, the training is the fastest…I think python's multiprocessing may not be very efficient in windows OS.

Thank you for the update. I will update the default value in the GUI to set it to 0 to avoid similar issues for other users!

Big-ANGELO · Answer 7 · Sat Mar 25 2023 18:20:01 GMT+0800 (China Standard Time)

So something related to windows and possibly windows drivers... Hard to fix those

Hi, I found that the error occurred when loading the data. So I try to changed the parameter "num_workers" in the function "torch.utils.data.DataLoader" to turn it down, it work. And when I set it to 0, the training is the fastest…I think python's multiprocessing may not be very efficient in windows OS.

Could you tell me how you solve this problem in a detailed way? Thx!

Stu_Teriss · Answer 8 · Mon Mar 27 2023 10:18:21 GMT+0800 (China Standard Time)

So something related to windows and possibly windows drivers... Hard to fix those

Hi, I found that the error occurred when loading the data. So I try to changed the parameter "num_workers" in the function "torch.utils.data.DataLoader" to turn it down, it work. And when I set it to 0, the training is the fastest…I think python's multiprocessing may not be very efficient in windows OS.

Could you tell me how you solve this problem in a detailed way? Thx!

You can fix it by updating to the latest version now, the author has put this setting in the GUI.