Fails with cannot import colo_set_process_memory_fraction in Docker
Adrian-1234 opened this issue · comments
On a Multi GPU A100 system:
$ cat CONFIG_FILE.py
from colossalai.amp import AMP_TYPE
SEQ_LENGTH = 512
BATCH_SIZE = 8
NUM_EPOCHS = 10
WARMUP_EPOCHS = 1
parallel = dict(
tensor=dict(mode="1d", size=4),
)
model = dict(
type="palm_small",
# use_grad_checkpoint=False,
# use_act_offload=False,
)
fp16 = dict(mode=AMP_TYPE.NAIVE)
clip_grad_norm = 1.0
export DATA=wiki_dataset/
export TOKENIZER=tokenizer/
$ docker run -ti --gpus all --rm palm torchrun --nproc_per_node 1 train.py --from_torch --config CONFIG_FILE.py
Traceback (most recent call last):
File "train.py", line 18, in
from colossalai.utils import colo_set_process_memory_fraction, colo_device_memory_capacity
ImportError: cannot import name 'colo_set_process_memory_fraction' from 'colossalai.utils' (/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/colossalai/utils/init.py)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11) of binary: /root/miniconda3/envs/pytorch/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/pytorch/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.10.1', 'console_scripts', 'torchrun')())
File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, **kwargs)
File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2022-05-29_08:01:15
host : 1d3306a6abee
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 11)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html