hpcaitech / PaLM-colossalai

Scalable PaLM implementation of PyTorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fails with cannot import colo_set_process_memory_fraction in Docker

Adrian-1234 opened this issue · comments

On a Multi GPU A100 system:

$ cat CONFIG_FILE.py

from colossalai.amp import AMP_TYPE

SEQ_LENGTH = 512

BATCH_SIZE = 8

NUM_EPOCHS = 10

WARMUP_EPOCHS = 1

parallel = dict(

tensor=dict(mode="1d", size=4),

)

model = dict(

type="palm_small",

# use_grad_checkpoint=False,

# use_act_offload=False,

)

fp16 = dict(mode=AMP_TYPE.NAIVE)

clip_grad_norm = 1.0

export DATA=wiki_dataset/

export TOKENIZER=tokenizer/


$ docker run -ti --gpus all --rm palm torchrun --nproc_per_node 1 train.py --from_torch --config CONFIG_FILE.py

Traceback (most recent call last):

File "train.py", line 18, in

from colossalai.utils import colo_set_process_memory_fraction, colo_device_memory_capacity

ImportError: cannot import name 'colo_set_process_memory_fraction' from 'colossalai.utils' (/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/colossalai/utils/init.py)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11) of binary: /root/miniconda3/envs/pytorch/bin/python

Traceback (most recent call last):

File "/root/miniconda3/envs/pytorch/bin/torchrun", line 33, in

sys.exit(load_entry_point('torch==1.10.1', 'console_scripts', 'torchrun')())

File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper

return f(*args, **kwargs)

File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main

run(args)

File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run

elastic_launch(

File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call

return launch_agent(self._config, self._entrypoint, list(args))

File "/root/miniconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent

raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

============================================================

train.py FAILED


Failures:

<NO_OTHER_FAILURES>


Root Cause (first observed failure):

[0]:

time : 2022-05-29_08:01:15

host : 1d3306a6abee

rank : 0 (local_rank: 0)

exitcode : 1 (pid: 11)

error_file: <N/A>

traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html