torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file

Question

torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file

yandachen opened this issue a year ago · comments

Hello, I installed your package using setup/setup.sh. The single-GPU command in the tutorial works fine, but when I run the multi-GPU command deepspeed --num_gpus 8 --num_nodes 2 --master_addr machine1 train.py --config conf/tutorial-gpt2-micro.yaml --nnodes 2 --nproc_per_node 8 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z2-small-conf.json --run_id tutorial-gpt2-micro-multi-node I received an error message saying that

File "miniconda3/envs/mistral/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1775, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: ~/.cache/torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory.

I also tried running the same code in the same environment but on a different machine, and this time I get the error message

File "miniconda3/envs/mistral/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1494, in verify_ninja_availability
raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions

Do you have any idea about how to resolve this issue? I installed all packages using setup/setup.sh so I guess my package versions follow what you included in the requirements files. Thanks!

J38 · Answer 1 · Tue Mar 07 2023 15:14:45 GMT+0800 (China Standard Time)

This worked for me today:

# create new conda environment
conda create -n mistral-march-2023 python=3.8.12 pytorch=1.11.0 torchdata cudatoolkit=11.3 -c pytorch
conda activate mistral-march-2023
pip install -r setup/pip-requirements.txt

# install flash attention
git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
python setup.py install

# clone mistral
cd ..
git clone https://github.com/stanford-crfm/mistral.git
cd mistral
git checkout mistral-flash-dec-2022

# install modified transformers
cd ..
git clone https://github.com/huggingface/transformers.git
cd transformers
# copy the modified modeling_gpt2.py into the transformers repo before installing
cp ../mistral/transformers/models/gpt2/modeling_gpt2.py src/transformers/models/gpt2/modeling_gpt2.py
pip install -e .

# run demo
# note: I think some of the default configurations are probably broken and you'll need to modify based on your experiment, but it is easy to do so
cd ..
cd mistral
deepspeed --hostfile hostfile --num_gpus 8 --num_nodes 1 --master_addr sphinx6 train.py --config conf/mistral-micro.yaml --nnodes 1 --nproc_per_node 8 --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z2-small-bf16-conf.json --run_id mistral-w-flash-demo

Yanda Chen · Answer 2 · Wed Mar 08 2023 06:38:06 GMT+0800 (China Standard Time)

Thanks for your prompt response. I ran the above code but received this error message:

CerberusError: config could not be validated against schema. The errors are,
{'training_arguments': [{'gradient_checkpointing': ['unknown field']}]}

Can you let me know how to fix this?

By the way, what command are you using to install the python packages needed for mistral? Are you using pip install -r setup/pip-requirements.txt? Just want to confirm so that I'm using the same version.

Thanks.

J38 · Answer 3 · Sun Mar 12 2023 17:51:01 GMT+0800 (China Standard Time)

I updated the branch to fix those configuration issues.

And yes I think I did that pip as well and forgot that from the install.

J38 · Answer 4 · Sun Mar 12 2023 17:59:56 GMT+0800 (China Standard Time)

I'm probably going to update main branch to do this and put changes in main into a separate branch.

J38 · Answer 5 · Sun Mar 12 2023 18:00:24 GMT+0800 (China Standard Time)

So main should be sort of like mistral-flash-dec-2022 ...

Yanda Chen · Answer 6 · Sun Mar 12 2023 23:12:28 GMT+0800 (China Standard Time)

Hello, thanks so much for working on this! The code you provided above works!