Error (and crash) when using EFA from docker running on ubuntu AMI
yukunlin opened this issue · comments
Overview of issue
I have a docker image with EFA and aws-ofi-nccl installed. This image "works" with EFA when running on an AL2 AMI (but is slow, see #106). However, when the same image is run on an ubuntu AMI, we get an error message:
[0]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993
Repro Steps
Ubuntu Setup
- Instance type:
p3dn.24xlarge
(EFA enabled) - AMI: AWS Deep Learning Base AMI GPU CUDA 11 (Ubuntu 20.04) 20220403, (
ami-061dac75dbd529aef
inus-west-2
)-
Nvidia driver version: 510.47.03
-
CUDA version: 11.6 (shouldn't be a factor because we're running the job from docker)
-
/opt/amazon/efa/bin/fi_info -p efa
:provider: efa fabric: EFA-fe80::da:b9ff:fe04:8af domain: rdmap0s6-rdm version: 114.10 type: FI_EP_RDM protocol: FI_PROTO_EFA provider: efa fabric: EFA-fe80::da:b9ff:fe04:8af domain: rdmap0s6-dgrm version: 114.10 type: FI_EP_DGRAM protocol: FI_PROTO_EFA
-
nvidia-docker
version: 20.10.14
-
Training command (exectued on both training nodes):
nvidia-docker run \
--mount type=bind,src=/home/ubuntu/ps-fsx,dst=/job \
--network host \
--device /dev/infiniband/uverbs0 \
--env FI_PROVIDER=EFA \
--env NCCL_SOCKET_IFNAME=ens5 \
--env LOGLEVEL=INFO \
--env NCCL_PROTO=simple \
--env NCCL_DEBUG=INFO \
919560170281.dkr.ecr.us-west-2.amazonaws.com/yukunlin-test-snap:manual_docker5 \
python -m torch.distributed.launch --rdzv_backend c10d --rdzv_endpoint $MASTER_IP:29500 --rdzv_id 'foobar' --nnodes 2 --nproc_per_node 8 --role '' \
--master_addr=$MASTER_IP --master_port=12345 \
fairseq_train_wrapped \
--task language_modeling \
/job/fairseq/data-bin/wikitext-103 \
--save-dir "/job/fairseq/checkpoints/transformer_wikitext-103_ubuntu" \
--arch transformer_lm --share-decoder-input-output-embed \
--dropout 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
--tokens-per-sample 512 --sample-break-mode none \
--max-tokens 2048 \
--max-update 50000 2>&1 | tee ~/output_ubuntu_efa.txt
Full log: https://gist.github.com/yukunlin/dd5e5ee6e41f84a696e76b74e75c65d0
Other Observations
This seems related to #44. I did follow #44 (comment) and ran nccl-test
on the AMI successfully, which indicates EFA is working (on the AMI at least).
Note that the docker image works (it doesn't crash) when running from an AL2 AMI (see #106)
Have you tried increasing the memory that can be pinned? This limit can be set to unlimited by passing --ulimit memlock=-1
to the docker
command. The full stanza I use personally is:
--ulimit nofile=50000:50000 \
--ulimit stack=67108864 \
--ulimit memlock=-1 \
--ulimit core=-1