Error (and crash) when using EFA from docker running on ubuntu AMI

Question

Error (and crash) when using EFA from docker running on ubuntu AMI

yukunlin opened this issue 2 years ago · comments

Overview of issue

I have a docker image with EFA and aws-ofi-nccl installed. This image "works" with EFA when running on an AL2 AMI (but is slow, see #106). However, when the same image is run on an ubuntu AMI, we get an error message:

[0]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993

Repro Steps

Ubuntu Setup

Instance type: p3dn.24xlarge (EFA enabled)

AMI: AWS Deep Learning Base AMI GPU CUDA 11 (Ubuntu 20.04) 20220403, (ami-061dac75dbd529aef in us-west-2)

Nvidia driver version: 510.47.03
CUDA version: 11.6 (shouldn't be a factor because we're running the job from docker)

/opt/amazon/efa/bin/fi_info -p efa:

provider: efa
    fabric: EFA-fe80::da:b9ff:fe04:8af
    domain: rdmap0s6-rdm
    version: 114.10
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::da:b9ff:fe04:8af
    domain: rdmap0s6-dgrm
    version: 114.10
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA

nvidia-docker version: 20.10.14

Training command (exectued on both training nodes):

nvidia-docker run \
   --mount type=bind,src=/home/ubuntu/ps-fsx,dst=/job \
   --network host \
   --device /dev/infiniband/uverbs0 \
   --env FI_PROVIDER=EFA \
   --env NCCL_SOCKET_IFNAME=ens5 \
   --env LOGLEVEL=INFO \
   --env NCCL_PROTO=simple \
   --env NCCL_DEBUG=INFO \
   919560170281.dkr.ecr.us-west-2.amazonaws.com/yukunlin-test-snap:manual_docker5 \
   python -m torch.distributed.launch --rdzv_backend c10d --rdzv_endpoint $MASTER_IP:29500 --rdzv_id 'foobar' --nnodes 2 --nproc_per_node 8 --role '' \
   --master_addr=$MASTER_IP --master_port=12345 \
   fairseq_train_wrapped \
   --task language_modeling \
  /job/fairseq/data-bin/wikitext-103 \
  --save-dir "/job/fairseq/checkpoints/transformer_wikitext-103_ubuntu" \
  --arch transformer_lm --share-decoder-input-output-embed \
  --dropout 0.1 \
  --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
  --tokens-per-sample 512 --sample-break-mode none \
  --max-tokens 2048 \
  --max-update 50000 2>&1 | tee ~/output_ubuntu_efa.txt

Full log: https://gist.github.com/yukunlin/dd5e5ee6e41f84a696e76b74e75c65d0

Other Observations

This seems related to #44. I did follow #44 (comment) and ran nccl-test on the AMI successfully, which indicates EFA is working (on the AMI at least).

Note that the docker image works (it doesn't crash) when running from an AL2 AMI (see #106)

Greg Inozemtsev · Answer 1 · Wed Apr 20 2022 03:38:47 GMT+0800 (China Standard Time)

Have you tried increasing the memory that can be pinned? This limit can be set to unlimited by passing --ulimit memlock=-1 to the docker command. The full stanza I use personally is:

    --ulimit nofile=50000:50000 \
    --ulimit stack=67108864 \
    --ulimit memlock=-1 \
    --ulimit core=-1

Yukun · Answer 2 · Wed Apr 20 2022 07:37:39 GMT+0800 (China Standard Time)

@nzmsv thanks for the suggestion! Adding --ulimit memlock=-1 to the docker command fixes this.