aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error (and crash) when using EFA from docker running on ubuntu AMI

yukunlin opened this issue · comments

commented

Overview of issue

I have a docker image with EFA and aws-ofi-nccl installed. This image "works" with EFA when running on an AL2 AMI (but is slow, see #106). However, when the same image is run on an ubuntu AMI, we get an error message:

[0]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993

Repro Steps

Ubuntu Setup

  • Instance type: p3dn.24xlarge (EFA enabled)
  • AMI: AWS Deep Learning Base AMI GPU CUDA 11 (Ubuntu 20.04) 20220403, (ami-061dac75dbd529aef in us-west-2)
    • Nvidia driver version: 510.47.03

    • CUDA version: 11.6 (shouldn't be a factor because we're running the job from docker)

    • /opt/amazon/efa/bin/fi_info -p efa:

      provider: efa
          fabric: EFA-fe80::da:b9ff:fe04:8af
          domain: rdmap0s6-rdm
          version: 114.10
          type: FI_EP_RDM
          protocol: FI_PROTO_EFA
      provider: efa
          fabric: EFA-fe80::da:b9ff:fe04:8af
          domain: rdmap0s6-dgrm
          version: 114.10
          type: FI_EP_DGRAM
          protocol: FI_PROTO_EFA
      
    • nvidia-docker version: 20.10.14

Training command (exectued on both training nodes):

nvidia-docker run \
   --mount type=bind,src=/home/ubuntu/ps-fsx,dst=/job \
   --network host \
   --device /dev/infiniband/uverbs0 \
   --env FI_PROVIDER=EFA \
   --env NCCL_SOCKET_IFNAME=ens5 \
   --env LOGLEVEL=INFO \
   --env NCCL_PROTO=simple \
   --env NCCL_DEBUG=INFO \
   919560170281.dkr.ecr.us-west-2.amazonaws.com/yukunlin-test-snap:manual_docker5 \
   python -m torch.distributed.launch --rdzv_backend c10d --rdzv_endpoint $MASTER_IP:29500 --rdzv_id 'foobar' --nnodes 2 --nproc_per_node 8 --role '' \
   --master_addr=$MASTER_IP --master_port=12345 \
   fairseq_train_wrapped \
   --task language_modeling \
  /job/fairseq/data-bin/wikitext-103 \
  --save-dir "/job/fairseq/checkpoints/transformer_wikitext-103_ubuntu" \
  --arch transformer_lm --share-decoder-input-output-embed \
  --dropout 0.1 \
  --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
  --tokens-per-sample 512 --sample-break-mode none \
  --max-tokens 2048 \
  --max-update 50000 2>&1 | tee ~/output_ubuntu_efa.txt

Full log: https://gist.github.com/yukunlin/dd5e5ee6e41f84a696e76b74e75c65d0

Other Observations

This seems related to #44. I did follow #44 (comment) and ran nccl-test on the AMI successfully, which indicates EFA is working (on the AMI at least).

Note that the docker image works (it doesn't crash) when running from an AL2 AMI (see #106)

Have you tried increasing the memory that can be pinned? This limit can be set to unlimited by passing --ulimit memlock=-1 to the docker command. The full stanza I use personally is:

    --ulimit nofile=50000:50000 \
    --ulimit stack=67108864 \
    --ulimit memlock=-1 \
    --ulimit core=-1
commented

@nzmsv thanks for the suggestion! Adding --ulimit memlock=-1 to the docker command fixes this.