aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Couldn't allocate endpoint. RC: -22, ERROR: Invalid argument

flyingdown opened this issue · comments

I have this error, and want to know how to slove it.

$ mpirun -n 2 --host node13,node14 ./nccl_message_transfer
TRACE: Function: main Line: 58: NET/OFI Using CUDA device 0 for memory allocation
TRACE: Function: main Line: 58: NET/OFI Using CUDA device 0 for memory allocation
INFO: Function: ofi_init Line: 1006: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
TRACE: Function: find_ofi_provider Line: 525: NET/OFI Could not find any optimal provider supporting GPUDirect RDMA
INFO: Function: ofi_init Line: 1006: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
TRACE: Function: find_ofi_provider Line: 525: NET/OFI Could not find any optimal provider supporting GPUDirect RDMA
INFO: Function: ofi_init Line: 1033: NET/OFI Selected Provider is psm2
INFO: Function: main Line: 69: NET/OFI Process rank 0 started. NCCLNet device used on node13 is AWS Libfabric.
INFO: Function: main Line: 73: NET/OFI Received 1 network devices
INFO: Function: ofi_pciPath Line: 1094: NET/OFI No NIC info for dev 0
INFO: Function: ofi_getProperties Line: 1194: NET/OFI No NIC info for dev 0. Supplying default values for NIC properties.
TRACE: Function: print_dev_props Line: 78: NET/OFI ****************** Device 0 Properties ******************
TRACE: Function: print_dev_props Line: 79: NET/OFI hfi1_0;hfi1_1: PCIe Path: (null)
TRACE: Function: print_dev_props Line: 80: NET/OFI hfi1_0;hfi1_1: Plugin Support: 1
TRACE: Function: print_dev_props Line: 81: NET/OFI hfi1_0;hfi1_1: Device GUID: 0
TRACE: Function: print_dev_props Line: 82: NET/OFI hfi1_0;hfi1_1: Device Speed: 0
TRACE: Function: print_dev_props Line: 83: NET/OFI hfi1_0;hfi1_1: Device Port: 1
TRACE: Function: print_dev_props Line: 84: NET/OFI hfi1_0;hfi1_1: Device Maximum Communicators: 65535
TRACE: Function: main Line: 104: NET/OFI Rank 0 uses 0 device for communication
INFO: Function: main Line: 114: NET/OFI Server: Listening on dev 0
WARN: Function: create_nccl_ofi_component Line: 708: NET/OFI Couldn't allocate endpoint. RC: -22, ERROR: Invalid argument
WARN: Function: main Line: 115: NET/OFI OFI NCCL failure: 2

$ fi_info -p psm2
provider: psm2
fabric: psm2
domain: psm2
version: 1.6
type: FI_EP_RDM
protocol: FI_PROTO_PSMX2
provider: psm2
fabric: psm2
domain: psm2
version: 1.6
type: FI_EP_RDM
protocol: FI_PROTO_PSMX2
provider: psm2
fabric: psm2
domain: psm2
version: 1.6
type: FI_EP_RDM
protocol: FI_PROTO_PSMX2

The following error message

NET/OFI Couldn't allocate endpoint. RC: -22, ERROR: Invalid argument

indicates that plugin wasn't able to create an endpoint with psm2 provider. Unfortunately, I do not have access to a system with PSM devices to debug the issue myself. Could you enable libfabric logging and see if the PSM2 provider gives more details on "Invalid Argument". I think you can use FI_LOG_LEVEL=debug FI_LOG_PROV=psm2

No follow-up