aws / aws-ofi-nccl

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DataLoader crash when using FI_EFA_USE_DEVICE_RDMA=1

tohaowu opened this issue · comments

Our AWS p4d.24xlarge job passed on 08/24, and the throughput was 3511 samples/second.
We used two p4d.24xlarges with FI_PROVIDER="efa" and FI_EFA_USE_DEVICE_RDMA=1

This test failed recently. The error message is following

File "/workspace/bert/run_pretraining.py", line 1592, in
args, final_loss, train_time_raw = main()
File "/workspace/bert/run_pretraining.py", line 1344, in main
for step, batch in enumerate(train_dataloader):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 356, in iter
return self._get_iterator()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 302, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 941, in init
self._reset(loader, first_iter=True)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 972, in _reset
self._try_put_index()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1206, in _try_put_index
index = self._next_index()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 509, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 226, in iter
for idx in self.sampler:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 124, in iter
yield from torch.randperm(n, generator=generator).tolist()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 827) is killed by signal: Segmentation fault.

When test without FI_EFA_USE_DEVICE_RDMA=1, the test passes. But the throughput is 1673 samples/sec.

This is the dockerfile we used.
https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-docker_base/Dockerfile.base

I have seen these errors in the past when my training data wasn't behind fast enough link (has to switch from NFS based storage to local storage). Do you still experience this error?

No follow-up

I've replicated this on a recent set of p4d.24xlarges

Some info here: I've found that the use of FI_EFA_USE_DEVICE_RDMA is quite unstable (causing segfaults) pretty much whenever there's a forked subprocess. So DataLoaders with num_workers>0 have issues. So mitigation is to either disable RDMA or to avoid forks.

I would hope this would be addressed with RDMAV_FORK_SAFE=1 but that doesn't seem to help. If there is guidance, I would greatly appreciate it.

This should be addressed by #77

Ah I tried but that doesn't seem to help. Not sure.

Does this mean you tried the patch in #77 and it didn't help? In that case I am very interested!

Correct. I checked out #77 and recompiled aws-ofi-nccl and found it didn't mitigate my segfaults.

I've gotten the workers to dump their cores and done some inspecting, and they're all crashing in pretty innocuous/random spots. Something is messing with their memory space underneath their feet.

Thinking about this some more, the problem could be caused by registering any buffer that is not page aligned. There is definitely some work that needs to be done in this area.

What does your application use fork() for?

What does your application use fork() for?

Pytorch DataLoaders use fork under the hood. The main purpose is to load/preprocess data in a background process so that the GPUs don't have be blocked waiting on data I/O.

The only remaining potential path I haven't explored is recompiling pytorch. The latest UltraCluster AMIs bundle cuda 11.4 but pytorch has only been compiled with 11.3 (or lower in my case; I'm using on pytorch 1.9.1 for unrelated reasons).

commented

@stephenroller we lately discovered that NCCL may enable some modes that EFA doesn’t support and could lead to inconsistency, and potentially corruption. We updated the documents to required setting the env variable: -x NCCL_PROTO=simple

could you check what is the current setting in your runs and try this variable if not set

i have this issue intermittently happening as well. it only happens when the dataloaders are being created. once they are created successfully, the job will run to completion. theres about a 30%-50% chance that a given job will fail, and i just hav eto keep restarting (on the same nodes) until it works. i also notice that the larger the model size, the more likely we are to hit this issue

i was not setting NCCL_PROTO at all. tried it with it set to simple and it did not help

@stephenroller we lately discovered that NCCL may enable some modes that EFA doesn’t support and could lead to inconsistency, and potentially corruption. We updated the documents to required setting the env variable: -x NCCL_PROTO=simple

could you check what is the current setting in your runs and try this variable if not set

Explicitly setting NCCL_PROTO did not resolve.

My models are larger (3-13B parameters) and roughly comparable to the megatron-zero3, so there's a lot of complexity and memory thrashing within my models. I also find things are less predictable: sometimes I can crash on the first SGD step, sometimes after a few hundreds steps, sometimes during the middle of validation.

That said, @alexeib witnesses it on smaller models (600M) that are closer to standard vision transformers, and in a reasonably distinct codebase.

commented

I'm indeed using the newest version AMI: https://aws.amazon.com/releasenotes/aws-deep-learning-ami-gpu-cuda-11-4-ubuntu-18-04/. Due to my own instances being managed resource, our best options are to find mitigations, or to release new AMIs that don't have issues (if indeed this is an AMI issue). Please contact me directly at roller@fb.com if you would prefer to coordinate with existing official support channels.

I understand there are internal issues with that image being tracked (case 9248945321) but unsure what other issues are being tracked right now (e.g. segfaults).

Tracking in T108814700 on the Meta side.

I'll come up with a patch that warns when unaligned memory regions are registered to see if we get any hints.

Also, the compatibility between fork and RDMA has been significantly improved in newer Linux kernels (though the plug-in will need some work to take advantage of these features). Which version are you using?

These segfaults could be related to non-aligned buffer registrations done in Pytorch's Dataloader library but we would need to confirm if any such registrations are happening (more likely via libfabric rather than plugin). I understand from your conversation that you are using Pytorch 1.9.1 with CUDA 11.4.1 (but the pytorch is compiled with 11.3.1). Is that right?

If yes, we will try to reproduce with these versions and debug it.

Also, in case it is actually misaligned buffers causing this issue then it should be independent of toggling FI_EFA_USE_DEVICE_RDMA=1 (but it could be more likely to be triggered when using CUDA buffers?).

if it helps i had this (or a similar issue) with cuda 11.1 and pytorch 1.9.1 as well as pytorch 1.10.0 (both compiled to target cuda 11.1 and nccl 2.8.4) - the error message was different then - we got "RuntimeError: CUDA error: unspecified launch failure" with some cryptic cuda stacktrace and Xid nvidia driver errors in the logs.

we then upgraded to the new AMI with cuda 11.4, compiled pytorch 1.10.1 to target cuda 11.4 and now get the same problem but this time we get the segfault in the dataloader instead. same symptoms and same resolution (keep restarting jobs until it works)

another training task used to work on the cuda 11.1 / pytorch 1.9.1 setup but stopped working on the new AMI with the same issue, but only appearing after training for one or a couple of epochs. the fix was to enable "persistent_workers" in the dataloader so they don't get re-created every epoch - seemed to make things a bit more stable

all these issues go away if RDMA is disabled

It's probably an issue with cuda 11.4 + efa.

Environments that fail:

  • pytorch 1.9.1 (compiled for 11.1) + cuda 11.4 + nccl 2.11.4+cuda11.4 (as shipped in the AMI referenced)
  • pytorch 1.9.1 (compiled for 11.1) + self compiled cuda 11.4, nccl and aws-ofi-nccl
  • pytorch 1.10.1 (compiled for 11.3) + cuda 11.4 + nccl 2.11.4+cuda11.4 (as shipped in the AMI referenced)

Did not try:

  • pytorch 1.10.1 (self compiled for 11.4) + cuda 11.4, etc. with drivers 470 (Alexei did though, also saw issues)

Environments that succeed (470 drivers):

  • pytorch 1.10.1 (compiled for 11.3) + cuda 11.3 + nccl 2.11.4+cuda11.4 (manually downloaded cuda 11.3, set it)

To try on Monday:

  • new AMI with cuda 11.3, 460 drivers, and pytorch 1.10.1 compiled for 11.3

I think my question got buried, re-raising. What is the Linux kernel version?

$ uname -a
Linux ip-[redacted] 5.4.0-1060-aws #63~18.04.1-Ubuntu SMP Mon Nov 15 14:31:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

@stephenroller

Can you confirm that the segfault happen to the child process?

Also, can you provide the backtrace from the core dump?

Yes, the child process segfaults.

I have lost the core dumps :( we wiped the cluster today to change to the new AMI with CUDA 11.3. I'm verifying it works now.

When I analyzed the core dumps, they were generally in innocuous places like inside standard libraries (once in vec::fold inside a rust library; once deep inside the Python interpreter, etc). They were inconsistent and happening in well tested places, so I was lead to believe something was touching memory underneath me.

Is upgrading the kernel a possibility or should we be looking at alternative solutions? Kernel 5.15 (LTS) includes all the changes to make fork work well with RDMA.

@nzmsv Kernel upgrade would only help if this indeed is a page-aligned memory registration issue. I think we should try to reproduce it at our end and verify if that's the case. We can propose solutions based on our findings. Please keep in mind that customer ingest new AMIs to do kernel upgrades.

Okay I'm on the brand new AMIs and replicated the issue again.

With pytorch 1.10.1, and launching using the variables set from /etc/profile.d/dlami.sh I observe that EFA/RDMA is not enabled. I manually added
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$CUDA_HOME/efa/lib"
to my environment and relaunched, and confirmed EFA/RDMA is being used correctly (this should really be done in dlami.sh for me).

I observe, however, that it's currently using the NCCL version bundled with pytorch (2.10.3). When I manually set
export LD_PRELOAD=$CUDA_HOME/lib/libnccl.so.2.11.4
to absolutely force the NCCL version, then I get the segfaults (and logs report that I'm using 2.11.4+cuda11.5, so I guess that's what's bundled in the newest AMI).

So issues seem like they're tied to newer versions of CUDA (11.4 or 11.5) or NCCL compiled with newer versions of CUDA. Could be environment mismatch but my previous attempts to compile EVERYTHING myself to force alignment didn't really seem to help.

EDIT:
okay I observe the crash with nccl bundled with pytorch as well

We'll work on an internal reproduction. Thank you!

So far what has worked for me is going and downloading nccl11.4+cuda11.4 directly from NVIDIA, and forcing that to be used at runtime via LD_PRELOAD:

export LD_PRELOAD=/data/home/roller/lib/nccl_2.11.4-1+cuda11.4_x86_64/lib/libnccl.so.2.11.4

This unblocks my research for now.

Does this combination stop segfaults in the child process?

Okay the latest AMI from yesterday doesn't seem to have issues.

I was asked to put together a proxy workload that triggers these behaviors.

Here's a rough proxy of our workload. If you can't replicate with this public benchmark, we can start bringing in some of our more complex behavior we have implemented in our private repo.

Replicate the 13B model from here. Increase the number of GPUs to increase the probability of the issue:
https://github.com/pytorch/fairseq/tree/main/examples/fully_sharded_data_parallel

But you may need to add something like --num-workers 8 to explicitly turn on background workers.

You can download a public dataset compatible with fairseq with the instructions here:
https://github.com/pytorch/fairseq/tree/main/examples/language_model#1-preprocess-the-data

Thank you! I am still trying to reproduce the crash. Does the crash reproduce with this proxy workload in your environment?

One more question: for all the environments described above (good and bad) were these running in Docker or directly on the hosts?

Unfortunately, in order to unblock the research, I presently only have access to a cluster with the latest 11.3 image (which doesn't crash). We'll need to work with Six Nines to stand up a cluster with 11.4 in order to test this proxy workload on my end.

I want to keep this discussion mostly on official support channels (cc @AWSNB: emailed you; could you add @nzmsv and the other persons we met this week?), but wanted to leave some info here.

One more question: for all the environments described above (good and bad) were these running in Docker or directly on the hosts?

"bare" metal

Thanks Stephen. I will open another support channel with you.

This issue has been noticed by a separate AWS team in April 2021. It is still a problem unfortunately.

https://github.com/aws/sagemaker-training-toolkit/releases/tag/v3.9.2

This was found to be caused by an issue in Libfabric. Resolved by ofiwg/libfabric#7431