DataLoader crash when using FI_EFA_USE_DEVICE_RDMA=1

Question

DataLoader crash when using FI_EFA_USE_DEVICE_RDMA=1

tohaowu opened this issue 3 years ago · comments

Our AWS p4d.24xlarge job passed on 08/24, and the throughput was 3511 samples/second.
We used two p4d.24xlarges with FI_PROVIDER="efa" and FI_EFA_USE_DEVICE_RDMA=1

This test failed recently. The error message is following

File "/workspace/bert/run_pretraining.py", line 1592, in
args, final_loss, train_time_raw = main()
File "/workspace/bert/run_pretraining.py", line 1344, in main
for step, batch in enumerate(train_dataloader):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 356, in iter
return self._get_iterator()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 302, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 941, in init
self._reset(loader, first_iter=True)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 972, in _reset
self._try_put_index()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1206, in _try_put_index
index = self._next_index()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 509, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 226, in iter
for idx in self.sampler:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 124, in iter
yield from torch.randperm(n, generator=generator).tolist()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 827) is killed by signal: Segmentation fault.

When test without FI_EFA_USE_DEVICE_RDMA=1, the test passes. But the throughput is 1673 samples/sec.

This is the dockerfile we used.
https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-docker_base/Dockerfile.base

Rashika Kheria · Answer 1 · Thu Oct 28 2021 06:25:19 GMT+0800 (China Standard Time)

I have seen these errors in the past when my training data wasn't behind fast enough link (has to switch from NFS based storage to local storage). Do you still experience this error?

Rashika Kheria · Answer 2 · Wed Nov 10 2021 04:40:37 GMT+0800 (China Standard Time)

No follow-up

Stephen Roller · Answer 3 · Thu Dec 23 2021 20:19:49 GMT+0800 (China Standard Time)

I've replicated this on a recent set of p4d.24xlarges

Stephen Roller · Answer 4 · Thu Dec 23 2021 20:58:00 GMT+0800 (China Standard Time)

Some info here: I've found that the use of FI_EFA_USE_DEVICE_RDMA is quite unstable (causing segfaults) pretty much whenever there's a forked subprocess. So DataLoaders with num_workers>0 have issues. So mitigation is to either disable RDMA or to avoid forks.

I would hope this would be addressed with RDMAV_FORK_SAFE=1 but that doesn't seem to help. If there is guidance, I would greatly appreciate it.

Greg Inozemtsev · Answer 5 · Fri Dec 24 2021 01:11:31 GMT+0800 (China Standard Time)

This should be addressed by #77

Stephen Roller · Answer 6 · Fri Dec 24 2021 05:20:40 GMT+0800 (China Standard Time)

Ah I tried but that doesn't seem to help. Not sure.

Greg Inozemtsev · Answer 7 · Fri Dec 24 2021 06:37:01 GMT+0800 (China Standard Time)

Does this mean you tried the patch in #77 and it didn't help? In that case I am very interested!

Stephen Roller · Answer 8 · Fri Dec 24 2021 06:40:14 GMT+0800 (China Standard Time)

Correct. I checked out #77 and recompiled aws-ofi-nccl and found it didn't mitigate my segfaults.

I've gotten the workers to dump their cores and done some inspecting, and they're all crashing in pretty innocuous/random spots. Something is messing with their memory space underneath their feet.

Greg Inozemtsev · Answer 9 · Fri Dec 24 2021 06:40:16 GMT+0800 (China Standard Time)

Thinking about this some more, the problem could be caused by registering any buffer that is not page aligned. There is definitely some work that needs to be done in this area.

What does your application use fork() for?

Stephen Roller · Answer 10 · Fri Dec 24 2021 06:41:52 GMT+0800 (China Standard Time)

What does your application use fork() for?

Pytorch DataLoaders use fork under the hood. The main purpose is to load/preprocess data in a background process so that the GPUs don't have be blocked waiting on data I/O.

Stephen Roller · Answer 11 · Fri Dec 24 2021 06:43:42 GMT+0800 (China Standard Time)

The only remaining potential path I haven't explored is recompiling pytorch. The latest UltraCluster AMIs bundle cuda 11.4 but pytorch has only been compiled with 11.3 (or lower in my case; I'm using on pytorch 1.9.1 for unrelated reasons).

AWSNB · Answer 12 · Fri Dec 24 2021 15:02:48 GMT+0800 (China Standard Time)

@stephenroller we lately discovered that NCCL may enable some modes that EFA doesn’t support and could lead to inconsistency, and potentially corruption. We updated the documents to required setting the env variable: -x NCCL_PROTO=simple

could you check what is the current setting in your runs and try this variable if not set

Alexei Baevski · Answer 13 · Fri Dec 24 2021 22:41:00 GMT+0800 (China Standard Time)

i have this issue intermittently happening as well. it only happens when the dataloaders are being created. once they are created successfully, the job will run to completion. theres about a 30%-50% chance that a given job will fail, and i just hav eto keep restarting (on the same nodes) until it works. i also notice that the larger the model size, the more likely we are to hit this issue

i was not setting NCCL_PROTO at all. tried it with it set to simple and it did not help

Stephen Roller · Answer 14 · Sat Dec 25 2021 00:22:56 GMT+0800 (China Standard Time)

@stephenroller we lately discovered that NCCL may enable some modes that EFA doesn’t support and could lead to inconsistency, and potentially corruption. We updated the documents to required setting the env variable: -x NCCL_PROTO=simple

could you check what is the current setting in your runs and try this variable if not set

Explicitly setting NCCL_PROTO did not resolve.

My models are larger (3-13B parameters) and roughly comparable to the megatron-zero3, so there's a lot of complexity and memory thrashing within my models. I also find things are less predictable: sometimes I can crash on the first SGD step, sometimes after a few hundreds steps, sometimes during the middle of validation.

That said, @alexeib witnesses it on smaller models (600M) that are closer to standard vision transformers, and in a reasonably distinct codebase.

AWSNB · Answer 15 · Sat Dec 25 2021 01:08:57 GMT+0800 (China Standard Time)

Working on it on our side and looking for internal logs for clues Meanwhile, just to be sure: are you on nvidia driver 470 or cuda 11.4.x ? we found those to have issues on A100 systems So for now, we recommended to all p4d customers to stay with driver 460 and cuda 11.3.1. Nvidia is aware of the issues with the newer driver/cuda From: Stephen Roller ***@***.***> Reply-To: aws/aws-ofi-nccl ***@***.***> Date: Friday, December 24, 2021 at 8:24 AM To: aws/aws-ofi-nccl ***@***.***> Cc: "Bshara, Nafea" ***@***.***>, Comment ***@***.***> Subject: Re: [aws/aws-ofi-nccl] DataLoader cash when using FI_EFA_USE_DEVICE_RDMA=1 (#69) @stephenroller<https://github.com/stephenroller> we lately discovered that NCCL may enable some modes that EFA doesn’t support and could lead to inconsistency, and potentially corruption. We updated the documents to required setting the env variable: -x NCCL_PROTO=simple could you check what is the current setting in your runs and try this variable if not set Explicitly setting NCCL_PROTO did not resolve. My models are larger (3-13B parameters) and roughly comparable to the megatron-zero3, so there's a lot of complexity and memory thrashing within my models. I also find things are less predictable: sometimes I can crash on the first SGD step, sometimes after a few hundreds steps, sometimes during the middle of validation. That said, @alexeib<https://github.com/alexeib> witnesses it on smaller models (600M) that are closer to standard vision transformers, and in a reasonably distinct codebase. — Reply to this email directly, view it on GitHub<#69 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFTRWCL2C3VH6EM46J2VWIDUSSM6XANCNFSM5DEYJFLQ>. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

Stephen Roller · Answer 16 · Sat Dec 25 2021 01:24:14 GMT+0800 (China Standard Time)

I'm indeed using the newest version AMI: https://aws.amazon.com/releasenotes/aws-deep-learning-ami-gpu-cuda-11-4-ubuntu-18-04/. Due to my own instances being managed resource, our best options are to find mitigations, or to release new AMIs that don't have issues (if indeed this is an AMI issue). Please contact me directly at roller@fb.com if you would prefer to coordinate with existing official support channels.

I understand there are internal issues with that image being tracked (case 9248945321) but unsure what other issues are being tracked right now (e.g. segfaults).

Stephen Roller · Answer 17 · Sat Dec 25 2021 01:33:37 GMT+0800 (China Standard Time)

Tracking in T108814700 on the Meta side.

Greg Inozemtsev · Answer 18 · Sat Dec 25 2021 04:22:01 GMT+0800 (China Standard Time)

I'll come up with a patch that warns when unaligned memory regions are registered to see if we get any hints.

Also, the compatibility between fork and RDMA has been significantly improved in newer Linux kernels (though the plug-in will need some work to take advantage of these features). Which version are you using?

Rashika Kheria · Answer 19 · Sat Dec 25 2021 05:12:08 GMT+0800 (China Standard Time)

These segfaults could be related to non-aligned buffer registrations done in Pytorch's Dataloader library but we would need to confirm if any such registrations are happening (more likely via libfabric rather than plugin). I understand from your conversation that you are using Pytorch 1.9.1 with CUDA 11.4.1 (but the pytorch is compiled with 11.3.1). Is that right?

If yes, we will try to reproduce with these versions and debug it.

Also, in case it is actually misaligned buffers causing this issue then it should be independent of toggling FI_EFA_USE_DEVICE_RDMA=1 (but it could be more likely to be triggered when using CUDA buffers?).

Alexei Baevski · Answer 20 · Sat Dec 25 2021 05:58:56 GMT+0800 (China Standard Time)

if it helps i had this (or a similar issue) with cuda 11.1 and pytorch 1.9.1 as well as pytorch 1.10.0 (both compiled to target cuda 11.1 and nccl 2.8.4) - the error message was different then - we got "RuntimeError: CUDA error: unspecified launch failure" with some cryptic cuda stacktrace and Xid nvidia driver errors in the logs.

we then upgraded to the new AMI with cuda 11.4, compiled pytorch 1.10.1 to target cuda 11.4 and now get the same problem but this time we get the segfault in the dataloader instead. same symptoms and same resolution (keep restarting jobs until it works)

another training task used to work on the cuda 11.1 / pytorch 1.9.1 setup but stopped working on the new AMI with the same issue, but only appearing after training for one or a couple of epochs. the fix was to enable "persistent_workers" in the dataloader so they don't get re-created every epoch - seemed to make things a bit more stable

all these issues go away if RDMA is disabled

Stephen Roller · Answer 21 · Sat Dec 25 2021 07:22:48 GMT+0800 (China Standard Time)

It's probably an issue with cuda 11.4 + efa.

Environments that fail:

pytorch 1.9.1 (compiled for 11.1) + cuda 11.4 + nccl 2.11.4+cuda11.4 (as shipped in the AMI referenced)
pytorch 1.9.1 (compiled for 11.1) + self compiled cuda 11.4, nccl and aws-ofi-nccl
pytorch 1.10.1 (compiled for 11.3) + cuda 11.4 + nccl 2.11.4+cuda11.4 (as shipped in the AMI referenced)

Did not try:

pytorch 1.10.1 (self compiled for 11.4) + cuda 11.4, etc. with drivers 470 (Alexei did though, also saw issues)

Environments that succeed (470 drivers):

pytorch 1.10.1 (compiled for 11.3) + cuda 11.3 + nccl 2.11.4+cuda11.4 (manually downloaded cuda 11.3, set it)

To try on Monday:

new AMI with cuda 11.3, 460 drivers, and pytorch 1.10.1 compiled for 11.3

Greg Inozemtsev · Answer 22 · Sat Dec 25 2021 12:15:35 GMT+0800 (China Standard Time)

I think my question got buried, re-raising. What is the Linux kernel version?

Stephen Roller · Answer 23 · Sat Dec 25 2021 12:25:50 GMT+0800 (China Standard Time)

$ uname -a
Linux ip-[redacted] 5.4.0-1060-aws #63~18.04.1-Ubuntu SMP Mon Nov 15 14:31:31 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Wei Zhang · Answer 24 · Mon Dec 27 2021 23:26:21 GMT+0800 (China Standard Time)

@stephenroller

Can you confirm that the segfault happen to the child process?

Also, can you provide the backtrace from the core dump?

Stephen Roller · Answer 25 · Mon Dec 27 2021 23:45:15 GMT+0800 (China Standard Time)

Yes, the child process segfaults.

I have lost the core dumps :( we wiped the cluster today to change to the new AMI with CUDA 11.3. I'm verifying it works now.

When I analyzed the core dumps, they were generally in innocuous places like inside standard libraries (once in vec::fold inside a rust library; once deep inside the Python interpreter, etc). They were inconsistent and happening in well tested places, so I was lead to believe something was touching memory underneath me.

Greg Inozemtsev · Answer 26 · Tue Dec 28 2021 00:51:58 GMT+0800 (China Standard Time)

Is upgrading the kernel a possibility or should we be looking at alternative solutions? Kernel 5.15 (LTS) includes all the changes to make fork work well with RDMA.

Rashika Kheria · Answer 27 · Tue Dec 28 2021 01:00:50 GMT+0800 (China Standard Time)

@nzmsv Kernel upgrade would only help if this indeed is a page-aligned memory registration issue. I think we should try to reproduce it at our end and verify if that's the case. We can propose solutions based on our findings. Please keep in mind that customer ingest new AMIs to do kernel upgrades.

Stephen Roller · Answer 28 · Tue Dec 28 2021 01:23:33 GMT+0800 (China Standard Time)

Okay I'm on the brand new AMIs and replicated the issue again.

With pytorch 1.10.1, and launching using the variables set from /etc/profile.d/dlami.sh I observe that EFA/RDMA is not enabled. I manually added
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$CUDA_HOME/efa/lib"
to my environment and relaunched, and confirmed EFA/RDMA is being used correctly (this should really be done in dlami.sh for me).

~~I observe, however, that it's currently using the NCCL version bundled with pytorch (2.10.3).~~ When I manually set
export LD_PRELOAD=$CUDA_HOME/lib/libnccl.so.2.11.4
to absolutely force the NCCL version, then I get the segfaults (and logs report that I'm using 2.11.4+cuda11.5, so I guess that's what's bundled in the newest AMI).

So issues seem like they're tied to newer versions of CUDA (11.4 or 11.5) or NCCL compiled with newer versions of CUDA. Could be environment mismatch but my previous attempts to compile EVERYTHING myself to force alignment didn't really seem to help.

EDIT:
okay I observe the crash with nccl bundled with pytorch as well

Greg Inozemtsev · Answer 29 · Tue Dec 28 2021 01:43:54 GMT+0800 (China Standard Time)

We'll work on an internal reproduction. Thank you!

Stephen Roller · Answer 30 · Tue Dec 28 2021 01:55:23 GMT+0800 (China Standard Time)

So far what has worked for me is going and downloading nccl11.4+cuda11.4 directly from NVIDIA, and forcing that to be used at runtime via LD_PRELOAD:

export LD_PRELOAD=/data/home/roller/lib/nccl_2.11.4-1+cuda11.4_x86_64/lib/libnccl.so.2.11.4

This unblocks my research for now.

Greg Inozemtsev · Answer 31 · Tue Dec 28 2021 01:57:54 GMT+0800 (China Standard Time)

Does this combination stop segfaults in the child process?

Stephen Roller · Answer 32 · Wed Dec 29 2021 07:18:27 GMT+0800 (China Standard Time)

Okay the latest AMI from yesterday doesn't seem to have issues.

Stephen Roller · Answer 33 · Sun Jan 09 2022 00:40:41 GMT+0800 (China Standard Time)

I was asked to put together a proxy workload that triggers these behaviors.

Here's a rough proxy of our workload. If you can't replicate with this public benchmark, we can start bringing in some of our more complex behavior we have implemented in our private repo.

Replicate the 13B model from here. Increase the number of GPUs to increase the probability of the issue:
https://github.com/pytorch/fairseq/tree/main/examples/fully_sharded_data_parallel

But you may need to add something like --num-workers 8 to explicitly turn on background workers.

You can download a public dataset compatible with fairseq with the instructions here:
https://github.com/pytorch/fairseq/tree/main/examples/language_model#1-preprocess-the-data

Greg Inozemtsev · Answer 34 · Sun Jan 09 2022 00:44:20 GMT+0800 (China Standard Time)

Thank you! I am still trying to reproduce the crash. Does the crash reproduce with this proxy workload in your environment?

Greg Inozemtsev · Answer 35 · Sun Jan 09 2022 00:49:04 GMT+0800 (China Standard Time)

One more question: for all the environments described above (good and bad) were these running in Docker or directly on the hosts?

Stephen Roller · Answer 36 · Sun Jan 09 2022 00:52:04 GMT+0800 (China Standard Time)

Unfortunately, in order to unblock the research, I presently only have access to a cluster with the latest 11.3 image (which doesn't crash). We'll need to work with Six Nines to stand up a cluster with 11.4 in order to test this proxy workload on my end.

I want to keep this discussion mostly on official support channels (cc @AWSNB: emailed you; could you add @nzmsv and the other persons we met this week?), but wanted to leave some info here.

Stephen Roller · Answer 37 · Sun Jan 09 2022 00:52:17 GMT+0800 (China Standard Time)

One more question: for all the environments described above (good and bad) were these running in Docker or directly on the hosts?

"bare" metal

Rashika Kheria · Answer 38 · Sun Jan 09 2022 00:56:03 GMT+0800 (China Standard Time)

Thanks Stephen. I will open another support channel with you.

Tete Xiao · Answer 39 · Fri Jan 14 2022 01:52:57 GMT+0800 (China Standard Time)

This issue has been noticed by a separate AWS team in April 2021. It is still a problem unfortunately.

https://github.com/aws/sagemaker-training-toolkit/releases/tag/v3.9.2

Greg Inozemtsev · Answer 40 · Tue Mar 29 2022 06:48:31 GMT+0800 (China Standard Time)

This was found to be caused by an issue in Libfabric. Resolved by ofiwg/libfabric#7431