using high worker count on imagenet script results in segmentation faults

Question

using high worker count on imagenet script results in segmentation faults

Coderx7 opened this issue 2 years ago · comments

Seyyed Hossein Hasanpour commented 2 years ago

Context

I was trying to train on imagenet using the provided script that I noticed using thread counts(worker threads -j) more than 4, results in segmentation faults and or other weird errors like not finding the images on disk(they are there, the script fails to find them mid training it seems).

Pytorch version:1.11.0+cu113'
Operating System and version:Ubuntu 20.04.4

Your Environment

Installed using source? [yes/no]:no
Are you planning to deploy it using docker container? [yes/no]:no
Is it a CPU or GPU environment?:GPU
Which example are you using: imagenet resnet18
Link to code or data to repro [if any]: this repository (everything is default)

Expected Behavior

When using high worker count, there shouldnt be any issues.

Current Behavior

I have seen two outcomes so far which are as follows:
1.mid training (usually starting at the second epoch, after a few iterations) it issues a segmentation fault error
2.mid training (usually starting at the second epoch again, after a few iterations) it suddenly reports the file path to images are invalid, several skips show up and then crash happens (look at the error logs below, its evident)

Possible Solution

I have no Idea what the underlying cause is, I have 32GB of system RAM and 32GB of swap file on my nvme drive

I am using the latest nvidia driver as of now which is 515.48.07 (with cuda 11.7 installed) on my RTX3080.
So the segmentation fault due to lack of memory should be out of question, as during this, I can still see a lot of unused RAM and swap
like only 23/32 and 4/32 for RAM and swap are used respectively. The swapiness is set to 10 as well.
I also ran several tests to ensure RAM or NVME and CPU are OK.
I ran Aida64 cache and memory stress test for 2 hours successfully, followed by a full memtestx86 test (13 test cases with 4/4 phases) basically the default tests were all passed successfully.
My i7 12700K was also tested using Aida64 (CPU/FPU/cache) for more than 4 hours without any issues, followed by prime95 v30.1 for 22 mins.
My motherboard(Asus rog strix z690-a gaming wifi d4)'s BIOS is updated to the latest version (1504).
-NVME is a samsung 980 1tb which both in Samsung magician, and Hard disk Sentinel and CrystalDisk report as 100% healthy and without any issues whatsoever. I also used another SSD drive (EVO 860 500G) and saw the very same result.
So I'm at a loss myself as to why this is happening.

Steps to Reproduce

1.clone this repo
2.install anaconda3 (mine is Anaconda3-2021.11-Linux-x86_64.sh)
3.cd into imagenet sub directory and run this command :

python main.py /media/hossein/SSD1/ImageNet_DataSet/ -a resnet18 -p 200 -j 20

...

Failure Logs [if any]

with 20 worker threads:

(base) hossein@hossein-pc:~/examples/imagenet$ python main.py /media/hossein/SSD1/ImageNet_DataSet/ -a resnet18 -p 200 -j 20  
=> creating model 'resnet18'
Epoch: [0][   1/5005]	Time  7.392 ( 7.392)	Data  5.281 ( 5.281)	Loss 6.9868e+00 (6.9868e+00)	Acc@1   0.00 (  0.00)	Acc@5   0.39 (  0.39)
Epoch: [0][ 201/5005]	Time  0.187 ( 0.223)	Data  0.000 ( 0.026)	Loss 6.5352e+00 (6.7885e+00)	Acc@1   1.17 (  0.44)	Acc@5   4.30 (  1.96)
Epoch: [0][ 401/5005]	Time  0.186 ( 0.206)	Data  0.000 ( 0.014)	Loss 6.1321e+00 (6.5707e+00)	Acc@1   2.34 (  0.82)	Acc@5   8.59 (  3.32)
Epoch: [0][ 601/5005]	Time  0.186 ( 0.200)	Data  0.000 ( 0.009)	Loss 5.9468e+00 (6.3950e+00)	Acc@1   1.95 (  1.24)	Acc@5   8.98 (  4.75)
Epoch: [0][ 801/5005]	Time  0.187 ( 0.197)	Data  0.000 ( 0.007)	Loss 5.6806e+00 (6.2534e+00)	Acc@1   5.08 (  1.72)	Acc@5  16.02 (  6.18)
Epoch: [0][1001/5005]	Time  0.188 ( 0.195)	Data  0.000 ( 0.006)	Loss 5.5401e+00 (6.1349e+00)	Acc@1   5.86 (  2.17)	Acc@5  16.02 (  7.46)
Epoch: [0][1201/5005]	Time  0.201 ( 0.194)	Data  0.000 ( 0.005)	Loss 5.3751e+00 (6.0282e+00)	Acc@1   6.25 (  2.63)	Acc@5  19.14 (  8.74)
Epoch: [0][1401/5005]	Time  0.205 ( 0.193)	Data  0.000 ( 0.004)	Loss 5.2670e+00 (5.9327e+00)	Acc@1   4.69 (  3.11)	Acc@5  16.41 (  9.98)
Epoch: [0][1601/5005]	Time  0.199 ( 0.193)	Data  0.000 ( 0.004)	Loss 5.1215e+00 (5.8440e+00)	Acc@1   8.59 (  3.61)	Acc@5  24.61 ( 11.19)
Epoch: [0][1801/5005]	Time  0.203 ( 0.192)	Data  0.000 ( 0.004)	Loss 5.1571e+00 (5.7612e+00)	Acc@1   7.03 (  4.13)	Acc@5  21.48 ( 12.41)
Epoch: [0][2001/5005]	Time  0.187 ( 0.192)	Data  0.000 ( 0.003)	Loss 4.9309e+00 (5.6844e+00)	Acc@1  11.33 (  4.63)	Acc@5  26.56 ( 13.57)
Epoch: [0][2201/5005]	Time  0.184 ( 0.191)	Data  0.000 ( 0.003)	Loss 5.0364e+00 (5.6117e+00)	Acc@1   9.38 (  5.14)	Acc@5  25.00 ( 14.71)
Epoch: [0][2401/5005]	Time  0.185 ( 0.191)	Data  0.000 ( 0.003)	Loss 4.7977e+00 (5.5432e+00)	Acc@1  11.72 (  5.65)	Acc@5  28.91 ( 15.80)
Epoch: [0][2601/5005]	Time  0.183 ( 0.191)	Data  0.000 ( 0.003)	Loss 4.7133e+00 (5.4774e+00)	Acc@1  13.67 (  6.16)	Acc@5  32.42 ( 16.87)
Epoch: [0][2801/5005]	Time  0.198 ( 0.191)	Data  0.000 ( 0.003)	Loss 4.4756e+00 (5.4146e+00)	Acc@1  16.02 (  6.66)	Acc@5  33.59 ( 17.89)
Epoch: [0][3001/5005]	Time  0.185 ( 0.190)	Data  0.000 ( 0.002)	Loss 4.3845e+00 (5.3551e+00)	Acc@1  16.41 (  7.16)	Acc@5  37.50 ( 18.89)
Epoch: [0][3201/5005]	Time  0.185 ( 0.190)	Data  0.000 ( 0.002)	Loss 4.3374e+00 (5.2978e+00)	Acc@1  14.06 (  7.66)	Acc@5  36.33 ( 19.86)
Epoch: [0][3401/5005]	Time  0.195 ( 0.190)	Data  0.000 ( 0.002)	Loss 4.4920e+00 (5.2429e+00)	Acc@1  17.58 (  8.15)	Acc@5  33.20 ( 20.79)
Epoch: [0][3601/5005]	Time  0.185 ( 0.190)	Data  0.000 ( 0.002)	Loss 4.2876e+00 (5.1908e+00)	Acc@1  19.53 (  8.63)	Acc@5  36.72 ( 21.70)
Epoch: [0][3801/5005]	Time  0.186 ( 0.190)	Data  0.000 ( 0.002)	Loss 3.9208e+00 (5.1397e+00)	Acc@1  23.83 (  9.12)	Acc@5  46.09 ( 22.58)
Epoch: [0][4001/5005]	Time  0.186 ( 0.190)	Data  0.000 ( 0.002)	Loss 4.1425e+00 (5.0907e+00)	Acc@1  16.41 (  9.59)	Acc@5  40.62 ( 23.44)
Epoch: [0][4201/5005]	Time  0.193 ( 0.190)	Data  0.000 ( 0.002)	Loss 4.3411e+00 (5.0447e+00)	Acc@1  15.23 ( 10.06)	Acc@5  35.55 ( 24.26)
Epoch: [0][4401/5005]	Time  0.185 ( 0.190)	Data  0.000 ( 0.002)	Loss 3.9242e+00 (4.9993e+00)	Acc@1  24.22 ( 10.52)	Acc@5  41.41 ( 25.07)
Epoch: [0][4601/5005]	Time  0.191 ( 0.190)	Data  0.000 ( 0.002)	Loss 3.9165e+00 (4.9562e+00)	Acc@1  22.27 ( 10.95)	Acc@5  44.53 ( 25.84)
Epoch: [0][4801/5005]	Time  0.187 ( 0.190)	Data  0.000 ( 0.002)	Loss 4.0109e+00 (4.9140e+00)	Acc@1  21.48 ( 11.40)	Acc@5  39.45 ( 26.59)
Epoch: [0][5001/5005]	Time  0.188 ( 0.190)	Data  0.000 ( 0.002)	Loss 3.6265e+00 (4.8737e+00)	Acc@1  27.73 ( 11.82)	Acc@5  50.00 ( 27.32)
Test: [  1/196]	Time  4.802 ( 4.802)	Loss 2.1619e+00 (2.1619e+00)	Acc@1  50.00 ( 50.00)	Acc@5  79.69 ( 79.69)
 *   Acc@1 23.730 Acc@5 47.718
Epoch: [1][   1/5005]	Time  5.212 ( 5.212)	Data  4.916 ( 4.916)	Loss 3.8498e+00 (3.8498e+00)	Acc@1  22.27 ( 22.27)	Acc@5  44.53 ( 44.53)
Epoch: [1][ 201/5005]	Time  0.185 ( 0.215)	Data  0.000 ( 0.025)	Loss 4.1774e+00 (3.8481e+00)	Acc@1  17.19 ( 22.85)	Acc@5  41.02 ( 45.71)
Epoch: [1][ 401/5005]	Time  0.186 ( 0.203)	Data  0.000 ( 0.013)	Loss 3.6431e+00 (3.8265e+00)	Acc@1  26.56 ( 23.16)	Acc@5  50.78 ( 46.17)
Epoch: [1][ 601/5005]	Time  0.174 ( 0.198)	Data  0.000 ( 0.009)	Loss 3.7591e+00 (3.8062e+00)	Acc@1  23.05 ( 23.48)	Acc@5  45.70 ( 46.58)
Epoch: [1][ 801/5005]	Time  0.190 ( 0.196)	Data  0.000 ( 0.007)	Loss 3.7421e+00 (3.7850e+00)	Acc@1  21.88 ( 23.77)	Acc@5  50.39 ( 46.95)
Epoch: [1][1001/5005]	Time  0.190 ( 0.195)	Data  0.000 ( 0.006)	Loss 3.7866e+00 (3.7657e+00)	Acc@1  25.39 ( 24.11)	Acc@5  48.05 ( 47.32)
Epoch: [1][1201/5005]	Time  0.186 ( 0.194)	Data  0.000 ( 0.005)	Loss 3.3603e+00 (3.7503e+00)	Acc@1  32.42 ( 24.32)	Acc@5  56.25 ( 47.59)
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/home/hossein/examples/imagenet/main.py", line 482, in <module>
  File "/home/hossein/examples/imagenet/main.py", line 115, in main
    def main_worker(gpu, ngpus_per_node, args):
  File "/home/hossein/examples/imagenet/main.py", line 256, in main_worker
    acc1 = validate(val_loader, model, criterion, args)
  File "/home/hossein/examples/imagenet/main.py", line 310, in train
    
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 10250) is killed by signal: Segmentation fault.

with 8 worker threads:

(base) hossein@hossein-pc:~/examples/imagenet$ python main.py /media/hossein/SSD1/ImageNet_DataSet/ -a resnet18 -p 200 -j 8  
=> creating model 'resnet18'
Epoch: [0][   1/5005]	Time  3.748 ( 3.748)	Data  2.065 ( 2.065)	Loss 7.0358e+00 (7.0358e+00)	Acc@1   0.00 (  0.00)	Acc@5   1.17 (  1.17)
Epoch: [0][ 201/5005]	Time  0.187 ( 0.210)	Data  0.000 ( 0.010)	Loss 6.6022e+00 (6.7987e+00)	Acc@1   1.56 (  0.46)	Acc@5   2.34 (  1.76)
Epoch: [0][ 401/5005]	Time  0.188 ( 0.202)	Data  0.000 ( 0.006)	Loss 6.2603e+00 (6.5814e+00)	Acc@1   0.39 (  0.82)	Acc@5   5.47 (  3.10)
Epoch: [0][ 601/5005]	Time  0.203 ( 0.199)	Data  0.000 ( 0.004)	Loss 5.9890e+00 (6.4140e+00)	Acc@1   1.95 (  1.24)	Acc@5   7.81 (  4.54)
Epoch: [0][ 801/5005]	Time  0.208 ( 0.200)	Data  0.000 ( 0.004)	Loss 5.7437e+00 (6.2739e+00)	Acc@1   4.30 (  1.64)	Acc@5   9.77 (  5.86)
Epoch: [0][1001/5005]	Time  0.201 ( 0.201)	Data  0.000 ( 0.003)	Loss 5.4019e+00 (6.1487e+00)	Acc@1   4.69 (  2.11)	Acc@5  16.41 (  7.23)
Epoch: [0][1201/5005]	Time  0.206 ( 0.201)	Data  0.000 ( 0.003)	Loss 5.4279e+00 (6.0416e+00)	Acc@1   5.08 (  2.57)	Acc@5  14.06 (  8.54)
Epoch: [0][1401/5005]	Time  0.194 ( 0.201)	Data  0.000 ( 0.002)	Loss 5.2668e+00 (5.9460e+00)	Acc@1   8.20 (  3.05)	Acc@5  19.53 (  9.79)
Epoch: [0][1601/5005]	Time  0.195 ( 0.200)	Data  0.000 ( 0.002)	Loss 5.3094e+00 (5.8585e+00)	Acc@1   5.08 (  3.52)	Acc@5  15.23 ( 10.99)
Epoch: [0][1801/5005]	Time  0.200 ( 0.200)	Data  0.000 ( 0.002)	Loss 4.8217e+00 (5.7789e+00)	Acc@1  12.89 (  4.00)	Acc@5  28.52 ( 12.14)
Epoch: [0][2001/5005]	Time  0.193 ( 0.200)	Data  0.000 ( 0.002)	Loss 4.8928e+00 (5.7039e+00)	Acc@1   9.77 (  4.48)	Acc@5  26.17 ( 13.27)
Epoch: [0][2201/5005]	Time  0.187 ( 0.200)	Data  0.000 ( 0.002)	Loss 4.8277e+00 (5.6317e+00)	Acc@1  13.67 (  4.98)	Acc@5  26.17 ( 14.38)
Epoch: [0][2401/5005]	Time  0.207 ( 0.199)	Data  0.000 ( 0.002)	Loss 4.9947e+00 (5.5636e+00)	Acc@1   8.59 (  5.48)	Acc@5  24.22 ( 15.46)
Epoch: [0][2601/5005]	Time  0.195 ( 0.199)	Data  0.000 ( 0.002)	Loss 4.7661e+00 (5.4994e+00)	Acc@1  12.50 (  5.98)	Acc@5  30.47 ( 16.51)
Traceback (most recent call last):
  File "/home/hossein/examples/imagenet/main.py", line 479, in <module>
    main()
  File "/home/hossein/examples/imagenet/main.py", line 112, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File "/home/hossein/examples/imagenet/main.py", line 253, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, args)
  File "/home/hossein/examples/imagenet/main.py", line 292, in train
    for i, (images, target) in enumerate(train_loader):
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data
    return self._process_data(data)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data
    data.reraise()
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/_utils.py", line 457, in reraise
    raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torchvision/datasets/folder.py", line 230, in __getitem__
    sample = self.loader(path)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torchvision/datasets/folder.py", line 269, in default_loader
    return pil_loader(path)
  File "/home/hossein/anaconda3/lib/python3.9/site-packages/torchvision/datasets/folder.py", line 247, in pil_loader
    with open(path, "rb") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/media/hossein/SSD1/ImageNet_DataSet/train/n01930112/n01930112_6877.JPEG'

I cant use small number of threads, becasue instead of 10 mins per epoch, I have to wait like 20 mins! and it takes ages to quickly train and experiment. needless to say, the gpu utilization would be extremely bad with small number of workers(I can get around 97% full gpu utilization with 20 threads, while gpu stays idle most of the time with 4 threads) so its a must have for my case.

Seyyed Hossein Hasanpour · Answer 1 · Mon Jun 20 2022 16:29:01 GMT+0800 (China Standard Time)

OK, it turned out to be a RAM frequency issue, my ram was set as 2133mhz by default, and this seemingly would result in crashes, and a slew of other issues, when especially under heavy load, it would not show itself otherwise, and no amount of stress testing/etc would reveal the issue. setting it to 3000mhz/3200mhz fixed my issues thankfully.
This might be an issue for 12th gen cpu/mobo that use DDR4 memories, so if you have one and experience similar issue, definitely make sure to check your ram frequencies. I