Can I use multi-gpu?

Question

Can I use multi-gpu?

ODFG34 opened this issue 2 years ago · comments

SeungJun Jang commented 2 years ago

Hello.

First of all, thank you for your good thesis and code release.

I am writing because I have a question in the process of applying it to 'custom data'.

I have two 'GPU RTX-2080ti'.

When operating the model, is it possible to perform GPU parallel processing?

I'm asking because I looked at the code as a whole and judged that there was no parallel processing of the model.

If there is no 'GPU' parallel processing code, should I add the code myself?

I run to code my custom data , code error

I use other custom image data but CUDA memory error.
so, I use sample image data
my custom data
label : normal, Center, Donut, Edge-Loc, Edge-Ring, Loc, Near-full, Near-full, Scratch (total 9 label)
Total image count : 172,950
my goal
1. Nine label multi-classification and explainable image using heatmap.
2. I want to use multi-gpu
3. I want to nominal-label : normal (I think maybe 'normal' number is 6th , ex) --nominal-label 6 )
  but nominal-label [0, 1].
code error
sudo python runners/run_custom.py --logdir /mnt/c/JSJ/FCDD_Wafer_0930_1st --datadir /mnt/c/JSJ/Wafer_denoising_data_split --objective fcdd -b 1 -e 1 --it 1 -d custom -n FCDD_CNN28 --preproc none --supervise-mode unsupervised --noise-mode cifar100 --nominal-label 0 -ovr
normal_class : 0
Plotting ROC for completed classes up to 0...
Traceback (most recent call last):
File "runners/run_custom.py", line 48, in
runner.run()
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/runners/bases.py", line 203, in run
self.run_classes(**vars(self.args))
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/runners/bases.py", line 221, in run_classes
res = self.run_seeds(
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/runners/bases.py", line 181, in run_seeds
res = self.run_one(
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/runners/bases.py", line 100, in run_one
setup = trainer_setup(
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/training/setup.py", line 104, in trainer_setup
ds = load_dataset(
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/datasets/init.py", line 67, in load_dataset
dataset = ADImageFolderDataset(
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/datasets/image_folder.py", line 92, in init
self.mean, self.std = self.extract_mean_std(self.trainpath, normal_class)
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/datasets/image_folder.py", line 193, in extract_mean_std
for x, _ in loader:
File "/home/psm/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 652, in next
data = self._next_data()
File "/home/psm/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1347, in _next_data
return self._process_data(data)
File "/home/psm/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1373, in _process_data
data.reraise()
File "/home/psm/.local/lib/python3.8/site-packages/torch/_utils.py", line 461, in reraise
raise exception
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
File "/home/psm/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 34, in _pin_memory_loop
data = pin_memory(data, device)
File "/home/psm/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 65, in pin_memory
return type(data)([pin_memory(sample, device) for sample in data]) # type: ignore[call-arg]
File "/home/psm/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 65, in
return type(data)([pin_memory(sample, device) for sample in data]) # type: ignore[call-arg]
File "/home/psm/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 50, in pin_memory
return data.pin_memory(device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Exception in thread Thread-1:
Traceback (most recent call last):
Fatal Python error: could not acquire lock for <_io.BufferedWriter name=''> at interpreter shutdown, possibly due to daemon threads
Python runtime state: finalizing (tstate=0xe89e30)

Thread 0x00007f313cff9700 (most recent call first):
File "/usr/lib/python3.8/threading.py", line 1202 in invoke_excepthook
File "/usr/lib/python3.8/threading.py", line 934 in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 890 in _bootstrap

Current thread 0x00007f31de32d740 (most recent call first):

[1] 9072 abort sudo python runners/run_custom.py --logdir /mnt/c/JSJ/FCDD_Wafer_0930_1st

or

sudo python runners/run_custom.py --logdir /mnt/c/JSJ/FCDD_Wafer_0930_1st --datadir /mnt/c/JSJ/Wafer_denoising_data_split --objective fcdd -b 1 -e 1 --it 1 -d custom -n FCDD_CNN224 --preproc none --supervise-mode noise --noise-mode cifar100 --nominal-label 0 -ovr
[sudo] password for psm:
normal_class : 0
Plotting ROC for completed classes up to 0...
Traceback (most recent call last):
File "runners/run_custom.py", line 48, in
runner.run()
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/runners/bases.py", line 203, in run
self.run_classes(**vars(self.args))
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/runners/bases.py", line 221, in run_classes
res = self.run_seeds(
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/runners/bases.py", line 181, in run_seeds
res = self.run_one(
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/runners/bases.py", line 100, in run_one
setup = trainer_setup(
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/training/setup.py", line 104, in trainer_setup
ds = load_dataset(
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/datasets/init.py", line 67, in load_dataset
dataset = ADImageFolderDataset(
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/datasets/image_folder.py", line 92, in init
self.mean, self.std = self.extract_mean_std(self.trainpath, normal_class)
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/datasets/image_folder.py", line 193, in extract_mean_std
for x, _ in loader:
File "/home/psm/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 652, in next
data = self._next_data()
File "/home/psm/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1347, in _next_data
return self._process_data(data)
File "/home/psm/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1373, in _process_data
data.reraise()
File "/home/psm/.local/lib/python3.8/site-packages/torch/_utils.py", line 461, in reraise
raise exception
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
File "/home/psm/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 34, in _pin_memory_loop
data = pin_memory(data, device)
File "/home/psm/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 65, in pin_memory
return type(data)([pin_memory(sample, device) for sample in data]) # type: ignore[call-arg]
File "/home/psm/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 65, in
return type(data)([pin_memory(sample, device) for sample in data]) # type: ignore[call-arg]
File "/home/psm/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 50, in pin_memory
return data.pin_memory(device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/psm/.local/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 28, in _pin_memory_loop
r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
File "/home/psm/.local/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 297, in rebuild_storage_fd
fd = df.detach()
File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/usr/lib/python3.8/multiprocessing/resource_sharer.py", line 87, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 502, in Client
c = SocketClient(address)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 630, in SocketClient
s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

liznerski · Answer 1 · Tue Oct 04 2022 20:40:31 GMT+0800 (China Standard Time)

Hey.

When operating the model, is it possible to perform GPU parallel processing?

No, this is not supported. Feel free to implement that and post the code around here (e.g., in a pull request).

I use other custom image data but CUDA memory error.
[...]
sudo python runners/run_custom.py --logdir /mnt/c/JSJ/FCDD_Wafer_0930_1st --datadir /mnt/c/JSJ/Wafer_denoising_data_split --objective fcdd -b 1 -e 1 --it 1 -d custom -n FCDD_CNN28 --preproc none --supervise-mode unsupervised --noise-mode cifar100 --nominal-label 0 -ovr
[...]
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/datasets/init.py", line 67, in load_dataset
dataset = ADImageFolderDataset(
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/datasets/image_folder.py", line 92, in init
self.mean, self.std = self.extract_mean_std(self.trainpath, normal_class)
File "/usr/local/lib/python3.8/dist-packages/fcdd-1.1.0-py3.8.egg/fcdd/datasets/image_folder.py", line 193, in extract_mean_std
for x, _ in loader:
[...]
RuntimeError: CUDA error: out of memory

You seem to have changed the code here as, in the current release, line 193 contains return all_x.permute(1, 0, 2, 3).flatten(1).mean(1), all_x.permute(1, 0, 2, 3).flatten(1).std(1) instead of for x, _ in loader:. The extract_mean_std function's purpose is to get the empirical mean and variance of the training dataset to later use it for standardizing the samples during training. The current implementation iterates through the complete training set, gathers all data, and then computes the mean and variance. Note that it doesn't put the tensors to the GPU as the complete training set typically doesn't fit in there. Your code, however, seems to actually move the tensors to GPU because otherwise it shouldn't throw a CUDA out-of-memory error. Can you check that?

Nine label multi-classification and explainable image using heatmap.

FCDD is used to perform AD and not multi-classification. I guess you want to perform AD and evaluate it using the one vs. rest protocol?

I want to nominal-label : normal (I think maybe 'normal' number is 6th , ex) --nominal-label 6 )
but nominal-label [0, 1].

Can you rephrase this? I'm not sure what you mean.

SeungJun Jang · Answer 2 · Thu Oct 13 2022 17:16:22 GMT+0800 (China Standard Time)

First of all, thank you very much for your reply.

Please understand that the reply was slow due to various applications including what you said.

Thank you for confirming once again that 'Multi-GPU' is not supported.

When I was training the FCDD_224 model,

model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  #dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)

I tried to apply the above code because I had experience using it in other modeling, but I couldn't use it because it was blocked in the connection with the heatmap.

"""
You seem to have changed the code here as, in the current release, line 193 contains return all_x.permute(1, 0, 2, 3).flatten(1).mean(1), all_x.permute(1, 0, 2, 3).flatten(1).std(1) instead of for x, _ in loader:. The extract_mean_std function's purpose is to get the empirical mean and variance of the training dataset to later use it for standardizing the samples during training. The current implementation iterates through the complete training set, gathers all data, and then computes the mean and variance. Note that it doesn't put the tensors to the GPU as the complete training set typically doesn't fit in there. Your code, however, seems to actually move the tensors to GPU because otherwise it shouldn't throw a CUDA out-of-memory error. Can you check that?
"""
Referring to the advice, we gathered the tensors data into one and calculated them according to the number of data per batch size, not calculation.
I modified the code as below.
I think there is no problem calculating the mean and standard deviation of the data.

line 171

    def extract_mean_std(self, path: str, cls: int) -> Tuple[Tuple[float, float, float], Tuple[float, float, float]]:
        transform = transforms.Compose([
            transforms.Resize((self.shape[-2], self.shape[-1])),
            transforms.ToTensor(),
        ])
        ds = ImageFolderDataset(
            path, 'unsupervised', self.raw_shape, self.ovr, self.nominal_label, self.anomalous_label,
            normal_classes=[cls], transform=transform, target_transform=transforms.Lambda(
                lambda x: self.anomalous_label if x in self.outlier_classes else self.nominal_label
            )
        )
        ds = Subset(
            ds,
            np.argwhere(
                np.isin(ds.targets, np.asarray([cls])) * np.isin(ds.anomaly_labels, np.asarray([self.nominal_label]))
            ).flatten().tolist()
        )

        batch_size_bs = 50
        loader = DataLoader(dataset=ds, batch_size=batch_size_bs, shuffle=False, num_workers = 4, pin_memory=True)

        batch_cnt = -1
        temp_mean = 0
        temp_std = 0
        # # all_x = []

        for x, _ in loader:

            batch_cnt =batch_cnt + 1

            temp_previous_data_cnt = batch_size_bs * batch_cnt

            temp_tensor_a = x

            temp_mean = (temp_mean * temp_previous_data_cnt + temp_numpy_permute_flatten_a.mean(1) * batch_size_bs) / (temp_previous_data_cnt + batch_size_bs)
            temp_std = np.sqrt(((temp_std**2) * (temp_previous_data_cnt-1) + (temp_numpy_permute_flatten_a.std(1)**2) * batch_size_bs)/(temp_previous_data_cnt + batch_size_bs-1)  )   

        tensor_mean = torch.Tensor(temp_mean)
        tensor_std = torch.Tensor(temp_std)

        return tensor_mean, tensor_std

However, due to the capacity of ram (48gb), not all data can be utilized yet.

Q1. Do you happen to know what the problem is?

When I use less than total data, I encounter error code.

TEST 13191/14384 ID fcdd.training.fcdd.FCDDTrainer NCLS (8,)
[1] 19327 killed sudo python runners/run_custom.py --logdir /mnt/c/JSJ/FCDD_Wafer

but, I understand error code.
I've been looking for a solution for the past few days, but I couldn't find a solution.

Nine label multi-classification and explainable image using heatmap.

I want to nominal-label : normal (I think maybe 'normal' number is 6th , ex) --nominal-label 6 )
but nominal-label [0, 1].

Problem solved for '2.' and '3.' There was a slight mistake.

Q2. For classes other than 'normal', the performance is low, so I'm asking if it's a problem with the number of data.(Class other than "normal").

liznerski · Answer 3 · Thu Oct 13 2022 18:14:22 GMT+0800 (China Standard Time)

I think there is no problem calculating the mean and standard deviation of the data.

I'm not entirely sure if your implementation is accurate. Have a look at this for sophisticated solutions: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

Q1. Do you happen to know what the problem is?

Well, during testing all relevant data (inputs, outputs, anomaly scores, etc.) are gathered in main memory to quickly generate heatmaps and scores. This creates problems if either the data is too large or the main memory too small. I would guess that your program is killed because you're out of memory. If that's the case, you'll need to optimize the code to just load required data into main memory ("on demand"). However, it's weird that it does only happen if you use less data.

Q2. For classes other than 'normal', the performance is low, so I'm asking if it's a problem with the number of data.(Class other than "normal").

What do you mean by "the performance is low"? The AUC metric requires two classes (in our case: normal vs. anomalous), so there is no separate performance measurement for the anomalous samples only. Are you talking about the loss?

SeungJun Jang · Answer 4 · Thu Oct 13 2022 19:47:11 GMT+0800 (China Standard Time)

Thank you very much for your quick response.

> When I use less than total data, I encounter error code.

   > However, it's weird that it does only happen if you use less data.

Sorry, I said the wrong thing.

Of course, this means that the above error occurs when the total data is used, and it occurs even if it is a little less than the total number of data.

In other words, I can only apply up to 35% of the total data.

Q1. Is there a way to save memory(code) instead of adding 'ram'?

> What do you mean by "the performance is low"?

I applied it as 'ovr' method.
I have 9 classes and it's a class of 'normal, 1, ..., 8'.
('1,...,8' is a variety of anomaly data. ex. crack, scratch, ... )
The AUC value is good if 'normal' is normal and the rest is set to higher. (147,431 'normal' out of the total data count 172,950)
However, if '1, ..., 8' is considered normal and the rest is considered anomaly, the AUC value is not good.

In addition to AUC, accuracy, recall, and precision have just found out that there is no log left, so I'm going to modify it and operate it.
Because I need to metric.( accuracy, recall, and precision)

> Q2. (new question)I want to convert that FCDD model style is anomaly detection(binary classification) to multi-class classfication

For each class (ex. number of My custom data class is 9), the model is created using the 'ovr' method. And for each image, we use nine models to detect anomaly .
The method of adopting a class of models with the highest probability.

I would like to ask your opinion about the above.
I think we're going to need a lot of computing resources like memory.

liznerski · Answer 5 · Tue Oct 18 2022 17:08:20 GMT+0800 (China Standard Time)

Q1. Is there a way to save memory(code) instead of adding 'ram'?

Yes, you'll need to change the code so that it doesn't load all data into RAM. Essentially, you only need the labels and the final image-wise anomaly scores (a scalar per image) at the same time in RAM to compute the AUC (line 353). Therefore, instead of also putting the images into RAM (line 286), you'll just remember the paths and load them on demand when creating the heatmaps (here). That might already be enough to solve the problem.

Q2. (new question)I want to convert that FCDD model style is anomaly detection(binary classification) to multi-class classfication

Ah. Okay, I get what you want to do. Well... FCDD is building on top of HSC, which is known to be more robust towards the choice of training outliers (see this paper). Thus, it could have some benefits over classical binary cross entropy when, e.g., the training "outliers" (in your case, just the other classes) are less "fitting" or the number of training "outliers" is very small. In your case, however, the "outliers" should be well fitting. So I'm unsure whether you will see some improvement, but I would be very curious to see whether this works out. So, go ahead!