On Using FMIX my dataloader is unable to mix the data in training loop

Question

On Using FMIX my dataloader is unable to mix the data in training loop

IamSparky opened this issue 4 years ago · comments

I have used the following training loop for my plant image dataset

def train_loop_fn(data_loader, model, optimizer, device, scheduler=None):
    running_loss = 0.0
    running_corrects = 0
    
    model.train()
    
    alpha, decay_power = 1.0, 3.0
    
    for batch_index,dataset in enumerate(data_loader):
        image = dataset["image"]
        label = dataset["label"]
        
        image = image.to(device, dtype=torch.float)
        label = label.to(device, dtype=torch.float)
        
        image, perm, lambda_value = sample_and_apply(image, alpha, decay_power, (224, 224))
        optimizer.zero_grad()

        outputs = model(image)
        
        loss = loss_fn(outputs, label) * lambda_value + loss_fn(outputs, label[perm]) * (1 - lambda_value)

        loss.backward()
        xm.optimizer_step(optimizer)

        running_loss += loss.item()

    scheduler.step()
            
    train_loss = running_loss / float(len(train_dataset))
    
    xm.master_print('training Loss: {:.4f} '.format(train_loss))

and my dataset class look like this

import cv2
import torch
from torchvision import transforms
import albumentations
from PIL import Image

class leaf_classification(Dataset):
    def __init__(self, ids, image_id, label, mean , std , is_valid):
        self.ids = ids
        self.image_id = image_id
        self.label = label
        self.is_valid = is_valid
        if self.is_valid == 1: # transforms for validation images
            self.aug = albumentations.Compose([
               albumentations.Normalize(mean , std , always_apply = True) 
            ])
        else:                  # transfoms for training images 
            self.aug = albumentations.Compose([
                albumentations.Normalize(mean , std , always_apply = True),
                albumentations.ShiftScaleRotate(shift_limit = 0.0625,
                                                scale_limit = 0.1 ,
                                                rotate_limit = 5,
                                                p = 0.9)
            ]) 
        
    def __len__(self):
        return len(self.ids)
    
    def __getitem__(self, index):
        # converting jpg format of images to numpy array
        img = np.array(Image.open('../input/cassava-leaf-disease-classification/train_images/' + self.image_id[index])) 
        
        img = cv2.resize(img, dsize=(224, 224), interpolation=cv2.INTER_CUBIC)
        img = self.aug(image = img)['image']
        img = np.transpose(img , (2,0,1)).astype(np.float32) # 2,0,1 because pytorch excepts image channel first then dimension of image
        
       
        return {
            'image' : torch.tensor(img, dtype = torch.float) , 
            'label' : torch.tensor(self.label[index], dtype = torch.float)
        }

And its generating the error

while training it generating this error

please help me in resolving this error . Here's the link to my notebook

Ethan Harris · Answer 1 · Mon Nov 23 2020 01:36:14 GMT+0800 (China Standard Time)

Hi, looks like you're on the right lines. I've created a copy of your notebook and made a few changes here: https://www.kaggle.com/ethanwharris/fmix-cassava-leaf-disease-classification

Changes made

used from FMix.fmix import ... instead of cd FMix (which cause the file error as after the cd everything was one level lower)
used sample_mask instead of sample_and_apply which used a mask from numpy and didn't seem to work with the XLA device
moved model.to(device) as the torchsummary package was moving the model to CPU
added train_loss to the scheduler.step call (although this should probably be a val loss)

There are still some errors in eval_loop_fn but these aren't related to FMix

Hope that helps!

IamSparky · Answer 2 · Mon Nov 23 2020 02:22:43 GMT+0800 (China Standard Time)

Thanks brother , really appreciate your work as well as your help as it started working after I made the necessary changes . But I just wanna know am I going wrong with regards to your point number 4 ?

IamSparky · Answer 3 · Wed Nov 25 2020 22:49:56 GMT+0800 (China Standard Time)

Hello Ethan ,
I am again started facing error for this line x1, x2 = image * mask, image[perm] * (1 - mask)
getting this error

in the function

defining the training loop

def train_loop_fn(data_loader, model, optimizer, device, scheduler=None):
    running_loss = 0.0
    running_corrects = 0
    
    model.train()
    
    alpha, decay_power = 1.0, 3.0
    
    for batch_index,dataset in enumerate(data_loader):
        image = dataset["image"]
        label = dataset["label"]
        
        
        lambda_value, mask = sample_mask(alpha, decay_power, (224, 224), 0.0, False)
        mask = torch.from_numpy(mask).to(device)
        perm = torch.randperm(image.size(0))

        x1, x2 = image * mask, image[perm] * (1 - mask)
        image = x1 + x2
        
        image = image.to(device, dtype=torch.float)
        label = label.to(device, dtype=torch.float)
        
        optimizer.zero_grad()

        outputs = model(image)
        
        loss = loss_fn(outputs, label) * lambda_value + loss_fn(outputs, label[perm]) * (1 - lambda_value)
#         loss = loss_fn(outputs, label)

        loss.backward()
        xm.optimizer_step(optimizer)

        running_loss += loss.item()
            
    train_loss = running_loss / float(len(train_data))
    scheduler.step(train_loss)
    
    return train_loss

don't konw why it was working fine earlier.

Notebook link

Ethan Harris · Answer 4 · Mon Jan 18 2021 20:32:25 GMT+0800 (China Standard Time)

Hi, sorry I missed this.

Not sure what the error was here, it looks like the tensors are the wrong sizes. So might need to squeeze / unsqueeze in places to get it to work. Closing this issue as it looks like it's not a bug in our code.