training time does not reduces after increasing batch_size on 32 GB CPU instance

Question

training time does not reduces after increasing batch_size on 32 GB CPU instance

suyogkute opened this issue a year ago · comments

suyogkute commented a year ago

with batch_size = 1, ETA - 1 hour
with batch_size = 16 ETA - 10 hour

this behaviour is same on 32 GB RAM CPU intsnace vs local 16 GB CPU system.

ithllc · Answer 1 · Fri Dec 08 2023 05:16:27 GMT+0800 (China Standard Time)

@elbruno @mmphego @fmorenovr @prateekralhan

Any answer on this from the team would be highly appreciated. I am currently training this on A100's in Google Colab and I second the matter of discussion. I have 12 classes, including the background for image classification. The screenshot is below:

output from the train_model method is:
Train 808 images
Validate 350 images
Applying augmentation on dataset
Checkpoint Path: /content/drive/MyDrive/computer_vision_model
Selecting layers to train

the code for training the model is just like what is discussed in the documentation, and I applied all the software downgrades, currently running it on tensorflow 2.8.0.

code below:
from google.colab import drive
drive.mount('/content/drive')

!pip3 uninstall tensorflow
!pip3 install tensorflow==2.8.0 # current version 2.14.0, reinstall it after this
!pip3 install tensorflow--gpu

!pip install requests numpy pillow scipy scikit-image==0.18.3 imgaug matplotlib labelme2coco==0.1.0 pixellib==0.5.2

import pixellib
from pixellib.custom_train import instance_custom_training

import json
import numpy as np
import pandas as pd
import os
import tensorflow as tf
print(tf.version)

train_maskrcnn = instance_custom_training()
train_maskrcnn.modelConfig(network_backbone = 'resnet101', num_classes= 12, batch_size = 4)
train_maskrcnn.load_pretrained_model(models_dir+'/mask_rcnn_coco.h5')
train_maskrcnn.load_dataset(exports_dir)
train_maskrcnn.train_model(num_epochs = 100, augmentation=True, path_trained_models = models_dir)