IndexError: max(): Expected reduction dim 1 to have non-zero size

Question

IndexError: max(): Expected reduction dim 1 to have non-zero size

carloelle opened this issue a year ago · comments

Hi,

I followed your code and generated patches from your sample data.
Now that I have train_folder, I try to run 3_train.py but I have the following error:

Traceback (most recent call last):
  File "/beegfs/scratch/ric.ostuni/ric.ostuni/DP_Carlo/deepslide/code/3_train.py", line 6, in <module>
    train_resnet(batch_size=config.args.batch_size,
  File "/beegfs/scratch/ric.ostuni/ric.ostuni/DP_Carlo/deepslide/code/utils_model.py", line 483, in train_resnet
    train_helper(model=model,
  File "/beegfs/scratch/ric.ostuni/ric.ostuni/DP_Carlo/deepslide/code/utils_model.py", line 265, in train_helper
    __, train_preds = torch.max(train_outputs, dim=1)
IndexError: max(): Expected reduction dim 1 to have non-zero size.

do you have any suggestion on how to proceed?

best,
carlo

Carlo Leonardi · Answer 1 · Thu Jul 13 2023 21:43:36 GMT+0800 (China Standard Time)

@jasonwei20

Joseph DiPalma · Answer 2 · Thu Jul 13 2023 23:04:02 GMT+0800 (China Standard Time)

Can you please run the following commands and provide the output?

wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

Carlo Leonardi · Answer 3 · Fri Jul 14 2023 20:12:42 GMT+0800 (China Standard Time)

here:


(deepslide_env) leonardi.carlo@dgx01:/beegfs/scratch/ric.ostuni/ric.ostuni/DP_Carlo/DigitalPathology$ wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
--2023-07-14 14:10:46--  https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21653 (21K) [text/plain]
Saving to: ‘collect_env.py’

collect_env.py                                        100%[=======================================================================================================================>]  21.15K  --.-KB/s    in 0.03s   

2023-07-14 14:10:47 (675 KB/s) - ‘collect_env.py’ saved [21653/21653]

(deepslide_env) leonardi.carlo@dgx01:/beegfs/scratch/ric.ostuni/ric.ostuni/DP_Carlo/DigitalPathology$ python collect_env.py
Collecting environment information...
PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.26.4
Libc version: glibc-2.27

Python version: 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:39:03)  [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-4.15.0-136-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 11.1.74
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB
Nvidia driver version: 455.23.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  2
Core(s) per socket:  20
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Stepping:            1
CPU MHz:             2408.587
CPU max MHz:         3600.0000
CPU min MHz:         1200.0000
BogoMIPS:            4389.88
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            51200K
NUMA node0 CPU(s):   0-19,40-59
NUMA node1 CPU(s):   20-39,60-79
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d

Versions of relevant libraries:
[pip3] numpy==1.25.0
[pip3] torch==2.0.1
[pip3] torchaudio==2.0.2
[pip3] torchvision==0.15.2
[pip3] triton==2.0.0
[conda] cudatoolkit               11.3.1              h9edb442_11    conda-forge
[conda] mkl                       2022.2.1         h84fe81f_16997    conda-forge
[conda] numpy                     1.25.0           py39h6183b62_0    conda-forge
[conda] torch                     2.0.1                    pypi_0    pypi
[conda] torchaudio                2.0.2                    pypi_0    pypi
[conda] torchvision               0.15.2                   pypi_0    pypi
[conda] triton                    2.0.0                    pypi_0    pypi

ps: I even tried to reinstall all the packages in condo_env.yaml and pip_requirements.txt but still get the same error

Joseph DiPalma · Answer 4 · Fri Jul 14 2023 22:04:05 GMT+0800 (China Standard Time)

That all looks fine. Can you provide the entire print out from our code including the configuration?

Carlo Leonardi · Answer 5 · Fri Jul 14 2023 22:26:03 GMT+0800 (China Standard Time)

sure

(deepslide_env) leonardi.carlo@dgx01:/beegfs/scratch/ric.ostuni/ric.ostuni/DP_Carlo/DigitalPathology$ CUDA_VISIBLE_DEVICES=0 python /beegfs/scratch/ric.ostuni/ric.ostuni/DP_Carlo/deepslide/code/3_train.py --batch_size 32 --num_epochs 100 --save_interval 5
###############     CONFIGURATION     ###############
all_wsi:	all_wsi
val_wsi_per_class:	20
test_wsi_per_class:	30
keep_orig_copy:	True
num_workers:	8
patch_size:	224
wsi_train:	wsi_train
wsi_val:	wsi_val
wsi_test:	wsi_test
labels_train:	labels_train.csv
labels_val:	labels_val.csv
labels_test:	labels_test.csv
train_folder:	train_folder
patches_eval_train:	patches_eval_train
patches_eval_val:	patches_eval_val
patches_eval_test:	patches_eval_test
num_train_per_class:	80000
type_histopath:	True
purple_threshold:	100
purple_scale_size:	15
slide_overlap:	3
gen_val_patches_overlap_factor:	1.5
image_ext:	jpg
by_folder:	True
color_jitter_brightness:	0.5
color_jitter_contrast:	0.5
color_jitter_saturation:	0.5
color_jitter_hue:	0.2
num_epochs:	100
num_layers:	18
learning_rate:	0.001
batch_size:	32
weight_decay:	0.0001
learning_rate_decay:	0.85
resume_checkpoint:	False
save_interval:	5
checkpoints_folder:	checkpoints
checkpoint_file:	xyz.pt
pretrain:	False
log_folder:	logs
auto_select:	True
preds_train:	preds_train
preds_val:	preds_val
preds_test:	preds_test
inference_train:	inference_train
inference_val:	inference_val
inference_test:	inference_test
vis_train:	vis_train
vis_val:	vis_val
vis_test:	vis_test
device:	cuda:0
classes:	[]
num_classes:	0
train_patches:	train_folder/train
val_patches:	train_folder/val
path_mean:	[-3.221987121548864e-10, 4.5612265013772796e-41, -3.2248959058733817e-10]
path_std:	[nan, 1.1290033796740317e-07, nan]
resume_checkpoint_path:	checkpoints/xyz.pt
log_csv:	logs/log_7142023_162337.csv
eval_model:	checkpoints/xyz.pt
threshold_search:	(0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
colors:	('red', 'white', 'blue', 'green', 'purple', 'orange', 'black', 'pink', 'yellow')

#####################################################





+++++ Running 3_train.py +++++
/home/leonardi.carlo/.conda/envs/deepslide_env/lib/python3.9/site-packages/torch/utils/data/dataloader.py:560: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
0 classes: []
num train images 3481312
num val images 615488
CUDA is_available: True
/home/leonardi.carlo/.conda/envs/deepslide_env/lib/python3.9/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
  warnings.warn("Initializing zero-element tensors is a no-op")
train_folder: train_folder
num_epochs: 100
num_layers: 18
learning_rate: 0.001
batch_size: 32
weight_decay: 0.0001
learning_rate_decay: 0.85
resume_checkpoint: False
resume_checkpoint_path (only if resume_checkpoint is true): checkpoints/xyz.pt
save_interval: 5
output in checkpoints_folder: checkpoints
pretrain: False
log_csv: logs/log_7142023_162337.csv


Traceback (most recent call last):
  File "/beegfs/scratch/ric.ostuni/ric.ostuni/DP_Carlo/deepslide/code/3_train.py", line 6, in <module>
    train_resnet(batch_size=config.args.batch_size,
  File "/beegfs/scratch/ric.ostuni/ric.ostuni/DP_Carlo/deepslide/code/utils_model.py", line 483, in train_resnet
    train_helper(model=model,
  File "/beegfs/scratch/ric.ostuni/ric.ostuni/DP_Carlo/deepslide/code/utils_model.py", line 265, in train_helper
    __, train_preds = torch.max(train_outputs, dim=1)
IndexError: max(): Expected reduction dim 1 to have non-zero size.

Carlo Leonardi · Answer 6 · Fri Jul 14 2023 22:38:02 GMT+0800 (China Standard Time)

I see the problem, folders were not moved correctly and indeed number of classes were not recognised correctly (0).
thank you!