MIC-DKFZ / nnDetection

nnDetection is a self-configuring framework for 3D (volumetric) medical object detection which can be applied to new data sets without manual intervention. It includes guides for 12 data sets that were used to develop and evaluate the performance of the proposed method.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Example training dataset causing error

AceMcAwesome77 opened this issue · comments

Hi, I am trying to follow along with the example dataset training given on the page, but I am getting an error at the stage where I run "nndet_train 000". Here is my trace. Do you know what could be causing this?

19.1 M Trainable params
0 Non-trainable params
19.1 M Total params
76.244 Total estimated model params size (MB)
Validation sanity check: 0it [00:00, ?it/s]INFO Using validation DataLoader3DOffset with {}
INFO Building Sampling Cache for Dataloder
Sampling Cache: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 8665.92it/s]
INFO Using 5 num_processes and 2 num_cached_per_queue for augmentation.
INFO VALIDATION KEYS:
odict_keys(['case_0', 'case_7'])
Validation sanity check: 0%| | 0/10 [00:00<?, ?it/s]using pin_memory on device 0
Traceback (most recent call last):
File "/opt/conda/bin/nndet_train", line 33, in
sys.exit(load_entry_point('nndet', 'console_scripts', 'nndet_train')())
File "/opt/code/nndet/nndet/utils/check.py", line 62, in wrapper
return func(*args, **kwargs)
File "/opt/code/nndet/scripts/train.py", line 70, in train
_train(
File "/opt/code/nndet/scripts/train.py", line 290, in _train
trainer.fit(module, datamodule=datamodule)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
self._run(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
self._dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
self.accelerator.start_training(self)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
return self._run_train()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1031, in _run_train
self._run_sanity_check(self.lightning_module)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1115, in _run_sanity_check
self._evaluation_loop.run()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
dl_outputs = self.epoch_loop.run(
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 110, in advance
output = self.evaluation_step(batch, batch_idx, dataloader_idx)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 154, in evaluation_step
output = self.trainer.accelerator.validation_step(step_kwargs)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 211, in validation_step
return self.training_type_plugin.validation_step(*step_kwargs.values())
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 178, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/opt/code/nndet/nndet/ptmodule/retinaunet/base.py", line 172, in validation_step
losses, prediction = self.model.train_step(
File "/opt/code/nndet/nndet/core/retina.py", line 146, in train_step
prediction = self.postprocess_for_inference(
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/opt/code/nndet/nndet/core/retina.py", line 187, in postprocess_for_inference
boxes, probs, labels = self.postprocess_detections(
File "/opt/code/nndet/nndet/core/retina.py", line 326, in postprocess_detections
boxes, probs, labels = self.postprocess_detections_single_image(boxes, probs, image_shape)
File "/opt/code/nndet/nndet/core/retina.py", line 375, in postprocess_detections_single_image
keep = box_utils.batched_nms(boxes, probs, labels, self.nms_thresh)
File "/opt/code/nndet/nndet/core/boxes/nms.py", line 106, in batched_nms
return nms(boxes_for_nms, scores, iou_threshold)
File "/opt/conda/lib/python3.8/site-packages/torch/autocast_mode.py", line 198, in decorate_autocast
return func(*args, **kwargs)
File "/opt/code/nndet/nndet/core/boxes/nms.py", line 78, in nms
return nms_fn(boxes.float(), scores.float(), iou_threshold)
TypeError: 'NoneType' object is not callable

The compilation of the GPU failed during installation but you are trying to run nnDetection on a GPU. Please also check the FAQ for common questions regarding the installation. If the FAQ doesn't help please provide the full installation log and environment information (potentially try the docker installation if possible).

Best,
Michael

Thanks for the reply - this is confusing to me because the logs right after running appear to show that the GPU is being used. Although it does also say "nnDetection was not build with GPU support". See the bolded lines here for conflicting information - why would it say "GPU available: true, used: true" if the GPU was not set up properly?

root@ebd7ev5t1fe7:/opt/data/Task000D3_Example# nndet_train 000
2023-02-17 21:54:57.520 | WARNING | nndet.core.boxes.nms::26 - nnDetection was not build with GPU support!
Overwrites: None
Found existing folder /opt/models/Task000D3_Example/RetinaUNetV001_D3V001_3d/fold0, this run will overwrite the results inside that folder
INFO Log file at /opt/models/Task000D3_Example/RetinaUNetV001_D3V001_3d/fold0/train.log
ERROR Was not able to read git information, trying to continue without.
INFO Using splits /opt/data/Task000D3_Example/preprocessed/splits_final.pkl with fold 0
INFO Architecture overwrites: {} Anchor overwrites: {}
INFO Building architecture according to plan of RetinaUNetV001
INFO Start channels: 32; head channels: 128; fpn channels: 128
INFO Discarding anchor generator kwargs {'stride': 1}
INFO Building:: encoder Encoder: {}
INFO Building:: decoder UFPNModular: {'min_out_channels': 8, 'upsampling_mode': 'transpose', 'num_lateral': 1, 'norm_lateral': False, 'activation_lateral': False, 'num_out': 1, 'norm_out': False, 'activation_out': False}
INFO Running ATSS Matching with num_candidates=4 and center_in_gt False.
INFO Building:: classifier BCECLassifier: {'num_convs': 1, 'norm_channels_per_group': 16, 'norm_affine': True, 'reduction': 'mean', 'loss_weight': 1.0, 'prior_prob': 0.01}
INFO Init classifier weights: prior prob 0.01
INFO Building:: regressor GIoURegressor: {'num_convs': 1, 'norm_channels_per_group': 16, 'norm_affine': True, 'reduction': 'sum', 'loss_weight': 1.0, 'learn_scale': True}
INFO Learning level specific scalar in regressor
INFO Overwriting regressor conv weight init
INFO Building:: head DetectionHeadHNMNative: {} sampler HardNegativeSamplerBatched: {'batch_size_per_image': 32, 'positive_fraction': 0.33, 'pool_size': 20, 'min_neg': 1}
INFO Sampling hard negatives on a per batch basis
INFO Building:: segmenter DiCESegmenterFgBg {'dice_kwargs': {'batch_dice': True}}
INFO Running batch dice True and do bg False in dice loss.
INFO Model Inference Summary:
detections_per_img: 100
score_thresh: 0
topk_candidates: 10000
remove_small_boxes: 0.01
nms_thresh: 0.6
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:446: UserWarning: Checkpoint directory /opt/models/Task000D3_Example/RetinaUNetV001_D3V001_3d/fold0 exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
INFO Using 1 GPUs for training
INFO Using None plugins for training
Using native 16bit precision.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
INFO Initialize SWA with swa epoch start 49
INFO Augmentation: BaseMoreAug transforms and base_more params
INFO Loading network patch size [160 128 128] and generator patch size [289, 249, 272]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
INFO Running: initial_lr 0.01 weight_decay 3e-05 SGD with momentum 0.9 and nesterov True

| Name                                                 | Type                    | Params | In sizes                                                                                                                         | Out sizes

0 | model | BaseRetinaNet | 19.1 M | [1, 1, 160, 128, 128] | ['?', [[1263600, 6]], '?']
1 | model.encoder | Encoder | 14.0 M | [1, 1, 160, 128, 128] | [[1, 32, 160, 128, 128], [1, 64, 80, 64, 64], [1, 128, 40, 32, 32], [1, 256, 20, 16, 16], [1, 320, 10, 8, 8], [1, 320, 5, 4, 4]]
2 | model.encoder.stages | ModuleList | 14.0 M | ? | ?

nnDetection has CUDA code which is compiled during installation. If the environment is not configured correctly it won't perform the compilation and the NMS operation will not work on the GPU => that is the reason for the error and the warning "2023-02-17 21:54:57.520 | WARNING | nndet.core.boxes.nms::26 - nnDetection was not build with GPU support!"

Please check the readme for all the requirements of a source installation, only having a GPU is not sufficient for correct installation (except if docker ist used), a CUDA installation is needed for the compilation step.

Thanks for attaching the log but that is the training log, not the installation log. Also your environment information would be required (see the question/answer of the FAQ section).

Thanks - I am going through the FAQ now and trying to match the pytorch version information as close to your example as I can. My current problem is that I get told this requirement:

nndet 0.1 requires pytorch_lightning<=1.4.2,>=1.3.1

However if I use any of the pytorch_lightning versions in that range, I get this error:

root@ebd7eab81fe7:/opt/data/Task000D3_Example# nndet_train 000
Traceback (most recent call last):
File "/opt/conda/bin/nndet_train", line 33, in
sys.exit(load_entry_point('nndet', 'console_scripts', 'nndet_train')())
File "/opt/conda/bin/nndet_train", line 25, in importlib_load_entry_point
return next(matches).load()
File "/opt/conda/lib/python3.8/importlib/metadata.py", line 77, in load
module = import_module(match.group('module'))
File "/opt/conda/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 783, in exec_module
File "", line 219, in _call_with_frames_removed
File "/opt/code/nndet/scripts/train.py", line 27, in
import pytorch_lightning as pl
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/init.py", line 20, in
from pytorch_lightning import metrics # noqa: E402
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/init.py", line 15, in
from pytorch_lightning.metrics.classification import ( # noqa: F401
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/init.py", line 14, in
from pytorch_lightning.metrics.classification.accuracy import Accuracy # noqa: F401
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/classification/accuracy.py", line 18, in
from pytorch_lightning.metrics.utils import deprecated_metrics
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/metrics/utils.py", line 29, in
from pytorch_lightning.utilities import rank_zero_deprecation
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/init.py", line 18, in
from pytorch_lightning.utilities.apply_func import move_data_to_device # noqa: F401
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/apply_func.py", line 29, in
from torchtext.legacy.data import Batch
ModuleNotFoundError: No module named 'torchtext.legacy'

It appears the .legacy function has been deprecated. What version of pytorch_lightning are you using on a working system with pytorch 1.11.0+cu113? A pip freeze of all your packages from the working system could be helpful if other dependency issues arise.

Thanks again!

The Issue is caused by the old lightning version and not a nnDetection problem per se.
(Most likely the error only occurs if you installed torchtext which is not required -> please uninstall it)

Here the pip list of a fresh env which trains without error:

Package                  Version    Editable project location
------------------------ ---------- ------------------------------------------------
absl-py                  1.4.0
aiohttp                  3.8.4
aiosignal                1.3.1
alembic                  1.9.4
antlr4-python3-runtime   4.9.3
async-timeout            4.0.2
attrs                    22.2.0
batchgenerators          0.21
bayesian-optimization    1.4.2
cachetools               5.3.0
certifi                  2022.12.7
charset-normalizer       3.0.1
click                    8.1.3
cloudpickle              2.2.1
cma                      3.3.0
colorama                 0.4.6
contourpy                1.0.7
cycler                   0.11.0
databricks-cli           0.17.4
dicom2nifti              2.4.7
docker                   6.0.1
entrypoints              0.4
Flask                    2.2.3
fonttools                4.38.0
frozenlist               1.3.3
fsspec                   2023.1.0
future                   0.18.3
gitdb                    4.0.10
GitPython                3.1.31
google-auth              2.16.1
google-auth-oauthlib     0.4.6
greenlet                 2.0.2
grpcio                   1.51.3
gunicorn                 20.1.0
hydra-core               1.3.1
idna                     3.4
imageio                  2.25.1
importlib-metadata       5.2.0
importlib-resources      5.12.0
itsdangerous             2.1.2
Jinja2                   3.1.2
joblib                   1.2.0
kiwisolver               1.4.4
linecache2               1.0.0
llvmlite                 0.39.1
loguru                   0.6.0
Mako                     1.2.4
Markdown                 3.4.1
MarkupSafe               2.1.2
matplotlib               3.7.0
MedPy                    0.4.0
mlflow                   2.1.1
multidict                6.0.4
networkx                 3.0
nevergrad                0.4.2
nibabel                  5.0.1
nndet $path to git dir
nnunet                   1.6.6
numba                    0.56.4
numpy                    1.23.5
nvidia-cublas-cu11       11.10.3.66
nvidia-cuda-nvrtc-cu11   11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11        8.5.0.96
oauthlib                 3.2.2
omegaconf                2.3.0
packaging                22.0
pandas                   1.5.3
Pillow                   9.4.0
pip                      22.3.1
protobuf                 4.22.0
pyarrow                  10.0.1
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pyDeprecate              0.3.1
pydicom                  2.3.1
PyJWT                    2.6.0
pyparsing                3.0.9
python-dateutil          2.8.2
python-gdcm              3.0.21
pytorch-lightning        1.4.2
pytz                     2022.7.1
PyWavelets               1.4.1
PyYAML                   6.0
querystring-parser       1.2.4
requests                 2.28.2
requests-oauthlib        1.3.1
rsa                      4.9
scikit-image             0.19.3
scikit-learn             1.2.1
scipy                    1.10.1
seaborn                  0.12.2
setuptools               65.6.3
shap                     0.41.0
SimpleITK                2.0.2
six                      1.16.0
sklearn                  0.0.post1
slicer                   0.0.7
smmap                    5.0.0
SQLAlchemy               1.4.46
sqlparse                 0.4.3
tabulate                 0.9.0
tensorboard              2.12.0
tensorboard-data-server  0.7.0
tensorboard-plugin-wit   1.8.1
threadpoolctl            3.1.0
tifffile                 2023.2.3
torch                    1.13.1
torchmetrics             0.7.3
torchvision              0.14.1
tqdm                     4.64.1
traceback2               1.4.0
typing_extensions        4.5.0
unittest2                1.1.0
urllib3                  1.26.14
websocket-client         1.5.1
Werkzeug                 2.2.3
wheel                    0.38.4
yarl                     1.8.2
zipp                     3.14.0

Thanks for the pip freeze list - I have update all my package versions to match yours. However I am still getting the "not built with GPU support" error. Here is my nndet_env readout:

----- PyTorch Information -----
PyTorch Version: 1.13.1+cu117
PyTorch Debug: False
PyTorch CUDA: 11.7
PyTorch Backend cudnn: 8500
PyTorch CUDA Arch List: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
PyTorch Current Device Capability: (6, 1)
PyTorch CUDA available: True

----- System Information -----
System NVCC: nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

System Arch List: 6.1
System OMP_NUM_THREADS: 1
System CUDA_HOME is None: True
System CPU Count: 20
Python Version: 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0]

----- nnDetection Information -----
det_num_threads 6
det_data is set True
det_models is set True

I am attempting to train on a Quadro P5000 which has Compute Capability = 6.1 according to the link provided in the FAQ. I see 6.1 in the System Arch List but not in the PyTorch CUDA Arch List which might be my problem. I ran TORCH_CUDA_ARCH_LIST=6.1 in my linux console as I thought that would update the Pytorch CUDA Arch List but it instead updated the System Arch List (as you can see in my printout; there used to be additional values besides 6.1 before I ran that line).

What do I run to update Pytorch CUDA Arch List to include 6.1? Or could there be a different problem?

I have already deleted the "build" folder from /opt/code/nndet as suggested in the FAQ.

I also tried running "TORCH_CUDA_ARCH_LIST="6.1" pip install -e ." and it rebuilt but still shows:

PyTorch CUDA Arch List: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
PyTorch Current Device Capability: (6, 1)

i.e. still missing 6.1 from the PyTorch CUDA Arch List.

Thanks!

AFAIK the only way to update the pytorch arch list is to compile pytorch from source (since it is the arch list for which pytorch was built, so there is nothing nnDetection can do about that). You could try to reduce the pytorch version (there might be changes necessary for versions below 1.10) to check if they have support for your arch.

Well, I tried every pytorch version between 1.6.0 and 1.13.1 and none of them have 'sm_61' in the Pytorch CUDA Arch List. So I think I am out of luck with my Quadro P5000 there.

Do you know how I would go about compiling pytorch from source with the 6.1 in the arch list? I have not compiled something from source before.

Could you still share the nnDetection intallation log to make sure that the error is not related to something else?

PyTorch has instruction on the compilation in their github repo, I would recommend to follow them.

Actually, I ran nndet_train 000 this morning and the GPU suddenly works! Something must have clicked into place during all my package variation. Here is a printout of my successful nndet_env:

----- PyTorch Information -----
PyTorch Version: 1.13.1+cu117
PyTorch Debug: False
PyTorch CUDA: 11.7
PyTorch Backend cudnn: 8500
PyTorch CUDA Arch List: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
PyTorch Current Device Capability: (6, 1)
PyTorch CUDA available: True

----- System Information -----
System NVCC: nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0

System Arch List: 5.2 6.0 6.1 7.0 7.5 8.0 8.6+PTX
System OMP_NUM_THREADS: 1
System CUDA_HOME is None: True
System CPU Count: 20
Python Version: 3.8.5 (default, Sep 4 2020, 07:30:14)
[GCC 7.3.0]

So now I am trying to run the "nndet_train 000 --sweep" on the small test set of only 10 train and 10 test, and it says it will take about 2 hours per epoch, does that seem right? The progress bar looks like this:

Epoch 0: 21%|████████████████████████▉ | 554/2600 [26:38<1:38:11, 2.88s/it, loss=0.0264, v_num=3aa8]

Where is the 2600 number coming from if I only have 10 samples? My real dataset is about 1,000 samples so 200 hours per epoch would not be feasible. Running nvidia-smi does show that my GPU is being used: 9733MiB / 16384MiB.
Here is my full training printout:

root@eqd7eaej1fe7:/opt/code/nndet# nndet_train 000 --sweep
Overwrites: None
Found existing folder /opt/models/Task000D3_Example/RetinaUNetV001_D3V001_3d/fold0, this run will overwrite the results inside that folder
INFO Log file at /opt/models/Task000D3_Example/RetinaUNetV001_D3V001_3d/fold0/train.log
ERROR Was not able to read git information, trying to continue without.
INFO Using splits /opt/data/Task000D3_Example/preprocessed/splits_final.pkl with fold 0
INFO Architecture overwrites: {} Anchor overwrites: {}
INFO Building architecture according to plan of RetinaUNetV001
INFO Start channels: 32; head channels: 128; fpn channels: 128
INFO Discarding anchor generator kwargs {'stride': 1}
INFO Building:: encoder Encoder: {}
INFO Building:: decoder UFPNModular: {'min_out_channels': 8, 'upsampling_mode': 'transpose', 'num_lateral': 1, 'norm_lateral': False, 'activation_lateral': False, 'num_out': 1, 'norm_out': False, 'activation_out': False}
INFO Running ATSS Matching with num_candidates=4 and center_in_gt False.
INFO Building:: classifier BCECLassifier: {'num_convs': 1, 'norm_channels_per_group': 16, 'norm_affine': True, 'reduction': 'mean', 'loss_weight': 1.0, 'prior_prob': 0.01}
INFO Init classifier weights: prior prob 0.01
INFO Building:: regressor GIoURegressor: {'num_convs': 1, 'norm_channels_per_group': 16, 'norm_affine': True, 'reduction': 'sum', 'loss_weight': 1.0, 'learn_scale': True}
INFO Learning level specific scalar in regressor
INFO Overwriting regressor conv weight init
INFO Building:: head DetectionHeadHNMNative: {} sampler HardNegativeSamplerBatched: {'batch_size_per_image': 32, 'positive_fraction': 0.33, 'pool_size': 20, 'min_neg': 1}
INFO Sampling hard negatives on a per batch basis
INFO Building:: segmenter DiCESegmenterFgBg {'dice_kwargs': {'batch_dice': True}}
INFO Running batch dice True and do bg False in dice loss.
INFO Model Inference Summary:
detections_per_img: 100
score_thresh: 0
topk_candidates: 10000
remove_small_boxes: 0.01
nms_thresh: 0.6
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:446: UserWarning: Checkpoint directory /opt/models/Task000D3_Example/RetinaUNetV001_D3V001_3d/fold0 exists and is not empty.
rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
INFO Using 1 GPUs for training
INFO Using None plugins for training
Using native 16bit precision.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
INFO Initialize SWA with swa epoch start 49
INFO Augmentation: BaseMoreAug transforms and base_more params
INFO Loading network patch size [160 128 128] and generator patch size [289, 249, 272]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
INFO Running: initial_lr 0.01 weight_decay 3e-05 SGD with momentum 0.9 and nesterov True

[model architecture]

19.1 M Trainable params
0 Non-trainable params
19.1 M Total params
76.244 Total estimated model params size (MB)
Validation sanity check: 0it [00:00, ?it/s]INFO Using validation DataLoader3DOffset with {}
INFO Building Sampling Cache for Dataloder
Sampling Cache: 100%|██████████████████████████| 2/2 [00:00<00:00, 7966.39it/s]
INFO Using 5 num_processes and 2 num_cached_per_queue for augmentation. | 0/2 [00:00<?, ?it/s]
INFO VALIDATION KEYS:
odict_keys(['case_0', 'case_7'])
Validation sanity check: 0%| | 0/10 [00:00<?, ?it/s]using pin_memory on device 0
Validation sanity check: 100%|███████████████████████| 10/10 [00:17<00:00, 1.62s/it]INFO Val loss reached: 1.82636
/opt/conda/lib/python3.8/site-packages/sklearn/metrics/_ranking.py:1029: UndefinedMetricWarning: No positive samples in y_true, true positive value should be meaningless
warnings.warn(
INFO mAP@0.1:0.5:0.05: 0.000 AP@0.1: 0.000 AP@0.5: 0.000
INFO Proxy FG Dice: 0.017
INFO This epoch took 17 s
INFO Using training DataLoader3DOffset with {}
INFO Building Sampling Cache for Dataloder
Sampling Cache: 100%|████████████████████████████| 8/8 [00:00<00:00, 12056.93it/s]
INFO Using 5 num_processes and 2 num_cached_per_queue for augmentation.
INFO TRAINING KEYS:
odict_keys(['case_1', 'case_2', 'case_3', 'case_4', 'case_5', 'case_6', 'case_8', 'case_9'])
Epoch 0: 0%| | 0/2600 [00:00<00:00, 3521.67it/s]using pin_memory on device 0
Epoch 0: 0%|▌ | 11/2600 [00:39<2:21:00, 3.27s/it, loss=1.52, v_num=433e]

Alright, glad to see that it works now :) I'll close the Issue for now, feel free to reopen or open a new one if anything else comes up.

Could you please reopen it actually - I did have a question buried in my last comment - copying it here:

So now I am trying to run the "nndet_train 000 --sweep" on the small test set of only 10 train and 10 test, and it says it will take about 2 hours per epoch, does that seem right? The progress bar looks like this:

Epoch 0: 21%|████████████████████████▉ | 554/2600 [26:38<1:38:11, 2.88s/it, loss=0.0264, v_num=3aa8]

Where is the 2600 number coming from if I only have 10 samples? My real dataset is about 1,000 samples so 200 hours per epoch would not be feasible. Running nvidia-smi does show that my GPU is being used: 9733MiB / 16384MiB.

nnDetection samples a predefined number of batches per epoch, those are independent of your dataset size. Please check some of the other Issues since there are already a few of "slow training time" ones where some of the parameters were optimised (also the FAQ contains information as well). Given the type of GPU (high VRAM, no mixed precision support, compute capabilities), the training time of 2 hours/epoch might be normal though ...
(Note: the full training schedule is way to long for the example, the example is really only intended for testing)

If something else comes up, please prefer opening a new Issue so the problems remain searchable.