[Question]Two doubts about the Test Pool and its evaluation results in the paper of nnDetection

Question

[Question]Two doubts about the Test Pool and its evaluation results in the paper of nnDetection

ClimberXIE opened this issue 5 months ago · comments

Hello~
Thanks for your excellent work and your sharing!
When I read the paper of nnDetection, I met two questions that I wanted to ask:
(1) It is mentioned in the paper that the TCIA Lymph Node data set belongs to the Test Pool and I think it doesn't participate in the K-Fold training process at all and are only used as test data to evaluate the generalization ability of the model? But in Table 1. from Supplementary Material, I noticed the "Number of Scans (Tr/Ts)", and for mediastinal Lymph Nodes dataset ,"63/27" is written. I don't quite understand whether there are 63 mediastinal Lymph nodes in the TCIA Lymph Node dataset involved in training? Or what "63/27" means?

(2)In Fig.3 of this paper, there are "Five-Fold Cross Validation Results" and "Hold-out Test Split Results" for the evaluation of mediastinal lymph nodes. I found that the Results of "Five-Fold Cross Validation Results" were slightly worse than those of "Hold-out Test Split Results" for the evaluation of mediastinal lymph nodes. What's the difference between the models used for these two results? Or the results are simply different just because the samples being tested are different?

I am not sure if I've misunderstood much.
Looking forward to your reply.Thanks again!
Best wishes.

Michael Baumgartner · Answer 1 · Thu Jan 04 2024 21:53:29 GMT+0800 (China Standard Time)

Dear @ClimberXIE ,

nnDetection has two generalisation axis:
(1) generalisation to a new dataset: this evaluated the suitability of the method, i.e. how well do the rules for the parameters generalise; do the fixed parameters really fit to a new datasets etc. => note: nnDetection is not a foundation model - this means it requires a training dataset of the new problem
(2) generalisation of a model: this is the classic (ML) generalisation of a trained model to unseen test data from the same distribution as the training data

This reflects the typical use case: a new detection problem arrives -> training + test data is collected and annotated -> model is trained on new training data (this is 1) -> trained model is tested on unseen test data (this is 1 + 2)

The lymph nodes belong to the test pool, since it was not part of the development of nnDetection, i.e. nnDetection was not actively tailored for detecting lymph nodes at any point. We split the lymph node dataset into a training+validation part (63 images) and a test set (27 images) <- these are also the two numbers you mentioned

The cross-validation tend to be slightly worse for most problems since the validation set is only predicted by a single model while for testing the ensemble of the cross-validation (typically 5) models is used.

Best,
Michael

ClimberXIE · Answer 2 · Fri Jan 05 2024 16:28:55 GMT+0800 (China Standard Time)

Dear @ClimberXIE ,

nnDetection has two generalisation axis: (1) generalisation to a new dataset: this evaluated the suitability of the method, i.e. how well do the rules for the parameters generalise; do the fixed parameters really fit to a new datasets etc. => note: nnDetection is not a foundation model - this means it requires a training dataset of the new problem (2) generalisation of a model: this is the classic (ML) generalisation of a trained model to unseen test data from the same distribution as the training data

This reflects the typical use case: a new detection problem arrives -> training + test data is collected and annotated -> model is trained on new training data (this is 1) -> trained model is tested on unseen test data (this is 1 + 2)

The lymph nodes belong to the test pool, since it was not part of the development of nnDetection, i.e. nnDetection was not actively tailored for detecting lymph nodes at any point. We split the lymph node dataset into a training+validation part (63 images) and a test set (27 images) <- these are also the two numbers you mentioned

The cross-validation tend to be slightly worse for most problems since the validation set is only predicted by a single model while for testing the ensemble of the cross-validation (typically 5) models is used.

Best, Michael

Thanks for your patient explanation, I truly understand now.
Best wishes!