Data preprocessing in Part 2

Question

Data preprocessing in Part 2

kasozivincent opened this issue 4 years ago · comments

Kasozi Vincent commented 4 years ago

Hello everyone.
I am a bit stuck in part 2 especially on data preparing.

Why is it that in the candidates.csv some nodules are occurring more than once?
Second, what specifically are Image coordinates, I really don't understand the logic of converting from world coordinates(what are they anyway? is it the Cartesian x, y, z system) to voxel coordinates and why should we care to convert between the two?
Why are we manually splitting the dataset, can't we rely on PyTorch or sktlearn to randomly split it for us?
Thank you please

Thomas Viehmann · Answer 1 · Thu Dec 31 2020 07:14:47 GMT+0800 (China Standard Time)

Don't despair! Data is a mess, and so this is a bit closer to real world data than, say, imagenet.

These are good questions (and see the end of chapter 14 for some of the hickups we had with the data).

I don't know from the top of my head how the candidates.csv has been created, but two possible reasons would be
- they are from different annotators annotating the same nodule,
- they have been annotated slice by slice and whatever cleanup procedure has been used has not combined them yet.
In the real world, you and me are in measured in mm (see the picture of the person in the CT scanner) and the CT scan we get is in voxels. The CSV (which is our input data) has been supplied in mm coordinates.
Different CT scanners will have different resolutions and depending on the application we might even want to rescale etc. Also, when we find results, a "x mm nodule" is likely much more interesting than "a y pixel nodule" to a radiologist. In this way, the patient mm coordinates are much more objective.
Keep in mind that this not only is the situation for us here, but generally when working with imaging, and medical imaging in particular.
If the 3d nature adds a certain unwieldiness, you might also think of a digital photograph. Whatever edge you see has a size in pixels in the photo and a size in mm in the real world. There is difference between that model Lamborghini I might just barely afford and a real one even if the model has been done really well. In a way the CT situation is simpler than the image one as we don't have a (not quite known) perspective and projection.
Usually one has constraints (such as split between patients, not within data from one patient, stratify by something, ...) and it's satisfying the constraints that is hard, the actual splitting is easy. This makes is more natural to view the split as something you do rather than delegating to a library.

Kasozi Vincent · Answer 2 · Thu Dec 31 2020 07:54:59 GMT+0800 (China Standard Time)

Thank you, so can I assume that the world coordinate system is the same as the Patient coordinate system and it is the coordinate system that we use in Calculus lessons?
And for different resolutions, why should we care? Here is analogy, take for example that I take a dog photo using a 5mp camera and my friend uses an Iphone X to shoot the same dog, if we fed that photo to a CNN would it fail to identify the dog in the image?
Thank you Mr. Thomas

Thomas Viehmann · Answer 3 · Thu Dec 31 2020 20:16:55 GMT+0800 (China Standard Time)

Yes, the real-world coordinates are always in the Patient coordinate system shown in fig 10.6 and converted as 10.7.

CNN detection accuracy is indeed resolution dependent, this is why we often use RandomResizedCrop to augment crops to appear at different resolutions (the default parameters span ~1 order of magnitude here).

Also, scale does matter in the real world a lot. If you take a peek ahead a chapter 14, you'll see that size is a fairly strong baseline for determining whether a nodule would likely be considered malignant by a radiologist.

Kasozi Vincent · Answer 4 · Fri Jan 01 2021 13:44:09 GMT+0800 (China Standard Time)

Hello Mr. Thomas.
Do you have any idea of adding section in the book to fully describe the structure of a DICOM file and its useful attributes that we need for our data processing? It seems the book really flows well until this part. attributes like GetSpacing are just popping up and unfortunately the explanations in the book are so general according to me and see not to address the problem at hand well(it is my suggestion!!! Am not the smartest)

I also read from somewhere that the array data in the file are not HU values and that we always need to convert them back to HU using the RescaleSlope and RescaleIntercept attributes of the Dicom file(I assume they are in the mhd files)......... I am still reallt struggling with CT Scan details. I request that if you got time, you write some well detailed article to save repeating yourself explaining the same thing in the future. It seems you have got lots of questions about it, you knew from Twitter that it was the same problem that i had.
Thank you and I wish a happy new year

Alexandru Vesa · Answer 5 · Mon Jan 11 2021 02:52:59 GMT+0800 (China Standard Time)

Hello !
As @kasozivincent said the details for CT Scan are a little bit forgotten in this book (also, great code and explanations). Can we have a more detailed explanations about that conversion between different coordinate system and also about .mhd files?
Thanks a lot !