nyukat / breast_cancer_classifier

Deep Neural Networks Improve Radiologists' Performance in Breast Cancer Screening

Home Page:https://ieeexplore.ieee.org/document/8861376

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

some images doesn't work in crop_single_mammogram.py

nightandweather opened this issue · comments

thank you for your work sharing,
I'm trying to adapt your repository to our dataset.

`score_heat_list = []
import glob
def make_dir(name):
if not os.path.isdir(name):
os.makedirs(name)
print(name, "폴더가 생성되었습니다.")
else:
print("해당 폴더가 이미 존재합니다.")

make_dir('save_imageheatmap_model_figure_folder')

def json_extract_feature(json_data):
patient=json_data['case_id']

#read_all_data:
"""
components 
'user id' = no
'case_id' = split.('_')[1] = patients number
'contour_list' = dict('image_type',dict())

"""
temp_image_type = []
temp_image_type1 = []
temp_image_type2 = []
temp_image_type3 = []

temp_key = []

temp_contour = []
temp_contour1 = []
temp_contour2 = []
temp_contour3 = []



for image_type in json_data['contour_list']['cancer']:
    # print(image_type)
    if image_type == 'lcc': 
        temp_image_type.append(image_type)
    if image_type == 'lmlo':
        temp_image_type1.append(image_type)
    if image_type == 'rcc':
        temp_image_type2.append(image_type)
    if image_type == 'rmlo':
        temp_image_type3.append(image_type)



    for key in json_data['contour_list']['cancer'][image_type]:
        # print(key)

        for contour in json_data['contour_list']['cancer'][image_type][key]:
            
            # print(contour)
            # print(contour.get('x'))
            # print(contour.get('y'))
            bin_list = [contour.get('y'),contour.get('x')]
            if image_type == 'lcc':
                temp_contour.append(bin_list)
            if image_type == 'lmlo':
                temp_contour1.append(bin_list)
            if image_type == 'rcc':
                temp_contour2.append(bin_list)
            elif image_type == 'rmlo':
                temp_contour3.append(bin_list)
    
return temp_image_type,temp_image_type1,temp_image_type2,temp_image_type3,temp_contour,temp_contour1,temp_contour2,temp_contour3

from skimage import draw
def polygon2mask(image_shape, polygon):
"""Compute a mask from polygon.
Parameters
----------
image_shape : tuple of size 2.
The shape of the mask.
polygon : array_like.
The polygon coordinates of shape (N, 2) where N is
the number of points.
Returns
-------
mask : 2-D ndarray of type 'bool'.
The mask that corresponds to the input polygon.
Notes
-----
This function does not do any border checking, so that all
the vertices need to be within the given shape.
Examples
--------
>>> image_shape = (128, 128)
>>> polygon = np.array([[60, 100], [100, 40], [40, 40]])
>>> mask = polygon2mask(image_shape, polygon)
>>> mask.shape
(128, 128)
"""
polygon = np.asarray(polygon)
vertex_row_coords, vertex_col_coords = polygon.T
fill_row_coords, fill_col_coords = draw.polygon(
vertex_row_coords, vertex_col_coords, image_shape)
mask = np.zeros(image_shape, dtype=np.bool)
mask[fill_row_coords, fill_col_coords] = True
return mask
##############################################################################################################################
from tqdm import tqdm
from src.heatmaps.run_producer_single import produce_heatmaps
import json
from PIL import Image
annotation_folder = r'/home/ncc/Desktop/2020_deep_learning_breastcancer/annotation_SN/'
import pickle
for png in tqdm(png_list[0:8]):
print(PATH+png)
crop_single_mammogram(PATH+png, horizontal_flip = 'NO', view = png.split('_')[1].split('.')[0],
cropped_mammogram_path = PATH+'cropped_image/'+png, metadata_path = PATH+png.split('.')[0]+'.pkl',num_iterations = 100, buffer_size = 50)
print(PATH+'cropped_image/'+png)
get_optimal_center_single(PATH+'cropped_image/'+png,PATH+png.split('.')[0]+'.pkl')
model_input = load_inputs(
image_path=PATH+'cropped_image/'+png,
metadata_path=PATH+png.split('.')[0]+'.pkl',
use_heatmaps=False,
)
####################################################################################################################################
parameters = dict(
device_type='gpu',
gpu_number='0',

patch_size=256,

stride_fixed=20,
more_patches=5,
minibatch_size=10,
seed=np.random.RandomState(shared_parameters["seed"]),

initial_parameters="/home/ncc/Desktop/breastcancer/nccpatient/breast_cancer_classifier/models/sample_patch_model.p",
input_channels=3,
number_of_classes=4,

cropped_mammogram_path=PATH+'cropped_image/'+png,
metadata_path=PATH+png.split('.')[0]+'.pkl',
heatmap_path_malignant=PATH+png.split('.')[0]+'_malignant_heatmap.hdf5',
heatmap_path_benign=PATH+png.split('.')[0]+'_benign_heatmap.hdf5',

heatmap_type=[0, 1],  # 0: malignant 1: benign 0: nothing

use_hdf5="store_true"

)
###########################################################################################################################

read annotation SN00000016_L-CC.png

#코드를 읽어보면 이름이 같은 JSON 파일을 4번 읽어오고 있음.. 코드 경량화때 해결 필요
#annotation 기준은 CROP된 이미지가 아니라, 원본 이미지임, 그런데 이미지로 보여주는건 CROP된 이미지로 보여주고 있음..

# print(png.split('_')[0])
with open(PATH+png.split('.')[0]+'.pkl','rb') as f:
    location_data = pickle.load(f)
print(location_data)
start_point1 = list(location_data['window_location'])[0]
endpoint1 = list(location_data['window_location'])[1]
start_point2 = list(location_data['window_location'])[2]
endpoint2 = list(location_data['window_location'])[3]
print(start_point1,start_point2)
with open(annotation_folder+'Cancer_'+png.split('_')[0]+'.json') as json_file:
    json_data = json.load(json_file)

temp_image_type,temp_image_type1,temp_image_type2,temp_image_type3,temp_contour,temp_contour1,temp_contour2,temp_contour3 = json_extract_feature(json_data)

import operator
if png.split('_')[1].split('.')[0] =='L-CC':
    new_contour_list = temp_contour
if png.split('_')[1].split('.')[0] =='L-MLO':
    new_contour_list = temp_contour1
if png.split('_')[1].split('.')[0] =='R-CC':
    new_contour_list = temp_contour2  
if png.split('_')[1].split('.')[0] =='R-MLO':
    new_contour_list = temp_contour3

im = Image.open(PATH+png)
im_cropped = Image.open(PATH+'cropped_image/'+png)
print('원본 이미지:',im.size,'cropped image:',im_cropped.size)
new_contour = []
for image_list in new_contour_list:
    # print('_',image_list)
    new_temp_contour =map(operator.add,image_list,reversed(list(np.array(im.size)/2)))
    new_contour.append(list(new_temp_contour))
    # print(new_contour)
try:
    # 'window_location': (103, 2294, 0, 1041)
    img = polygon2mask(im.size[::-1],np.array(list(new_contour)))
    img_cropped = img[start_point1:endpoint1,start_point2:endpoint2]
    im = cv2.imread(PATH+png)
    im_cropped = cv2.imread(PATH+'cropped_image/'+png)
except ValueError as e:
    img = np.zeros(im.size)

###########################################################################################################################
random_number_generator = np.random.RandomState(shared_parameters["seed"])

# random_number_generator = np.random.RandomState(shared_parameters["seed"])
produce_heatmaps(parameters)
image_heatmaps_parameters = shared_parameters.copy()
image_heatmaps_parameters["view"] = png.split('_')[1].split('.')[0]
image_heatmaps_parameters["use_heatmaps"] = True
image_heatmaps_parameters["model_path"] = "/home/ncc/Desktop/breastcancer/nccpatient/breast_cancer_classifier/models/ImageHeatmaps__ModeImage_weights.p"



model, device = load_model(image_heatmaps_parameters)

model_input = load_inputs(
image_path=PATH+'cropped_image/'+png,
metadata_path=PATH+png.split('.')[0]+'.pkl',
use_heatmaps=True,
benign_heatmap_path=PATH+png.split('.')[0]+'_malignant_heatmap.hdf5',
malignant_heatmap_path=PATH+png.split('.')[0]+'_benign_heatmap.hdf5')

batch = [
process_augment_inputs(
    model_input=model_input,
    random_number_generator=random_number_generator,
    parameters=image_heatmaps_parameters,
    ),
]

tensor_batch = batch_to_tensor(batch, device)
y_hat = model(tensor_batch)
###############################################################
fig, axes = plt.subplots(1, 5, figsize=(16, 4))
x = tensor_batch[0].cpu().numpy()
axes[0].imshow(im, cmap="gray")
axes[0].imshow(img, cmap = 'autumn', alpha = 0.4)
axes[0].set_title("OG_Image")

axes[1].imshow(im_cropped, cmap="gray")
axes[1].imshow(img_cropped, cmap = 'autumn', alpha = 0.4)
axes[1].set_title("Image")

axes[2].imshow(x[0], cmap="gray")
axes[2].imshow(img_cropped, cmap = 'autumn', alpha = 0.4)
axes[2].set_title("Image")

axes[3].imshow(x[1], cmap=LinearSegmentedColormap.from_list("benign", [(0, 0, 0), (0, 1, 0)]))
axes[3].set_title("Benign Heatmap")

axes[4].imshow(x[2], cmap=LinearSegmentedColormap.from_list("malignant", [(0, 0, 0), (1, 0, 0)]))
axes[4].set_title("Malignant Heatmap")
plt.savefig('save_imageheatmap_model_figure_folder'+'/'+png.split('.')[0]+'.png')
################################################################
predictions = np.exp(y_hat.cpu().detach().numpy())[:, :2, 1]
predictions_dict = {
    "image" : png,
    "benign": float(predictions[0][0]),
    "malignant": float(predictions[0][1]),
}

print(predictions_dict)
score_heat_list.append(predictions_dict)`

Screenshot from 2020-11-25 11-38-47

Attached file is cropped mammography which is made by this code.
Issue is some mammogram doesn't crop well. Am I doing something wrong?

Hi @nightandweather,

It seems that the issue you are experiencing is that crop_single_mammogram function from src/cropping/crop_single.py does not remove background for some of your images. Is this correct?

Our cropping algorithm has some strict assumptions about the data. If your dataset does not satisfy any of these assumptions, the cropping algorithm will not work well:

  • the background is strictly 0 value everywhere
  • if there are nonzero artifact in the background, they are thin enough to disappear after 100 iterations of erosion.

For more information, please refer to Algorithm 1 in our data report.

I would recommend trying any one of the following to address the issue:

  • Using a threshold greater than 0 when generating masks. Instead of img_mask = img > 0, try img_mask = img > thres where thres is some small positive integer. Try multiple different values until it works well.
  • Contrast-adjusting your input images such that the background will have strictly 0 pixel values.
  • Try increasing --num-iterations value to higher than 100. Try multiple different values until it works well. Note that if you increase this value too high, then the algorithm will start to fail entirely at capturing the breast area.

Thank you for your comment!
Just like you said, image threshold method works!

Actually.. there are more issues to adapt this github code in our dataset

Attached ROC curve is result of following your Github Repository.

Screenshot from 2020-11-26 16-46-54

Use 10 malignant annotated patients(single breast diagnosed malignant,left, right) * 4 mammography standard view to get ROC curve, and annotation label list is

[1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0]

ex)If the left breast is malignant, it is assumed that it is malignant in the l-mlo,l-cc view.

and model got low score

test_image_model.zip

Maybe I made a mistake in preprocessing, As stated in the paper, the pixel array of dcm was saved as png as uint 16 data using the standardized code provided in github.
this is dicom to 16 bit png code

`# 아무래도 16비트 변환이 안된 모양이다. 다시 만들어보자
"""

cv2.imread default가 8bit로 읽어들인다.

"""
#########################################################################
import natsort
import imageio
import numpy as np
def standard_normalize_single_image(image):
"""
Standardizes an image in-place
"""

image =image - np.mean(image) # np.mean을 통과하면 float64로 변하는 모양인데..
image /= np.maximum(np.std(image), 10**(-5))
return image

def load_dcm_data(path):
shpfiles = []
labels = []
# annotation_files = pd.read_excel(csv)
for dirpath, subdirs, files in os.walk(path):
for x in files:
if x.endswith(".dcm"):
shpfiles.append(os.path.join(dirpath, x))

return natsort.natsorted(shpfiles)

다시 dicom을 png로 변환한다.

time consuming 작업이므로 몇개만 골라서 빼보자

#######################################################################################
import pydicom
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
base_malignant_folder = r'/home/ncc/Desktop/2020_deep_learning_breastcancer/submit_breast_SN/malignant'
test_folder_path = r'/home/ncc/Desktop/2020_deep_learning_breastcancer/test_malignant_folder/'
sample_malignant_list = load_dcm_data(base_malignant_folder)[0:40]
sample_malignant_list
threshold = 100
for f in sample_malignant_list:
file_name = os.path.basename(f)

ds = pydicom.dcmread(f) # read dicom image
img = ds.pixel_array # get image array (0~ 4095)
img[img<150]=0

        
plt.hist(img.ravel()); 
plt.show()


#논문에서 원하는건 16bit의 standardized된 mammmography png임
img = standard_normalize_single_image(img)
img2 = (65535*(img - img.min())/img.ptp()).astype(np.uint16) # (0~65535)

# img_threshold = img_threshold.reshape(img2.shape)
# Get the information of the incoming image type



print('scaler적용',img2.max(),img2.min())
imageio.imwrite(test_folder_path + file_name.split('.')[0]+'.png',img2.astype(np.uint16)) # write png image`

And while I was looking at the last issues,
#9 (comment)

Do I have to change the annotated label according to the results of the model in order to raise the AUC core?

and, I'd like to know the value of the parameter to get the 3ch(gray, benign, malignant) image in the paper(image-heatmap model parameter).
I made various parameter values through dictionaries and carried out grid searching, but no satisfactory figure was found.

Thank you for accepting my question!

It seems that you are trying to load the images and feed it to the model on your own, instead of using our pipeline. I noticed some misunderstandings and problematic lines of code in your custom pipeline.

  • img[img<150]=0 is problematic because you are modifying the image directly, treating all values below 150 as background. This might or might not be a good threshold for the background, but even if it is, it could also erase some regions inside the breast. In addition, even if this thresholding works reasonably for some of your images, it might not be an ideal value for the other images. What I meant is to use thresholding when generating mask in the cropping algorithm.
  • standard_normalize_single_image function does not return anything as output, but rather modifies the given array in-place. Therefore, if you did not change our function, img = standard_normalize_single_image(img) will assign img as None. It seems that you changed the behavior of our function so that there is some return value. This might be problematic later when some of our code calls this function.
  • We do not expect the saved png images to have already been standardized. The standardization happens as a part of image loading pipeline. It suffices to save the original dicom images as 16-bit png files. You don't even necessarily have to rescale the image to 0~65535, as the images will later become standardized.
  • As you said, the default behavior of cv2.imread will read the saved 16 bit png files as 8 bit images. You should be using the provided read_image_png function, which uses imageio.imread function instead.

As @kjgeras mentioned in #9 (comment), even the slightest differences in the loading pipeline can lead to random predictions.

Even if all the preprocessing is done correctly, if your dataset itself is drastically different from our own, it could also affect the performances as mentioned in #19.

  • When I look at the screenshot of some of the cropped mammograms, I notice that you have different set of images which have undergone different contrast adjustment. Specifically, images from SN0000006 look like the image intensity is adjusted with some windowing, but other images from SN0000001, SN0000002, SN0000004, SN0000005 look like no contrast adjustment has been performed on them. They look like they might have ImageType value of 'ORIGINAL' in dicom metadata, which we reject in our filtering process.
  • There could be other differences in dicom metadata as well between the datasets. You can get more information by looking at section 2.C in our data report.
  • The difference might also come from the manufacturer of the scanner. You can look into table 4 in our data report.

I suggest you look at the debugging strategy written by @kjgeras which can be found in #9 (comment)

At this stage, I recommend that you clone our repository again without any modification, fix the cropping algorithm to use nonzero masking threshold, and use the provided pipeline as-is (run.sh or run_single.sh). For the images, you can just save dicom pixel_array as 16-bit png files without any standardization or normalization. This way, you can be more sure that you are preprocessing the images the way we expect. If you still do not get a reasonable performance, you can try examining the dicom metadata to see if you applied the same filtering criteria as we did.

And while I was looking at the last issues,
#9 (comment)

Do I have to change the annotated label according to the results of the model in order to raise the AUC core?

No, you must not change the label according to the results of the model when calculating AUC. What the comment was discussing is how to set the decision threshold, which has nothing to do with AUC calculation. On the other hand, AUC is the metric that measures the model's ability to distinguish between malignant and not malignant cases at all threshold values. To learn more about AUC, I recommend reading this article about receiver operating characteristic curve.

and, I'd like to know the value of the parameter to get the 3ch(gray, benign, malignant) image in the paper(image-heatmap model parameter).
I made various parameter values through dictionaries and carried out grid searching, but no satisfactory figure was found.

I am not sure what you mean by this sentence. Do you mean you changed heatmap parameters such as stride_fixed and patch_size from src/heatmaps/run_producer.py or src/heatmaps/run_producer_single.py? You should not have changed these parameters, as the classifier model expects heatmaps generated with predefined parameters we provided.

The reason why you are not be seeing satisfactory heatmaps might also be related to the differences in datasets and issues in your custom pipeline I explained in #42 (comment) . If you are not feeding the images the way we do, you cannot expect reasonable heatmaps.