keras-team / keras

Deep Learning for humans

Home Page:http://keras.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Dense Prediction API Design, Including Segmentation and Fully Convolutional Networks

ahundt opened this issue · comments

Dense Prediction API Design, Including Segmentation and Fully Convolutional Networks

This issue is to develop an API design for dense prediction tasks such as Segmentation, which includes Fully Convolutional Networks (FCN), and was based on the discussion at #5228 (comment). The goal is to ensure Keras incorporates best practices by default for this sort of problem. Community input, volunteers, and implementations will be very welcome. #6655 is where preprocessing layers can be discussed.

Motivating Tasks and Datasets

Reference Materials

Feature Requests

These are ideas rather than a finalized proposal so input is welcome!

  • Input data: Support one or more Images as input + Supplemental data (ex: image + vector)
  • Augmentation of Input Data and Dense Labels
    • Example: Both image and label must be zoomed & translated equally in Pascal VOC
  • Input image dimensions should be able to vary
    • Ideally by height, width & number of channels
  • Loss function "2D" support, such as single and multi label results for each pixel in an image
  • class_weight support for dense labels
    • Example: Single class weight value for each class in an image segmentation task such as in Pascal VOC 2012.
  • Sparse to Dense Prediction weight transfer
  • Automatic Sparse to Dense Model conversion (advanced)
    • configuration at each downsampling stage
    • remove pooling layers and apply an equivalent atrous dilation in the next convolution layer
    • add an upsampling layer for each downsampling stage
  • SegmentationTop Layer?
    • Sigmoid single class predictions
    • Spatial Softmax argmax multi class predictions
    • Multi Label Predictions (sigmoid?)
  • "Upsample" Layer?
    • like "Activation" layer, where reasonable upsampling approaches can be defined with a simple string parameter
  • Example implementation training & testing on MSCOCO & Pascal VOC 2012 + extended berkeley labels
    • (advanced) pretrain pascal voc on coco then VOC
  • COCO pycocotools json format dataset support used by several datasets
    • supports multi-label segmentation, keypoint data, image descriptions, and more
  • TFRecord dataset support (probably TensorFlow only, maybe only in tensorflow implementation of keras)
  • flow_from_directory & Segmentation Data Generator
    • Keras-FCN,
    • Single class label support
    • Multi class label support
  • mean Intesection Over Union (mIOU) utility Keras-FCN
  • Image and label masks
  • Proper palette handling for png based labels
  • sparse label format for multi-label data?
  • debugging utilities
    • save predictions to file
  • Iterative training of partial networks at varying strides, as described in the FCN paper (advanced, may not be necessary as per Keras-FCN performance)

Existing Keras Utilities with compatible license

Questions

  • Is something as clear as 30 seconds to keras segmentation possible?
  • Is anything above missing, redundant, or out of date compared to the state of the art?
  • Should the current ImageDataGenerator be extended or is a separate class like Keras-FCN's SegDataGenerator clearer?
  • Should there be a guide of some sort?
  • What will make for useful training progress and debugging data? (sparse mIOU?, something else?)
  • What is needed to handle large datasets quickly and efficiently? (should this be out of scope?)

Should the current ImageDataGenerator be extended or is a separate class like Keras-FCN's SegDataGenerator clearer?

Depends; if you were to implement it as a subclass, which methods would be reused and which would have to be overridden?

Should there be a guide of some sort?
Is something as clear as 30 seconds to keras segmentation possible?

Sure

What is needed to handle large datasets quickly and efficiently? (should this be out of scope?)

Reading images from disk with ImageDataGenerator using multiprocessing and several processes is already pretty quick and efficient.
The HDF5Matrix can be made more efficient via use of multiprocessing (or at least threading) to avoid IO being a bottleneck.

Really interested in helping you! Maybe we should have a dedicated slack channel so we could all discuss.

I had a Mean IOU implemented somewhere I'll try to find it!
There is a lot of formats thought on how to specify bounding boxes/ segmentation map. SSD uses Priors shapes, Faster RCNN uses anchors boxes, YOLO v1 is using nothing.
Could get quite crowded.

SSD Keras has some data augmentation for boxes. We could probably uses it.

@Dref360 the semantic_segmentation slack channel would work. Bounding box design input would be great because I'm not currently using them.

I would say that predicting a bounding box is a significantly different task from segmentation, in particular you may need a complicated loss function to handle many boxes. I'm also not sure if best practices are well established enough for this.

For upscaling operations popular choices include:

  • Conv2DTranspose. Already implemented in Keras.
  • Resize interpolation. See tf.image.resize and discussion on distill.pub. Current UpSampling layer I think does nearest neighbor with integer upsampling factors.
  • PixelShuffle/Subpixel Convolution. See this repo for discussion. Now easy in Tensorflow with tf.depth_to_space and some Keras+Theano implementations exist.

Also pix2pix is a popular variant using adversarial training that would be nice to have as an example. There are several Keras implementations out there.

For FCNs I've found base Keras to be pretty useable but one sticking point is that it's not easy to replace a fixed size model or Input layer to one that has None size for all the spatial dimensions, which is all you really need to have a FCN that allows multiple scale inputs. I think the best way to do this now is to create a new instance of the same model except for the Input layer and use get_weights + set_weights. It would be nice if there was a convenient way to just resize the model's input spatial dimensions and have it propagate to all layers, raising an error if it's not possible e.g., if there's a Dense layer.

I'd be interested in contributing as well! However, keep in mind that there are a few subtasks within the segmentation problem and that makes the task harder.

For example, all semantic segmentation networks, such as FCN, segnet, ENet, ICNet etc, do is pixel classification. They cannot detect objects and therefore can't differentiate between distinct instances of the same class in an image.

Other works, such as DeepMask/SharpMask/FastMask, output mask proposals for each object they detect but they do not do classification. This means that in theory they can detect objects that belong in classes they have not seen before.

Finally, Instance Segmentation does both (e.g. Instance-FCN, FCIS, Mask R-CNN). It can tell where a person ends and another begins and also outputs a class label for each instance it detects.

Detection is an inherent part of the pipeline for two of the subtasks, so if we plan to cover all three cases, I don't think we can get away with not discussing it.

@PavlosMelissinos good points, training on varied tasks like instance recognition and mask proposals should also be considered, what are the best practices for that type of data? How are they typically formatted? Masks are also sometimes useful for segmentation, such as the pascal voc "ambiguous regions".

@Dref360 I thought about the bounding box issue some more and I agree with @allanzelener that the tools will be significantly different for bounding boxes. Unless there is a compelling reason I've missed to keep it here, I think bounding box algorithms should be considered out of scope for this issue and should be handled as a separate github issue.

For segmentation training it will be important to support loading data data from a directory, and to support the most common dataset formats, which to my knowledge are the Pascal VOC format and the COCO json format. This post goes into loading from a directory in a reasonable way and including support for Pascal VOC.

  • I think ImageDataGenerator could simply be updated with couple additional options and parameters
  • What would be a good design for loading supplementary data?
  • PNG is the obvious choice for multi class single label data
  • what support should be provided for multi class multi label data? .npy format seems to work but is too huge for sparse data. Perhaps .mat and .tfrecord formats?

Here is how SegDataGenerator works in Keras-FCN:

seg_aug_generator = SegDataGenerator(
                 featurewise_center=False,
                 samplewise_center=False,
                 featurewise_std_normalization=False,
                 samplewise_std_normalization=False,
                 channelwise_center=False,
                 rotation_range=0.,
                 width_shift_range=0.,
                 height_shift_range=0.,
                 shear_range=0.,
                 zoom_range=0.,
                 zoom_maintain_shape=True,
                 channel_shift_range=0.,
                 fill_mode='constant',
                 cval=0.,
                 label_cval=255,
                 crop_mode='none',
                 crop_size=(0, 0),
                 pad_size=None,
                 horizontal_flip=False,
                 vertical_flip=False,
                 rescale=None,
                 data_format='default')

   generator = seg_aug_generator.flow_from_directory(
                            file_path, data_dir, data_suffix,
                            label_dir, label_suffix, classes,
                            ignore_label=255,
                            target_size=None, color_mode='rgb',
                            class_mode='sparse',
                            batch_size=32, shuffle=True, seed=None,
                            save_to_dir=None, save_prefix='', save_format='jpeg',
                            loss_shape=None)

   model.fit_generator(generator=generator, ...)

# Some internal details for the directory iterator:
    '''
    Users need to ensure that all files exist.
    Label images should be png images where pixel values represents class number.
    find images -name *.jpg > images.txt
    find labels -name *.png > labels.txt
    for a file name 2011_002920.jpg, each row should contain 2011_002920
    file_path: location of train.txt, or val.txt in PASCAL VOC2012 format,
        listing image file path components without extension
    data_dir: location of image files referred to by file in file_path
    label_dir: location of label files
    data_suffix: image file extension, such as `.jpg` or `.png`
    label_suffix: label file suffix, such as `.png`, or `.npy`
    loss_shape: shape to use when applying loss function to the label data
    '''

I think much of this functionality can be added directly to ImageDataGenerator.

file_path becomes file_list

SegDataGenerator.flow_from_directory'sfile_path parameter should be replaced with a file_list parameter and a separate function should be created to ImageDataGenerator that can load a pascal voc formatted .txt file listing filenames without extensions and return a list of them.

Dense class_mode options

class_mode should add new options to specify that it is a dense prediction task. What should these be named, pixel_categorical, pixel_binary, pixel_multilabel etc or perhaps dense instead of pixel? Perhaps it should take a tuple or something else indicating the data dimensionality?

Image Dimensions

Does anyone have design suggestions for dealing with the dimension issue detailed by @allanzelener? The SegDataGenerator design is to simply pad images with mask pixels to the maximum expected image size. This seems to work okay, but can probably have significant computational cost.

new augmentation options

I think most of the new augmentation options in SegDataGenerator also look good and can simply be added directly.

Supplementary Data

Supplementary Data is also likely necessary (definitely in my case), I think it may be wise to allow a second list of input files to be supplied in a different format, which can be simple vectors or images stored in a .mat or .npy, or some other format. However, perhaps this should be a separate class? If so, how would consistency of indexes be ensured? Can two different generators be chained together in a manner analogous to zip() for lists?

loss_shape

loss_shape is a workaround, because the output dimensions will vary based on the model, and we will want the loss function to operate on the output data as it is. Can it be avoided?

File Formats

Common Formats

In the current ImageDataGenerator:

Any PNG, JPG or BMP images inside each of the subdirectories directory tree will be included in the generator.

I think the addition of data_dir, data_suffix, label_dir, label_suffix is a good decision that does not need to conflict with this, they can simply default to None which retains the current behavior.

Arbitrary Formats

The API could easily support arbitrary file formats with a function, object that opens the files in the directory and returns an appropriate numpy array. Should this exist? Which parameter should accept these? Perhaps instead of *_suffix the parameter could be *_format, which can take these classes and/or functions? Design suggestions welcome

commented

@ahundt as a longtime Keras user that is now figuring out my way through multiclass semantic segmentation with sample weighting plus data augmentation via the ImageDataGenerator, I fully support your initiative. I believe you've covered the majority of needs above, and can't think of anything smart to add.

What I can tell you is the ImageDataGenerator API for my specific need (above) is a bit opaque. I have issues with the objects (for images and masks) built using .fit and .flow and then passed to .fit_generator where an error is raised because it expects a tuple (when in fact a tuple is being presented).

#2971 discusses some problems with ordered lists being needed for y but not X. I'm not sure that's the case, but I'm hacking/resolving it with reshapes. A subsequent roadblock is a mismatch between image and mask batches, although both are set at 32.

My impression is these hardships are more due to the design not being for my case use, but given that it is one that seems quite popular, it would be beneficial to expose/redesign the API appropriately.

@mptorr if you're using ImageDataGenerator, I believe the mismatches are due to each object generating random numbers separately, and the workaround is to provide the same random seed to each so they access indices in the same order. SegDataGenerator resolves this by accepting image and label dirs in a single object.

commented

@ahundt thanks for the suggestion—in fact I am using a fixed and identical seed for both generators, but still get the error. Anyway, I don't want to hijack this thread with my travails... at some point I hope to figure this out.

I was going to try your SegDataGenerator however wanted to ask 2 things about it, as they may pertain to your request for features/suggestions:

[1] it appears it currently does not support pixelwise weighting to compensate for class imbalance. This would be an important feature to me, as most of my segmentation tasks will have disproportionately over/under-represented classes. Currently I balance classes using Keras' sample_weighting as temporal, however without data augmentation due to my issues above. Since the sample weighting matches each pixel in the image/mask pair, it would need to be appropriately transformed to match augmented images/masks. Let me know if I'm overlooking this feature in SegDataGenerator.

[2] I'm a bit confused on how SegDataGenerator loads images. The comment in the class perhaps could be reworded (or have examples) for the most important arguments. I also didn't understand how to use this info: for a file name 2011_002920.jpg, each row should contain 2011_002920. Of course, this lack of understanding may reflect my own limitations, but just thought it could help development.

I'll be glad to give it a spin, especially if there's an option for sample weighting. Glad to continue this conversation elsewhere if more appropriate than on this thread.

I'm sorry for taking this long to comment but I just found the time to do so and I think there's too much stuff to discuss here. Should we split the issue into multiple threads maybe?

I recognize the following parts of the pipeline as separate entities regarding standardization and support for different implementations:
Dataset format,
data preprocessing and augmentation,
architecture IO - what kind of input/output should the actual computational graph expect? In other words, what should be the output of preprocessing/augmentation, and what should be the desired output of the network and therefore the input of evaluation.
evaluation - MS-COCO uses a large set of metrics (stricter variants of IoU) and in that way it's elaborate. I also think it's one of the only ones that report AR@IoU0.5-0.95 which is supposed to correlate well with real world performance. For that reason, I have created a script, based on the one found on the FastMask repo, that runs inference on multiple images and converts the results to a MS-COCO evaluation format, a json file that looks like the one used to store ground truth annotations. Finally, should there also be support for auxilliary loss evaluation, such as the one used in many frameworks after detection? I mostly agree with you that this last one seems out of scope for this issue but maybe it deserves some discussion.

Preprocessing

Imho, this is the stinkiest part of the pipeline and usually goes like this in most projects:

Semantic segmentation

  1. Get each image and its ground truth from some data source
  2. resize both
  3. Use resized rgb and gt as input to network

Mask proposal networks / instance segmentation

  1. Get each image and its ground truth pixel labels from some data source
  2. Get ground truth bounding boxes around each instance
  3. jiggle them a bit
  4. retrieve rgb and target mask crops for the selected bounding box
  5. resize crops
  6. Use resized crops as input to network

I believe semantic segmentation and anything that deals with bounding box should be considered separate tasks and be built upon gradually. Semantic segmentation is relatively simple, therefore maybe let's consider that first but acknowledge that it only covers a part of the wider task. Object detection networks are not yet standardized on keras, so we probably should take it one step at a time.

Resizing

I'm taking the initiative to start with one term that is ambiguous, resizing. I'm not sure what the proper terminology is for some of this stuff, so please bear with me:

Resizing can either be achieved through stretching (with pixel interpolation), padding (e.g. with zeros) or cropping.

Padding gives the worst results as it messes up somewhat with the statistics of the image and wastes network capacity at the same time.

Cropping, on the other hand, may remove too much context from the image, which is also undesirable. Furthermore, on prediction, using crops means that only part of the image area will be seen by the network in each pass. Therefore multiple passes over the image are required in order to cover the whole area.

In general it seems obvious to me that some kind of stretching is necessary. However, it is problematic when used alone in the case of multi-label, one hot targets (most popular option in segmentation datasets, e.g. MS-COCO). An easy solution would be to convert each one hot vector to a class index vector, then to PIL.Image (or equivalent), do the resize there and then convert back to one hot and feed that into the network. This however forces the selection of a single label for each pixel. Is this an important issue or should we safely assume that it's due to labeling error (annotations are not exact)?
Converting it back and forth is also slightly slow. scipy.ndimage.zoom can resize a numpy array natively but interpolation is done on all dimensions of the array, as far as I remember.

multiscale training

This is also an important feature since CNNs are not completely scale invariant. YOLOv2 for instance, in order to be able to learn to detect objects at various scales changes the shape of the input every few batches. In keras, this is not exactly easy. I think tensorflow only allows one dimension of the input to be unspecified (None), so this might not be keras' fault. I have no idea whether it works with theano as a backend.

Data generation

As far as data loading goes, I suggest that some variant of the MSCOCO class I have created for the enet-keras repository be used. It definitely needs quite some cleaning up and unit tests of course as it's a bit clumsy right now but I believe the set of operations is valid. Any kind of feedback is welcome obviously.

The logic of the class could be standardized (I have added a dummy Dataset class which I will populate as soon as I'm a little bit more confident about the layout) and easily extended in order to allow custom datasets and/or loading from disk.

For FCNs I've found base Keras to be pretty useable but one sticking point is that it's not easy to replace a fixed size model or Input layer to one that has None size for all the spatial dimensions, which is all you really need to have a FCN that allows multiple scale inputs. I think the best way to do this now is to create a new instance of the same model except for the Input layer and use get_weights + set_weights. It would be nice if there was a convenient way to just resize the model's input spatial dimensions and have it propagate to all layers, raising an error if it's not possible e.g., if there's a Dense layer.

Does anyone have design suggestions for dealing with the dimension issue detailed by @allanzelener? The SegDataGenerator design is to simply pad images with mask pixels to the maximum expected image size. This seems to work okay, but can probably have significant computational cost.

@allanzelener @ahundt I think there's a messy workaround for that using Permute and a TimeDistributed wrapper but it's not exactly a solution.

Supplementary Data is also likely necessary (definitely in my case), I think it may be wise to allow a second list of input files to be supplied in a different format, which can be simple vectors or images stored in a .mat or .npy, or some other format. However, perhaps this should be a separate class? If so, how would consistency of indexes be ensured? Can two different generators be chained together in a manner analogous to zip() for lists?

@ahundt Can you explain what you mean here by "supplementary data" and what the use case is? I don't quite get it. For the zipping part maybe you're looking for this? EDIT: It's not a big deal though, why not just write a function that calls next for both generators and yields the pairs in a tuple?

Options for input data to SegDataGenerator style ImageDataGenerator could either:

  1. stay simple and only allow 1-2 data input sources and a label input for the first version
  2. allow arbitrary inputs just like the Model class.

I'm leaning towards option 1 because it would maximize compatibility with the existing ImageDataGenerator, would be easy to understand, and would work for many use cases. More complex use cases could reasonably write their own augmentation class and call the basic functions (translate, zoom, etc) with reasonable ease.

@PavlosMelissinos thanks for the feedback, replies below.

@ahundt Can you explain what you mean here by "supplementary data" and what the use case is? I don't quite get it. For the zipping part maybe you're looking for this?

My use case is a vector that represents how a robot arm in the scene will move and an image of that robot. So the input data is an image and a vector, while the labels is a 2D image containing scores of how successful the motions will be if they are relative to that x,y coordinate in the image.

Another example would be input text and an image. Ex: "the person on the right" and an image of two people side by side. The labeled data would be the same dimensions as the original image right person's pixels labeled as 1 and all other pixels labeled as 0.

EDIT: It's not a big deal though, why not just write a function that calls next for both generators and yields the pairs in a tuple?

Sounds like a reasonable possibility. How would performing or not performing zoom/translation be specified for each input?

Padding gives the worst results as it messes up somewhat with the statistics of the image and wastes network capacity at the same time.

Padding definitely requires extra memory and processing power, but are the results really that bad? I think it might depend on the network design. Resnet specifies zero padding and is particularly effective, for example.

multi-label, one hot targets, [...] class index

We should support each of these modes because each makes sense for a variety of reasonable applications.

resize crops

How about the SegDataGenerator API definition above? It lets the user specify the range of crop, translation, and resizing they would prefer.

Just a thought.

If you want to handle every case on earth (multi-label, one hot targets, [...] class index), maybe keras is not the best place to do it? The same way that Tensorflow has tensorflow-transform. Keras could have a keras-transform that would be a dependency from Keras. Keras is a deep-learning library, not a preprocessing one.

Anyway, the Pytorch way to do data augmentation sounds pretty cool with transform.compose

@Dref360
Duh, yes you're right, I got carried away a bit, sorry about that. :) Preprocessing is technically out of scope for keras. On the other hand, segmentation is a popular task and standardizing preprocessing (like ImageDataGenerator does for classification) by adding support for some basic operations would be useful. The basic problem is that images in MS-COCO do not have a constant size, like ImageNet. The purpose is to decide on a design for a class like SegDataGenerator that might fix some of the shortcomings of ImageDataGenerator.

@ahundt
re: supplementary data - Ah, I see. I don't think it's possible to formulate that in a way that is relevant to keras. A custom implementation, depending on the case seems like a cleaner approach. Covering every single combination of inputs does not seem feasible.

Padding definitely requires extra memory and processing power, but are the results really that bad? I think it might depend on the network design. Resnet specifies zero padding and is particularly effective, for example.

I was referring to padding in the context of preprocessing (where it takes up a sizable portion of the input image), it's my mistake for not making that clear. Zero padding within a CNN is not that bad (still skews statistics but it's not so big a deal and we don't really have a viable alternative).

How about the SegDataGenerator API definition above? It lets the user specify the range of crop, translation, and resizing they would prefer.

Say the user has an image/label pair that is originally 486px in height and 220 in width; the shape of the input tensor is (None, 256, 256, 3) for the image and (None, 65536, 81) for the label. How does SegDataGenerator deal with the conversion? Labels (one-hot) are tricky to resize in this case because numpy arrays do not properly support the operation (scipy has ndimage.zoom though, might be worth a shot), label 'bleeding' among non-spatial dimensions should not be allowed and NEAREST interpolation mode returns very weird and pixelated ground truth masks.

I think I'm in favor of using some presets (e.g. instance segmentation needs each sample to be a pair; a crop within a ground truth bounding box and the binary mask of that object) and leaving the rest up to the user.

If you want to handle every case on earth (multi-label, one hot targets, [...] class index), maybe keras is not the best place to do it?

I'd hardly suggest every case on earth, haha. It is very reasonable to let a user select from both the sets {single label, multi label} and {single class, multi class} for dense prediction tasks as they require. That means the following four options:

  • sigmoid (single label, single class)
  • class index (single label, multi class)
  • one hot (single label, multi class)
  • multi label (multi label, multi class)

Keras already supports those cases listed above for simple label prediction.

Anyway, the Pytorch way to do data augmentation sounds pretty cool with transform.compose

That led me to an interesting idea, rather than the sequential model style of pytorch's transform.compose, perhaps this could work like the functional API? That could potentially make arbitrary application of augmentation much simpler! It could eventually also make it possible to use the TF backend image augmentation APIs, but I'll keep backends out of scope for now.

That said, selling a major API change is much more difficult than a minor extension of ImageDataGenerator so I'll stick with the minor extension option for here, and I created #6655 where preprocessing layers can be discussed.

@allanzelener @ahundt I think there's a messy workaround for that using Permute and a TimeDistributed wrapper but it's not exactly a solution.

@PavlosMelissinos Could you elaborate on this?

Say the user has an image/label pair that is originally 486px in height and 220 in width; the shape of the input tensor is (None, 256, 256, 3) for the image and (None, 65536, 81) for the label. How does SegDataGenerator deal with the conversion?

This is one of the key changes I'm hoping we can make, where 2D labels are directly supported, in other words the label would be the same dimensions as the input data.

SegDataGenerator image/label transform code:

        x = apply_transform(x, transform_matrix, img_channel_index,
                            fill_mode=self.fill_mode, cval=self.cval)
        y = apply_transform(y, transform_matrix, img_channel_index,
                            fill_mode='constant', cval=self.label_cval)

Remember that labels cannot and should not be interpolated! Average of labels 1 and 3 is not the label 2. :-) You have to pick from 1 or 3 so while it isn't as smooth you've got to use an algorithm like nearest.

Could you elaborate on this?

From my experience, the problem in the arbitrary input shape scenario in a Fully Convolutional Network (no Dense layers) is at the end of the network, when you need to Flatten the output and compare it to the targets. I'm not confident that hack would work (it was actually suggested by a colleague as a temporary workaround), so I'll reproduce it tomorrow at work and get back to you.

This is one of the key changes I'm hoping we can make, where 2D labels are directly supported, in other words the label would be the same dimensions as the input data.

That's not a problem, after all reshaping is trivial.

Remember that labels cannot and should not be interpolated! Average of labels 1 and 3 is not the label 2. :-) You have to pick from 1 or 3 so while it isn't as smooth you've got to use an algorithm like nearest.

That's the actual problem (it's noticeably less smooth with nearest neighbor). Maybe there is a better solution?

EDIT: In semantic segmentation there is a direct association between a rgb pixel and the ground truth label pixel at the same position. If the annotation is done in a specific size and then that image is resized, there is information distortion because the pixels are moved and some unseen values might appear (especially in the case of bilinear, bicubic or lanczos antialiasing). I guess what I'm saying is that the pixel values of the resized target labels should be dependent on the values of the pixels in the rgb image and more specifically on the way the value of each pixel in the resized rgb image was produced from the original. Does that make sense?

Remember that labels cannot and should not be interpolated! Average of labels 1 and 3 is not the label 2. :-) You have to pick from 1 or 3 so while it isn't as smooth you've got to use an algorithm like nearest.

I can think of two sensible ways to handle this.

  • Convert labels to a one-hot encoding. Treat each dimension as a binary mask for that label. Rescale each binary mask independently. These can be your targets or you can normalize them and get a target distribution for each pixel or take the max value.
  • For many datasets you start with polygons from the labeling tool (e.g. see LabelMe) or you can do some work to generate a polygon for each connected component. In either case you can just rescale the verticies of the polygons by the same scaling factors as the image and the rescaled polygons should give good masks.

First approach is O(unique labels in image) and second approach is O(connected components).

@allanzelener Both nice ideas, especially the second one!

@allanzelener That's what I do, I rescale the polygons and then use OpenCV to draw the rescaled polygons. Works great and fast.

Here is my idea for a generate_samples_from_disk API to replace ImageDataGenerator.flow_from_directory that should still be clear but now work for a wider cross-section of applications:

def generate_samples_from_disk(sample_sets, callbacks=load_image,  batch_size=1, data_dirs=None):
    """Generate numpy arrays from files on disk in groups, such as single images or pairs of images.
    # Arguments
    sample_sets: A list of lists, each containing the data's filenames such as [['img1.jpg', 'img2.jpg'], ['label1.png', 'label2.png']].
        Also supports a list of txt files, each containing the list of filenames in each set such as ['images.txt', 'labels.txt'].
        If None, all images in the folders specified in data_dirs are loaded in lexicographic order.
    callbacks: One callback that loads data from the specified file path into a numpy array, `load_image` by default. 
       Either a single callback should be specified or a callback must be provided for each sample set, and must be the same length as sample_sets. 
    data_dirs: Directory or list of directories to load. 
        Default None means each entry in sample_sets contains the full path to each file.
        Specifying a directory means filenames sample_sets can be found in that directory.
        Specifying a list of directories means each sample set is in that separate directory, and must be the same length as sample_sets.
    batch_size: number of samples in a batch 
    # Returns
      Yields batch_size data points in each list provided.
    """

To do that I believe the python unpack mechanism would be the thing to use, but otherwise the implementation shouldn't be too complicated. It should also be set up so it can work with PASCAL VOC easily and cleanly.

Example usage with layout as downloaded by #6665:

# pascal voc + berkeley semantic contours annotations
train_file_path = os.path.expanduser('~/.keras/datasets/VOC2012/combined_imageset_train.txt') #Data/VOClarge/VOC2012/ImageSets/Segmentation
val_file_path   = os.path.expanduser('~/.keras/datasets/VOC2012/combined_imageset_val.txt')
data_dir        = os.path.expanduser('~/.keras/datasets/VOC2012/VOCdevkit/VOC2012/JPEGImages')
label_dir       = os.path.expanduser('~/.keras/datasets/VOC2012/combined_annotations')
def open_png(path):
    path = path + '.png'
    # ... open and return 1 channel uint8 numpy array ...

def open_jpg(path):
    path = path + '.jpg'
    # ... open and return 3 channel uint8 numpy array ...

seg_gen = generate_samples_from_disk([train_file_path, train_file_path], 
    callbacks=[open_jpg, open_png], 
    data_dirs=[data_dir, label_dir])
# now apply augmentation then fit

Any thoughts or details that are missing, perhaps how it would work with multiple input and label files per sample?

#6538 (comment) @allanzelener sounds like a nice approach, could you suggest an API design or have any reference code?

@allanzelener That's what I do, I rescale the polygons and then use OpenCV to draw the rescaled polygons. Works great and fast.

@Dref360 Do you have a link or is that private? I'm guessing OpenCV won't be permitted as a new dependency, there is a lot of baggage and dramatic version differences across OSes, and I haven't seen an API that's clean the way Keras is.

Okay it looks like dealing with sample_weight and class_weight shouldn't be too difficult to update for segmentation, the various training.py _standardize*() functions will need to be updated so they accept 2d (or more), so that means replacing functions like len() with ones that go over each entry in the size/shape instead.

However, some indicator, member variable, or parameter may need to be carried so the difference between one_hot data and dense segmentation labels can be accounted for. Additional investigation needed on that front.

What about adding a parameter to all the relevant layers and other APIs which either:

  • class_dimensions or channel_dimensions which defaults to 1 where 2 would indicate 2D and so on
  • class_shape or channel_shape which explicitly specifies the shape of the data that should be operated on by operations like loss functions

This could disambiguate the purpose of each data segment, and it could work in a manner analogous to a channels_first and channels_last that works on a per layer basis and could be inherited from previous layers by default, and specify which of the dimensions are class/channel dimensions. Thoughts?

Here is iteration 3.0 of this idea. I think this generalizes better to other non-segmentation problems. This is in addition to the extended segmentation data generator/augmentation, not instead of it. Comments are welcome!

data_spec list parameter for layers

What do you think of a data_spec list parameter for Layer which is essentially an improved (and local) image_data_format to resolve data ambiguity?

Example of data ambiguity

2D classes with dense prediction vs depth in a 3D CNN with single class prediction.

data_spec supported entries

['height', 'width', 'channel', 'length', 'time', 'class', 'depth', 'category']

Examples of data_spec

dense_prediction_input = ['batch', 'height', 'width', 'channel']
dense_prediction_output = ['batch', 'height', 'width', 'class']
imagenet_prediction_channel = ['batch', 'height', 'width', 'channel']
imagenet_prediction_output = ['batch', 'class']
label_3d_input = ['batch', 'height', 'width', 'depth', 'channel']
label_3d_output = ['batch', 'height', 'width', 'depth', 'channel']

# Usage:

dense_prediction_input = ['batch', 'height', 'width', 'channel']
# stored internally as a configuration setting of the input
x_input = Input(data_spec=dense_prediction_input)
# rest of cnn here...
x_output = Dense(data_spec=dense_prediction_output)(x)
categorical_output = K.to_categorical(x_output)
# Automatically known:
# categorical_dense_prediction_output = ['batch', 'height', 'width', 'category', 'class']
K.categorical_crossentropy(categorical_output)

If a Layer has this spec, an implementation like categorical_crossentropy can automatically reshape the data, run the algorithm correctly, then reshape it back to the original shape.

There might need to be input_data_spec and output_data_spec to handle the changes in dimension a layer might cause.

TBD: 'class' might instead be 'label' or 'target', and 'string' outputs could work too.

Multi GPU Bonus

'device' trivially extends this to GPUs

Distributed training Bonus

'host' might extends this to distributed training

Alternate: shape tuples

input_shape exists and could be extended to accept
[(10,'batch'),(None, 'height'),(None,'width),(10,'classes)].

Backwards compatibility

Backwards compatibility should be very achievable with each of these options! Just default to the current behavior if no data_spec is supplied.

Hi,

I'm experimenting with Keras implementations of Yolo and SSD (https://github.com/lhk/object_detection).
So far my code is very much just a toy project.
But there is one feature that doesn't seem to be used so far:

For augmentation, the papers on object detection use variations of crops and color changes.
I haven't seen the usual range of rotations/zooms/shifts so far. This is probably because you have to update the bounding box annotations to be kept in sync with the image.

I've implemented a basic prototype for automatic augmentation of images with bounding boxes: https://github.com/lhk/bbox_augmentations/blob/master/showcase.ipynb

This integrates nicely with Keras, I've actually used parts of your image preprocessing pipeline.
Since this seems to be the issue for general discussion about API design in the direction of object detection / segmentation, I would like to propose this feature:

Reimplementing the current flow_from functionality to work on images annotated with bounding boxes.

I would very much like to work on this. Could you point me in the right direction to get started ?
How can I productively contribute to this ?

For example, I could try to recreate the current infrastructure of generators for the new annotated data type. Would that be useful ?

It would be an awesome addition. Algorithm-wise, zooms and shifts are straightforward, but the way you do rotations is wrong in principle. For example, if you rotate a circle around its center, the bounding box doesn't change, while with your approach it does change. Although for small rotations, and if bounding boxes weren't all that tight to begin with, it wouldn't matter. In this case, bounding boxes should be jittered anyway, and the sampling can take the original+rotated into account?

@lhk I'd suggest starting with the SegDataGenerator class in Keras-FCN, and create a pull request for the official keras-contrib repository that trains on pascal voc, a dataset already in keras-contrib. If you want to go that route you should also be aware of this PR which has some first steps (but also bugs in the example at the time of writing): keras-team/keras-contrib#152

commented

To add some other resources:

Deeplabv3 Keras
PSPNet Keras

Thanks! I've been slowly integrating some functionality into github.com/keras-team/keras-contrib as well, there are several open pull requests.

commented

@ahundt, I am interested to help you in reinforcement learning with OpenAI gym. Please let me know, how should I proceed.

@Luffy1996 this issue is about image segmentation rather than RL so I'll message separately.

I'll close this issue for now since this thread didn't have any updates for quite a while. Please open another one if necessary.