A-Train Cloud Segmentation

Official GitHub repository for the A-Train Cloud Segmentation Dataset.

Overview
Dataset
Models
Installation
Usage
Acknowledgements

Overview

The A-Train Cloud Segmentation Dataset (ATCS) is a large-scale, volumetric cloud segmentation dataset designed to make it easy for computer vision and remote sensing researchers to improve their cloud-finding models. This repository contains everything you need to interface with the dataset, to train or evaluate our baseline models, and even to generate your own version of the dataset.

If you're interested in machine learning or computer vision, you may find this dataset interesting because:

It's large scale, with over 20k instances.
It's easy to use; all instances are pre-processed numpy / pickle objects.
Dataloaders and models are implemented in Pytorch.
Input images have 288 channels, posing a unique challenge.
Labels are rich, containing cloud types for 99 altitude bins.
Labels are sparse, making this an interesting test-case for sparse supervision.

If you come from the atmospheric remote sensing community, you may find this dataset interesting because:

Multi-angle inputs and rich labels enable estimation of full vertical cloud profiles from passive sensor data. Not only does this help with the parallax problems during Level 1-C processing, but also has implications for climate modeling.
Input contains 9 spectral bands, 3 of which have polarization values. As far as we know, this is the only similarly formatted large-scale volumetric cloud segmentation dataset containing polarization.

Dataset

ATCS consists of data from multiple satellites in the A-Train Constellation, synchronized spatially and temporally. The data is sourced from the publicly available ICARE Data and Services Center.

Instances are formatted as input/output pairs. The dataset contains 16,448 training and 4,112 validation instances.

Data is sampled uniformly from CloudSat orbits. The global coverage of our dataset is valuable, as the challenges of cloud segmentation vary strongly based on location. For instance, equatorial clouds are higher than polar clouds. Cloud segmentation can also be challenging over reflective surfaces, such as ice and snow, which makes the polar latitude samples in our dataset particularly valuable.

Data is sampled randomly from 11/27/2007 to 12/2/2009, as this is the intersection of the operational lifetimes of the relevant missions, before orbital corrections caused an increase in the time delay between sensors.

Input

Input comes from the POLDER sensor on the PARASOL satellite. Input arrays are formatted as 100x100 patches, which are equirectangular projections centered on the patch center. Ground sample distance is approximately 6 km. Note: angles are imaged at different times, and clouds may drift between angles. Due to the high ground sample distance, this drift should rarely exceed 1 or 2 pixels.

Each pixel contains normalized reflectance values for 9 different spectral bands: 443, 490, 565, 670, 763, 765, 865, 910, 1020. In addition, the bands 490, 670, and 865 contain polarization information, represented as the Q, U values of the Stokes parameterization. Finally, there are three geometric values: relative azimuth angle, solar zenith angle, and view zenith angle, giving 18 total values, per angle, per pixel. With 16 angles and 18 values, inputs have a total of 288 channels. At inference time, model inputs have shape:

input_shape = (batch_size, num_channels=288, height=100, width=100)

Some examples images are shown below. These images are averaged over the 4 central angles, interpolated to simulate true color, and max-min normalized:

Output

Outputs are not hand-labeled, and instead come from the CloudSat satellite. More specifically, we source the data from a CALTRACK product: 2B-CLDCLASS (vertical cloud profiles). These data are spatially aligned with the PARASOL/POLDER data to sub-pixel accuracy. The CloudSat data in this product are already time-synchronized with CALIPSO, therefore we use another CALTRACK product which contains time offsets between PARASOL and CALIPSO in order to ensure the records match within a maximum time offset threshold of 10 minutes.

Semantic segmentation involves predicting the class-membership mask of a single 'layer' of an image. However, our dataset contains vertical cloud profiles over 99 altitude bins. You can think of this as asking the network to simultaneously perform 99 (highly correlated) semantic segmentation tasks.

There are two main limitations of these outputs.

Since CloudSat's radar only acquires vertical cloud profiles in a narrow band, the output is only defined for a sparse set of locations in the input grid. This tends to look like a line running (mostly) through the input image from north to south. We constrain sampled locations so that there are at least 10 pixels between the output locations and the east/west borders of the image.
There is a temporal delay between POLDER and CLDCLASS measurements, so cloud locations may have slightly changed. Temporal offset tends to be within 3 minutes, so only the fastest-moving clouds will drift more than 1 pixel.

At inference time, outputs have shape:

output_shape = (batch_size, num_classes=9, num_altitude_bins=99, height=100, width=100)

An example label is displayed below:

The same example, displayed over its associated image. The green line shows the set of points for which we have supervision:

Ground truth is only defined on a sparse set of locations, with varying amounts of labeled locations per image. Additionally, these locations are not quantized to the same grid as the input. Therefore, before applying the loss function, we interpolate and reshape the output:

interpolated_shape = (num_pixels_in_batch, num_classes=9, num_altitude_bins=99)

Note: the interpolation function is differentiable, and doesn't interfere with backpropagation.

Tasks

We support three different tasks. The most challenging is volumetric semantic segmentation ("seg_3d"), which requires the model to predict a cloud type for every voxel in the output grid. Binary volumetric segmentation ("bin_seg_3d") requires the model to predict whether each voxel in the output grid is a cloud or not, but does not require prediction of cloud type. You can think of this as: bin_seg_3d = (seg_3d > 0). Flattened binary segmentation ("bin_seg_2d") collapses the output over the altitude dimension, simply asking the model to predict if each pixel has any clouds in its vertical profile. You can think of this as: bin_seg_2d = bin_seg_3d.any(axis=1).

Models

(TODO: experiments running)

Installation

Installation instructions assume a Linux install.

Clone this repository:

 git clone https://github.com/seanremy/atrain-cloudseg

Install python dependencies:

 conda create --name atrain python==3.9.5
 conda activate atrain
 python -m pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

Download and extract the dataset. You can download the dataset from this (temporary) link. Once you have the dataset downloaded, you can extract it with:
```
 tar -xzf <path/to/the/downloaded/file> -C <path/to/where/you/want/the/dataset>
```
Make the data and experiments directories, plus make a symbolic link to the dataset directory:
```
 mkdir data && mkdir experiments
 ln -s <path-to-extracted-dataset> data/atrain
```

Usage

Training

In order to see usage options, run:

    python scripts/train.py -h

Evaluation

TODO: this is a work in progress.

Leaderboard

TODO: the leaderboard is a work in progress.

Dataset Generation

You can generate your own version of the dataset, if you'd like to get more data, or use a different set of generation hyperparameters. The dataset generation script manages a session with the ICARE FTP server, where the original data lives. These files are in HDF5 or HDF4 formats, and in the case of the PARASOL data, the files are quite large.

In order to use this script, you first need to register with ICARE. The script will ask you for your username and password at runtime. If you want to avoid constantly re-entering your credentials, you can store them in 'icare_credentials.txt', with username on the first line, and password on the second. CAUTION: If you do this, ensure that your ICARE password is unique, as storing unhashed passwords is extremely insecure.

You can configure patch size, sampling rates, maximum time offset, minimum number of angles, and which fields to take. In addition, the script is resumable (it can crash if your connection fluctuates or from occasional I/O file lock errors). In order to see usage options, run:

    python scripts/generate_atrain_dataset.py -h

After generating the dataset, you will need to create an appropriately located symbolic link to the dataset locaation, which you can do with:

    ln -s <path-to-created-dataset> data/atrain

still need generate one or more train/val splits. To see usage for the split generation script, run:

    python scripts/generate_split.py -h

Contributing

Feel free to open issues or contribute a pull request to this project! I appreciate any help. If you do wish to contribute, please set up pre-commit.

Acknowledgements

Data provided by CNES and NASA. We thank the AERIS/ICARE Data and Services Center for providing access to the data used in this study.

The dataset and baselines were initially developed during my Summer 2021 internship at NASA's Goddard Space Flight Center. None of this work would be possible without the insights of my supervisor Kirk Knobelspiesse, as well as Andy Sayer, Bastiaan van Diedenhoven, Carlos Del Castillo, and Jason Xuan. I'd also like to thank my manager at SAIC, Fred Patt, and my advisors, James Hays and Judy Hoffman.

seanremy / atrain-cloudseg