Dataset train/eva/test partitions

Question

Dataset train/eva/test partitions

michaeltrs opened this issue 4 years ago · comments

Hi,

For the provided dataset, I noticed there are more data saved in disk than what the total of the partitions found in tileids folder. For example for the 48x48 pixel data there are a total of 28515 .tfrecord.gz files while eval.tileids, train_fold*.tileids, test_fold*.tileids collectively contain 10494 samples per year. That leaves 28515 - 2*10494 = 7527 samples which are not split into train/eval/test for 2016 and 2017.
Is there something wrong in the above description? if not, then how should we treat the unassigned data?

Many thanks,
Michael

Marc Rußwurm · Answer 1 · Tue Mar 24 2020 05:35:23 GMT+0800 (China Standard Time)

Hi Michael,

Thanks for your issue and your patience.

The tileids files are used for the results in the paper. All results are obtained from the tiles of tileids/eval.tileids.

The number of tfrecord files can vary from the tileids in the data splits due to two effects: 1) data preprocessing failed (tileid listed in failedtiles201*.txt) and 2) the tileids are on the margin region between blocks of train/valid/eval as shown in Figure 4 in the paper.

Overall the preprocessing chain looked like this:

a) for each tile within AOI: crop images and store to tfrecord, if error: add id to failedtiles201*.txt

b) separate area of interest into blocks for train/valid/eval with margin. Store the ids of tiles that lie within the respective blocks into tileids folder.

Since b) defines the split, not all tiles that have been processed in a) will be used by the training script.
So, the number of tfrecord files and tileids can vary.

We decided to separate the tileids from the actual data samples to allow for different folds and some experiments with different data split similar to what we did in the CVPR paper (east vs west, size of blocks). At the end, we did not include these experiments in the work of the IJGI paper.

I hope this clarifies things.
We quantitatively evaluated the models on the eval.tileids of the 24px by 24px tiles. These would be the data tiles that you could compare your method with the results of the paper directly.