Potential train-test leakage

Question

Potential train-test leakage

KarolisRam opened this issue 8 months ago · comments

Karolis Ramanauskas commented 8 months ago

The original paper says "... distinct training and test sets can be generated for each environment". This might not be the case for some of the environments.

Because of limited degrees of freedom in some games, for example maze with 3x3 or 5x5 sizes, some levels will repeat between train and test sets. I wanted to investigate how big of an issue this is. Below are my findings. Code to reproduce them is here: https://github.com/KarolisRam/procgen-level-overlap.

TL;DR - three games are affected - coinrun, maze and ninja. The overlap mostly happens in the simplest levels. The overlap rates for easy / hard difficulties are:

Coinrun 24% / 1%.
Maze 20% / 10%.
Ninja 4% / 3%.

Methodology

For each game, each difficulty level and both options for center_agent, I saved the images of the agent view of the first observation in the first 200,000 seeds. Uncentering the agent makes the whole level visible. I then checked what percent of the images with seeds between 100,000-199,999 have already appeared in seeds 0-99,999.

This can only prove test/train overlap for some games, and only on uncentered mode (which doesn't work for starpilot). For example, in bigfish the behaviour of the other fish cannot be determined from the first observation.

Results

env	easy	hard	centering matters?	fully described by first obs?
bigfish	3.57%	5.54%	no	no
bossfight	0.00%	0.00%	no	no
caveflyer	0.00%	0.00%	yes	yes?
chaser	0.00%	0.00%	no	no
climber	0.00%	0.00%	yes	yes?
coinrun	24.31%	1.05%	yes	yes?
dodgeball	0.00%	0.00%	no	yes
fruitbot	0.00%	0.00%	yes	yes
heist	0.00%	0.00%	no	yes
jumper	0.00%	0.00%	yes	yes
leaper	23.18%	0.82%	no	no
maze	20.22%	10.67%	no	yes
miner	0.00%	0.00%	no	yes
ninja	4.06%	2.92%	yes	yes?
plunder	12.89%	1.13%	no	no
starpilot	0.37%	0.37%	no	no

For bigfish, leaper, plunder and starpilot the first observation overlap doesn't mean level overlap, because the other objects of the game are not visible yet. For coinrun, maze and ninja there is clear overlap, mostly in the simplest levels. Other games show no overlap in first observation.

In a sample of 1,000 overlaps in maze hard, 20% were 5x5, 80% were 3x3. In maze easy, 0.1% was 7x7, 28% were 5x5, 72% were 3x3.

Caveats

While I'm confident of the maze overlaps, the coinrun and ninja ones are less certain, because the downsampled uncentered view makes the objects tiny. Some examples of observations from hard difficulty below:

Coinrun:

Maze:

Ninja: