openai / procgen

Procgen Benchmark: Procedurally-Generated Game-Like Gym-Environments

Home Page:https://openai.com/blog/procgen-benchmark/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Potential train-test leakage

KarolisRam opened this issue · comments

The original paper says "... distinct training and test sets can be generated for each environment". This might not be the case for some of the environments.

Because of limited degrees of freedom in some games, for example maze with 3x3 or 5x5 sizes, some levels will repeat between train and test sets. I wanted to investigate how big of an issue this is. Below are my findings. Code to reproduce them is here: https://github.com/KarolisRam/procgen-level-overlap.

TL;DR - three games are affected - coinrun, maze and ninja. The overlap mostly happens in the simplest levels. The overlap rates for easy / hard difficulties are:

  • Coinrun 24% / 1%.
  • Maze 20% / 10%.
  • Ninja 4% / 3%.

Methodology

For each game, each difficulty level and both options for center_agent, I saved the images of the agent view of the first observation in the first 200,000 seeds. Uncentering the agent makes the whole level visible. I then checked what percent of the images with seeds between 100,000-199,999 have already appeared in seeds 0-99,999.

This can only prove test/train overlap for some games, and only on uncentered mode (which doesn't work for starpilot). For example, in bigfish the behaviour of the other fish cannot be determined from the first observation.

Results

env easy hard centering matters? fully described by first obs?
bigfish 3.57% 5.54% no no
bossfight 0.00% 0.00% no no
caveflyer 0.00% 0.00% yes yes?
chaser 0.00% 0.00% no no
climber 0.00% 0.00% yes yes?
coinrun 24.31% 1.05% yes yes?
dodgeball 0.00% 0.00% no yes
fruitbot 0.00% 0.00% yes yes
heist 0.00% 0.00% no yes
jumper 0.00% 0.00% yes yes
leaper 23.18% 0.82% no no
maze 20.22% 10.67% no yes
miner 0.00% 0.00% no yes
ninja 4.06% 2.92% yes yes?
plunder 12.89% 1.13% no no
starpilot 0.37% 0.37% no no

For bigfish, leaper, plunder and starpilot the first observation overlap doesn't mean level overlap, because the other objects of the game are not visible yet. For coinrun, maze and ninja there is clear overlap, mostly in the simplest levels. Other games show no overlap in first observation.

In a sample of 1,000 overlaps in maze hard, 20% were 5x5, 80% were 3x3. In maze easy, 0.1% was 7x7, 28% were 5x5, 72% were 3x3.

Caveats

While I'm confident of the maze overlaps, the coinrun and ninja ones are less certain, because the downsampled uncentered view makes the objects tiny. Some examples of observations from hard difficulty below:

Coinrun:
coinrun-seed-00000000 coinrun-seed-00000001 coinrun-seed-00000004

Maze:
maze-seed-00000008 maze-seed-00000011 maze-seed-00000014

Ninja:
ninja-seed-00000004 ninja-seed-00000007 ninja-seed-00000015