Questions about atari evaluation protocol

Question

Questions about atari evaluation protocol

jmkim0309 opened this issue 3 years ago · comments

kimjm0309 & lukas.kim commented 3 years ago

Hi @danijar, thank you for this great work.

I have some questions about evaluation protocol used in this code and dreamerV2 paper.

Q1. In the paper, it is mentioned that you followed the evaluation protocol of Machdo et al. (2018), where they use "evaluation during training" which means the average score of the last 100 training episodes before the agent reaches 200M frames, without using the explicit evaluation phase. Is this "evaluation during training" protocol used for dreamerV2 as well? or did you separate evaluation episodes for evaluation?
Q2. Is there any standard atari evaluation protocol ? For instance, in IMPALA paper, it is addressed they used standard evaluation protocol where the scores over 200 evaluation episodes are averaged. So they used separate evaluation phase, while Machado et al did not. Also, sticky actions is not applied in IMPALA evaluation, while is is used in Machado et al and dreamerV2 for evaluation. So I wonder there is any evaluation protocol that we can call "standard". What is your opinion about this, and what is the reason of dreamerV2 following Machado et al's evaluation protocol, not the one of IMPALA?
Q3. In Machado et al, 5~24 different trials are averaged for evaluation. How many trials did you use in dreamerV2?
Q4. In the code (https://github.com/danijar/dreamerv2/blob/main/dreamerv2/train.py#L153-L155) , the number of evaluation episodes (config.eval_every) is 1 and evaluation interval (config.eval_every) is 1e5. How can I relate these settings to standard evaluation protocol of Machado et al?

Danijar Hafner · Answer 1 · Fri Aug 13 2021 05:26:22 GMT+0800 (China Standard Time)

Hi, thanks for your question. What we mean by standard evaluation protocol of Machdo et al. (2018) is that we use sticky actions (with 25% probability, the agent action is ignored and the previous action is repeated instead), the full action space (rather than different action spaces of only the useful actions of each game), the life-loss heuristic is not used.

We are using separate evaluation episodes where the mode of the policy is used instead of a sample but the difference to the training episode scores is small. We are running 1 such evaluation episode every 1e5 training steps as you pointed out. The plots in the paper are binned with bin size 1e6, which means the scores are averages over 10 evaluation episodes.