tinkoff-ai / lb-sac

Official implementation for "Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size", NeurIPS 2022, Offline RL Workshop

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some questions about the results of the LB-SAC paper report:

WuTi0525 opened this issue · comments

commented
  1. When I conducted walker2d-full-replay-v2 experiment, final normalized score was around 106.5, while 109.1 was reported in the paper. Although there is not a big difference, due to my improvement of this algorithm, The result is around 108. I want to know whether this result really improves the algorithm performance, or whether I should write the reproduced results together in the paper, similar to the halfcheetah-medium-expert-v2 results, although the reproduced results are slightly worse than those in the paper. However, my improved results are still slightly worse than the results reported in the paper, but higher than the original algorithm results reproduced by myself. Since this algorithm is still the most advanced of the model-free algorithms, the room for performance improvement is very limited, so small improvements to it are also very important to me.
  2. About the calculation method of normalized score for your two reports, final normalized score refers to the final policy score, which I understand the average result of the algorithm in the last step on the 4 random seeds. For Normalized maximum scores, I understand it in two ways:
    1. Record the best score on each seed for average.
    2. Maximum average result obtained in the same steps.
    Do you average first and then look for the best score or do you look for the best score first and then average?
    Looking forward to your early reply!

Hi @WuTi0525! Thank you for the interest in our work.

Regarding the first question, actually when I reproduced the algorithm for CORL, I also got numbers very close to 106, so probably values close to 109 are more of a luck. In fact, all the results of my reproduction are publicly available as a report in wandb, you can see it here and also use them for your work, because they can be trusted for sure. Therefore, I would suggest adding your improvements to lb_sac.py (yup, it's only one file!) in CORL and compare the results on the same configs + import wandb graphs into your project from the report.

As for the second question, first we averaged, and then we took the maximum.

commented

@Howuhh Thank you very much for your timely reply. Another question is: How many epochs did you run to obtain the final report score? I took the result of 300 epochs, although the default in the code is 350 epochs. Moreover, regarding configs/lb-sac/halfcheetah/halfcheetah_ medium_replay.yaml, num_critics should be equal to 4, you should have accidentally written it as 10.

@WuTi0525 yeah, it is set to 4 in the CORL configs, so this is a typo.
We did not have one fixed number of epochs as we trained to convergence. However, for your work you can choose what is convenient for you.

commented

是的,它在 CORL 配置中设置为 4,所以这是一个错字。我们没有一个固定数量的时期,因为我们训练收敛。但是,对于您的工作,您可以选择方便的工作。

Therefore, is the result reported in the paper the final score obtained by taking different epochs for different datasets? As long as it converges.

Yup, for exact numbers you can look in CORL configs for lb-sac

commented

Thank you for your detailed reply