Question about the paper/implementation

Question

Question about the paper/implementation

araffin opened this issue 2 years ago · comments

Hello,
thanks for sharing and open sourcing the work.
After a quick read of the paper, I had several questions:

did you do an ablation of UTD? in my experiments, UTD=10 may already be enough (at least with TQC, see below) and one major detail is the policy delay (as done in REDQ or DROPQ)
did you consider using TQC ? (SAC + distributional critic, it may remove the number of multiple critics too and usually yields better resuls than SAC, see https://sb3-contrib.readthedocs.io/en/master/modules/tqc.html#results and https://github.com/SamsungLabs/tqc_pytorch)
are you using a low-pass filter on the real robot? Have you considered not using one as in https://proceedings.mlr.press/v164/raffin22a.html? (also learning directly on real robot: https://www.youtube.com/watch?v=f_FmDFrYkPM)
or how do you ensure you are not breaking the robot by sending high-frequency commands (with larger value in motor damping?)

I have a working implementation of TQC + DropQ using Stable-Baselines3 that I can also share ;) (I can do a PR on request, and it will probably part of SB3 soon)
SB3 branch: https://github.com/DLR-RM/stable-baselines3/tree/feat/dropq
SB3 contrib branch: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/tree/feat/dropq
Training script: https://github.com/araffin/walk_in_the_park/blob/feat/sb3/train_sb3.py

EDIT: SBX = SB3 + Jax is available here: https://github.com/araffin/sbx (with TQC, DroQ and SAC-N)

W&B example run: https://wandb.ai/araffin/a1/runs/2ln32rqx?workspace=user-araffin

Ilya Kostrikov · Answer 1 · Mon Aug 29 2022 01:52:38 GMT+0800 (China Standard Time)

Hello,

we ablated over different utds and found that utd=20 works best. See this figure.
TQC is an exciting algorithm. However, we didn't try it specifically for this work.
we ran experiments with and without a low pass filter; however, in our specific setup, we didn't notice a significant difference, probably due to larger damping values. At the same time, I think in many scenarios the low-pass filter can be useful.

Results for TQC+DroQ look interesting! However, we do not plan to expand this repository and intent to keep it frozen to ensure the reproducibility of the results reported in the paper.

Antonin RAFFIN · Answer 2 · Mon Aug 29 2022 02:15:08 GMT+0800 (China Standard Time)

Thanks for the swift answer =)

we ablated over different utds and found that utd=20 works best. See this figure.

given how fast is the implementation, it would make sense to even try UTD > 20, no?

Btw, what makes it so fast? jax only or additional special tricks?

Did you consider running the training for longer than 20 minutes or does it plateau/breaks? (let's say 1h for the easiest setup)
Because the learned policies walk forward but one can tell it's a RL controller... (gaits are not so natural/good looking)

Ilya Kostrikov · Answer 3 · Mon Aug 29 2022 02:18:55 GMT+0800 (China Standard Time)

Our laptop could run training only with utd=20 in real time, so we didn't try larger values :)

Yes, it's just jax.jit. Otherwise, it's a vanilla implementation without any additional engineering.

In the wild, we were constrained by the battery capacity :) With more training it gets better and better.

Antonin RAFFIN · Answer 4 · Mon Aug 29 2022 06:49:00 GMT+0800 (China Standard Time)

In the wild, we were constrained by the battery capacity :) With more training it gets better and better.

Alright... still curious to see what it could do in the simplest setting (indoor, no battery, flat ground).

fyi, I created a small report for the runs I did today with TQC ;) https://wandb.ai/araffin/a1/reports/TQC-with-DropQ-config-on-walk-in-the-park-env--VmlldzoyNTQxMzgz
After minor tuning of the discount factor (gamma=0.98), it consistently reaches return > 3700 in only 8k env interactions =) (sometimes in only 5k)

Antonin RAFFIN · Answer 5 · Sat Sep 24 2022 03:55:13 GMT+0800 (China Standard Time)

As a follow up, I've got a working version of TQC + DroQ in jax here (borrowed some code from your implementation ;)): vwxyzjn/cleanrl#272
(also a version of TQC + TD3 + DroQ, need to polish everything)