Questions about the Go2 Distillation Training Performance
JLCucumber opened this issue · comments
Hi all,
I managed to complete the distillation step and got a policy with 256 num_envs after 60000 iterations, which took me three days to finish. However, when I checked the visual performance using play.py, the performance isn't very ideal. Here are some of my findings:
Good side:
- Compared with my previous minimal policy with 32
num_envs, which doesn't produce any reaction even bounced into the obstacle, this 256 version demonstrate a tendancy to try over and over again. And sometimes it can manage to overcome the jumping terrain.
Bad side:
- Still, the information of depth camera isn't effectively used by this student agent. Evidences are: go2 runs into a very low jumping obstacle without realizing there's something, and only after a number of trials will it managed to get over. I think this indicates that the student policy is actually using proprioception info rather than exteroception (depth image) to sense the environment. Here are some cases:
- Policy haven't yet converged, even after 60000 iterations. I checked the tensorboard and noticed the distillation loss and estimator loss showed an abrupt decline at around 87k-th iteration (shown in Figure below). For sure I will resume the distillation training based on the latest policy, but I'm wondering how I should do it exactly - shall I delete all the collected trajectories in the tmp_data, or keep training with all of them? On the README it said we need to delete the stored trajectories before the new training, but what if I want to keep training upon the latest policy?
- About the forward camera settings. I noticed there's a model overlap (穿模) and in the depth image I can see part of the go2's head. I doubt if the default configuration of the Go2 Depth camera needs some adjustment. I noticed that in Go1's version, you did quite a lot of tunes on the camera settings. Would you like to share some suggestions about tuning the forward camera settings on Go2, such as the position, pose, and camera extrinsic? Besides, do you have some suggestions to tune the camera
Thanks in advance for any suggestions!
Cheers,
Hongbo
Hello,
I encountered a similar problem as you, and the effect of my model after training was not ideal. My loss curve is similar to your curve on July 27, and the reward also increased (from 80k to 10, but ended due to the round limit), but the final visualization effect was not good.
But my main problem now is that I can't even reproduce this experiment. The current reward can only hover at 0 or below 0. If possible, can you give me some advice about what adjustments I should make so that the model can learn as many positive rewards as possible?
Sincerely and gratefully,
Mu Zihang
Hi,
Sorry for the long silence! Fortunately, I later found the root cause — there’s a buggy variable in the go2_distill config file. The parameter called ckpt_manipulator should be commented out when you play your student policy. Otherwise, it will “blind” your student’s eye by replacing the trained encoder with a randomly initialized one, leading to poor performance even on the simplest terrain.
Once you comment out this line, the student policy should perform well after around 10,000 iterations (roughly at the 50,000th iteration mark).
As for the reward curve, I didn’t modify any reward functions or their weights, so unfortunately I don’t have much insight on that part.
By the way, I think it’s worth mentioning this issue in the README — I double-checked and found that it’s not documented there. I’d really appreciate it if you could add a short note about this minor problem in the README, @ZiwenZhuang.
Here is the code:
class runner( Go2FieldCfgPPO.runner ):
policy_class_name = "EncoderStateAcRecurrent"
algorithm_class_name = "EstimatorTPPO"
experiment_name = "distill_go2"
num_steps_per_env = 32
if multi_process_:
pretrain_iterations = -1
class pretrain_dataset:
data_dir = "{A temporary directory to store collected trajectory}"
dataset_loops = -1
random_shuffle_traj_order = True
keep_latest_n_trajs = 1500
starting_frame_range = [0, 50]
resume = True
load_run = osp.join(logs_root, "field_go2",
"{Your trained oracle parkour model directory}",
)
ckpt_manipulator = "replace_encoder0" if "field_go2" in load_run else None <<<< HERE !!!
run_name = "".join(["Go2_",
("{:d}skills".format(len(Go2DistillCfg.terrain.BarrierTrack_kwargs["options"]))),
("_noResume" if not resume else "_from" + "_".join(load_run.split("/")[-1].split("_")[:2])),
])These are my reproduced student policy at 80k~100k (resuming from 40k's teacher).
https://github.com/user-attachments/assets/1f7fc240-7d5f-4da8-848b-c7b1b8c094f5
https://github.com/user-attachments/assets/a36c9259-634d-4725-966b-3e7aa7d1a531
Hi,
Sorry for the long silence! Fortunately, I later found the root cause — there’s a buggy variable in the go2_distill config file. The parameter called ckpt_manipulator should be commented out when you play your student policy. Otherwise, it will “blind” your student’s eye by replacing the trained encoder with a randomly initialized one, leading to poor performance even on the simplest terrain.
Once you comment out this line, the student policy should perform well after around 10,000 iterations (roughly at the 50,000th iteration mark).
As for the reward curve, I didn’t modify any reward functions or their weights, so unfortunately I don’t have much insight on that part.
By the way, I think it’s worth mentioning this issue in the README — I double-checked and found that it’s not documented there. I’d really appreciate it if you could add a short note about this minor problem in the README, @ZiwenZhuang.
Here is the code:
class runner( Go2FieldCfgPPO.runner ): policy_class_name = "EncoderStateAcRecurrent" algorithm_class_name = "EstimatorTPPO" experiment_name = "distill_go2" num_steps_per_env = 32 if multi_process_: pretrain_iterations = -1 class pretrain_dataset: data_dir = "{A temporary directory to store collected trajectory}" dataset_loops = -1 random_shuffle_traj_order = True keep_latest_n_trajs = 1500 starting_frame_range = [0, 50] resume = True load_run = osp.join(logs_root, "field_go2", "{Your trained oracle parkour model directory}", ) ckpt_manipulator = "replace_encoder0" if "field_go2" in load_run else None <<<< HERE !!! run_name = "".join(["Go2_", ("{:d}skills".format(len(Go2DistillCfg.terrain.BarrierTrack_kwargs["options"]))), ("_noResume" if not resume else "_from" + "_".join(load_run.split("/")[-1].split("_")[:2])), ])These are my reproduced student policy at 80k~100k (resuming from 40k's teacher).
https://github.com/user-attachments/assets/1f7fc240-7d5f-4da8-848b-c7b1b8c094f5 https://github.com/user-attachments/assets/a36c9259-634d-4725-966b-3e7aa7d1a531
Thanks a lot for sharing this fix and the clear explanation about the ckpt_manipulator issue! Your note really helps clarify things. This will definitely be useful for others as well.
Sorry for the confusion. I will update this to the readme.


