real-stanford / scalingup

[CoRL 2023] This repository contains data generation and training code for Scaling Up & Distilling Down

Home Page:https://www.cs.columbia.edu/~huy/scalingup/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problems When using CACHE

Louis-ZhangLe opened this issue · comments

Thanks for your great job. When I was reproducing your work using the CACHE, I could not find the matching hash key under the responses file. Looking forward to your reply, thank you.

Hey! I'm able to reproduce your error. When I rollback to the first commit 3d2f43c, I'm no longer get the same error. My guess is that this bug was introduced in 218a618.

Unfortunately, I don't have time for another week to fix this issue. For now, if you don't need the FR5 robot, could you also use 3d2f43c? Thanks!

Thank you for your answer. I'll give it a try first. Also, the reason why I didn't use the OpenAI API is because there were no logprobs in the response. May I ask if it is possible to remove logprobs from the code? Looking forward to your reply, thank you.

Ah right, the OpenAI API removed logprobs recently. Anyways, you should be able to remove the logprobs from the completion sampling procedure without affecting the results too much!

Hello, I have successfully run the version you submitted for the first time. I first perform the reproduction work in the transport task. But The training results cannot reach the effect of the paper, and the gap is too big. Can you provide more details on model training, such as training parameters, num_steps_per_update, batch_size and epoch for each domain. Looking forward to your reply, thank you.

Hey! The default training parameters are the ones I used (batch size of 1024, 10 epochs, 1 num steps per update, etc.
How many datapoints was used for training?

The datapoints of transport task is 52133. I found that the default value of num_steps_per_update is 10000, which means that the model is updated 10,000 times in an epoch. Are you sure you set it to 1. Moreover, Isn’t it necessary to test during training? Just verification is enough. So the evaluation.num_episodes=0?In addition, I found that inference using diffusion policy is slower and keeps printing some warnings, such as " WARNING Failed to converge after 299 steps: err_norm=0.104888 ". Is this normal? Finally, I would also like to ask about the model name with the best output called "last.ckpt". Why can't I load it? It says there is no such file, but the path is correct. Other checkpoints can be loaded. Looking forward to your reply, thank you.

Sorry, I was referring to "num_steps_per_update", not "num_timesteps_per_batch".

Hi! I just rollback to the first commit and download the responses files. But I still have the issue: 'openai.error.AuthenticationError: No API key provided. You can set your API key in code using 'openai.api_key = ', or you can set the environment variable OPENAI_API_KEY=). If your API key is stored in a file, you can point the openai module at it with 'openai.api_key_path = '. ....' I have no idea what might be the reason. Looking forward to the reply from both of you guys, thank you!

Hi! I just rollback to the first commit and download the responses files. But I still have the issue: 'openai.error.AuthenticationError: No API key provided. You can set your API key in code using 'openai.api_key = ', or you can set the environment variable OPENAI_API_KEY=). If your API key is stored in a file, you can point the openai module at it with 'openai.api_key_path = '. ....' I have no idea what might be the reason. Looking forward to the reply from both of you guys, thank you!

Maybe you can check the path saving cache file. Make sure it is scalingup/scalingup/responses.