tinkoff-ai / CORL

High-quality single-file implementations of SOTA Offline and Offline-to-Online RL algorithms: AWAC, BC, CQL, DT, EDAC, IQL, SAC-N, TD3+BC, LB-SAC, SPOT, Cal-QL, ReBRAC

Home Page:https://arxiv.org/abs/2210.07105

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RUN any_percent_bc.py have OSError

zhushi-math opened this issue · comments

here is error meessage
/home/bins/anaconda3/lib/python3.10/site-packages/gym/spaces/box.py:84: UserWarning: WARN: Box bound precision lowered by casting to float32
logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
Traceback (most recent call last):
File "/home/bins/桌面/CORL/algorithms/any_percent_bc.py", line 406, in
train()
File "/home/bins/anaconda3/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
response = fn(cfg, *args, **kwargs)
File "/home/bins/桌面/CORL/algorithms/any_percent_bc.py", line 307, in train
dataset = d4rl.qlearning_dataset(env)
File "/home/bins/d4rl/d4rl/init.py", line 87, in qlearning_dataset
dataset = env.get_dataset(**kwargs)
File "/home/bins/d4rl/d4rl/offline_env.py", line 87, in get_dataset
with h5py.File(h5path, 'r') as dataset_file:
File "/home/bins/anaconda3/lib/python3.10/site-packages/h5py/_hl/files.py", line 567, in init
fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
File "/home/bins/anaconda3/lib/python3.10/site-packages/h5py/_hl/files.py", line 231, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 106, in h5py.h5f.open
OSError: Unable to open file (truncated file: eof = 181944320, sblock->base_addr = 0, stored_eof = 474567252)

When I searched, I found that other such problems occurred because the .h5 file of the model was not loaded correctly, but I don't know what .h5 file needs to be loaded in the BC algorithm. The only thing that can be related to this problem is the following this code for in line 362-365

if config.load_model != "":
    policy_file = Path(config.load_model)
    trainer.load_state_dict(torch.load(policy_file))
    actor = trainer.actor

BUT in TrainConfig, line 30
load_model: str = "" # Model load file name, "" doesn't load

same in medium_expert_v2.yaml, line 10
load_model: ''

did not load any models

This seems to be a D4RL problem with dataset loading: it was not completed by some reason but files are still dangling or permissions are broken somehow.

Try to delete D4RL folder with datasets and re-run.

Also, I would strongly suggest to use the docker environment that we provide if you have a chance.

Will this problem affect the results of the EDAC algorithm I ran before? But the EDAC algorithm works here, is it possible that there is nothing wrong with the D4RL environment?
The following is the edac I ran out
W B Chart 2023_5_20 11_58_34

Based on the provided trace, the execution did not reach model loading, it failed earlier.

Which exact config do you use? I will check on my side. I can see that it is medium_expert_v2 , but the environment is missing.

Thank you very much for your willingness to help me. Use the default parameters of the CORL-EDAC algorithm on the code parameters, as follows

image

The following is my environment configuration, which is placed in Google Cloud Disk, please confirm!
https://drive.google.com/file/d/1r2oQaX7HMrJ3HrNhtOS8kyEKnXq8t9gq/view?usp=share_link

This seems to be a D4RL problem with dataset loading: it was not completed by some reason but files are still dangling or permissions are broken somehow.

Try to delete D4RL folder with datasets and re-run.

Also, I would strongly suggest to use the docker environment that we provide if you have a chance.

Hello, I am trying to use docker to configure the environment today, I am running any_percent_bc.py in the container, but failed. The following is the error message, the return value score is 82.634, significantly higher than the reference value.
image

Also, in my account, there is only this one graph——actor_loss

image
image
The process variable recorded in wandb is only actor_loos.

Got it, for this particular problem there is a condition missing to check if the checkpoint's path is specified.

This problem will be resolved in this PR #52 (we're heading to merge this one within a week or so)

If this is critical for you right now, you can use any_percent_bc.py from this branch