RUN any_percent_bc.py have OSError

Question

RUN any_percent_bc.py have OSError

zhushi-math opened this issue a year ago · comments

here is error meessage
/home/bins/anaconda3/lib/python3.10/site-packages/gym/spaces/box.py:84: UserWarning: WARN: Box bound precision lowered by casting to float32
logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
Traceback (most recent call last):
File "/home/bins/桌面/CORL/algorithms/any_percent_bc.py", line 406, in
train()
File "/home/bins/anaconda3/lib/python3.10/site-packages/pyrallis/argparsing.py", line 158, in wrapper_inner
response = fn(cfg, *args, **kwargs)
File "/home/bins/桌面/CORL/algorithms/any_percent_bc.py", line 307, in train
dataset = d4rl.qlearning_dataset(env)
File "/home/bins/d4rl/d4rl/init.py", line 87, in qlearning_dataset
dataset = env.get_dataset(**kwargs)
File "/home/bins/d4rl/d4rl/offline_env.py", line 87, in get_dataset
with h5py.File(h5path, 'r') as dataset_file:
File "/home/bins/anaconda3/lib/python3.10/site-packages/h5py/_hl/files.py", line 567, in init
fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
File "/home/bins/anaconda3/lib/python3.10/site-packages/h5py/_hl/files.py", line 231, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 106, in h5py.h5f.open
OSError: Unable to open file (truncated file: eof = 181944320, sblock->base_addr = 0, stored_eof = 474567252)

zhushi-math · Answer 1 · Fri May 19 2023 10:50:39 GMT+0800 (China Standard Time)

When I searched, I found that other such problems occurred because the .h5 file of the model was not loaded correctly, but I don't know what .h5 file needs to be loaded in the BC algorithm. The only thing that can be related to this problem is the following this code for in line 362-365

if config.load_model != "":
    policy_file = Path(config.load_model)
    trainer.load_state_dict(torch.load(policy_file))
    actor = trainer.actor

BUT in TrainConfig, line 30
load_model: str = "" # Model load file name, "" doesn't load

same in medium_expert_v2.yaml, line 10
load_model: ''

did not load any models

Vladislav Kurenkov · Answer 2 · Sat May 20 2023 00:09:30 GMT+0800 (China Standard Time)

This seems to be a D4RL problem with dataset loading: it was not completed by some reason but files are still dangling or permissions are broken somehow.

Try to delete D4RL folder with datasets and re-run.

Also, I would strongly suggest to use the docker environment that we provide if you have a chance.

zhushi-math · Answer 3 · Sat May 20 2023 12:03:06 GMT+0800 (China Standard Time)

Will this problem affect the results of the EDAC algorithm I ran before? But the EDAC algorithm works here, is it possible that there is nothing wrong with the D4RL environment?
The following is the edac I ran out

Vladislav Kurenkov · Answer 4 · Sun May 21 2023 04:29:56 GMT+0800 (China Standard Time)

Based on the provided trace, the execution did not reach model loading, it failed earlier.

Which exact config do you use? I will check on my side. I can see that it is medium_expert_v2 , but the environment is missing.

zhushi-math · Answer 5 · Sun May 21 2023 15:06:50 GMT+0800 (China Standard Time)

Thank you very much for your willingness to help me. Use the default parameters of the CORL-EDAC algorithm on the code parameters, as follows

The following is my environment configuration, which is placed in Google Cloud Disk, please confirm!
https://drive.google.com/file/d/1r2oQaX7HMrJ3HrNhtOS8kyEKnXq8t9gq/view?usp=share_link

zhushi-math · Answer 6 · Mon May 22 2023 04:34:02 GMT+0800 (China Standard Time)

This seems to be a D4RL problem with dataset loading: it was not completed by some reason but files are still dangling or permissions are broken somehow.

Try to delete D4RL folder with datasets and re-run.

Also, I would strongly suggest to use the docker environment that we provide if you have a chance.

Hello, I am trying to use docker to configure the environment today, I am running any_percent_bc.py in the container， but failed. The following is the error message, the return value score is 82.634, significantly higher than the reference value.

Also, in my account, there is only this one graph——actor_loss

The process variable recorded in wandb is only actor_loos.

Vladislav Kurenkov · Answer 7 · Tue May 23 2023 18:06:45 GMT+0800 (China Standard Time)

Got it, for this particular problem there is a condition missing to check if the checkpoint's path is specified.

This problem will be resolved in this PR #52 (we're heading to merge this one within a week or so)

If this is critical for you right now, you can use any_percent_bc.py from this branch

Vladislav Kurenkov · Answer 8 · Sun Jun 11 2023 00:38:40 GMT+0800 (China Standard Time)

Fixed in #52