google-deepmind / open_spiel

OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot resume Alphazero training with torchlib

robinpdev opened this issue · comments

I'm trying to resume training training an Alphazero model as described here

But i receive this error message:
$ ./build/examples/alpha_zero_torch_example $VSC_DATA/shared/robin/os_out/rthex_5x5_2/config.json Logging directory: /data/gent/465/vscxxxxx/shared/robin/os_out/rthex_5x5_2 Using existing model: /data/gent/465/vscxxxxx/shared/robin/os_out/rthex_5x5_2/vpnet.pb Playing game: rthex Spiel Fatal Error: /data/gent/465/vscxxxxx/shared/robin/open_spiel/open_spiel/utils/json.cc:67 error == str.substr(0, std::min(30, static_cast<int>(str.size()))) error = Empty string, str.substr(0, std::min(30, static_cast<int>(str.size()))) =

These are the files in the model path:

checkpoint-0-optimizer.pt learner.jsonl log-actor-2.txt log-evaluator-1.txt checkpoint-0.pt log-actor-0.txt log-actor-3.txt log-learner.txt config.json log-actor-1.txt log-evaluator-0.txt vpnet.pb

And this is the config.json file

{ "actors": 4, "checkpoint_freq": 1, "cutoff_probability": 0.800000, "cutoff_value": 0.950000, "devices": "cuda:0,cpu:0", "eval_levels": 7, "evaluation_window": 100, "evaluators": 2, "explicit_learning": false, "game": "rthex", "graph_def": "vpnet.pb", "inference_batch_size": 6, "inference_cache": 262144, "inference_threads": 3, "learning_rate": 0.000100, "max_simulations": 300, "max_steps": 0, "nn_depth": 10, "nn_model": "resnet", "nn_width": 128, "path": "/data/gent/465/vscxxxxx/shared/robin/os_out/rthex_5x5_2", "policy_alpha": 1.000000, "policy_epsilon": 0.250000, "replay_buffer_reuse": 3, "replay_buffer_size": 65536, "temperature": 1.000000, "temperature_drop": 10.000000, "train_batch_size": 1024, "uct_c": 2.000000, "weight_decay": 0.000100 }

Any idea what might cause this or how i could resolve it?

Hi @robinpdev ,

Apologies for the lateness on this.

This seems like an error when parsing the json, maybe it's missing a specific entry or has a syntax error.

Have you tried reading that json file separately using utils/json.{h,cc} ?

Did you manually create that .json file or was it printed out from AlphaZero training? Have you modified it?

That error is coming from here:

std::min(30, static_cast<int>(str.size()))));

But that's an general parse error function that is called from multiple places in the file. It would be good to have the full stack trace or at least the token that's causing the issue.

Either way, would be good to reproduce in a simpler setting. Can you reproduce the problem in a much simpler main program that only tries to read that .json file to see if it's the reader itself stumbling?

(To avoid unnecessary dependencies, we built our own simple JSON parser but we may have not covered a case that is required by your specific .json file. In that case it'd be a quick fix once we isolate the problem.)

The JSON file was generated by the alphazero training implementation and was not edited.

I however got it to work for my current configuration so this is not an immediate problem anymore. I will look into it more if i encounter this problem again. Thanks for the response.

Hey robin. Did you manage to get it combiled for gpu? Im really struggling to build the project for gpu

this seems to be fixed.

Hi, sorry for not responding! I have been away from GitHub for a while and didn't realize there are more people using Libtorch AZ. That's great to hear!

If you have the chance, would you be able to post your solution to this problem? Or maybe even the configuration that caused errors and the configuration you used that resolved the problem?