jonathan-laurent / AlphaZero.jl

A generic, simple and fast implementation of Deepmind's AlphaZero algorithm.

Home Page:https://jonathan-laurent.github.io/AlphaZero.jl/stable/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Help: dummy_run stuck

smart-fr opened this issue · comments

Hello,

When running dummy_run on my game, AlphaZero seems to get stuck during the initial benchmark play AlphaZero against MCTS.
If I debug the execution, I can narrow down at the location illustrated below, where pressing F11 to step into (or F10 to step over) has no effect: strangely enough, the debugger remains in this "Paused on breakpoint" state, instead of executing the TaskLocalState( ) constructor with no arguments.

dummy_run_stuck

If I dummy_run on tictactoe for example, I obtain the expected result. Hence, it doesn't look like an obvious compatibility issue.

But I suspect the model for my game is huge... Which maybe means that I can't reach my goal to create an agent for my game with AlphaZero.jl.

I would very much appreciate some guidance as to how I could precisely assess what's wrong with my instance, and in which direction I should look to try and make it work.

Thanks in advance!

I'll add a piece of information: the call stack seems to indicate that we are in the middle of a GPU memory allocation. However, my GPU memory is almost empty while this phenomenon occurs.

Adding a breakpoint inside TaskLocalState() allowed me to dive further, and to discover that the culprit is the following instruction:
math_mode = something(default_math_mode[], Base.JLOptions().fast_math==1 ? FAST_MATH : DEFAULT_MATH)

If I modify lib\cudadrv\state.jl to simplify this instruction and make it a simple assignment math_mode = DEFAULT_MATH, now this assignment doesn't work either, and the debugger remains petrified as "Paused on breakpoint". How could it be?

cuda_stuck

I understand this issue is beyond AlphaZero.jl scope and likely regards the CUDA drivers, but maybe someone encountered it and found a work around?

And what puzzles me the most is the fact that tictactoe works perfectly, meaning there might be a fix through the game code or model...

I believe my issue is a glitch introduced by the debugger.
Closing it now, opening a different one which appears when executing dummy_run without debugging it.