Help: dummy_run stuck

Question

Help: dummy_run stuck

smart-fr opened this issue 2 years ago · comments

Hello,

When running dummy_run on my game, AlphaZero seems to get stuck during the initial benchmark play AlphaZero against MCTS.
If I debug the execution, I can narrow down at the location illustrated below, where pressing F11 to step into (or F10 to step over) has no effect: strangely enough, the debugger remains in this "Paused on breakpoint" state, instead of executing the TaskLocalState( ) constructor with no arguments.

If I dummy_run on tictactoe for example, I obtain the expected result. Hence, it doesn't look like an obvious compatibility issue.

But I suspect the model for my game is huge... Which maybe means that I can't reach my goal to create an agent for my game with AlphaZero.jl.

I would very much appreciate some guidance as to how I could precisely assess what's wrong with my instance, and in which direction I should look to try and make it work.

Thanks in advance!

Stéphane Martin · Answer 1 · Wed Jan 18 2023 17:03:03 GMT+0800 (China Standard Time)

I'll add a piece of information: the call stack seems to indicate that we are in the middle of a GPU memory allocation. However, my GPU memory is almost empty while this phenomenon occurs.

Stéphane Martin · Answer 2 · Wed Jan 18 2023 20:00:28 GMT+0800 (China Standard Time)

Adding a breakpoint inside TaskLocalState() allowed me to dive further, and to discover that the culprit is the following instruction:
math_mode = something(default_math_mode[], Base.JLOptions().fast_math==1 ? FAST_MATH : DEFAULT_MATH)

If I modify lib\cudadrv\state.jl to simplify this instruction and make it a simple assignment math_mode = DEFAULT_MATH, now this assignment doesn't work either, and the debugger remains petrified as "Paused on breakpoint". How could it be?

I understand this issue is beyond AlphaZero.jl scope and likely regards the CUDA drivers, but maybe someone encountered it and found a work around?

And what puzzles me the most is the fact that tictactoe works perfectly, meaning there might be a fix through the game code or model...

Stéphane Martin · Answer 3 · Wed Jan 18 2023 20:32:06 GMT+0800 (China Standard Time)

I believe my issue is a glitch introduced by the debugger.
Closing it now, opening a different one which appears when executing dummy_run without debugging it.