lightvector / KataGo

GTP engine and self-play learning in Go

Home Page:https://katagotraining.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OpenCL KataGo is crashing during self-tuning (Intel Integrated Graphics sometimes buggy)

featurecat opened this issue · comments

In the Lizzie repository a user is having trouble starting KataGo. The crash is during self-tuning on the OpenCL official release of KataGo. Here is the issue which includes screenshots of the commandline while it is running. featurecat/lizzie#633

I'll leave this open just for visibility, but just to post an update here for anyone seeing this thread - Intel Integrated Graphics has caused some issues in the past, not just for KataGo but for some other projects too. At least one older version's OpenCL implementation is buggy/incomplete.

In other cases, quite possibly well it might be something that I'm missing as well. For example, there are various queryable limits that OpenCL exposes in its API, perhaps KataGo is not respecting one of those limits, and one might expect those limits to be lower for Integrated Graphics than for graphics cards. But without the ability to reproduce it locally myself or to have a user who is themselves technically very experienced and capable of doing some code diving and serious debugging, I don't see a good way to make progress here.

So for now - if you're trying to run KataGo using OpenCL on Intel Integrated Graphics - there is some chance it won't work, although for some users I think it actually has worked too. If you are encountering such an error in exactly this case, and you are experienced at debugging and willing to try compiling KataGo yourself and to edit the code or test things out, let me know.

I recently ran into a similar problem on an Integrated graphics on my install of KataGo. I do believe that I have found a work around, but it seems that it will heavily impact performance, but it will at least work. Every time I would attempt to run it, it would crash in an attempt to tune. This seems to be from some bug or incompatibility with the graphics driver, whether it is with some of your code and the card/driver or it is on intel's side, I don't know. But running a genconfig and making sure to only select the cpu device itself, probably device 1 as it was in my case, if you are cpu/integrated, the tuning phase will not crash and will record its results appropriately. I can continue to look into this, but I am not good at lower level code, c++, and know very little about drivers, it is all far outside of what I normally do. But I can give it a try if you'd like.

tl;dr Integrated graphics seem to cause a large issue, as you know. Running a genconfig and setting the devices to the cpu only (not the integrated graphics), while it will probably have an affect on performance, prevents the crashing during tuning. I can continue to look into this issue if you'd like, but it isn't what I'm good at so it may take a while and I could quite honestly come up empty handed.

Thank you for your great work!

Changed title just to be clearer for people browsing issues.

I am using an AMD RAVEN (DRM 3.36.0, 5.6.8-arch1-1, LLVM 10.0.0) (AMD) GPU and it seems to be stuck in tuning, it does not output anymore after the last line displayed below. In fact I cannot even kill via Ctrl+C it when it is "testing configurations".

I am using the OpenCL build but it is build from source with parameters cmake . -DBUILD_MCTS=1 -DUSE_BACKEND=OPENCL and I am on Arch Linux. Do let me know if you need more information.

2020-05-03 14:30:28+0800: Loading model and initializing benchmark...

Running quick initial benchmark at 16 threads!
2020-05-03 14:30:28+0800: nnRandSeed0 = 8513347526626713128
2020-05-03 14:30:28+0800: After dedups: nnModelFile0 = /usr/share/katago/networks/weights-b30.bin.gz useFP16 auto useNHWC auto
2020-05-03 14:30:29+0800: Found OpenCL Platform 0: Clover (Mesa) (OpenCL 1.1 Mesa 20.0.6)
2020-05-03 14:30:29+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2020-05-03 14:30:29+0800: Found OpenCL Device 0: AMD RAVEN (DRM 3.36.0, 5.6.8-arch1-1, LLVM 10.0.0) (AMD) (score 11000101)
2020-05-03 14:30:29+0800: Using OpenCL Device 0: AMD RAVEN (DRM 3.36.0, 5.6.8-arch1-1, LLVM 10.0.0) (AMD) OpenCL 1.1 Mesa 20.0.6
2020-05-03 14:30:29+0800: No existing tuning parameters found or parseable or valid at: /home/syx/.katago/opencltuning/tune6_gpuAMDRAVENDRM3360568arch11LLVM1000_x19_y19_c320_mv8.txt
2020-05-03 14:30:29+0800: Performing autotuning
2020-05-03 14:30:29+0800: Found OpenCL Platform 0: Clover (Mesa) (OpenCL 1.1 Mesa 20.0.6)
2020-05-03 14:30:29+0800: Found 1 device(s) on platform 0 with type CPU or GPU or Accelerator
2020-05-03 14:30:29+0800: Found OpenCL Device 0: AMD RAVEN (DRM 3.36.0, 5.6.8-arch1-1, LLVM 10.0.0) (AMD) (score 11000101)
2020-05-03 14:30:29+0800: Using OpenCL Device 0: AMD RAVEN (DRM 3.36.0, 5.6.8-arch1-1, LLVM 10.0.0) (AMD) OpenCL 1.1 Mesa 20.0.6
Setting winograd3x3TileSize = 4
------------------------------------------------------
Tuning xGemmDirect for 1x1 convolutions and matrix mult
Testing 56 different configs

@SyxP - I think this is not related to Intel Integrated Graphics so it might have been better in a separate issue. Or I guess this is also fine. Anyways the error you encountered is also a different known issue and in your case is possibly fixable.

I just now pushed a section in the main readme about things like this. Take a look at the entry on OpenCL Mesa there.

Hope that helps! :)

Thanks this was indeed the issue, and it now works!

I had (and still have) this intel gpu tuning crash with old v1.2 but the changes made in the OpenCl code since then seems to have fixed it - both tuning and games are working fine on v1.7.0.
Running on atom x5-z8350 with Intel HD 400 with latest driver.