xuhuisheng / rocm-build

build scripts for ROCm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

3.9 crashes during building on gfx803 with me but 3.10 does not crash.

crypt0miester opened this issue · comments

hey man, firstly, thanks for the work.

it has been two days for me trying to build rocm for tensorflow.

I got to the point of despair and raising this issue.

my setup:
GPU: Sapphire Radeon RX570 4GB
CPU: Intel Celeron
RAM: 8GB

my quesion is, do you have the 3.10 rocSPARSE version which would work on gfx803?

I tried building your version but it was for 3.9 right?

I still get the hipErrorNoBinaryForGpu issue even after rebuilding your version of the rocSPARSE

anything would be helpful. Thanks

ROCm-3.10 is as same as ROCm-3.9. You could clone https://github.com/ROCmSoftwarePlatform/rocSPARSE, checkout 3.10.x, move AMDGPU_TARGETS before the include. Then rebuild rocSPARSE.
ROCm-4.0 is the same, too.

Excellent. will try to do that. and get back to you. I have tried to use your check.sh
the rocBlas is "core dumped" have you encountered this issue before?

btw, should I do a full reinstallation after these errors? or just rebuild rocSPARSE?

so I got rocSPARSE to work but rocBlas one issue didnt resolve itself. lol

/rocm-build/check $ sudo bash check.sh 
check.sh: line 9:  2204 Illegal instruction     (core dumped) ./build/hello_rocblas
[rocFFT]    1.0.8.966-rocm-rel-3.10-27-2d35fd6
[rocPRIM]   201005
[rocRAND]   201006
[rocSPARSE] 101800
[rccl]      2708
check.sh: line 33:  2459 Illegal instruction     (core dumped) ./build/hello_miopen
check.sh: line 37:  2500 Illegal instruction     (core dumped) ./build/hello_rocsolver

managed to solve a lot of issues. now tensorflow just "Illegal Instrucion (core dumped)"

is it because of rocBlas?

Python 3.8.5 (default, Jul 28 2020, 12:59:40) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.add(3,5)
2021-02-17 18:09:54.024738: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libamdhip64.so
2021-02-17 18:09:54.554061: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: Ellesmere [Radeon RX 470/480/570/570X/580/580X/590]     ROCm AMD GPU ISA: gfx803
coreClock: 1.34GHz coreCount: 32 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 104.31GiB/s
Illegal instruction (core dumped)

Actually I haven't meet this Illegal instruction error on my RX580 8G.

But I still suggest rebuild rocBLAS with -DBUILD_WITH_TENSILE_HOST=OFF, I upload https://github.com/xuhuisheng/rocm-build/blob/rocm-4.1.x/gfx803/22.rocblas.sh, please try it.

BUILD_WITH_TENSILE_HOST=OFF will disable the asm scripts, just use old C language to implement GEMM. I believe the new asm GEMM have issues for gfx803. But right now, cannot find the point.

I think I bricked my gpu. I will try with another gpu. (atiflash/amdvbflash -i failed to test it)

I'll close this issue. I will open again if I found this issue unresolved.

Thanks mate.

Actually, I often make my RX580 crashed before rebuild rocBLAS, if I ran language model example from pytorch-example https://github.com/pytorch/examples/tree/master/word_language_model .
My solution is wait a while and reset the compute, then GPU will wake up.

how do I reset the compute?

I mean shutdown the power and reboot.

alright. will get back to you.

I was able to fix the GPU issue.

you are correct perhaps it is a rocBLAS issue.

I tried using your fix but I got on patching

error: patch failed: library/src/blas_ex/rocblas_gemm_ext2.hpp:4
error: library/src/blas_ex/rocblas_gemm_ext2.hpp: patch does not apply

I used

repo init -u https://github.com/RadeonOpenCompute/ROCm.git -b roc-4.0.x
repo sync

because 4.1.x is

manifests:
fatal: couldn't find remote ref refs/heads/roc-4.1.x

any solutions?

OK. Seems this patch related up-coming ROCm-4.1 is not suitable with ROCm-4.0.
Which version do you want? I will make a related patch for the version. Or you can just modify library/src/blas_ex/rocblas_gemm_ext2.hpp, move #include "rocblas_gemm_ex.hpp" outof #ifdef USE_TENSILE_HOST.

This will allow we using USE_TENSILE_HOST=OFF, otherwise it will report a error that cannot find some functions.

I will reopen this issue.

modified and removed rm -rf $ROCM_GIT_DIR/rocBLAS/library/src/blas3/Tensile/Logic/asm_full/r9nano*

let's see, wish me luck. :)

still getting the same thing after building. the build was successful too.

maybe this is a kernel issue?

which kernel version are you using?

I am using 5.4.0-65-generic

this is

It's weired that hello-rocblas did nothing but load the librocblas.so and print a version string. My environment is ubuntu-20.04.1 with linux-5.4.0-64.

And you can verify whether it is the kernel problem by running hip sample. https://github.com/xuhuisheng/rocm-build/blob/rocm-4.1.x/check/run-hip.sh. The hip square sample didnot use any rocm-libs component, just run a simple kernel function. If hip sample didnot throw errors, we can tell the kernel and hip level is correct.

After do some search, it said the Illegal instruction may cause by toolchain cross compiling. I suggest using docker to prepare a clear ubuntu:20.04 to install ROCm. check.sh should report rocblas version correctly, even not rebuild.

alright. I will try to use linux-5.4.0-64. and will come back to you.

I got a

$ dmesg | grep amd
[    0.000000] Linux version 5.4.0-64-generic (buildd@lcy01-amd64-021) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #72-Ubuntu SMP Fri Jan 15 10:27:54 UTC 2021 (Ubuntu 5.4.0-64.72-generic 5.4.78)
[    2.859037] amdkcl: loading out-of-tree module taints kernel.
[    2.859062] amdkcl: module verification failed: signature and/or required key missing - tainting kernel
[    3.072508] amdgpu: Unknown symbol amd_iommu_bind_pasid (err -2)
[    3.072678] amdgpu: Unknown symbol amd_iommu_set_invalidate_ctx_cb (err -2)
[    3.072796] amdgpu: Unknown symbol amd_iommu_free_device (err -2)
[    3.080007] amdgpu: Unknown symbol amd_iommu_unbind_pasid (err -2)
[    3.080037] amdgpu: Unknown symbol amd_iommu_init_device (err -2)
[    3.080279] amdgpu: Unknown symbol amd_iommu_set_invalid_ppr_cb (err -2)

on linux-5.4.0.64

I guess I should move on. it took me a week doing this.

you can close the issue.

cheers xuhu.

the issue seem to be in miopen.

when I do sh run-miopen.sh, I get:

Illegal instruction (core dumped)

which python version are you using?

Using Python-3.8.5, which is the default pthon version of ubuntu-20.04.1.

Do you have an apu on this computer? somebody said there is a bug on environment which have an apu and gpu.
please refer this issue: ROCm/ROCm#1306 (comment)

try /opt/rocm/bin/rocminfo to check if there is both apu and gpu.

no APUs. I will try to build with rocm-4.

if it didnt work I guess I'll have to find a way for it to work with 3.5- and below.

I suggest install ROCm-4.0, and run the check.sh.
If there is still Illegal instruction, will not need to rebuild the rocblas.

Because rebuild only solve the gfx803 issue, Illegal instruction could cause by other reason.

I still got Illegal instruction with ROCm-4.0 😞

trying with 3.3 now. everything is working
but I am trying to figure out which tensorflow to use. tensorflow-rocm==2.2.0 and 2.3.0 did not work.
ImportError: "libamdhip64.so.3": cannot open shared object file: No such file or directory

I test tensorflow-rocm==2.2.0rc5 localy successly with ROCm-3.3.

And when you have time, could you use docker installing an ubuntu:20.04 image to test ROCm-4.0 with check.sh? thank you.

I got it working with ROCm-3.3 and tensorflow-rocm==2.2.0

will do that when I have time. Thanks xuhu.