filecoin-project / filecoin-ffi

C and CGO bindings for Filecoin's Rust libraries

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GPU FFT failed! Falling back to CPU... Error: Ocl

stuberman opened this issue · comments

Describe the bug
GPU gets error when starting on Snark processing after being scheduled on PC2 phase. Recovers after 4 minutes

To Reproduce

2020-08-07T23:33:32.275 INFO storage_proofs_core::compound_proof > vanilla_proof:finish
2020-08-07T23:33:32.308 INFO storage_proofs_core::compound_proof > snark_proof:start
2020-08-07T23:33:32.418 INFO bellperson::groth16::prover > Bellperson 0.9.2 is being used!
2020-08-07T23:34:30.234Z INFO markets loggers/loggers.go:18 storage event {"name": "ProviderEventDealPublished", "proposal CID": "bafyreibtp5t6ffsm5q5uodattf2xc6zs2nawi2nqdhv7knoww4dlqgi7rm", "state": "StorageDealStaged", "message": ""}
2020-08-07T23:34:30.316Z INFO rpc go-jsonrpc@v0.1.1-0.20200602181149-522144ab4e24/client.go:204 rpc output message buffer {"n": 2}
2020-08-07T23:34:30.397Z INFO sectors storage-fsm@v0.0.0-20200730122205-d423ae90d8d4/sealing.go:123 Adding piece for deal 14602
2020-08-07T23:34:30.954Z INFO rpc go-jsonrpc@v0.1.1-0.20200602181149-522144ab4e24/client.go:204 rpc output message buffer {"n": 3}
2020-08-07T23:35:12.446 DEBUG bellperson::gpu::locks > Acquiring priority lock...
2020-08-07T23:35:12.446 DEBUG bellperson::gpu::locks > Priority lock acquired!
2020-08-07T23:35:12.660 INFO bellperson::gpu::locks > GPU is available for FFT!
2020-08-07T23:35:12.660 DEBUG bellperson::gpu::locks > Acquiring GPU lock...
2020-08-07T23:35:12.660 DEBUG bellperson::gpu::locks > GPU lock acquired!
2020-08-07T23:35:12.840 INFO bellperson::gpu::fft > FFT: 1 working device(s) selected.
2020-08-07T23:35:12.840 INFO bellperson::gpu::fft > FFT: Device 0: GeForce GTX 1080 Ti
2020-08-07T23:35:12.840 INFO bellperson::domain > GPU FFT kernel instantiated!
2020-08-07T23:35:13.376 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: Ocl Error:

################################ OPENCL ERROR ###############################

Error executing function: clEnqueueNDRangeKernel("radix_fft")

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html#errors

#############################################################################

[Repeats six times then]

2020-08-07T23:38:39.215 INFO storage_proofs_porep::stacked::vanilla::proof > persisting base tree_c 2/8 of length 153391689
2020-08-07T23:39:04.780 INFO bellperson::gpu::locks > GPU is available for Multiexp!
2020-08-07T23:39:04.780 DEBUG bellperson::gpu::locks > Acquiring GPU lock...
2020-08-07T23:39:04.780 DEBUG bellperson::gpu::locks > GPU lock acquired!
2020-08-07T23:39:05.020 INFO bellperson::gpu::multiexp > Multiexp: 1 working device(s) selected. (CPU utilization: 0)
2020-08-07T23:39:05.020 INFO bellperson::gpu::multiexp > Multiexp: Device 0: GeForce GTX 1080 Ti (Chunk-size: 6167411)
2020-08-07T23:39:05.020 INFO bellperson::multiexp > GPU Multiexp kernel instantiated!

Expected behavior
I would like to see GPU cleanly engage in C2 Phase

Screenshots
Error description from reference:

CL_MEM_OBJECT_ALLOCATION_FAILURE if there is a failure to allocate memory for data store associated with image or buffer objects specified as arguments to kernel.

BEFORE
nvidia-smi

Fri Aug 7 23:28:26 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 58% 80C P2 232W / 250W | 2003MiB / 11170MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3913035 C lotus-miner 2001MiB |
+-----------------------------------------------------------------------------+

AFTER
nvidia-smi

Fri Aug 7 23:55:50 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 55% 76C P2 228W / 250W | 3675MiB / 11170MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3913035 C lotus-miner 3673MiB |
+-----------------------------------------------------------------------------+

Version (run lotus version):
lotus version

Daemon: 0.4.3+git.460723d2+api0.8.1
Local: lotus version 0.4.3+git.460723d2

Additional context
Running Ubuntu 20.04 server
GeForce GTX 1080 Ti running latest NVIDIA driver 450.57

Related ENV variables used when building Lotus:
FIL_PROOFS_MAXIMIZE_CACHING=1
BELLMAN_CUSTOM_GPU=GeForce GTX 1080 Ti:3584
FFI_BUILD_FROM_SOURCE=1
RUSTFLAGS=-C target-cpu=native -g
FIL_PROOFS_USE_GPU_TREE_BUILDER=1
RUST_BACKTRACE=full
RUST_LOG=debug
FIL_PROOFS_USE_GPU_COLUMN_BUILDER=1

I have the same problem

I have the same problem

Additional info, in my experience, the whatever is locking the GPU is almost always released at 45 minutes into the C2 cycle and then the GPU lock for C2 is successful.

Quick Edit: Monitored a few more few more C2 cycles, and exactly between the 45 & 46 minute mark, the worker/miner is able to finally lock the GPU. Not sure what's magic about the 45 minute mark, but there's something there. I will note that the two machines that have the issue have a 1070Ti and a 1070.

#export BELLMAN_CUSTOM_GPU="GeForce GTX 1070 Ti:2432"
#export BELLMAN_CUSTOM_GPU="GeForce GTX 1070:1920"

2020-08-28T16:30:54.490 INFO filecoin_proofs::api::seal > got groth params (34359738368) while sealing
2020-08-28T16:30:54.490 INFO filecoin_proofs::api::seal > snark_proof:start
2020-08-28T16:30:54.506 INFO bellperson::groth16::prover > Bellperson 0.9.2 is being used!
2020-08-28T16:40:49.518 INFO bellperson::gpu::locks > GPU is available for FFT!
2020-08-28T16:40:49.521 DEBUG bellperson::gpu::locks > Acquiring GPU lock...
2020-08-28T16:40:49.521 DEBUG bellperson::gpu::locks > GPU lock acquired!
2020-08-28T16:40:50.321 INFO bellperson::gpu::fft > FFT: 1 working device(s) selected.
2020-08-28T16:40:50.322 INFO bellperson::gpu::fft > FFT: Device 0: GeForce GTX 1070
2020-08-28T16:40:50.323 INFO bellperson::domain > GPU FFT kernel instantiated!
2020-08-28T16:41:07.329 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: Ocl Error: 

################################ OPENCL ERROR ############################### 

Error executing function: clEnqueueNDRangeKernel("radix_fft")  

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)  

Please visit the following url for more information: 

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html#errors  

############################################################################# 

2020-08-28T16:41:41.601 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: Ocl Error: 

################################ OPENCL ERROR ############################### 

<LOOPS THIS ERROR FOR 45 MINUTES>

############################################################################# 

2020-08-28T17:15:53.093 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: Ocl Error: 

################################ OPENCL ERROR ############################### 

Error executing function: clEnqueueNDRangeKernel("radix_fft")  

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)  

Please visit the following url for more information: 

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html#errors  

############################################################################# 

2020-08-28T17:16:17.902 DEBUG bellperson::gpu::locks > GPU lock released!
2020-08-28T17:16:58.030 INFO bellperson::gpu::locks > GPU is available for Multiexp!
2020-08-28T17:16:58.034 DEBUG bellperson::gpu::locks > Acquiring GPU lock...
2020-08-28T17:16:58.034 DEBUG bellperson::gpu::locks > GPU lock acquired!
2020-08-28T17:16:58.675 INFO bellperson::gpu::utils > Adding "GeForce GTX 1070" to GPU list with 1920 CUDA cores.
2020-08-28T17:16:58.675 INFO bellperson::gpu::multiexp > Multiexp: 1 working device(s) selected. (CPU utilization: 1)
2020-08-28T17:16:58.675 INFO bellperson::gpu::multiexp > Multiexp: Device 0: GeForce GTX 1070 (Chunk-size: 3303970)
2020-08-28T17:16:58.675 INFO bellperson::multiexp > GPU Multiexp kernel instantiated!

commented

I've had a few machines with the same error. They all had GPU other than 2080Ti (dual 2080 SUPER for example).
Now I stick to single 2080Ti for this very reason.

Hey ProtocolLabs...Is there any progress ???? ... Same issue..... RTX 2080 SUPRER ==> make me crazy ...!!
2020-08-29T17:51:42.026 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: Ocl Error:

################################ OPENCL ERROR ###############################

Error executing function: clEnqueueNDRangeKernel("radix_fft")

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html#errors

#############################################################################

2020-09-06T09:31:37.304 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: Ocl Error:

################################ OPENCL ERROR ###############################

Error executing function: clEnqueueNDRangeKernel("radix_fft")

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueNDRangeKernel.html#errors

@stuberman @hyunmoon did this get resolved eventually?

I had seen this last week, but let me watch since I upgraded to 1.2.1

@ribasushi Unfortunately I still see these from time to time on my 1080 TI - running latest Master lotus version 1.2.2+git.b13226bc2

#############################################################################

2020-12-04T14:48:34.329 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: OpenCL Error: Ocl Error:

################################ OPENCL ERROR ###############################

Error executing function: clEnqueueWriteBuffer

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueWriteBuffer.html#errors

#############################################################################

2020-12-04T14:48:38.608Z INFO miner miner/miner.go:384 Time delta between now and our mining base: 8s (nulls: 0)
2020-12-04T14:48:56.638 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: OpenCL Error: Ocl Error:

################################ OPENCL ERROR ###############################

Error executing function: clEnqueueWriteBuffer

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueWriteBuffer.html#errors

#############################################################################

2020-12-04T14:49:06.129Z INFO miner miner/miner.go:384 Time delta between now and our mining base: 6s (nulls: 0)
2020-12-04T14:49:20.453 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: OpenCL Error: Ocl Error:

################################ OPENCL ERROR ###############################

Error executing function: clEnqueueWriteBuffer

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueWriteBuffer.html#errors

#############################################################################

2020-12-04T14:49:40.473Z INFO miner miner/miner.go:384 Time delta between now and our mining base: 10s (nulls: 0)
2020-12-04T14:49:46.734 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: OpenCL Error: Ocl Error:

################################ OPENCL ERROR ###############################

Error executing function: clEnqueueWriteBuffer

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueWriteBuffer.html#errors

#############################################################################

Well,it makes me crazy~~~
my lotus version
root@KRPA-U16-Series:~# lotus-worker --version
lotus-worker version 1.2.2+git.93d26195f

2020-12-10T22:28:54.507 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: OpenCL Error: Ocl Error:

################################ OPENCL ERROR ###############################

Error executing function: clEnqueueWriteBuffer

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueWriteBuffer.html#errors

#############################################################################

And Any one can help me ?

@msconnext What GPU type are you running? I am using a NVidia 1080 TI which has 11GB memory and should not really see this.

is this resolved?

No, this still occurs on rare occasions.

2021-06-18T05:48:47.208 WARN bellperson::gpu::locks > GPU Multiexp failed! Falling back to CPU... Error: OpenCL Error: Ocl Error:

################################ OPENCL ERROR ###############################

Error executing function: clEnqueueWriteBuffer

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)

Please visit the following url for more information:

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueWriteBuffer.html#errors

#############################################################################

I'm still seeing this error every sector that is sealed on a worker that has a GTX 1070/8GB during C2. We're getting almost to a year with the error and wondering if we'll see a fix to this or not. Basically see the GPU lock error every 30 seconds for about 45-50 minutes before it truly fails over and starts the C2 processing.

2021-08-04T06:12:05.832 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: OpenCL Error: Ocl Error: 

################################ OPENCL ERROR ############################### 

Error executing function: clEnqueueWriteBuffer  

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)  

Please visit the following url for more information: 

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueWriteBuffer.html#errors  

############################################################################# 

2021-08-04T06:12:34.660 WARN bellperson::gpu::locks > GPU FFT failed! Falling back to CPU... Error: OpenCL Error: Ocl Error: 

################################ OPENCL ERROR ############################### 

Error executing function: clEnqueueWriteBuffer  

Status error code: CL_MEM_OBJECT_ALLOCATION_FAILURE (-4)  

Please visit the following url for more information: 

https://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueWriteBuffer.html#errors  

############################################################################# 

.... Repeats over and over until...



2021-08-04T07:04:13.837 INFO bellperson::gpu::locks > GPU is available for Multiexp!
2021-08-04T07:04:13.957 INFO bellperson::gpu::multiexp > Multiexp: 1 working device(s) selected. (CPU utilization: 0.9)
2021-08-04T07:04:13.957 INFO bellperson::gpu::multiexp > Multiexp: Device 0: GeForce GTX 1070 (Chunk-size: 3303970)
2021-08-04T07:04:13.957 INFO bellperson::multiexp > GPU Multiexp kernel instantiated!
2021-08-04T08:27:36.527 INFO bellperson::groth16::prover > prover time: 8312.83745891s
2021-08-04T08:27:38.984 INFO filecoin_proofs::api::seal > snark_proof:finish
2021-08-04T08:27:38.985 INFO filecoin_proofs::api::seal > verify_seal:start: SectorId(1035)
2021-08-04T08:27:38.989 INFO filecoin_proofs::caches > trying parameters memory cache for: STACKED[34359738368]-verifying-key
2021-08-04T08:27:38.989 INFO filecoin_proofs::caches > found params in memory cache for STACKED[34359738368]-verifying-key
2021-08-04T08:27:38.989 INFO filecoin_proofs::api::seal > got verifying key (34359738368) while verifying seal
2021-08-04T08:27:39.037 INFO filecoin_proofs::api::seal > verify_seal:finish: SectorId(1035)
2021-08-04T08:27:39.037 INFO filecoin_proofs::api::seal > seal_commit_phase2:finish: SectorId(1035)
2021-08-04T08:27:39.038 INFO filcrypto::proofs::api > seal_commit_phase2: finish


@stuberman Just for the record. This issue is solved, right?
I have not seen this issue for a long time - and I'm unable to reproduce.

Not really. I continued to see this OCL error even on my 3090 with plenty of memory.
Now with CUDA I get this error on both my 2080Ti (miner) and 3090 (worker)
2021-12-11T15:16:17.770 WARN bellperson::gpu::locks > GPU Multiexp failed! Falling back to CPU... Error: GPU tools error: Cuda Error: "out of memory"

Ok, thanks! I will keep the ticket open and add labels.

We've switched to another library for OpenCL, hence I'm closing this issue. If it's still happening, please re-open with updated information.