How to use local function /set local_size.
LudoTheHUN opened this issue · comments
Hello.
I note that the file calx/src/calx/core.clj contains the 'local' function defined as:
(defn local [size]
(CLKernel$LocalSize. size))
I'm designing a kernel for which I will want to set the local work size myself, however, I do not see how I can set this veritable. I attempted to set this within the 'let' as per the example below, however this has no effect. Placing the 'local' function call anywhere else fails with 'Unable to resolve symbol: sourceOpenCL2'.
How could I put this option into effect? A side question, is there a public repo that uses calx that I could use a learning material?
I know the local size is not changing (stays as 256) because of a kernel with something like this:
int lid = get_local_id(0); ....
outvector[gid] = lid ; //The output values loop round 255
Kind regards.
Ludo
(use 'calx)
(def source
"__kernel void square (
__global const float *a,
__global float *b) {
int gid = get_global_id(0);
b[gid] = a[gid] * a[gid];
}")
(with-cl
(with-program (compile-program source)
(let [a (wrap [1 2 3] :float32)
b (mimic a)]
(local 64) ;;This seem to have no effect
(enqueue-kernel :square 3 a b)
(enqueue-read b))))
The (local ...) value is meant to be used as a parameter within enqueue-kernel, which describes the size of the array described by the 'local float*' that would be a parameter in your kernel. A more detailed description of this can be found at http://stackoverflow.com/questions/2541929/how-do-i-use-local-memory-in-opencl
Thank you, that makes sense and it works (as per code below).
However other questions arise, namely:
Is there a mechanism in calx to set up the the execution with, say, different values of CL_KERNEL_WORK_GROUP_SIZE (and other parameters), also, how to retrieve the device specific restrictions, eg: CL_DEVICE_LOCAL_MEM_SIZE.
Basically, within a kernel we have these values available to us:
int gsize = get_global_size(0);
int gid = get_global_id(0);
int lsize = get_local_size(0);
int lid = get_local_id(0);
We can control get_global_size(0), it seems get_local_size(0) could be controlled, but (AFAIK) calx does not expose a machanisim to set it.
Thanks again for any clarifications.
(def sourceOpenCL2
"
__kernel void square(
__global float *input,
__global float *output,
const unsigned int count)
{
int i = get_global_id(0);
if (i < count)
output[i] = input[i] * input[i];
}
__kernel void squarelocal(
__global float *input,
__global float *output,
__local float *temp,
const unsigned int count)
{
int gtid = get_global_id(0);
int ltid = get_local_id(0);
if (gtid < count)
{
temp[ltid] = input[gtid];
output[gtid] = temp[ltid] * temp[ltid];
}
}
")
(def OpenCLoutputAtom1 (atom 0))
(def OpenCLoutputAtom2 (atom 0))
(def OpenCLoutputAtom3 (atom 0))
(def OpenCLoutputAtom4 (atom 0))
(def OpenCLoutputAtom5 (atom 0))
(defn testoutputs [x]
(let [global_clj_size x
inputvec (vec (for [i (range global_clj_size)] (rand)))]
(with-cl
(with-program (compile-program sourceOpenCL2)
(let [a (wrap inputvec :float32)
b (mimic a)
c (wrap inputvec :float32)
d (mimic a)
globalsize global_clj_size]
(time (enqueue-kernel :square globalsize a b globalsize))
(time (enqueue-kernel :squarelocal globalsize c d (local global_clj_size) globalsize))
(swap! OpenCLoutputAtom1 (fn [x] (deref (enqueue-read a))))
(swap! OpenCLoutputAtom2 (fn [x] (deref (enqueue-read b))))
(swap! OpenCLoutputAtom3 (fn [x] (deref (enqueue-read c))))
(swap! OpenCLoutputAtom4 (fn [x] (deref (enqueue-read d))))
(release! a)
(release! b)
(release! c)
(release! d)
nil)))
(println
(reduce (fn [coll x]
(and coll (== (@OpenCLoutputAtom2 x) (@OpenCLoutputAtom4 x) )))
[] (range 0 global_clj_size)) ;tests outputs buffers were the same
(reduce (fn [coll x]
(and coll (== (@OpenCLoutputAtom1 x) (@OpenCLoutputAtom3 x) )))
[] (range 0 global_clj_size)) ;tests input bufferes were the same
(reduce (fn [coll x]
(and coll (== (@OpenCLoutputAtom2 x) (* (@OpenCLoutputAtom1 x) (@OpenCLoutputAtom1 x)) )))
[] (range 0 global_clj_size)) ;tests the :square computation came back with correct answer
(reduce (fn [coll x]
(and coll (== (@OpenCLoutputAtom4 x) (* (@OpenCLoutputAtom3 x) (@OpenCLoutputAtom3 x)) )))
[] (range 0 global_clj_size)) ;tests the :squarelocal computation came back with correct answer
(count @OpenCLoutputAtom1)
(count @OpenCLoutputAtom2)
(count @OpenCLoutputAtom3)
(count @OpenCLoutputAtom4)
)))
(testoutputs (* 2))
(testoutputs (* 2 2))
(testoutputs (* 2 2 2))
(testoutputs (* 2 2 2 2 2))
(testoutputs (* 2 2 2 2 2 2))
(testoutputs (* 2 2 2 2 2 2 2))
(testoutputs (* 2 2 2 2 2 2 2 2))
(testoutputs (* 2 2 2 2 2 2 2 2 2))
(testoutputs (* 2 2 2 2 2 2 2 2 2 2))
(testoutputs (* 2 2 2 2 2 2 2 2 2 2 2))
(testoutputs (* 2 2 2 2 2 2 2 2 2 2 2 2))
(testoutputs (* 2 2 2 2 2 2 2 2 2 2 2 2 2)) ;; 8192*sizeoffloat(4)= 32768bytes in local memory = This is the largest things can be for the local call.. on my GPU card.
(testoutputs (* 2 2 2 2 2 2 2 2 2 2 2 2 2 2)) ;; this and any larger local sizes fail with:
;java.lang.RuntimeException: Exception while waiting for events [Event {commandType: ReadBuffer}] (NO_SOURCE_FILE:0)