ztellman / calx

not under active development - idiomatic opencl bindings for clojure

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to use local function /set local_size.

LudoTheHUN opened this issue · comments

Hello.

I note that the file calx/src/calx/core.clj contains the 'local' function defined as:

(defn local [size]
  (CLKernel$LocalSize. size))

I'm designing a kernel for which I will want to set the local work size myself, however, I do not see how I can set this veritable. I attempted to set this within the 'let' as per the example below, however this has no effect. Placing the 'local' function call anywhere else fails with 'Unable to resolve symbol: sourceOpenCL2'.
How could I put this option into effect? A side question, is there a public repo that uses calx that I could use a learning material?

I know the local size is not changing (stays as 256) because of a kernel with something like this:
int lid = get_local_id(0); ....
outvector[gid] = lid ; //The output values loop round 255

Kind regards.

Ludo

(use 'calx)

(def source
  "__kernel void square (
       __global const float *a,
       __global float *b) {
    int gid = get_global_id(0);
    b[gid] = a[gid] * a[gid];
  }")

(with-cl
  (with-program (compile-program source)
    (let [a (wrap [1 2 3] :float32)
          b (mimic a)]
      (local 64)         ;;This seem to have no effect
      (enqueue-kernel :square 3 a b)
      (enqueue-read b))))

The (local ...) value is meant to be used as a parameter within enqueue-kernel, which describes the size of the array described by the 'local float*' that would be a parameter in your kernel. A more detailed description of this can be found at http://stackoverflow.com/questions/2541929/how-do-i-use-local-memory-in-opencl

Thank you, that makes sense and it works (as per code below).

However other questions arise, namely:
Is there a mechanism in calx to set up the the execution with, say, different values of CL_KERNEL_WORK_GROUP_SIZE (and other parameters), also, how to retrieve the device specific restrictions, eg: CL_DEVICE_LOCAL_MEM_SIZE.

Basically, within a kernel we have these values available to us:
int gsize = get_global_size(0);
int gid = get_global_id(0);
int lsize = get_local_size(0);
int lid = get_local_id(0);
We can control get_global_size(0), it seems get_local_size(0) could be controlled, but (AFAIK) calx does not expose a machanisim to set it.

Thanks again for any clarifications.

(def sourceOpenCL2
  " 
__kernel void square(
    __global float *input,
    __global float *output,
    const unsigned int count)
{
    int i = get_global_id(0);
    if (i < count)
        output[i] = input[i] * input[i];
}

__kernel void squarelocal(
    __global float *input,
    __global float *output,
    __local float *temp,
    const unsigned int count)
{
    int gtid = get_global_id(0);
    int ltid = get_local_id(0);
    if (gtid < count)
    {
        temp[ltid] = input[gtid];
        output[gtid] =  temp[ltid] * temp[ltid];
    }
}
  ")

(def OpenCLoutputAtom1 (atom 0))
(def OpenCLoutputAtom2 (atom 0))
(def OpenCLoutputAtom3 (atom 0))
(def OpenCLoutputAtom4 (atom 0))
(def OpenCLoutputAtom5 (atom 0))


(defn testoutputs [x]
(let [global_clj_size x
      inputvec (vec (for [i (range global_clj_size)] (rand)))]

(with-cl
  (with-program (compile-program sourceOpenCL2)
    (let [a (wrap inputvec :float32)
          b (mimic a)
      c (wrap inputvec :float32)
      d (mimic a)
      globalsize global_clj_size]
     (time (enqueue-kernel :square      globalsize a b                         globalsize))
     (time (enqueue-kernel :squarelocal globalsize c d (local global_clj_size) globalsize))
          (swap! OpenCLoutputAtom1 (fn [x] (deref (enqueue-read a))))
      (swap! OpenCLoutputAtom2 (fn [x] (deref (enqueue-read b))))
      (swap! OpenCLoutputAtom3 (fn [x] (deref (enqueue-read c))))
      (swap! OpenCLoutputAtom4 (fn [x] (deref (enqueue-read d))))
      (release! a)
      (release! b)
      (release! c)
      (release! d)
    nil)))

(println 
(reduce (fn [coll x]
           (and coll (== (@OpenCLoutputAtom2 x) (@OpenCLoutputAtom4 x) )))
 [] (range 0 global_clj_size))   ;tests outputs buffers were the same

(reduce (fn [coll x]
           (and coll (== (@OpenCLoutputAtom1 x) (@OpenCLoutputAtom3 x) )))
 [] (range 0 global_clj_size))   ;tests input bufferes were the same

(reduce (fn [coll x]
           (and coll (== (@OpenCLoutputAtom2 x) (* (@OpenCLoutputAtom1 x) (@OpenCLoutputAtom1 x)) )))
 [] (range 0 global_clj_size))    ;tests the :square computation came back with correct answer

 (reduce (fn [coll x]
           (and coll (== (@OpenCLoutputAtom4 x) (* (@OpenCLoutputAtom3 x) (@OpenCLoutputAtom3 x)) )))
 [] (range 0 global_clj_size))    ;tests the :squarelocal computation came back with correct answer
(count @OpenCLoutputAtom1)
(count @OpenCLoutputAtom2)
(count @OpenCLoutputAtom3)
(count @OpenCLoutputAtom4)
 )))

(testoutputs (* 2))
(testoutputs (* 2 2))
(testoutputs (* 2 2 2))
(testoutputs (* 2 2 2 2 2))
(testoutputs (* 2 2 2 2 2 2))
(testoutputs (* 2 2 2 2 2 2 2))
(testoutputs (* 2 2 2 2 2 2 2 2))
(testoutputs (* 2 2 2 2 2 2 2 2 2))
(testoutputs (* 2 2 2 2 2 2 2 2 2 2))
(testoutputs (* 2 2 2 2 2 2 2 2 2 2 2))
(testoutputs (* 2 2 2 2 2 2 2 2 2 2 2 2))
(testoutputs (* 2 2 2 2 2 2 2 2 2 2 2 2 2))  ;; 8192*sizeoffloat(4)= 32768bytes in local memory = This is the largest things can be for the local call.. on my GPU card.
(testoutputs (* 2 2 2 2 2 2 2 2 2 2 2 2 2 2))  ;; this and any larger local sizes fail with:
;java.lang.RuntimeException: Exception while waiting for events [Event {commandType: ReadBuffer}] (NO_SOURCE_FILE:0)