Syncleus / aparapi

The New Official Aparapi: a framework for executing native Java and Scala code on the GPU.

Home Page:http://aparapi.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Update] Add complete support for OpenCL atomics on arrays of integers by mapping Java arrays of AtomicIntegers

CoreRasurae opened this issue · comments

This enhancement will add support for OpenCL atomics on integers by leveraging existing Java AtomicInteger class.
Arrays of AtomicInteger are used on the Java side, but will be mapped to array of integers on the OpenCL side, while the Aparapi Kernel class will provide methods with the compatible semantics as defined for OpenCL atomics.
The mapping on the OpenCL side will have negligilble overhead compared to direct implementation in CL code.

Example:

public static class AtomicKernel extends Kernel {    	
    	private int in[];
    	private int out[];
    	
    	@Local
    	private final AtomicInteger maxs[] = new AtomicInteger[4];
    	    	
    	public AtomicKernel(int[] in, int[] out) {
    		this.in = in;
    		this.out = out;
    		for (int idx = 0; idx < 4; idx++) {
    			maxs[idx] = new AtomicInteger(0);
    		}
    	}
    	
        @Override
        public void run() {
        	final int localId = getLocalId(0);
        	
        	//Ensure that atomic values are initialized... this must be enforced for
                //OpenCL, otherwise they may contain random values, as for Java, 
                //it is not needed, as they are already initialized in AtomicInteger 
                //constructor.
        	//Since this is Aparapi, it must be initialized on both platforms. 
        	if (localId == 0) {
	        	atomicSet(maxs[MAX_VAL_IDX], 0);
	        	atomicSet(maxs[LOCK_IDX], 0);
        	}
        	//Ensure all threads start with the initialized atomic max value and lock.
        	localBarrier();
        	
        	final int offset = localId * 2;
    		int localMaxVal = 0;
    		int localMaxPosFromLeft = 0;
    		int localMaxPosFromRight = 0;
    		for (int i = 0; i < 2; i++) {
    			localMaxVal = max(in[offset + i], localMaxVal);
    			if (localMaxVal == in[offset + i]) {
    				localMaxPosFromLeft = offset + i;
    				localMaxPosFromRight = SIZE - (offset + i);
    			}
    		}
    		
        	atomicMax(maxs[MAX_VAL_IDX], localMaxVal);
    		//Ensure all threads have updated the atomic maxs[MAX_VAL_IDX]
        	localBarrier();
        	
        	int maxValue = atomicGet(maxs[MAX_VAL_IDX]);
        	if (maxValue == localMaxVal) {
        		//Only the threads that have the max value will reach this point, 
                        //however the max value,  may occur at multiple indices of the 
                        //input array.
        		if (atomicXchg(maxs[LOCK_IDX], 0xff) == 0) {
        			//Only one of the threads with the max value will get here,
                                //thus ensuring consistent update of maxPosFromRight and
                                // maxPosFromLeft.
        			atomicSet(maxs[MAX_POS_LEFT_IDX], localMaxPosFromLeft);
        			atomicSet(maxs[MAX_POS_RIGHT_IDX], localMaxPosFromRight);
        			out[MAX_VAL_IDX] = maxValue;
        			out[MAX_POS_LEFT_IDX] = atomicGet(maxs[MAX_POS_LEFT_IDX]);
        			out[MAX_POS_RIGHT_IDX] = localMaxPosFromRight;
        		}
        	}
        }
    }

@CoreRasurae In the future you can add this stuff directly as pull requests rather than issue. Though feel free to add issues for enhancement requests if you wish. Just saying there is no requirement that a PR needs to match to an issue.

Will you be adding this as a PR?

Yes, I will, in the next few days. I already have it running with the aparapi 1.4.1, but now I have to extract this feature and rebase into master branch.

@CoreRasurae Merged in your PR so closing this issue. Thanks again for your contribution.

@CoreRasurae Btw github isnt giving you credit for your commits because of the way you have your account and gt set up. If you'd like I can show you how to fix that.

@CoreRasurae
I use 2 AtomicInteger arrays for 2 million indexes each. This causes about 200 milliseconds to prepare and extract these arrays.
Is it possible to avoid preparation and extraction at each execution of the kernel?
I do not need to transfer these arrays to the host, only use them on execution.

@gpeabody Please avoid from creating questions on closed topics, instead open a new issue with your question.
Anyway regarding your question and without having any clue on how you implemented the kernel, I would suggest for you to place the AtomicInteger arrays in LocalMemory, that way they will only be initialized inside the kernel and there is no transfer overhead.