m4rs-mt / ILGPU.Algorithms

I'm trying to write my custom reducer which basically should do: y = x1x1 + x2x2 + etc.

The reducer code looks like:

public readonly struct MyReducer : IScanReduceOperation<int>
{
	public string CLCommand { get; }
	public int Identity { get => 0; }

	public int Apply(int first, int second)
	{
		return first + second * second;
	}
	//
	public void AtomicApply(ref int target, int value)
	{
		Atomic.Add(ref target, value);
	}
}

accl.Reduce<int, MyReducer>(accl.DefaultStream, buffer.View, target.View);

I was deciphering how to implement this from the code so it could be incorrect. This works when I'm testing it with CPU accelerator.

CPU: for buffer = [0, 1, 2, 3] it returns 14
Cuda: (GeForce card) it returns 6740
OpenCL: (Intel card) it crashes - exception message "An internal compiler error has been detected"

Could you help me how to write the custom reducer correctly?

Many thanks! :-)

PS: Finally a good GPU C# library :-)

@CsabaStupak It might be easier to perform your algorithm in two steps.

The first step would be to multiply the data array:

static void MultiplyKernel(Index1 index, ArrayView<int> data)
{
    data[index] = data[index] * index;
}

// Load and run the kernel
var kernel = accelerator.LoadAutoGroupedKernel<Index1, ArrayView<int>>(MultiplyKernel);
kernel(accelerator.DefaultStream, buffer.Length, buffer.View);

Then, perform a Sum reduction:

accelerator.Reduce<int, AddInt32>(accelerator.DefaultStream, buffer.View, target.View);

I have attached some sample code:
Issue68.zip

@MoFtZ Thanks for the sample code! I like how simple it is :-)

Shouldn't the multiply kernel look like this?
data[index] = data[index] * data[index];
Should I call accelerator.Synchronize(); between kernels/algorithms? Or it is enough to do it at the end when I'm retrieving data from GPU?

I was trying to came with my own reducer because I tried to avoid the memory allocation for the input data - I can't modify it. Expected that approach is faster than do it separately with GPU memory allocation.

* Shouldn't the multiply kernel look like this?
  `data[index] = data[index] * data[index];`

Ah, yes, you're right. That is just my poor interpretation of your algorithm =)

* Should I call `accelerator.Synchronize();` between kernels/algorithms? Or it is enough to do it at the end when I'm retrieving data from GPU?

Kernels are added to a stream (default stream if none is specified). Within a stream, the kernels will run sequentially, which is why I could queue the MultipleKernel then the Reduce kernel on the DefaultStream without explicit synchronization. If you queue two kernels on two different streams, these kernels will run in parallel.

Calling accelerator.Synchronize() will synchronize across all streams. If you only want to wait for a single stream, using stream.Synchronize().

I was trying to came with my own reducer because I tried to avoid the memory allocation for the input data - I can't modify it. Expected that approach is faster than do it separately with GPU memory allocation.

Looking at your custom reducer, the reason that it crashes on OpenCL is that you have not specified a CLCommand. This should be one of add, min or max (corresponding to the OpenCL function sub_group_reduce<op>). Using add in your case appears to make OpenCL work correctly.

The reason that the Cuda accelerator is returning an incorrect result when using your custom reducer appears to be a potential issue in ILGPU.Algorithms. Internally, it is calling the Apply function more times than the other accelerators, and therefore producing an incorrect result when using your custom reducer - FYI @m4rs-mt.

Many thanks for the help and explanations :-)

Internally, the reduction kernel for the current release (ILGPU.Algorithms v0.9.2) can be found here:

ILGPU.Algorithms/Src/ILGPU.Algorithms/ReductionExtensions.cs

Lines 84 to 102 in c0ad5cd

    
           internal static void ReductionKernel<T, TReduction>( 
        
               ArrayView<T> input, 
        
               ArrayView<T> output) 
        
               where T : unmanaged 
        
               where TReduction : struct, IScanReduceOperation<T> 
        
           { 
        
               var stride = GridExtensions.GridStrideLoopStride; 
        
               TReduction reduction = default; 
        
               var reduced = reduction.Identity; 
        
               for (var idx = Grid.GlobalIndex.X; idx < input.Length; idx += stride) 
        
                   reduced = reduction.Apply(reduced, input[idx]); 
        
               reduced = GroupExtensions.Reduce<T, TReduction>(reduced); 
        
               if (Group.IsFirstThread) 
        
                   reduction.AtomicApply(ref output[0], reduced); 
        
           }

At line 96, it calls the Apply method of your custom reducer.

The issue is that inside line 98, on the Cuda accelerator, it incorrectly calls the Apply method.

As a really hacky workaround, you could replace line 96 with your own code, and use the standard AddInt32 as the TReduction generic.

I think this issue has been resolved. If the issue still persists, please open a new issue in the ILGPU project.

	internal static void ReductionKernel<T, TReduction>(
	ArrayView<T> input,
	ArrayView<T> output)
	where T : unmanaged
	where TReduction : struct, IScanReduceOperation<T>
	{
	var stride = GridExtensions.GridStrideLoopStride;

	TReduction reduction = default;

	var reduced = reduction.Identity;
	for (var idx = Grid.GlobalIndex.X; idx < input.Length; idx += stride)
	reduced = reduction.Apply(reduced, input[idx]);

	reduced = GroupExtensions.Reduce<T, TReduction>(reduced);

	if (Group.IsFirstThread)
	reduction.AtomicApply(ref output[0], reduced);
	}

Custom reduce