m4rs-mt / ILGPU.Algorithms

The new standard algorithms library for ILGPU

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Custom reduce

CsabaStupak opened this issue · comments

commented

I'm trying to write my custom reducer which basically should do: y = x1x1 + x2x2 + etc.

The reducer code looks like:

public readonly struct MyReducer : IScanReduceOperation<int>
{
	public string CLCommand { get; }
	public int Identity { get => 0; }

	public int Apply(int first, int second)
	{
		return first + second * second;
	}
	//
	public void AtomicApply(ref int target, int value)
	{
		Atomic.Add(ref target, value);
	}
}

accl.Reduce<int, MyReducer>(accl.DefaultStream, buffer.View, target.View);

I was deciphering how to implement this from the code so it could be incorrect. This works when I'm testing it with CPU accelerator.

  • CPU: for buffer = [0, 1, 2, 3] it returns 14
  • Cuda: (GeForce card) it returns 6740
  • OpenCL: (Intel card) it crashes - exception message "An internal compiler error has been detected"

Could you help me how to write the custom reducer correctly?

Many thanks! :-)

PS: Finally a good GPU C# library :-)

commented

@CsabaStupak It might be easier to perform your algorithm in two steps.

The first step would be to multiply the data array:

static void MultiplyKernel(Index1 index, ArrayView<int> data)
{
    data[index] = data[index] * index;
}

// Load and run the kernel
var kernel = accelerator.LoadAutoGroupedKernel<Index1, ArrayView<int>>(MultiplyKernel);
kernel(accelerator.DefaultStream, buffer.Length, buffer.View);

Then, perform a Sum reduction:

accelerator.Reduce<int, AddInt32>(accelerator.DefaultStream, buffer.View, target.View);

I have attached some sample code:
Issue68.zip

commented

@MoFtZ Thanks for the sample code! I like how simple it is :-)

  • Shouldn't the multiply kernel look like this?
    data[index] = data[index] * data[index];

  • Should I call accelerator.Synchronize(); between kernels/algorithms? Or it is enough to do it at the end when I'm retrieving data from GPU?

I was trying to came with my own reducer because I tried to avoid the memory allocation for the input data - I can't modify it. Expected that approach is faster than do it separately with GPU memory allocation.

commented
* Shouldn't the multiply kernel look like this?
  `data[index] = data[index] * data[index];`

Ah, yes, you're right. That is just my poor interpretation of your algorithm =)

* Should I call `accelerator.Synchronize();` between kernels/algorithms? Or it is enough to do it at the end when I'm retrieving data from GPU?

Kernels are added to a stream (default stream if none is specified). Within a stream, the kernels will run sequentially, which is why I could queue the MultipleKernel then the Reduce kernel on the DefaultStream without explicit synchronization. If you queue two kernels on two different streams, these kernels will run in parallel.

Calling accelerator.Synchronize() will synchronize across all streams. If you only want to wait for a single stream, using stream.Synchronize().

I was trying to came with my own reducer because I tried to avoid the memory allocation for the input data - I can't modify it. Expected that approach is faster than do it separately with GPU memory allocation.

Looking at your custom reducer, the reason that it crashes on OpenCL is that you have not specified a CLCommand. This should be one of add, min or max (corresponding to the OpenCL function sub_group_reduce<op>). Using add in your case appears to make OpenCL work correctly.

The reason that the Cuda accelerator is returning an incorrect result when using your custom reducer appears to be a potential issue in ILGPU.Algorithms. Internally, it is calling the Apply function more times than the other accelerators, and therefore producing an incorrect result when using your custom reducer - FYI @m4rs-mt.

commented

Many thanks for the help and explanations :-)

commented

Internally, the reduction kernel for the current release (ILGPU.Algorithms v0.9.2) can be found here:

internal static void ReductionKernel<T, TReduction>(
ArrayView<T> input,
ArrayView<T> output)
where T : unmanaged
where TReduction : struct, IScanReduceOperation<T>
{
var stride = GridExtensions.GridStrideLoopStride;
TReduction reduction = default;
var reduced = reduction.Identity;
for (var idx = Grid.GlobalIndex.X; idx < input.Length; idx += stride)
reduced = reduction.Apply(reduced, input[idx]);
reduced = GroupExtensions.Reduce<T, TReduction>(reduced);
if (Group.IsFirstThread)
reduction.AtomicApply(ref output[0], reduced);
}

At line 96, it calls the Apply method of your custom reducer.

The issue is that inside line 98, on the Cuda accelerator, it incorrectly calls the Apply method.

As a really hacky workaround, you could replace line 96 with your own code, and use the standard AddInt32 as the TReduction generic.

I think this issue has been resolved. If the issue still persists, please open a new issue in the ILGPU project.