wrong tid in cta_launch with non power of two NT
christiankerl opened this issue · comments
using cta_launch with NT not being a power of two results in wrong tid values passed to the lambda. simple test case:
static const int NT = 96
mgpu::cta_launch<NT>([=] MGPU_DEVICE (const int tid, const int block) {
printf("thread %d %d %d\n", threadIdx.x, tid, threadIdx.x & (NT - 1));
}, 1, ctx);
For NT = 32, 64 and 128 this works fine. However setting NT to any multiple of 32 should be valid, right?
That's a good point. I should definitely change that. Started off only supporting powers of two because I wasn't sure if all the algorithms I wrote would work on those odd CTA sizes. But I should relax that restriction for cta_launch and transform etc, and static_assert inside non-supported functions if I need to. Will roll that fix out in the next update. Thx.
Fixed in the 2.10 push. Let me know if you have problems with it.
works fine now, thanks for the fix