Kokkos Support

Question

Kokkos Support

brian-kelley opened this issue 4 months ago · comments

For those not familiar with it, Kokkos (https://github.com/kokkos/kokkos) is a C++ library and programming model for portable, shared-memory parallelism. A program written once using Kokkos can be compiled for a variety of CPU and GPU backends using the common vendor toolchains. OpenMP, Cuda, HIP, and SYCL are some of the most popular backends.

Kokkos is used heavily by many codes to come out of the Exascale Computing Project, as well as Trilinos and apps built with it. Enzyme has support for OpenMP and Cuda code, but supporting Kokkos could be an efficient route to autodiff on other parallel backends (in particular, AMD and Intel GPUs).

Here is a simple Kokkos program that attempts to use __enzyme_autodiff to compute the gradient of the 2-norm. nrm2 uses Kokkos::parallel_reduce to sum up the squared elements of a vector, and then does Kokkos::sqrt (one of the portable wrappers for basic math functions). The hand-written gradient gradNrm2 computes what the answer should be.

#include <Kokkos_Core.hpp>
#include <iostream>

using Vector = Kokkos::View<double*>;

template<typename RT, typename... Args>
RT __enzyme_autodiff(void*, Args...);

int enzyme_dup;
int enzyme_out;
int enzyme_const;

// nrm2 (function to differentiate)
// Enzyme handles this version fine.
/*
double nrm2(const Vector& v)
{
  double sum = 0;
  for(size_t i = 0; i < v.extent(0); i++)
  {
    sum += v(i) * v(i);
  }
  return Kokkos::sqrt(sum);
}
*/

struct Nrm2_Functor
{
  Nrm2_Functor(Vector v_) : v(v_) {}

  KOKKOS_INLINE_FUNCTION void operator()(int i, double& lsum) const
  {
    lsum += v(i) * v(i);
  }

  Vector v;
};

double nrm2(const Vector& v)
{
  double sum = 0;
  Kokkos::parallel_reduce(v.extent(0), Nrm2_Functor(v), sum);
  return Kokkos::sqrt(sum);
}

// Analytical gradient of nrm2
Vector gradNrm2(Vector v)
{
  double n2 = nrm2(v);
  Vector grad("grad", v.extent(0));
  for(size_t i = 0; i < v.extent(0); i++) // this could be a parallel_for loop also
  {
    grad(i) = v(i) / n2;
  }
  return grad;
}

void printVector(Vector v)
{
  for(size_t i = 0; i < v.extent(0); i++)
  {
    std::cout << v(i) << " ";
  }
  std::cout << '\n';
}

int main(int argc, char* argv[])
{
  Kokkos::initialize(argc, argv);
  {
    Kokkos::View<double*> v("v", 3);
    v(0) = 2;
    v(1) = 3;
    v(2) = 4;
    std::cout << "Vector v: ";
    printVector(v);
    Vector g = gradNrm2(v);
    std::cout << "Correct gradient of nrm2 at v: ";
    printVector(g);
    std::cout << "Enzyme gradient of nrm2 at v: ";
    Vector ge("ge", 3);
    __enzyme_autodiff<void>((void*) nrm2, enzyme_dup, v, ge);
    printVector(ge);
  }
  Kokkos::finalize();
  return 0;
}

Uncommenting the first version of nrm2 (where the reduction is a normal for loop) works, so Enzyme is smart enough to understand the View<double*> as a data structure, and accesses using operator().

While most Kokkos constructs are templated on execution/memory spaces, View type, functor type, etc, there are still functions that are compiled into libkokkos and so their definitions are not available in the translation unit that ClangEnzyme can see. Here is some of the output that happens when I try to build this program with Enzyme:

error: Enzyme: No augmented forward pass found for _ZN6Kokkos4Impl22SharedAllocationRecordIvvE9decrementEPS2_
 at context:   %call.i = call noundef ptr @_ZN6Kokkos4Impl22SharedAllocationRecordIvvE9decrementEPS2_(ptr noundef %1) #23
error: Enzyme: No reverse pass found for _ZN6Kokkos4Impl22SharedAllocationRecordIvvE9decrementEPS2_
 at context:   %call.i = call noundef ptr @_ZN6Kokkos4Impl22SharedAllocationRecordIvvE9decrementEPS2_(ptr noundef %1) #25
freeing without malloc   %4 = load ptr, ptr %m_control8, align 8
error: Enzyme: No augmented forward pass found for _ZN6Kokkos5Tools8endFenceEm
 at context:   call void @_ZN6Kokkos5Tools8endFenceEm(i64 noundef %0) #27
error: Enzyme: No augmented forward pass found for _ZN6Kokkos5Tools10beginFenceENSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEEjPm
 at context:   call void @_ZN6Kokkos5Tools10beginFenceENSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEEjPm(ptr noundef %malloccall1, i32 noundef %add, ptr noundef %malloccall) #27
error: Enzyme: No augmented forward pass found for _ZN6Kokkos6SerialC1Ev
 at context:   call void @_ZN6Kokkos6SerialC1Ev(ptr noundef nonnull align 8 dereferenceable(16) %malloccall2) #27
freeing without malloc   %4 = load ptr, ptr %m_control8, align 8
freeing without malloc   %4 = load ptr, ptr %m_control8, align 8
freeing without malloc   %4 = load ptr, ptr %m_control8, align 8
error: Enzyme: No augmented forward pass found for _ZN6Kokkos5Tools17endParallelReduceEm
 at context:   call void @_ZN6Kokkos5Tools17endParallelReduceEm(i64 noundef %0) #27
error: Enzyme: No augmented forward pass found for _ZN6Kokkos4Impl14SerialInternal23resize_thread_team_dataEmmmm
 at context:   call void @_ZN6Kokkos4Impl14SerialInternal23resize_thread_team_dataEmmmm(ptr noundef nonnull align 8 dereferenceable(169) %call4, i64 noundef %conv, i64 noundef 0, i64 noundef 0, i64 noundef 0) #27
error: Enzyme: No augmented forward pass found for _ZNSt3__15mutex4lockEv
 at context:   call void @_ZNSt3__15mutex4lockEv(ptr noundef nonnull align 8 dereferenceable(40) %0) #27
error: Enzyme: No augmented forward pass found for _ZNSt3__15mutex6unlockEv
 at context:   call void @_ZNSt3__15mutex6unlockEv(ptr noundef nonnull align 8 dereferenceable(40) %0) #27
...

Most of functions are called internally by the View allocation and the parallel_reduce.

The high-level path to supporting Kokkos probably looks something like:

Handle the internal Kokkos functions (in CallDerivatives.cpp?)
Make Enzyme aware of parallel_for and parallel_reduce: for the reverse mode, the gradient code should be a parallel_reduce and parallel_for respectively with differentiated versions of the functor body. Like we discussed, this must also avoid data races using atomics if View elements are updated in parallel.
Use existing Cuda paths as a basis for handling host-device memory migration in heterogeneous backends?
Add integration tests?

Some other open questions:

We can already convert MLIR to Kokkos C++ source. But is it feasible to turn differentiated LLVM code back into C++?
Could it be portable/generic, so that the same code can be compiled on the full set of Kokkos backends?
- This would greatly increase the value of this work, since gradients could be generated once, and then Clang/LLVM/Enzyme would not be needed to compile the actual application.

Enzyme community members who might be interested: @wsmoses @michel2323 @ftynse @vchuravy @ivanradanov @albertcohen
And others from the Sandia/Kokkos group: @kliegeois @srajama1

Manuel Drehwald · Answer 1 · Thu Feb 01 2024 05:21:21 GMT+0800 (China Standard Time)

Re visability of symbols for ClangEnzyme, did you try LLDEnzyme or one of the other paths described here? https://enzyme.mit.edu/getting_started/UsingEnzyme/#differentiating-cc

William Moses · Answer 2 · Thu Feb 01 2024 05:29:31 GMT+0800 (China Standard Time)

In the case of several of the above functions, I think the right solution is to mark them as allocation like (we have an attribute for this).

brian-kelley · Answer 3 · Thu Feb 01 2024 06:14:03 GMT+0800 (China Standard Time)

@ZuseZ4 Thanks for the tip. I started with that example CMakeLists.txt, and made Kokkos a subdirectory. I do get through everything until the final executable link:

(from verbose makefile)
clang++ -O0 -stdlib=libc++ -Wl,-rpath,blah/lib/x86_64-unknown-linux-gnu -fuse-ld=lld -Wl,-mllvm -Wl,-load=/blah/enzyme-install/lib/LLDEnzyme-16.so -Wl,--load-pass-plugin=/blah/enzyme-install/lib/LLDEnzyme-16.so -DKOKKOS_DEPENDENCE CMakeFiles/myProgram.dir/myProgram.cpp.o -o myProgram  kokkos/containers/src/libkokkoscontainers.a kokkos/core/src/libkokkoscore.a -ldl kokkos/simd/src/libkokkossimd.a

ld.lld: error: <unknown>:0:0: in function main i32 (i32, ptr): Enzyme: Cannot cast __enzyme_autodiff primal argument 1, found i32 0, type i32 - to arg 0 ptr

clang-16: error: linker command failed with exit code 1 (use -v to see invocation)

I assume the primal argument 1 is talking about v above? Can I tweak my declaration of __enzyme_autodiff or nrm2 to work around this? If it helps, the same error happens with the non-parallel version of nrm2 uncommented, which worked under ClangEnzyme.

William Moses · Answer 4 · Thu Feb 01 2024 06:19:02 GMT+0800 (China Standard Time)

You probably want to do extern on enzyme_dup per that error.

I will forewarn that LTO comes with various compile time implications.

If it works now that's a great starting point, but we'll probably want to do some attributes directly for ease of kokkos users.

brian-kelley · Answer 5 · Thu Feb 01 2024 06:24:32 GMT+0800 (China Standard Time)

Thanks @wsmoses , that fixed it for the for-loop version but not the parallel_reduce version. The error messages look very similar to with ClangEnzyme, so maybe I misdiagnosed the original problem:

ld.lld: error: <unknown>:0:0: in function preprocess__ZN6Kokkos4Impl11ViewTrackerINS_4ViewIPdJEEEED2Ev void (ptr): Enzyme: No augmented forward pass found for _ZN6Kokkos4Impl22SharedAllocationRecordIvvE9decrementEPS2_
 at context:   %8 = call noundef ptr @_ZN6Kokkos4Impl22SharedAllocationRecordIvvE9decrementEPS2_(ptr noundef %7) #28


ld.lld: error: <unknown>:0:0: in function preprocess__ZN6Kokkos4Impl11ViewTrackerINS_4ViewIPdJEEEED2Ev void (ptr): Enzyme: No reverse pass found for _ZN6Kokkos4Impl22SharedAllocationRecordIvvE9decrementEPS2_
 at context:   %8 = call noundef ptr @_ZN6Kokkos4Impl22SharedAllocationRecordIvvE9decrementEPS2_(ptr noundef %7) #30

freeing without malloc   %19 = load ptr, ptr %18, align 8

ld.lld: error: <unknown>:0:0: in function preprocess__ZN6Kokkos5Tools12Experimental4Impl19profile_fence_eventINS_6SerialEZNKS4_5fenceERKNSt3__112basic_stringIcNS5_11char_traitsIcEENS5_9allocatorIcEEEEEUlvE_EEvSD_NS2_19DirectFenceIDHandleERKT0_ void (ptr, i32, ptr): Enzyme: No augmented forward pass found for _ZN6Kokkos5Tools8endFenceEm
 at context:   call void @_ZN6Kokkos5Tools8endFenceEm(i64 noundef %8) #32


ld.lld: error: <unknown>:0:0: in function preprocess__ZN6Kokkos5Tools12Experimental4Impl19profile_fence_eventINS_6SerialEZNKS4_5fenceERKNSt3__112basic_stringIcNS5_11char_traitsIcEENS5_9allocatorIcEEEEEUlvE_EEvSD_NS2_19DirectFenceIDHandleERKT0_ void (ptr, i32, ptr): Enzyme: No augmented forward pass found for _ZN6Kokkos5Tools10beginFenceENSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEEjPm
 at context:   call void @_ZN6Kokkos5Tools10beginFenceENSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEEjPm(ptr noundef %5, i32 noundef %7, ptr noundef %4) #32


ld.lld: error: <unknown>:0:0: in function preprocess__ZN6Kokkos15parallel_reduceI12Nrm2_FunctordEENSt3__19enable_ifIXntoooosr6Kokkos7is_viewIT0_EE5valuesr6Kokkos10is_reducerIS4_EE5valuesr3std10is_pointerIS4_EE5valueEvE4typeERKmRKT_RS4_ void (ptr, ptr, ptr): Enzyme: No augmented forward pass found for _ZN6Kokkos6SerialC1Ev
 at context:   call void @_ZN6Kokkos6SerialC1Ev(ptr noundef nonnull align 8 dereferenceable(16) %5) #32

William Moses · Answer 6 · Thu Feb 01 2024 06:26:10 GMT+0800 (China Standard Time)

Yeah my same comment about marking the function as allocation like (assuming I'm reading this correctly as an allocation function) applies here.

William Moses · Answer 7 · Thu Feb 01 2024 06:27:36 GMT+0800 (China Standard Time)

But also this is still odd because this implies that you didn't do full LTO with wherever these functions were implemented. And this Enzyme couldn't find the definition to differentiate and complained.

I do think this is probably at the level we should mark a custom derivative for at a higher level, but still would be good to confirm it is okay if given the definitions in llvm.

brian-kelley · Answer 8 · Thu Feb 01 2024 08:18:58 GMT+0800 (China Standard Time)

@wsmoses Made some progress - I wasn't very familiar with the usage of LTO before. I still had to add -flto to the compilation of the Kokkos libraries and also install the LLVMgold.so plugin.

Now, it's "cannot deduce type of memset" and "cannot deduce type of copy"

ld.lld: error: <unknown>:0:0: in function preprocess__ZNSt3__15mutexC2B7v160006Ev void (ptr): Enzyme: Cannot deduce type of memset   call void @llvm.memset.p0.i64(ptr align 8 %2, i8 0, i64 40, i1 false) #46
<analysis>
ptr %0: {[-1]:Pointer}, intvals: {}
  %2 = getelementptr inbounds %"class.std::__1::mutex", ptr %0, i32 0, i32 0: {[-1]:Pointer}, intvals: {}
</analysis>

ld.lld: error: <unknown>:0:0: in function preprocess__ZN6Kokkos5Tools12Experimental23invoke_kokkosp_callbackIPFv28Kokkos_Profiling_SpaceHandlePKcPKvmEJRKS3_S5_RS7_RKmEEEvNS1_23MayRequireGlobalFencingERKT_DpOT0_ void (ptr, ptr, ptr, ptr, ptr): Enzyme: Cannot deduce type of copy   call void @llvm.memcpy.p0.p0.i64(ptr align 1 %7, ptr align 1 %1, i64 64, i1 false) #46
<analysis>

Is this the kind of thing that would be fixed by adding the allocation-like attribute to the Kokkos functions in question?
If so, is there an example of this attribute being used? I found this line in customalloc.c but I'm not sure if this is what you're referring to:

void* __enzyme_allocation_like[4] = {(void*)myallocator, (void*)1, (void*)"2,-1", (void*)myfree};

Manuel Drehwald · Answer 9 · Thu Feb 01 2024 08:24:34 GMT+0800 (China Standard Time)

Not recommended for production usecase, but try https://enzyme.mit.edu/getting_started/UsingEnzyme/#loose-type-analysis to get started. I think there might also be an example in the cmake on how to add it (but not sure).

Also, LLVMgold.so looks suspicious, I think it should use LLD and not gold(?), but maybe Billy knows more.

William Moses · Answer 10 · Sat Feb 03 2024 12:17:24 GMT+0800 (China Standard Time)

@ZuseZ4 no gold is fine here

@brian-kelley can you add -g so we can look at a backtrace of where the error comes from. It probably makes sense to mark some mutex or something as inactive there.

also FYI Enzyme does differentiate through AMD GPUs as well per your earlier comment