lean-dojo / LeanCopilot

We only select premises on CPUs but want to make it work also for GPUs.

Line 246 in 8c7b338

// TODO: We should remove this line when everything can work well on CUDA.

@Peiyang-Song . A recent discussion on Zulip may be relevant: https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/LeanCopilot/near/418208166

I believe they should work for CUDA if changed from:

LeanCopilot/cpp/ct2.cpp

Lines 335 to 336 in 38a5a48

    
           const int *p_topk_indices = topk_indices.data<int>(); 
        
           const float *p_topk_values = topk_values.data<float>();

to:

auto p_topk_indices = topk_indices.to_vector<int>();
auto p_topk_values = topk_values.to_vector<float>();
// Possibly the following line is also required; I'd have tested if I had a system that the CUDA support ran on; sadly it appears something accidentally assumed a newer GPU than mine, as mine is technically supported according to the documentation. Well, encoding and generating fail with the prebuild binary; I can't even seem to compile with CUDA support.
ctranslate2::synchronize_stream(device);

.
It is my expectation that attempting to run it without the synchronize_stream, if it's necessary to have, would cause spurious old or even garbage values to be read out in the subsequent loop, because in that case the device-to-host memcpy would be issued asynchronously and would race the execution of said subsequent loop.
As they are issued later, p_topk_values would be more prone to containing wrong/old values; I'd expect a simple loop checking that they are in order as mandated by the TopK operator, with a query embedding just pulled out of nowhere (the loss function for the encoder fine tuning should have arranged for the premise embeddings to relatively evenly cover the embedding space's hypershphere), to surface the issue if it exists.

For better throughput retrieval should be batched; as-is, the bulk of it's work does a single multiply-add operation per scalar loaded from memory; for example, a Ryzen 9 5950X would, if I'm not mistaken, be DRAM read bandwidth bound up to a batch size of about 80; this would though require loop tiling to both register file and L1 cache, as with a 1472 embedding dimension and f32 values 80 queries would already barely fit in L2 cache.
The idea I'd have would be to process all queries on all involved cores, chunk the premise list, and spread the chunks across the cores.
My understanding is that GPUs suffer somewhat harder from this need for large enough batch size to no longer be memory-bound.

And yes, I do mean that you retrieving 100 queries with a small enough k in the TopK, if running sufficiently optimal code, would be expected to take approximately twice as long as retrieving a single query. It's that bad.

We have recent reports about using Lean Copilot on certain Ubuntu machines with CUDA-enabled GPUs. We currently apply a suboptimal fix with CMake flags. Linking PR #109.

	const int *p_topk_indices = topk_indices.data<int>();
	const float *p_topk_values = topk_values.data<float>();

Premise Selection on GPUs