CUDATracePreload is a dynamic tracing tool for CUDA and NCCL API calls. It leverages the LD_PRELOAD mechanism to intercept and log API calls, making it easier to capture the low level calls made by Python frameworks such as Pytorch or Tensorflow.
- Trace CUDA API Calls: Automatically logs calls to CUDA APIs.
- Monitor NCCL Operations: Captures and reports on NCCL function invocations.
- Easy Integration: Simply set the LD_PRELOAD environment variable to use.
- Extensible: Easy to add your own metrics
make # creates tracer.so
Several build option are available through make (e.g, make TRACK_NCCL=1
).
LIBTORCH_CUDA_SO ?= /path/to/libtorch_cuda.so
CUDA_INCLUDE_PATH ?= /path/to/cuda/include
NCCL_INCLUDE_PATH ?= /path/to/nccl/include
TRACK_CUDA ?= 1 # Enables tracking of CUDA calls
TRACK_NCCL ?= 0 # Enables tracking of NCCL calls
We pass the path to tracer.so to LD_PRELOAD. The results can be found in the
log
folder. The number of log files may vary depending on the number of
processes per node.
# Example for running on llama2
mkdir log
LD_PRELOAD=tracer.so python -m torch.distributed.run --nproc_per_node 2 dialog.py --ckpt_dir llama-2-13b-chat/ --tokenizer_path tokenizer.model -max_seq_len 2048 --max_batch_size 6
Sample log
[DEVICE 1 ] cudaMalloc called with arguments: devPtr = size = 165675008
[DEVICE 1 ] cudaMalloc called with arguments: devPtr = size = 329252864
[DEVICE 1 ] cudaMemcpyAsync called with arguments: dst = src = count = 26214400 kind = str =
...
### Report ###
Total cudaMalloc: 18683 MB
Total ncclAllReduce count: 185139200 bytes
Total broadcast count: 0 bytes
Total reduce count: 0 bytes
Total allgather count: 8389120 bytes
Total reducescatter count: 0 bytes
Total ncclmemalloc count: 0 bytes
You can add metric function calls as lambda functions to the CREATE_HOOKED_NCCL_FUNCTION
or
CREATE_HOOKED_CUDA_FUNCTION
macros as shown below.
std::atomic<long> total_mb_allocated(0);
CREATE_HOOKED_CUDA_FUNCTION(
cudaError_t,
cudaMalloc,
(void** devPtr, size_t size),
(void**, size_t),
(devPtr, size),
[=]() { total_mb_allocated.fetch_add(size / 1000000, std::memory_order_relaxed); }
)
Or you can define your own handler. Below is an example for cudaMemcpy
.
cudaError_t cudaMemcpy ( void* dst, const void* src, size_t count, cudaMemcpyKind kind )
{
cudaError_t (*lcudaMemcpy) ( void*, const void*, size_t, cudaMemcpyKind) = (cudaError_t (*) ( void* , const void* , size_t , cudaMemcpyKind ))dlsym(RTLD_NEXT, "cudaMemcpy");
/* Do your own stuff */
return lcudaMemcpy( dst, src, count, kind );
}
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Feel free to fork the repository and submit pull requests.