mitsuba-renderer / drjit

Dr.Jit — A Just-In-Time-Compiler for Differentiable Rendering

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Profiling forward and backward execution time in Drjit autograd

Linyou opened this issue · comments

Hello!

I am currently exploring performance comparisons between Drjit's autograd and CUDA. I have tried to profile the time for backward propagation with the following code:

# Adding a timer to it
dr.backward(loss)
dr.sync_thread()

However, I am consistently getting much shorter times compared to what I observe with CUDA. This leads me to believe that I might be missing something in my profiling approach or misunderstanding the timing results.

Could anyone provide guidance on the proper way to accurately measure the forward and backward propagation times in Drjit's autograd? Any help would be greatly appreciated!

Thank you.

I think you need to have a subsequent call to dr.grad and dr.eval, e.g. something like this

dr.backward(loss)
grad = dr.grad(my_parameter)
dr.eval(grad)
dr.sync_thread()

Hi @Linyou

@dvicini is correct, you most likely need to explicitly evaluate the gradient before measuring your runtime. Similarly to the JIT evaluation, the autodiff mechanisms can/will produce lazy outputs that are not fully evaluated unless they are explicitly needed.

import drjit as dr
from drjit.cuda.ad import Float, UInt

a = dr.full(Float, 10, 3) # a := [10, 10, 10]
dr.enable_grad(a)
dr.set_grad(a, 1)

b = dr.arange(Float, 6) # b := [0, 1, 2, 3, 4, 5]
dr.scatter(b, a * 2, dr.arange(UInt, 3))
# b := [20, 21, 22, 3, 4, 5]

dr.forward(a) 
dr.set_log_level(dr.LogLevel.Info) # Will log whenever a kernel is executed from now on
print(f"{dr.grad(b)=}") # grad(b) := [2, 2, 2, 0, 0, 0]

The snippet above is a bit meaningless. However, if you run it, you will see that the logs indicate that a kernel is executed after dr.forward(a). So the gradient of b was not yet computed by the dr.forward() operation.

Thank you for the detailed explanation. I can see now how the lazy evaluation works within the autodiff mechanisms. Your suggestion of explicitly evaluating the gradient before measuring the runtime makes a lot of sense.

One quick comment: if you print the array, you are also paying the cost to memcpy the result to the CPU (which synchronizes GPU<->CPU and is done very inefficiently using several PCI transactions). So this may skew the results.

I would do dr.eval(..) and then dr.sync_thread() -- on the CUDA side, you should add cudaDeviceSynchronize() (which is what dr.sync_thread() calls internally) to compare times fairly.