clemisch / jaxtomo

Tomographic projector in JAX

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FP slow, dominated by MemcpyD2H

clemisch opened this issue · comments

When profiling FP, runtime seems to be dominated by MemcpyD2H as seen in the Perfetto UI. The actual computations take only very little time in between.

image

I think this could be related to JAX calling each step in jax.lax.scan from host, requiring some synchronization at each iteration. This is discussed here and here.

I don't know how to resolve this in pure JAX. Alternatives to current implementation is vmap'ing instead of scan, or using some unroll in scan. IIRC both led to longer runtime.

Ultimately we should avoid the Memcpy at each projection -- which is what I think is happening. We could switch the levels of scan and vmap, i.e. vmap over angles, but scan over detector rows. I would expect that to be slower, but we could try it.

Currenty, FP is ~6x slower than BP on CPU:

$ python3 timing.py --fp --bp --size=128
gpu      : None
prealloc : False
pmap     : False
fp       : True
bp       : True
size     : 128
dtype    : 'float32'
==== FP ====
(128, 128, 128) -> (128, 128, 128) :  1098 ms ,  0.52 µs per pixel , 0.002 GRays/s
==== BP ====
(128, 128, 128) -> (128, 128, 128) :   181 ms ,  0.09 µs per voxel , 0.012 GRays/s

For JAX performance tips, see google/jax#2940