AMReX-Astro / Castro

Castro (Compressible Astrophysics): An adaptive mesh, astrophysical compressible (radiation-, magneto-) hydrodynamics simulation code for massively parallel CPU and GPU architectures.

Home Page:http://amrex-astro.github.io/Castro

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

gpu running slower than cpu for subchandra problem

brady-ryan opened this issue · comments

My subchandra simulations frequently run much faster when using cpu’s compared to gpu’s, when the opposite should be true. The plot below was made to compare the cpu vs. gpu from the beginning of my simulation. It seems as though the gpu struggles almost immediately. All simulations are run on Perlmutter at NERSC.
gpu_vs_cpu_subchandra10km

Could you please post some relevant details, like the inputs file you used and the stdout from the CPU and GPU runs?

Sure, my problem model is sub_chandra.M_WD-0.98.M_He-0.050.delta10.00km.hse.CO.C12.O16.10.00km.isothermal. I have attached my inputs file and have linked downloads to the cpu and gpu .out files, as they are too big to upload to GitHub. Please let me know if you need anything else, thanks!

CPU
GPU

inputs_2d.N14.coarse.txt

My subchandra simulations frequently run much faster when using cpu’s compared to gpu’s, when the opposite should be true.

GPUs are massively parallel processors. They can launch thousands to tens of thousands of threads at a time, compared to the tens to hundreds of threads launched on a CPU. But the cost of all that parallelism is that each individual GPU thread is significantly slower than a CPU thread. So a GPU will never beat a CPU on performance unless there is enough work to do such that the GPU can take advantage of all of its threads at the same time, and perform much more work in parallel than the CPU can in the same amount of time. A rule of thumb is that you should probably have (at least) 1 million zones per GPU to ensure that all of the threads are saturated. But for this run you have

INITIAL GRIDS 
  Level 0   100 grids  819200 cells  100 % of domain
            smallest grid: 128 x 64  biggest grid: 128 x 64
  Level 1   72 grids  73728 cells  2.25 % of domain
            smallest grid: 32 x 32  biggest grid: 32 x 32
  Level 2   100 grids  204800 cells  1.5625 % of domain
            smallest grid: 64 x 32  biggest grid: 64 x 32
  Level 3   65 grids  414720 cells  0.791015625 % of domain
            smallest grid: 64 x 32  biggest grid: 128 x 64

which means that on Level 3, you have fewer than 10,000 zones per GPU.

The upshot is that you will need to run a significantly bigger problem on the GPU in order to see a performance benefit from it relative to a CPU. This can be done either by increasing the base resolution of the grid (amr.n_cell) and/or by increasing the number of levels of refinement you use.

I think that this has been sorted out now