stencil N=100000 R=3 BLOCK_SIZE=1024 execute time for single stencil execute time for multiple threads stencil with global device memory execute time for multiple threads stencil with per-block share memory