Yes, that's pi of x. This is a very incomplete implementation of the combinatorial
prime counting function of Deleglise and Rivat. See https://github.com/curtisseizert/CUDApix/blob/master/Deconvoluting%20Deleglise-Rivat.pdf for a detailed explanation of the breakdown used in this implementation (and others). At this point, all terms of the sum have corresponding implementations in some state or another. However, only a few of these give the correct answer, with the others off by approximately 1 ppm. They are also poorly organized and spread out all over the place. This is mostly because I have been bouncing from one to another when I hit a wall, hoping that time will make the solutions to my various problems more obvious.
With the project as it is, it is fair to say that the gpu is doing the right amount of work, and with that assumption in mind, it is also possible to say that this is a problem well suited to GPU computation. As one example, I made an OpenMP implementation of one term of the sum to compare results and perform correctness testing using a nearly identical approach to the CUDA implementation. The result was that the GPU (gtx 1080) performs the same computation as the CPU (i7 6700K overclocked to 4.6 GHz, 8 threads) in one thirtieth of the time.
Nonetheless, some significant challenges remain. One (and the most important) is making sure the correct answer is always obtained. This is very difficult when dealing with a sum containing at least five uncertain terms, however, the existing implementations (especially Kim Walisch's primecount) have been very helpful in this. The other is extending the algorithm to 128-bit territory. I have written a small library of 128-bit functions for CUDA, which keeps all the work on the GPU, but one must also design strategies for segmenting very large data sets to fit within the finite and non-expandable DRAM capacity of the GPU. It appears that above ~1024 this will actually require segmentation in multiple dimensions. This is non-trivial.
To sum up, all of the 'infrastructure' for the compuation is complete (e.g. 128-bit arithmetic, least prime factor generation, Mobius function generation, prime number generation (via CUDASieve), etc.). And if one partitions the sum as described by Oliveira e Silva (and implemented by Kim Waslisch in primecount), the terms P2, ordinary leaves, and trivial leaves give exactly the right answer in 64 bits. The easy and hard leaves are broken up differently, but the sum of the two is a bit off). A lot of progress has been made, but it is still a long way off. Wish me luck.