We now use SumYAxis when executing with CUDA for better memory patterns.
Instead of using the heavy Pass4/Pass4WithNormals, CUDA now uses a
2 pass approach with the second pass outputting the normals and
coordinates using with significantly less warp divergence
First the Pass1 now uses WorkletVisitPointsWithCells as it is
faster since it doesn't compute some implicit boundary / neighborhood
info.
Second we reworked the logic around using `Fill` and a conditional
write of the edge case. The requirements of SMP when on a NUMA
machine is the complete opposite of what works great with CUDA.