1. The code now works without CUDA_LAUNCH_BLOCKING set by using explicit synchronizations where required. 2. The code has also been modified to use thread specific memory spaces, which for Kokkos' Cuda backend means per thread streams.