Cycles: Improved thread order for better CUDA performance.

This patch puts threads that render the same pixel closer together,
as opposed to threads that render the same sample. Thus threads
within a warp are more coherent in memory access and control flow,
leading to performance improvements.

Example benchmarks on a Quadro RTX4000 (WDDM) on Windows 10:
Koro:                 4:23 ->  3:46
BMW:                  1:18 ->  1:25
Barbershop Interior: 17:52 -> 14:55
Classroom:            4:37 ->  3:45

Performance differences on OpenCL/AMD were hit and miss, some scenes
became faster, others lost significantly. Therefore, this is kept as
CUDA only change for now.
This commit is contained in:
Stefan Werner 2019-03-14 11:45:58 +01:00
parent 4887baf7d6
commit 47da8dcbca

@ -66,9 +66,15 @@ ccl_device_inline void get_work_pixel(ccl_global const WorkTile *tile,
ccl_private uint *y,
ccl_private uint *sample)
{
#ifdef __KERNEL_CUDA__
/* Keeping threads for the same pixel together improves performance on CUDA. */
uint sample_offset = global_work_index % tile->num_samples;
uint pixel_offset = global_work_index / tile->num_samples;
#else /* __KERNEL_CUDA__ */
uint tile_pixels = tile->w * tile->h;
uint sample_offset = global_work_index / tile_pixels;
uint pixel_offset = global_work_index - sample_offset * tile_pixels;
#endif /* __KERNEL_CUDA__ */
uint y_offset = pixel_offset / tile->w;
uint x_offset = pixel_offset - y_offset * tile->w;