forked from bartvdbraak/blender
Cycles: Improved thread order for better CUDA performance.
This patch puts threads that render the same pixel closer together, as opposed to threads that render the same sample. Thus threads within a warp are more coherent in memory access and control flow, leading to performance improvements. Example benchmarks on a Quadro RTX4000 (WDDM) on Windows 10: Koro: 4:23 -> 3:46 BMW: 1:18 -> 1:25 Barbershop Interior: 17:52 -> 14:55 Classroom: 4:37 -> 3:45 Performance differences on OpenCL/AMD were hit and miss, some scenes became faster, others lost significantly. Therefore, this is kept as CUDA only change for now.
This commit is contained in:
parent
4887baf7d6
commit
47da8dcbca
@ -66,9 +66,15 @@ ccl_device_inline void get_work_pixel(ccl_global const WorkTile *tile,
|
||||
ccl_private uint *y,
|
||||
ccl_private uint *sample)
|
||||
{
|
||||
#ifdef __KERNEL_CUDA__
|
||||
/* Keeping threads for the same pixel together improves performance on CUDA. */
|
||||
uint sample_offset = global_work_index % tile->num_samples;
|
||||
uint pixel_offset = global_work_index / tile->num_samples;
|
||||
#else /* __KERNEL_CUDA__ */
|
||||
uint tile_pixels = tile->w * tile->h;
|
||||
uint sample_offset = global_work_index / tile_pixels;
|
||||
uint pixel_offset = global_work_index - sample_offset * tile_pixels;
|
||||
#endif /* __KERNEL_CUDA__ */
|
||||
uint y_offset = pixel_offset / tile->w;
|
||||
uint x_offset = pixel_offset - y_offset * tile->w;
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user