blender/intern/cycles/kernel
Lukas Stockner fa3d50af95 Cycles: Improve denoising speed on GPUs with small tile sizes
Previously, the NLM kernels would be launched once per offset with one thread per pixel.
However, with the smaller tile sizes that are now feasible, there wasn't enough work to fully occupy GPUs which results in a significant slowdown.

Therefore, the kernels are now launched in a single call that handles all offsets at once.
This has two downsides: Memory accesses to accumulating buffers are now atomic, and more importantly, the temporary memory now has to be allocated for every shift at once, increasing the required memory.
On the other hand, of course, the smaller tiles significantly reduce the size of the memory.

The main bottleneck right now is the construction of the transformation - there is nothing to be parallelized there, one thread per pixel is the maximum.
I tried to parallelize the SVD implementation by storing the matrix in shared memory and launching one block per pixel, but that wasn't really going anywhere.

To make the new code somewhat readable, the handling of rectangular regions was cleaned up a bit and commented, it should be easier to understand what's going on now.
Also, some variables have been renamed to make the difference between buffer width and stride more apparent, in addition to some general style cleanup.
2017-11-30 07:37:08 +01:00
..
bvh Code refactor: rename subsurface to local traversal, for reuse. 2017-11-07 22:35:12 +01:00
closure Cycles: Fix wrong behavior of sharpness in Cubic SSS 2017-11-20 11:40:55 +01:00
filter Cycles: Improve denoising speed on GPUs with small tile sizes 2017-11-30 07:37:08 +01:00
geom Cycles: Make per-object random value output also work for Lamps 2017-11-14 04:17:54 +01:00
kernels Cycles: Improve denoising speed on GPUs with small tile sizes 2017-11-30 07:37:08 +01:00
osl Fix build with OSL 1.9.x, automatically aligns to 16 bytes now. 2017-11-20 23:24:24 +01:00
shaders Cycles: Fix OSL brick node after recent fix 2017-11-21 04:30:12 -05:00
split Cycles: Fix crash with split branched path tracing 2017-11-16 04:59:31 -05:00
svm Fix T53348: Cycles difference between gradient texture on CPU and GPU. 2017-11-23 17:14:04 +01:00
CMakeLists.txt Cycles: Improve denoising speed on GPUs with small tile sizes 2017-11-30 07:37:08 +01:00
kernel_accumulate.h Cycles: Add Volume Direct and Volume Indirect passes for volume-scattered light 2017-11-17 16:39:45 +01:00
kernel_bake.h Cycles: Replace __MAX_CLOSURE__ build option with runtime integrator variable 2017-11-09 01:04:06 -05:00
kernel_camera.h Cycles: Remove ccl_fetch and SOA 2017-03-08 00:52:41 -05:00
kernel_compat_cpu.h Code refactor: make texture code more consistent between devices. 2017-10-07 14:53:14 +02:00
kernel_compat_cuda.h Code refactor: make texture code more consistent between devices. 2017-10-07 14:53:14 +02:00
kernel_compat_opencl.h Code refactor: make texture code more consistent between devices. 2017-10-07 14:53:14 +02:00
kernel_differential.h Cycles: OpenCL kernel split 2015-05-09 19:52:40 +05:00
kernel_emission.h Cycles: reduce closure memory usage for emission/shadow shader data. 2017-11-05 20:48:33 +01:00
kernel_film.h Cycles: Use native saturate function for CUDA 2015-04-28 00:38:32 +05:00
kernel_globals.h Code refactor: make texture code more consistent between devices. 2017-10-07 14:53:14 +02:00
kernel_jitter.h Cycles: Use more stable version of integer square root function 2017-05-09 17:07:17 +02:00
kernel_light.h Fix incorrect MIS weights in Cycles with multiple lights. 2017-11-07 22:35:12 +01:00
kernel_math.h Cycles: Make all #include statements relative to cycles source directory 2017-03-29 13:41:11 +02:00
kernel_montecarlo.h Cycles: Cleanup, indendation 2017-10-06 19:33:59 +05:00
kernel_passes.h Cycles: Add Volume Direct and Volume Indirect passes for volume-scattered light 2017-11-17 16:39:45 +01:00
kernel_path_branched.h Cycles: Replace __MAX_CLOSURE__ build option with runtime integrator variable 2017-11-09 01:04:06 -05:00
kernel_path_common.h Code refactor: remove rng_state buffer and compute hash on the fly. 2017-10-04 21:11:14 +02:00
kernel_path_state.h Code cleanup: remove hack to avoid seeing transparent objects in noise. 2017-09-20 19:38:08 +02:00
kernel_path_subsurface.h Code refactor: rename subsurface to local traversal, for reuse. 2017-11-07 22:35:12 +01:00
kernel_path_surface.h Cycles: reduce subsurface stack memory usage. 2017-09-28 15:18:43 +02:00
kernel_path_volume.h Cycles: reduce subsurface stack memory usage. 2017-09-28 15:18:43 +02:00
kernel_path.h Fix T53349: AO bounces not working correct with OpenCL. 2017-11-26 15:53:00 +01:00
kernel_projection.h Cycles: Implement denoising option for reducing noise in the rendered image 2017-05-07 14:40:58 +02:00
kernel_queues.h Cycles: Add function to dequeue a ray 2017-06-10 03:51:18 -04:00
kernel_random.h Cycles: restore SOBOL_SKIP hack, for some cases where it helps still. 2017-10-29 16:44:20 +01:00
kernel_shader.h Cycles: Make per-object random value output also work for Lamps 2017-11-14 04:17:54 +01:00
kernel_shadow.h Cycles: reduce closure memory usage for emission/shadow shader data. 2017-11-05 20:48:33 +01:00
kernel_subsurface.h Cycles: Replace __MAX_CLOSURE__ build option with runtime integrator variable 2017-11-09 01:04:06 -05:00
kernel_textures.h Code refactor: make texture code more consistent between devices. 2017-10-07 14:53:14 +02:00
kernel_types.h Cycles: Add per-tile render time debug pass 2017-11-17 16:40:24 +01:00
kernel_volume.h Cycles: better distance sampling for chromatic volume extinction. 2017-11-10 01:37:10 +01:00
kernel_work_stealing.h Code refactor: add WorkTile struct for passing work to kernel. 2017-10-04 21:11:14 +02:00
kernel.h Code refactor: device memory cleanups, preparing for mapped host memory. 2017-11-05 15:22:04 +01:00