By calculating the size of the state buffer in the kernel rather than the host
less code is needed and the size actually reflects the requested features.
Will also be a little faster in some cases because of larger global work size.
Because the split kernel can render multiple samples in parallel it is
necessary to have everything initialized before rendering of any samples
begins. The code that normally handles initialization of
`rng_state` (`kernel_path_trace_setup()`) only does so for the first sample,
which was causing artifacts in the split kernel due to uninitialized
`rng_state` for some samples.
Note that because the split kernel can render samples in parallel this
means that the split kernel is incompatible with the LCG.
This was only needed for the previous implementation of parallel samples. As
we don't have that any more it can be removed.
Real reason for removal tho is this: `per_sample_output_buffers` was being
calculated too small and artifacts resulted. The tile buffer is already
the correct size and calculating the size for `per_sample_output_buffers`
is a bit difficult with the current layout of the code. As
`per_sample_output_buffers` was only needed for `sum_all_radiance`,
removing that kernel and writing output to the tile buffer directly
fixes the artifacts.
This is to help debug and track memory usage for generic buffers. We
have similar for textures already since those require a name, but for
buffers the name is only for debugging proposes.
Simple workaround for some issues we've been having with AMD drivers hanging
and rendering systems unresponsive. Unfortunately this makes things a bit
slower, but its better than having to do hard reboots. Will be removed when
drivers have been fixed.
Define CYCLES_DISABLE_DRIVER_WORKAROUNDS to disable for testing purposes.
This does a few things at once:
- Refactors host side split kernel logic into a new device
agnostic class `DeviceSplitKernel`.
- Removes tile splitting, a new work pool implementation takes its place and
allows as many threads as will fit in memory regardless of tile size, which
can give performance gains.
- Refactors split state buffers into one buffer, as well as reduces the
number of arguments passed to kernels. Means there's less code to deal
with overall.
- Moves kernel logic out of OpenCL kernel files so they can later be used by
other device types.
- Replaced OpenCL specific APIs with new generic versions
- Tiles can now be seen updating during rendering
Suspended pools allows to push huge amount of initial tasks
without any threading synchronization and hence overhead.
This gives ~50% speedup of cached rigid body with file from
T50027 and seems to have no negative affect in other scenes
here.
The idea is to allow some amount of tasks to be pushed from working
thread to it's local queue, so we can acquire some work without doing
whole mutex lock.
This should allow us to remove some hacks from depsgraph which was
added there to keep threads alive.
This allows us to avoid TLS stored in pool which gives us advantage of
using pre-allocated tasks pool for the pools created from non-main thread.
Even on systems with slow pthread TLS it should not be a problem because
we access it once at a pool construction time. If we want to use this more
often (for example, to get rid of push_from_thread) we'll have to do much
more accurate benchmark.
Basically move all thread-specific data (currently it's only task
memory pool) from a dedicated array of taskScheduler to TaskThread.
This way we can add more thread-specific data in the future with
less of a hassle.
This feature was adding extra complexity to task scheduling
which required yet extra variables to be worried about to be
modified in atomic manner, which resulted in following issues:
- More complex code to maintain, which increases risks of
something going wrong when we modify the code.
- Extra barriers and/or locks during task scheduling, which
causes extra threading overhead.
- Unable to use some other implementation (such as TBB) even for
the comparison tests.
Notes about other changes.
There are two places where we really had to use that limit.
One of them is the single threaded dependency graph. This will
now construct a single-threaded scheduler at evaluation time.
This shouldn't be a problem because it only happens when using
debugging command line arguments and the code simply don't
run in regular Blender operation.
The code seems a bit duplicated here across old and new
depsgraph, but think it's OK since the old depsgraph is already
gone in 2.8 branch and i don't see where else we might want
to use such a single-threaded scheduler.
When/if we'll want to do so, we can move it to a centralized
single-threaded scheduler in threads.c.
OpenGL render was a bit more tricky to port, but basically we
are using conditional variables to wait background thread to
do all the job.
This slightly changes SDef behavior, by now respecting object transforms
at bind time, thus not requiring the objects to be aligned in their
respective local spaces, but instead using world space.
When rendering multi-view in side-by-side or top-bottom mode, we squash
the UI to half of its size and draw it twice on screen. That means the
cursor coordinates used for UI interaction don't match what's visible on
screen.
This commit is a little event system hack (tm) to fix this. It has some
small glitches with cursor grabbing, but nothing to bad.
We'll also use it for viewport HMD support.
D1350, thanks for the feedback @dfelinto!
It was only possible to separate all geometry from an intersection or none.
Made this into an enum with a 3rd option to 'Cut', (now default)
which keeps each side of the intersection separate
without splitting faces in half.
There was a bug in the intended code behaviour to always seek with a
pitch of 1.0 regardless of pitch/pitch animation/doppler effects.
Check the bug report for a more detailed explanation of problems
concerning pitch and seeking.