While atomics library was trying to use "user-space" defined
LIKELY() and UNLIKELY(), this is not always true that user
code was checking for those macro coming from an unrelated
area.
This commit adds a sample-based profiler that runs during CPU rendering and collects statistics on time spent in different parts of the kernel (ray intersection, shader evaluation etc.) as well as time spent per material and object.
The results are currently not exposed in the user interface or per Python yet, to see the stats on the console pass the "--cycles-print-stats" argument to Cycles (e.g. "./blender -- --cycles-print-stats").
Unfortunately, there is no clear way to extend this functionality to CUDA or OpenCL, so it is CPU-only for now.
Reviewers: brecht, sergey, swerner
Reviewed By: brecht, swerner
Differential Revision: https://developer.blender.org/D3892
The idea is to make main thread and job threads to be scheduled
on CPU dies which has direct access to memory (those are NUMA
nodes 0 and 2).
We also do this for new EPYC CPUs since their NUMA nodes 1 and 3
do have access but only to a higher range DDR slots. By preferring
nodes 0 and 2 on EPYC we make it so users with partially filled
DDR slots has fast memory access.
One thing which is not really solved yet is localization of
memory allocation: we do not guarantee that memory is allocated
on the closest to the NUMA node DDR slot and hope that memory
manager of OS is acting in favor of us.
In some instances, the number of control vertices of a hair could change mid-frame.
Cycles would then be unable to calculate proper motion blur for those hairs. This adds
interpolated CVs to fill in for the missing data. While this will not necessarily result in
a fully accurate reconstruction of the guide hair, it preserves motion blur instead of disabling it.
Reviewers: #cycles, sergey
Reviewed By: #cycles, sergey
Subscribers: sergey, brecht, #cycles
Tags: #cycles
Differential Revision: https://developer.blender.org/D3695
Second part of the fix: do not try at all to compute normals in degenerated
geometry. Just loss of time and potential issues later with weird
invalid computed values.
The goal is to address performance regression when going from
few threads to 10s of threads. On a systems with more than 32
CPU threads the benefit of threaded loop was actually harmful.
There are following tweaks now:
- The chunk size is adaptive for the number of threads, which
minimizes scheduling overhead.
- The number of tasks is adaptive to the list size and chunk
size.
Here comes performance comparison on the production shot:
Number of threads DEG time before DEG time after
44 0.09 0.02
32 0.055 0.025
16 0.025 0.025
8 0.035 0.033
For some reason when building with gcc-7.2 (which is default
in previous Ubuntu LTS) the guarded allocator is not being
properly instantiated.
Doesn't happen with newer version of gcc-7 which is 7.3, and
also doesn't happen with gcc-6 and gcc-8.
Would be nice to know what is wrong, but for the time being
committing workaround which keeps Blender users happy.