Notes:
- There is still some bvh cache code, but that is from the engines initial commit, we might clean this up further or keep it.
- Changes in util_cache.h/.c are kept, this might be re-used in the future.
Avoid memmove() happening on every insert of duplicated node to the references
list. Temporary pre-allocated vector is used for new references which is then
being inserted into actual array in one go later.
Gives around 4x speedup building spatially split BVH for the grass field in the
cassette player shot from Gooseberry.
This commit implements object reference node spatial split making it possible
to use spatial split for top-level BVH.
The code is not in use yet because enabling spatial split on top level BVH is
not coming for free and it needs to be investigated if it's worth in terms of
improved render times.
This way it's easy to add more reference types allowed for splitting to the
BVH reference split function without making this function too much big. This
way it's possible to experiment with such features as splitting object instance
references.
So far should not be any functional changes.
Previous idea behind having vector during building and array for actual storage
was needed in order to minimize amount of re-allocations happening during the
build, but it lead to double memory overhead used by those arrays at the vector
to array conversion stage.
Issue with such approach was that for BVH without spatial split size of arrays
is known in advance and it never changes, which made vector to array conversion
totally redundant.
Also after testing with several rather complex from spatial split scenes (such
as trees) it seems even conservative approach of reallocation (when we perform
re-allocation when leaf does not fit into the memory) doesn't give measurable
difference in time.
This makes it so we can switch to array, which will avoid unneeded memory
re-allocations when spatial split is disabled without harming other cases.
it's a bit difficult to measure exact benefit of this change on our production
files here, but depending on the scene it might give quite reasonable memory
save.
This way we can get rid of inefficient memory usage caused by BVH boundbox
part being unused by leaf nodes but still being allocated for them. Doing
such split allows to save 6 of float4 values for QBVH per leaf node and 3
of float4 values for regular BVH per leaf node.
This translates into following memory save using 01.01.01.G rendered
without hair:
Device memory size Device memory peak Global memory peak
Before the patch: 4957 5051 7668
With the patch: 4467 4562 7332
The measurements are done against current master. Still need to run speed tests
and it's hard to predict if it's faster or not: on the one hand leaf nodes are
now much more coherent in cache, on the other hand they're not so much coherent
with regular nodes anymore.
Reviewers: brecht, juicyfruit
Subscribers: venomgfx, eyecandy
Differential Revision: https://developer.blender.org/D1236
It was an issue with what bounds to use for BVH node during construction.
Also corrected case when there are all 4 primitive types in the range and
also there're objects in the same range.
This inconsistency drove me totally crazy, it's really confusing
when it's inconsistent especially when you work on both Cycles and
Blender sides.
Shouldn;t cause merge PITA, it's whitespace changes only, Git should
be able to merge it nicely.
The issue was caused by mismatch in how aligned triangles storage was
filled in during BVH construction and how it was used during rendering.
Basically, i was leaving uninitialized storage for triangles when
there was deformation motion blur detected for the mesh. Was likely
some sort of optimization, but in fact it's still possible that regular
triangles would be needed for rendering.
So now we're storing aligned storage for all triangle primitives and
only skipping motion triangles (the deformation motion blur flag from
mesh is now ignored).
Ideally we should get rid of those temporary vectors anyway, but
it's not so trivial because of the alignment. For untl then we'll
just have a bit worse solution. This part of code is not the root
of the issue of memory spikes for now anyway.
But since we're getting rid of temporary memory earlier actual spike
is a bit smaller as now. For example in franck_sheep file it's now
5489.69MB vs. previously 5599.90MB.
This way we save 3 bytes per BVH node while building BVH, which overall
gives 100Mb memory save when preparing Frank for render.
It's not really much comparing to overall memory usage (which is 11Gb
during scene preparation here) but still doesn't harm to have solved.
When doing BVH leaf node split we can't rely on leaf size limit from
BVH parameters in case there's spatial split enabled.
This commit basically reverts previous optimization change here which
used stack-allocated memory and uses heap-allocated vector now.
It's possible to boost this code up again by using own allocator.
This commit basically makes it so statistics print from different BVH trees are not
being interleaved with each other. Glog ensures this when debug print is done as a
single put to stream operator.
Since leaf node gets split further into per-primitive type leaves old check
for number of curves became a bit ridiculous -- it might lead to two leaf nodes
each of which would contain only one curve primitive (one motion curve and one
regular curve).
This lead to quite dramatic slowdown for Victor model -- around 40%, which is
totally unacceptable.
This commit is aimed to prevent such situation and from quick render test it
seems victor is now back to normal render time. Further testing is needed tho.
There are also other ideas about splitting the node, will need to look into
them next.
This commit enables BVH leaf nodes split by the primitive type and makes it
so BVH traversal code is now aware and benefits from this.
As was mentioned in original commit, this change is crucial to be able to do
single ray to multiple triangle intersection. But it also appears to give
barely visible speedup in some scene.
In any case there should be no noticeable slowdown, and this change is what
we need to have anyway.
The idea of this change is make it possible to split leaf nodes by primitive
type, making leaf containing primitives of the same type.
This would become handy when working on a single ray to multiple triangles
intersection code, plus with careful implementation it might give some extra
benefits on BVH traversal code by avoiding primitive type fetch and check for
each primitive in the node. But that's a bit tricky to have benefits on this
change only because depth of BVH increases.
This option is not exposed to the interface at all and not used even secretly,
the commit is only needed to help working further in this direction without
messing around with local patches and worrying of them running out of date.
This is harmless for now because tail of the node is zero in there, but better
to fix it early so in the case of extending BVH nodes this code doesn't give
issues.
This commit implements traversal for QBVH tree, which is based on the old loop
code for traversal itself and Embree for node intersection.
This commit also does some changes to the loop inspired by Embree:
- Visibility flags are only checked for primitives.
Doing visibility check for every node cost quite reasonable amount of time
and in most cases those checks are true-positive.
Other idea here would be to do visibility checks for leaf nodes only, but
this would need to be investigated further.
- For minimum hair width we extend all the nodes' bounding boxes.
Again doing curve visibility check is quite costly for each of the nodes and
those checks returns truth for most of the hierarchy anyway.
There are number of possible optimization still, but current state is good
enough in terms it makes rendering faster a little bit after recent watertight
commit.
Currently QBVH is only implemented for CPU with SSE2 support at least. All
other devices would need to be supported later (if that'd make sense from
performance point of view).
The code is enabled for compilation in kernel. but blender wouldn't use it
still.
Using this paper: Sven Woop, Watertight Ray/Triangle Intersection
http://jcgt.org/published/0002/01/05/paper.pdf
This change is expected to address quite reasonable amount of reports from the
bug tracker, plus it might help reducing the noise in some scenes.
Unfortunately, it's currently about 7% slower than the previous solution with
pre-computed triangle plane equations, but maybe with some smart tweaks to the
code (tests reshuffle, using SIMD in a nice way or so) we can avoid the speed
regression.
But perhaps smartest thing to do here would be to change single triangle / ray
intersection with multiple triangles / ray intersections. That's how Embree does
this and it's watertight single ray intersection is not any faster that this.
Currently only triangle intersection is modified accordingly to the paper, in
the future we would also want to modify the node / ray intersection.
Reviewers: brecht, juicyfruit
Subscribers: dingto, ton
Differential Revision: https://developer.blender.org/D819
The idea is to store visibility flags for leaf nodes only since visibility check
for inner nodes costs too much for QBVH hence it is not optimal to perform.
Leaf QBVH nodes have plenty of space to store all sort of flags, so we can make
nodes one element smaller, saving noticeable amount of memory.
Previously offsets were calculated based on the BVH node size,
which is wrong and real PITA in cases when some extra data is
to be added into (or removed from) the node.
Now use offsets which are not calculated form the node size.
This solves quite an over-allocation in BVH instances packing code,
unfortunately, it's not a magic bullet to solve memory bump caused
by the recent QBVH changes.
For that we'll likely need to decouple storage for leaf and inner
nodes. However, it's not really clear for now if it's something
important since that'd still be just a fraction of memory comparing
to all the hi-res textures.
Title says it all, quite straightforward implementation.
Would only mention that there's a bit of code duplication around packing node
into pack.nodes. Trying to de-duplicate it ends up in quite hairy code (like
functions with loads of arguments some of which could be NULL in certain
circumstances etc..). Leaving solving this duplication for later.
Before all the nodes were counted and allocated, leading to situations when
bunch of allocated memory is not used because reasonable amount of nodes are
simply ignored.
The issue was noticed with gcc-4.7 (used by the release build environment)
which didn't generate optimal enough code for BVH references swap. Seems it
looked up for the assign operator for each of the reference structure members
even though nothing special was required for assignment.
Forcing compiler to use simple memory copy gives speedup of like 2-3 times.
The issue doesn't happen with OSX's clang and new gcc-4.9, but since we're
gonna to stick to gcc-4.7 for official releases for quite some time still it's
nice to have performance issues resolved for all the compilers.
Didn't put the code into #ifdef so if in the future some issues appears with
alignment or assignment which need to happen as an operator we notice this
earlier.