The issue was caused by the reshuffle needed to make objects flags have proper
object's bounding box to solve regressions in SSS objects intersecting volumes.
There's actually a feedback loop happening here, which is now solved in quite
naive way -- for the true displacement we consider all objects are capable of
intersecting volumes, synchronize object flags prior to displacement shader
tasks runs and then re-update object flags for proper bounding box.
Not sure what will be the proper solution here, we can't do preliminary check
of intersection for displacement shader, but on the other hand we don't really
need this flag for displacement shader anyway.
This will help figuring out cases when node was not properly handled by the SVM
by aborting execution on CPU, where all the nodes are expected to be supported.
This commits finishes initial selective nodes compilation into kernel, which
helps a lot performance-wise for AMD OpenCL kernels.
Split by node groups is based on statistics from simple scenes like BMW and
more complex scenes like mango and gooseberry production files. Further
tweaks are always possible, but it should be a good starting point.
TODO: Still need to ignore unused nodes when calculating requested shader
features.
For now it's reported to the stdout, matching to the CUDA behavior.
In the future we can hide this into GLog logging once the kernels
are considered all stable and so.
Like Camera Motion, only available in the Experimental kernel.
This should be it for the upcoming release, we now support almost everything, apart from Transparent Shadows, SSS and Volume.
Same as last commit, code is unused and this one actually would have required some fixes,
as these variants output values outside the 0-1 value range, which doesn't fit Cycles shader design.
Let's finally delete this code, after 4 years of being unused,
there really is no excuse anymore.
If we decide to extend the procedural textures in SVM, we can do this anytime in the future.
This commit re-shuffles code in split kernel once again and makes it so common
parts which is in the headers is only responsible to making all the work needed
for specified ray index. Getting ray index, checking for it's validity and
enqueuing tasks are now happening in the device specified part of the kernel.
This actually makes sense because enqueuing is indeed device-specified and i.e.
with CUDA we'll want to enqueue kernels from kernel and avoid CPU roundtrip.
TODO:
- Kernel comments are still placed in the common header files, but since queue
related stuff is not passed to those functions those comments might need to
be split as well.
Just currently read them considering that they're also covering the way how
all devices are invoking the common code path.
- Arguments might need to be wrapped into KernelGlobals, so we don't ened to
pass all them around as function arguments.
It was kept disabled due to render artifacts which weer in fact caused by bad
memory access, which is fixed in the previous commit.
We now also can make it enabled in regular AMD split kernel after someone tests
the updated code.
The code was failing to compile on runtime because of some path differences,
and it seems we don't need to specify full path to the file which originally
seemed to be needed to make include directives expansion working correct.
This was broken after the kernel file restructure.
Variables allocated in the __local address space can only be defined
inside a __kernel function.
We probably need to solve this a bit differently once we do the CUDA
kernel split, but this fix shoud be good enough until then.
Since the kernel split work we're now having quite a few of new files, majority
of which are related on the kernel entry points. Keeping those files in the
root kernel folder will eventually make it really hard to follow which files are
actual implementation of Cycles kernel.
Those files are now moved to kernel/kernels/<device_type>. This way adding extra
entry points will be less noisy. It is also nice to have all device-specific
files grouped together.
Another change is in the way how split kernel invokes logic. Previously all the
logic was implemented directly in the .cl files, which makes it a bit tricky to
re-use the logic across other devices. Since we'll likely be looking into doing
same split work for CUDA devices eventually it makes sense to move logic from
.cl files to header files. Those files are stored in kernel/split. This does not
mean the header files will not give error messages when tried to be included
from other devices and their arguments will likely be changed, but having such
separation is a good start anyway.
There should be no functional changes.
Reviewers: juicyfruit, dingto
Differential Revision: https://developer.blender.org/D1314
They were lost during simplification of kernel loading but might be rather
crucial for the performance.
Also made it so cflags are shared across kernels. Surely it might lead to
some unwanted kernel re-compilation but at the same time they might easily
run out of sync with the changes in kernel and so.
In certain configurations (for example when start resolution is set to small
value for background render and progressive refine enabled) number of tiles
might change in the tile manager. This situation will confuse progressive
refine feature and likely cause crash.
We might also add some settings verification in the session constructor, but
having an assert with brief explanation about what's wrong should already be
much better than nothing.
For animations, you often want an animated render seed (noise pattern).
This could be done by e.g. setting a driver on the seed value.
Now it's a little checkbox, that can be enabled.
The animated seed is based on the current Blender frame and
the seed value itself. Simply enabling it, will already result in an animated
seed (different on each Blender frame), but it can be randomized further
by setting a different seed value.
Disabled per default, so no backward compatibility break.
Differential Revision: https://developer.blender.org/D1285
Experimental feature set id currently unavailable for megakernel, it'll
require some changes to the cache system to distinguish cached regular
kernels from cached experimental kernels.
Currently unused, but some features will be enabled soon.
Driver fails to compile kernel in reasonable time for those devices here,
so for easier testing of the OpenCL split kernel work disabling bake kernel
for now.
This required allocating some memory related on object transform needed
by ShaderData and currently it is done for all the platforms. Since we're
targeting full feature-complete platforms this is rather acceptable at
this point and in the future we'll do selective NO_HAIR/NO_SSS/NO_BLUR
kernels.
This is experimental still and in fact there're some major issues on
NVidia platform and it's not really clear if it's a bug in compiler,
some uninitizlied variable or other kind of issue.
Some stupid fixes like spaces around operator and missing semicolon,
plus fix for wrong detecting of ShaderData SOA size. Thar was harmless
since there's only one closure array, but still better to fix this.
This file was actually checking for features enabled on CPU and surely all
of them were enabled, so removing them does not cause any difference.
ideally we'll need to do runtime feature detection and just pass some stuff
as NULL to the kernel, or maybe also have variadic kernel entry points which
is also possible quite easily.
It's good for testing and seems to work quite reliably here.
This probably not totally cheap in terms of performance, but this we
could solve quite easily by selective kernel compilation once other
things are tested/proved to be reliable.
No need to store them in the class, they're unlikely to be changed
and if they do change we're in big trouble anyway.
More appropriate approach would be then to typedef this things in
kernel_types.h, but still use inlined sizeof(),
This makes OCIO viewport color correction a little bit faster (about -0.5s for 100 samples)
Also set max half float value to 65504.0 to conform with IEEE 754.
Only those ones are priority for now, all the rest are still testable
if CYCLES_OPENCL_TEST or CYCLES_OPENCL_SPLIT_KERNEL_TEST environment
variables are set.
This commit contains all the work related on the AMD megakernel split work
which was mainly done by Varun Sundar, George Kyriazis and Lenny Wang, plus
some help from Sergey Sharybin, Martijn Berger, Thomas Dinges and likely
someone else which we're forgetting to mention.
Currently only AMD cards are enabled for the new split kernel, but it is
possible to force split opencl kernel to be used by setting the following
environment variable: CYCLES_OPENCL_SPLIT_KERNEL_TEST=1.
Not all the features are supported yet, and that being said no motion blur,
camera blur, SSS and volumetrics for now. Also transparent shadows are
disabled on AMD device because of some compiler bug.
This kernel is also only implements regular path tracing and supporting
branched one will take a bit. Branched path tracing is exposed to the
interface still, which is a bit misleading and will be hidden there soon.
More feature will be enabled once they're ported to the split kernel and
tested.
Neither regular CPU nor CUDA has any difference, they're generating the
same exact code, which means no regressions/improvements there.
Based on the research paper:
https://research.nvidia.com/sites/default/files/publications/laine2013hpg_paper.pdf
Here's the documentation:
https://docs.google.com/document/d/1LuXW-CV-sVJkQaEGZlMJ86jZ8FmoPfecaMdR-oiWbUY/edit
Design discussion of the patch:
https://developer.blender.org/T44197
Differential Revision: https://developer.blender.org/D1200
The goal is to be able to compile kernel with nodes which are actually needed
to render current scene, hence improving performance of the kernel,
The idea is:
- Have few node groups, starting with a group which contains nodes are used
really often, and then couple of groups which will be extension of this one.
- Have feature-based nodes disabling, so it's possible to disable nodes related
to features which are not used with the currently used nodes group.
This commit only lays down needed routines for this approach, actual split will
happen later after gathering statistics from bunch of production scenes.
This will be used by split kernel in order to compile most optimal kernel.
Maximum number of closures is actually being cached in the session, so viewport
rendering will not trigger kernel re-loading when number of closures goes down.
This is currently unused but crucial for things like calculating amount of
device memory required to deal with the tasks.
Maybe not really best place to store it, but consider it good enough for now.
Previously we only had experimental flag passed to device's load_kernel() which
was all fine. But since we're gonna to have some extra parameters passed there
it makes sense to wrap them into a single struct, which will make it easier to
pass stuff around.
This header was already included into some of the implementation files already,
and this change is needed for some upcoming changes in the way how kernel_types.h
works.
Previously it was only set at compilation time which is all fine but does
not let us to check which closure the node corresponds to prior to the
compilation.
Now we calculate color in range 800..12000 using an approximation a/x+bx+c for R and G and ((at + b)t + c)t + d) for B.
Max absolute error for RGB for non-lut function is less than 0.0001, which is enough to get the same 8 bit/channel color as for OSL with a noticeable performance difference.
However there is a slight visible difference between previous non-OSL implementation because of lookup table interpolation and offset-by-one mistake.
The previous implementation gave black color outside of soft range (t > 12000), now it gives the same color as for 12000.
Also blackbody node without input connected is being converted to value input at shader compile time.
Reviewers: dingto, sergey
Reviewed By: dingto
Subscribers: nutel, brecht, juicyfruit
Differential Revision: https://developer.blender.org/D1280
This way it is possible to have viewport simplification bumped all the way up,
making viewport really responsive but still have final render to use highest
subdivision possible.
Reviewers: lukastoenne, campbellbarton, dingto
Reviewed By: campbellbarton, dingto
Subscribers: dingto, nutel, eyecandy, venomgfx
Differential Revision: https://developer.blender.org/D1273
This replaces sequential ray moving followed with scene intersection with
single BVH traversal, which gives us all possible intersections.
Only implemented for CPU, due to qsort and a bigger memory usage on GPU
which we rather avoid. GPU still uses the regular bvh volume intersection code, while CPU now uses the new code.
This improves render performance for scenes with:
a) Camera inside volume mesh
b) SSS mesh intersecting a volume mesh/domain
In simple volume files (not much geometry) performance is roughly the same
(slightly faster). In files with a lot of geometry, the performance
increase is larger. bmps.blend with a volume shader and camera inside the
mesh, it renders ~10% faster here.
Patch by Sergey and myself.
Differential Revision: https://developer.blender.org/D1264
Object flags are depending on bounding box which is only available after
mesh synchronization.
This was broken since 7fd4c44 which happened quite close to the release
and oddly enough was not sopped by anyone. Render test is coming for this.
Was spotted by Thomas Dinges while working on another patch.
This patch adds support for light portals: objects that help sampling the
environment light, therefore improving convergence. Using them tor other
lights in a unidirectional pathtracer is virtually useless.
The sampling is done with the area-preserving code already used for area lamps.
MIS is used both for combination of different portals and for combining portal-
and envmap-sampling.
The direction of portals is considered, they aren't used if the sampling point
is behind them.
Reviewers: sergey, dingto, #cycles
Reviewed By: dingto, #cycles
Subscribers: Lapineige, nutel, jtheninja, dsisco11, januz, vitorbalbio, candreacchio, TARDISMaker, lichtwerk, ace_dragon, marcog, mib2berlin, Tunge, lopataasdf, lordodin, sergey, dingto
Differential Revision: https://developer.blender.org/D1133
This more a workaround for CUDA optimizer which can't optimize clamp(x, 0, 1)
into a single instruction and uses 4 instructions instead.
Original patch by @lockal with own modification:
Don't make changes outside of the kernel. They don't make any difference
anyway and term saturate() has a bit different meaning outside of kernel.
This gives around 2% of speedup in Barcelona file, but in more complex shader
setups with lots of math nodes with clamping speedup could be much nicer.
Subscribers: dingto
Projects: #cycles
Differential Revision: https://developer.blender.org/D1224
This way we can get rid of inefficient memory usage caused by BVH boundbox
part being unused by leaf nodes but still being allocated for them. Doing
such split allows to save 6 of float4 values for QBVH per leaf node and 3
of float4 values for regular BVH per leaf node.
This translates into following memory save using 01.01.01.G rendered
without hair:
Device memory size Device memory peak Global memory peak
Before the patch: 4957 5051 7668
With the patch: 4467 4562 7332
The measurements are done against current master. Still need to run speed tests
and it's hard to predict if it's faster or not: on the one hand leaf nodes are
now much more coherent in cache, on the other hand they're not so much coherent
with regular nodes anymore.
Reviewers: brecht, juicyfruit
Subscribers: venomgfx, eyecandy
Differential Revision: https://developer.blender.org/D1236
This way memory overhead caused by the BVH building is not so visible and peak
memory usage will be reduced.
Implementing this idea is not so straightforward actually, because we need to
synchronize images used for true displacement before meshes. Detecting whether
image is used for true displacement is not so striaghtforward, so for now all
all displacement types will synchronize images used for them.
Such change brings memory usage from 4.1G to 4.0G with the 01_01_01_D scene
from gooseberry. With 01_01_01_G scene it's 7.6G vs. 6.8G (before and after
the patch).
Reviewers: campbellbarton, juicyfruit, brecht
Subscribers: eyecandy
Differential Revision: https://developer.blender.org/D1217
Combine all the highpoly pixel arrays into a single array with a lookup
object_id for each of the highpoly objects.
Note: This changes the Bake API, external engines should refer to the
bake_api.c for the latest API.
Many thanks for Sergey Sharybin for the complete review, changes
suggestion and feedback. (you rock!)
Reviewers: sergey
Subscribers: pildanovak, marcclintdion, monio, metalliandy, brecht
Maniphest Tasks: T41092
Differential Revision: https://developer.blender.org/D772
Issue was caused by MSVC not being able to optimize some code out in the same
way as GCC/Clang does, so now that parts of code are explicitly unfolded in
order to help compilers out.
This makes speed loss much less drastic on my laptop. That's probably as good
as we can do with MSVC without investing infinite amount of time looking trying
to workaround the optimizer.
Originally we thought it's needed in order to distinguish builtin file from
filename which starts with '@', but the filepath is actually full path there
and it's unlikely to have file system where '@' is a proper root character.
Surprisingly this does not give visible speed differences, but it's still
nice to get rid of redundant check.
There were two major problems with the interactivity of material previews:
- Beckmann tables were re-generated on every material tweak.
This is because preview scene is not set to be persistent, so re-triggering
the render leads to the full scene re-sync.
- Images could take rather noticeable time to load with OIIO from the disk
on every tweak.
This patch addressed this two issues in the following way:
- Beckmann tables are now static on CPU memory.
They're couple of hundred kilobytes only, so wouldn't expect this to be
an issue. And they're needed for almost every render anyway.
This actually also makes blackbody table to be static, but it's even smaller
than beckmann table.
Not totally happy with this approach, but others seems to complicate things
quite a bit with all this render engine life time and so..
- For preview rendering all images are considered to be built-in. This means
instead of OIIO which re-loads images on every re-render they're coming
from ImBuf cache which is fully manageable from blender side and unused
images gets freed later.
This would make it impossible to have mipmapping with OSL for now, but we'll
be working on that later anyway and don't think mipmaps are really so crucial
for the material preview.
This seems to be a better alternative to making preview scene persistent,
because of much optimal memory control from blender side.
Reviewers: brecht, juicyfruit, campbellbarton, dingto
Subscribers: eyecandy, venomgfx
Differential Revision: https://developer.blender.org/D1132
Official Documentation:
http://www.blender.org/manual/render/workflows/multiview.html
Implemented Features
====================
Builtin Stereo Camera
* Convergence Mode
* Interocular Distance
* Convergence Distance
* Pivot Mode
Viewport
* Cameras
* Plane
* Volume
Compositor
* View Switch Node
* Image Node Multi-View OpenEXR support
Sequencer
* Image/Movie Strips 'Use Multiview'
UV/Image Editor
* Option to see Multi-View images in Stereo-3D or its individual images
* Save/Open Multi-View (OpenEXR, Stereo3D, individual views) images
I/O
* Save/Open Multi-View (OpenEXR, Stereo3D, individual views) images
Scene Render Views
* Ability to have an arbitrary number of views in the scene
Missing Bits
============
First rule of Multi-View bug report: If something is not working as it should *when Views is off* this is a severe bug, do mention this in the report.
Second rule is, if something works *when Views is off* but doesn't (or crashes) when *Views is on*, this is a important bug. Do mention this in the report.
Everything else is likely small todos, and may wait until we are sure none of the above is happening.
Apart from that there are those known issues:
* Compositor Image Node poorly working for Multi-View OpenEXR
(this was working prefectly before the 'Use Multi-View' functionality)
* Selecting camera from Multi-View when looking from camera is problematic
* Animation Playback (ctrl+F11) doesn't support stereo formats
* Wrong filepath when trying to play back animated scene
* Viewport Rendering doesn't support Multi-View
* Overscan Rendering
* Fullscreen display modes need to warn the user
* Object copy should be aware of views suffix
Acknowledgments
===============
* Francesco Siddi for the help with the original feature specs and design
* Brecht Van Lommel for the original review of the code and design early on
* Blender Foundation for the Development Fund to support the project wrap up
Final patch reviewers:
* Antony Riakiotakis (psy-fi)
* Campbell Barton (ideasman42)
* Julian Eisel (Severin)
* Sergey Sharybin (nazgul)
* Thomas Dinged (dingto)
Code contributors of the original branch in github:
* Alexey Akishin
* Gabriel Caraballo
It is still possible to free a bit more memory by detecting buildin images
which are not used by shaders, but that's not going to improve memory usage
that much to bother about this now.
Such change brings peak memory usage from 4.1GB to 3.4GB when rendering
01_01_01_D layout scene from the Gooseberry project. Mainly because of
freeing memory used by rather huge environment map in the viewport.
Reviewers: campbellbarton, juicyfruit
Subscribers: eyecandy
Differential Revision: https://developer.blender.org/D1215
This attribute is not really supported for volumes, so it get's converted to
constant 0 at shader compile time.
TODO: We should consider doing the same for tangent attribute in order to save
some annoying checks at tracing time.
Our own implementation is in fact the same performance as in fast_math from
OpenShadingLanguage, but implementation from fast_math is using explicit madd
function, which increases chance of compiler deciding to use intrinsics.
This patch is based on some work done in D788 and re-formulation from Beckmann
implementation in OpenShadingLanguage.
Skipping texture lookup helps a lot on GPUs where it's more expensive to access
texture memory than to do some extra calculation in threads.
CPU code still uses lookup-table based approach since this seems to be still
faster (at least on computers i've got access to).
This change gives about 2% speedup on BMW scene with GTX560TI.
It did not preserve stratification too well and lookup-table approach was
working much better. There are now also some more interesting forumlation
from Wenzel and OpenShadingLanguage which should work better than old code.
This argument was unused and got nicely optimized out. But once it
starts to be using registers are getting stressed really crazy,
causing slow down of render.
It's not that bad because this typo could only caused not really
efficient BVH traversal, causing higher render times. Not as if
it was causing render artifacts.
It was an issue with what bounds to use for BVH node during construction.
Also corrected case when there are all 4 primitive types in the range and
also there're objects in the same range.
This inconsistency drove me totally crazy, it's really confusing
when it's inconsistent especially when you work on both Cycles and
Blender sides.
Shouldn;t cause merge PITA, it's whitespace changes only, Git should
be able to merge it nicely.
Issue was caused by accident in c8a9a56 which not only disabled glossy
reflection if Glossy visibility is disabled, but also Diffuse reflection.
Quite safe and should go to final release branch.
Issue was caused by cycles in shader graph confusing it's
simplification stage. Now we're ignoring links which are
marked as invalid from blender side so we don't run into
such cycles and keep graph code simple.
Issue was introduced in 01ee21f where i didn't notice *_setup()
function only doing partial initialization, and some of parameters
are expected to be initialized by callee function.
This was hitting only some setups, so tests with benchmark scenes
didn't unleash issues. Now it should all be fine.
This is to go to the 2.74 branch and we actually might re-AHOY.
Fix T44007: Cycles Volumetrics: block artifacts with overlapping volumes
The issue was caused by uninitialized parameters of some closures, which
lead to unpredictable behavior of shader_merge_closures().
A new checkbox "High quality" is provided in camera settings to enable
this. This creates a depth of field that is much closer to the rendered
result and even supports aperture blades in the effect, but it's more
expensive too. There are optimizations to do here since the technique is
very fill rate heavy.
People, be careful, this -can- lock up your screen if depth of field
blurring is too extreme.
Technical details:
This uses geometry shaders + instancing and is an adaptation of
techniques gathered from
http://bartwronski.com/2014/04/07/bokeh-depth-of-field-going-insane-http://advances.realtimerendering.com/s2011/SousaSchulzKazyan%20-
%20in%20Real-Time%20Rendering%20Course).ppt
TODOs:
* Support dithering to minimize banding.
* Optimize fill rate in geometry shader.
Bump result was passed to set_normal node and then set_node was connected
to all unconnected Normal inputs, including the one from original Bump
node, causing cycles.
The purpose of this change is to add extra possibility to render engines and
export scripts to reduce peak memory footprint during their operation.
This new argument should be used with care since it'll leave mesh in not really
compatible with blender format, but it's ok to be used on temp meshes.
Unfortunately, it's hard to get scene where it'll show huge benefit because
in my tests with cycles peak memory is reached in MEM_printmemlist_stats().
However, in the file with sintel dragon it gives around 1gig of memory benefit
after removing the polys which would allow other heavy to compute stuff such as
hair (or even pointiness calculation) to not be a peak memory usage.
In any case, this change is nice to have IMO, and only means more parts of
scene export code should be optimized memory-wise.
Reviewers: campbellbarton
Differential Revision: https://developer.blender.org/D1125
Issue this commit is addressed to is that particle system and particle modifier
will contain caches once derived mesh was requested and this cached data will
never be freed.
This could easily lead to unwanted memory peaks during synchronization stage
of rendering.
The idea is to have RNA function in object which would free caches which can't
be freed otherwise. This function is not intended to deal with derived final
since it might be used by other objects (for example by object with boolean
modifier).
This cache freeing is only happening in the background rendering and locked
interface rendering.
From quick tests with victor file this change reduces peak memory usage by
command line rendering by around 6% (1780MB vs. 1883MB). For rendering from
the interface it's about 12% (1763MB vs. 1998MB).
Reviewers: campbellbarton, lukastoenne
Differential Revision: https://developer.blender.org/D1121
This commit makes some preliminary fixes and tweaks aimed to make blender
compilable with C++11 feature set. This includes:
- Build system attribute to enable C++11 featureset.
It's for sure default OFF, but easy to enable to have a play around with
it and make sure all the stuff is compilable before we go C++11 for real.
- Changes in Compositor to use non-named cl_int structure fields.
This is because __STRICT_ANSI__ is defined by default by GCC and OpenCL
does not use named fields in this case.
- Changes to TYPE_CHECK() related on lack of typeof() in C++11
This uses decltype() instead with some trickery to make sure returned type
is not a reference.
- Changes for auto_ptr in Freestyle
This actually conditionally switches between auto_ptr and unique_ptr since
auto_ptr is deprecated in C++11. Seems to be not strictly needed but still
nice to be ready for such an update anyway/
This all based on changes form depsgraph_refactor branch apart from the weird
changes which were made in order to support MinGW compilation. Those parts of
change would need to be carefully reviewed again after official move to gcc49
in MinGW.
Tested on Linux with GCC-4.7 and Clang-3.5, other platforms are not tested and
likely needs some more tweaks.
Reviewers: campbellbarton, juicyfruit, mont29, lukastoenne, psy-fi, kjym3
Differential Revision: https://developer.blender.org/D1089
The fix was really flacky, in terms during speed benchmarks i had
abort() in the fallback block to be sure it never runs in production
scenes, but that affected on the optimization as well. Without this
abort there's quite bad slowdown of 5-7% on the renders even tho
the Pleucker fallback was never run.
This is all weird and for now reverting the change which affects on
all the production scenes and will look into alternative fixes for
the original issue with precision loss on huge planes.
This reverts commit 9489205c5c0b9b432d02be4a3d0d15fc62ee6cb9.
This merges consecutive empty steps in the decoupled record function,
which can lead to fewer iterations in the scatter functions.
Only helps slightly though (1%), but doesn't hurt to have this.
Differential Revision: https://developer.blender.org/D873
Use multiple threads for building the MIS table, if the
resolution is higher than 512.
Also replace division by cdf_total, with a inverse multiplication by
cdf_total_inv. This gives further speedup.
On my Macbook (8 CPU threads) this improves the time to build the table:
Resolution 4096: From 0.16s to 0.03s
Resolution 8096: From 0.61s to 0.11s
This especially helps to reduce the scene update time, when tweaking world
shader while viewport rendering is running.
Patch by Sergey and myself.
Differential Revision: https://developer.blender.org/D1159
The issue was caused by mismatch in how aligned triangles storage was
filled in during BVH construction and how it was used during rendering.
Basically, i was leaving uninitialized storage for triangles when
there was deformation motion blur detected for the mesh. Was likely
some sort of optimization, but in fact it's still possible that regular
triangles would be needed for rendering.
So now we're storing aligned storage for all triangle primitives and
only skipping motion triangles (the deformation motion blur flag from
mesh is now ignored).
The issue was caused by numerical instability whrn having ray origin close to a huge
triangle, which could have aused bad ray distance check.
Watertight Woop intersection isn't really addressing such cases, it's dealing with
small triangles far away from the ray origin instead, so it's a bit tricky yo make
it working reliably.
While we're quite close to the release it's safer to do check in Pleaucker coordinates
if ray close to a huge triangle. Likely this additional check combined with some other
tweaks to the code doesn't cause measurable slowdown in the scenes tested here.
After the release we can play a bit more with this code in order to make it more
stable without Pleucker fallback.
Seems it's just another issue with the compiler, worked around by explicitly
telling not to inline some function.
In theory we can unify this with CPU, but we're quite close to the release
so better be safe than sorry.