This allows to save a memory copy, which will be particularly useful for network rendering.
Reviewers: sergey, brecht, dingto, juicyfruit, maiself
Differential Revision: https://developer.blender.org/D2323
In scenes with many lights, some of them might have a very small contribution to some pixels, but the shadow rays are traced anyways.
To avoid that, this patch adds probabilistic termination to light samples - if the contribution before checking for shadowing is below a user-defined threshold, the sample will be discarded with probability (1 - (contribution / threshold)) and otherwise kept, but weighted more to remain unbiased.
This is the same approach that's also used in path termination based on length.
Note that the rendering remains unbiased with this option, it just adds a bit of noise - but if the setting is used moderately, the speedup gained easily outweighs the additional noise.
Reviewers: #cycles
Subscribers: sergey, brecht
Differential Revision: https://developer.blender.org/D2217
When using the Normal output of the Texture Coordinate node on Point and Spot lamps, the coordinates now depend on the rotation of the lamp.
On Area lamps, the Parametric output of the Geometry node now returns UV coordinates on the area lamp.
Credit for the Area lamp part goes to Stefan Werner (from D1995).
Oh man, is it a compiler bug? Is it something we do stupid?
For now more crap to prevent crashes. During the conference will talk to
Maxyn about how can we troubleshoot such weird issues.
Basically don't use rcp() in areas which seems to be critical after
second look. Also disabled some multiplication operators, not sure
yet why they might be a problem.
Tomorrow will be setting up a full test with all cases which were
buggy in our farm to see if this fix is complete.
There is some precision issues for big magnitude coordinates which started
to give weird behavior of release builds. Some weird memory usage in BVH
which is tricky to nail down because only happens in release builds and GDB
reports all variables as optimized out when trying to use RelWithDebInfo.
There are two things in this commit:
- Attempt to make vectorized code closer to original one, hoping that it'll
eliminate precision issue.
This seems to work for transform_point().
- Similar trick did not work for transform_direction() even tho absolute
error here is much smaller. For now disabled that function, need a more
careful look here.
Several ideas here:
- Optimize calculation of near_{x,y,z} in a way that does not require
3 if() statements per update, which avoids negative effect of wrong
branch prediction.
- Optimization of direction clamping for BVH.
- Optimization of point/direction transform.
Brings ~1.5% speedup again depending on a scene (unfortunately, this
speedup can't be sum across all previous commits because speedup of
each of the changes varies from scene to scene, but it still seems to
be nice solid speedup of few percent on Linux and bigger speedup was
reported on Windows).
Once again ,thanks Maxym for inspiration!
Still TODO: We have multiple places where we need to calculate near
x,y,z indices in BVH, for now it's only done for main BVH traversal.
Will try to move this calculation to an utility function and see if
that can be easily re-used across all the BVH flavors.
The idea here is to avoid if statements which could cause wrong
branch prediction.
Gives a bit of measurable speedup up to ~1%. Still nice :)
Inspired by Maxym Dmytrychenko, thanks!
Initialization order of global stats and node types was not strictly
defined and it was possible to have node types initialized first and
stats after that. This will zero out memory which was allocated from
the statistics causing assert failure when de-initializing node types.
This was giving some speedup but made intersection tests to fail
from watertight point of view.
Needs deeper investigation, but need to quickly get it fixed for
the studio.
This gives about 5% speedup on AVX2 kernels (other kernels still
have SSE disabled for math operations) and this solves the slowdown
of koro scene mention in the previous commit.
The title says it all actually. This commit also contains
changes to pass float3 as const reference in affected functions.
This should make MSVC happier without breaking OpenCL because it's
only done in areas which are ifdef-ed for non-OpenCL.
Another patch based on inspiration from Maxym Dmytrychenko, thanks!
Based on existing ssef data type and to my knowledge it's also what happens in
Embree nowadays.
Inspired by Maxym Dmytrychenko and required for the upcoming triangle
intersection commit.
Hopefully the copyright message is correct.
Mostly this is making inlining match CUDA 7.5 in a few performance critical
places. The end result is that performance is now better than before, possibly
due to less register spilling or other CUDA 8.0 compiler improvements.
On benchmarks scenes, there are 3% to 35% render time reductions. Stack memory
usage is reduced a little too.
Reviewed By: sergey
Differential Revision: https://developer.blender.org/D2269
Previously it was falling back to just a path after #include
statement was finished. Now we fall back to a proper current
file name after dealing with the preprocessor statement.
Basically just moves cached kernels from ~/.config/blender/BLENDER_VERSION to
~/.cache/cycles/kernels. This has following benefits:
- Follows XDG specification more closely,
not as if it's totally crucial or measurable by users, but still nice.
- Prevents unexpected sizes of config folder, makes disk space used in more
predictable for users way.
- Allows to share kernels across multiple Blender versions,
which makes it easier debugging at the times close to release.
- "Copy Previous Settings" operator will no longer be copying possibly
gigabytes of cached kernels, which used to lead to really nast disk usage
and annoying delays of copying settings.
- In the future we can have some smart logic to clear old unused cached
kernels.
Currently only done for Linux and OSX. Windows still follows old "cache"
folder logic, but it's not really important for now because we don't
support kernel compilation on this platform yet.
Reviewers: dingto, juicyfruit, brecht
Reviewed By: brecht
Differential Revision: https://developer.blender.org/D2197
Weirdly enough, this version of XCode seems to have static_assert()
even when NOT using C++11. This is totally weird and counter intuitive
since static_assert() is supposed to be C++11 onlky feature.
Can XCode stop using future, please? :)
This way OpenCL devices can also benefit from a smaller memory footprint, when using e.g. bumpmaps (greyscale, 1 channel).
Additional target for my GSoC 2016.
Now we have the 4 component ones first (float4, byte4, half4) followed by the 1 component ones (float, byte, half).
Makes code a bit more consistent and also reduces code a bit when enabling half support on GPU in next commit.
This also exposed a typo in half CPU images for 3D textures, which wasn't used yet, but good to have that one fixed anyway.
Enables Catmull-Clark subdivision meshes with support for creases and attribute
subdivision. Still waiting on OpenSubdiv to fully support face varying
interpolation for subdividing uv coordinates tho. Also there may be some
inconsistencies with Blender's subdivision which will be resolved at a
later time.
Code for reading patch tables and creating patch maps is borrowed
from OpenSubdiv.
Reviewed By: brecht
Differential Revision: https://developer.blender.org/D2111
Currently cycles cannot correctly render motion blur for objects that appear or
disappear during the shutter window. Until that can be fixed properly, it may be
better to hide such particles rather than let them render as if they were
stationary for half of the frame.
Reviewed By: brecht
Differential Revision: https://developer.blender.org/D2125
All the changes are mainly giving explicit tips on inlining functions,
so they match how inlining worked with previous toolkit.
This make kernel compiled by CUDA 8 render in average with same speed
as previous kernels. Some scenes are somewhat faster, some of them are
somewhat slower. But slowdown is within 1% so far.
On a positive side it allows us to enable newer generation cards on
buildbots (so GTX 10x0 will be officially supported soon).
This adds support for ngons and attributes on subdivision meshes. Ngons are
needed for proper attribute interpolation as well as correct Catmull-Clark
subdivision. Several changes are made to achieve this:
- new primitive `SubdFace` added to `Mesh`
- 3 more textures are used to store info on patches from subd meshes
- Blender export uses loop interface instead of tessface for subd meshes
- `Attribute` class is updated with a simplified way to pass primitive counts
around and to support ngons.
- extra points for ngons are generated for O(1) attribute interpolation
- curves are temporally disabled on subd meshes to avoid various bugs with
implementation
- old unneeded code is removed from `subd/`
- various fixes and improvements
Reviewed By: brecht
Differential Revision: https://developer.blender.org/D2108
- In fresnel_dielectric, the differentials calculation sometimes divided by zero.
- When the normal map was (0.5, 0.5, 0.5), the code would try to normalize a zero vector. Now, it just uses the regular normal as a fallback.
- The approximate error function used in Beckmann sampling sometimes overflowed to inf while calculating r^16. The final value is 1 - 1/r^16, however,
so now it just returns 1 if the computation would overflow otherwise.
This way restrict can be used for CUDA and OpenCL as well.
From quick tests in areas i've been testing this it might give some
barely measurable %% of speedup, but it increases registers pressure.
So use of this qualifier is still really limited.
This is a special builder type which is allowed to orient nodes to
strands direction, hence minimizing their surface area in comparison
with axis-aligned nodes. Such nodes are much more efficient for hair
rendering.
Implementation of BVH builder is based on Embree, and generally idea
there is to calculate axis-aligned SAH and oriented SAH and if SAH
of oriented node is smaller than axis-aligned SAH we create unaligned
node.
We store both aligned and unaligned nodes in the same tree (which
seems to be different from what Embree is doing) so we don't have
any any extra calculations needed to set up hair ray for BVH
traversal, hence avoiding any possible negative effect of this new
BVH nodes type.
This new builder is currently not in use, still need to make BVH
traversal code aware of unaligned nodes.
This commit adds a new distribution to the Glossy, Anisotropic and Glass BSDFs that implements the
multiple-scattering microfacet model described in the paper "Multiple-Scattering Microfacet BSDFs with the Smith Model".
Essentially, the improvement is that unlike classical GGX, which only models single scattering and assumes
the contribution of multiple bounces to be zero, this new model performs a random walk on the microsurface until
the ray leaves it again, which ensures perfect energy conservation.
In practise, this means that the "darkening problem" - GGX materials becoming darker with increasing
roughness - is solved in a physically correct and efficient way.
The downside of this model is that it has no (known) analytic expression for evalation. However, it can be
evaluated stochastically, and although the correct PDF isn't known either, the properties of MIS and the
balance heuristic guarantee an unbiased result at the cost of slightly higher noise.
Reviewers: dingto, #cycles, brecht
Reviewed By: dingto, #cycles, brecht
Subscribers: bliblubli, ace_dragon, gregzaal, brecht, harvester, dingto, marcog, swerner, jtheninja, Blendify, nutel
Differential Revision: https://developer.blender.org/D2002
This is an initial commit for half texture support in Cycles.
It adds the basic infrastructure inside of the ImageManager and support for these textures on CPU.
Supported:
* Half Float OpenEXR images (can be used for e.g HDRs or Normalmaps) now use 1/2 the memory, when loaded via disk (OIIO).
ToDo:
Various things like support for inbuilt half textures, GPU... will come later, step by step.
Part of my GSoC 2016.
The original quad intersection test works by just testing against the two triangles that define the quad.
However, in this case it's actually faster to use the same test that's also used for portals: Determining
the distance to the plane in which the quad lies, calculating the hitpoint and checking whether it's in the
quad by projecting onto the sides.
Reviewers: brecht, sergey, dingto
Reviewed By: dingto
Differential Revision: https://developer.blender.org/D2045
Currently for windows only, this is an initial commit towards native
support of NUMA.
Current commit makes it so Cycles will use all logical processors on
Windows running on system with more than 64 threads.
Reviewers: juicyfruit, dingto, lukasstockner97, maiself, brecht
Subscribers: LazyDodo
Differential Revision: https://developer.blender.org/D2049
Some of these values can get quite large and are hard to read, adding this
makes it easy to read them at a glance.
Reviewed By: sergey
Differential Revision: https://developer.blender.org/D2039
This adds support for CUDA Texture objects (also known as Bindless textures) for Kepler GPUs (Geforce 6xx and above).
This is used for all 2D/3D textures, data still uses arrays as before.
User benefits:
* No more limits of image textures on Kepler.
We had 5 float4 and 145 byte4 slots there before, now we have 1024 float4 and 1024 byte4.
This can be extended further if we need to (just change the define).
* Single channel textures slots (byte and float) are now supported on Kepler as well (1024 slots for each type).
ToDo / Issues:
* 3D textures don't work yet, at least don't show up during render. I have no idea whats wrong yet.
* Dynamically allocate bindless_mapping array?
I hope Fermi still works fine, but that should be tested on a Fermi card before pushing to master.
Part of my GSoC 2016.
Reviewers: sergey, #cycles, brecht
Subscribers: swerner, jtheninja, brecht, sergey
Differential Revision: https://developer.blender.org/D1999
This way, we also save 3/4th of memory for single channel byte textures (e.g. Bump Maps).
Note: In order for this to work, the texture *must* have 1 channel only.
In Gimp you can e.g. do that via the menu: Image -> Mode -> Grayscale
Until now, single channel textures were packed into a float4, wasting 3 floats per pixel. Memory usage of such textures is now reduced by 3/4.
Voxel Attributes such as density, flame and heat benefit from this, but also Bumpmaps with one channel.
This commit also includes some cleanup and code deduplication for image loading.
Example Smoke render from Cosmos Laundromat: http://www.pasteall.org/pic/show.php?id=102972
Memory here went down from ~600MB to ~300MB.
Reviewers: #cycles, brecht
Differential Revision: https://developer.blender.org/D1981
Title says it all, this adds OpenCL float4 texture support.
There is a bug in the code still, I get a "Out of ressources error" on nvidia hardware here, not sure whats wrong yet.
Will investigate further, but maybe someone else has an idea. :)
Reviewers: #cycles, brecht
Subscribers: brecht, candreacchio
Differential Revision: https://developer.blender.org/D1983
If the CUDA Toolkit is installed and the user is on Linux,
adaptive, feature based CUDA runtime compile is now possible to enable via:
* Environment flag CYCLES_CUDA_ADAPTIVE_COMPILE or
* Debug menu (Debug value 256) in the Cycles UI.
Couple of issues here:
- Was a bug in heap memory allocation when run out
of allowed stack memory.
- Debug MSVC was failing because it uses separate
allocator for some sort of internal proxy thing,
which seems to be unable to be using stack memory
because allocator is being created in non-persistent
stack location.
This is an attempt to gracefully handle out-of-memory events
and stop rendering with an error message instead of a crash.
It uses bad_alloc exception, and usually i'm not really fond
of exceptions, but for such limited use for errors from which
we can't recover it should be fine.
Ideally we'll need to stop full Cycles Session, so viewport
render and persistent images frees all the memory, but that
we can support later, since it'll mainly related on telling
Blender what to do.
General rules are:
- Use as less exception handles as possible, try to find a
most geenric pace where to handle those.
For example, ccl::Session.
- Threads needs own handling, exception trap from one thread
will not catch exceptions from other threads.
That's why BVH build needs own thing.
Reviewers: brecht, juicyfruit, dingto, lukasstockner97
Differential Revision: https://developer.blender.org/D1898
This mimics behavior of default allocators in STL and allows all the routines
to catch out-of-memory exceptions and hopefully recover from that situation/
Instead of treating Fermi GPU limits as default,
and overriding them for other devices,
we now nicely set them for each platform.
* Due to setting values for all platforms,
we don't have to offset the slot id for OpenCL anymore,
as the image manager wont add float images for OpenCL now.
* Bugfix: TEX_NUM_FLOAT_IMAGES was always 5, even for CPU,
so the code in svm_image.h clamped float textures with alpha on CPU after the 5th slot.
Reviewers: #cycles, brecht
Reviewed By: #cycles, brecht
Subscribers: brecht
Differential Revision: https://developer.blender.org/D1925
Seems particular CUDA implementations has some precision issues,
which made integer coordinate (which was expected to always be
positive) to go negative.
Couple of things:
- No need to use string streams to format the version string,
we can do it at compile time and don't bother with anything
at runtime.
- Function declaration was wring and would have caused linking
conflicts in cases when util_version.h was included from
multiple places.
We should have an utility function to get Cycles version so
applications which are linked to Cycles dynamically can query
the version, but that can't be done as an inlined function in
header and would need to be a function properly exported to a
global symbol table (aka, be implemented in a .cpp file).
Now Cycles has its own versioning, that is mainly interesting for external projects, which integrate the engine.
We start with version 1.7.0. Reasons for that:
* The engine is too mature for a 1.0 release.
* We assume that Cycles inside of Blender 2.61 was version 0.1. We count upwards in 0.1 steps, therefore Cycles inside of Blender 2.77 would be 1.7.
We use a common versioning scheme here, with 3 decimals for the major, minor and patch level.
At the moment cycles --version can be used to display the version, easy to parse for external projects. The info will be added to the UI later aswell.
When the Moon is full it was possible to have a dead-lock in task
scheduler's exit() method.
Similar problem was fixed in Blender's task scheduler 3 years ago
in bae2a2c.
Policy here is a bit more complicated, if tree becomes too deep we're
forced to create a leaf node and size of that leaf wouldn't be so well
predicted, which means it's quite tricky to use single stack array for
that.
Made it more official feature that StackAllocator will fall-back to
heap when running out of stack memory.
It's still much better than always using heap allocator.
We had per-tree statistics already, but it's a bit tricky to see overall
time because trees could be building in parallel.
In fact, we can now print statistics for any TaskPool.
Main use case of this ID will be to emulate TLS which otherwise
would require having some platform-specific implementations which
is not always really optimal.
See notes about the argument in util_task.h.
Currently unused, but will be handy for an upcoming changes.
It'll also be nice to be able to do scoped_lock() for both
Mutex and Spin, but currently it's not really easy to do,
need some changes in typedefs and such, will happen as a
separate commit.
This seems quite useful for the development, so you don't need to wait
all the kernels to be re-compiled when working on a new feature, which
speeds up re-iteration.
Marked as an advanced option, so if it doesn't work so well in practice
it's safe to revert anyway.
This is a mix of regression and old unsupported configuration.
Regression was caused by some checks added on Blender side which was
checking whether python function returned error or not. This made it
impossible to enable Cycles when running from a file path which can't
be encoded with MBCS codepage.
Non-regression issue was that it wasn't possible to use pre-compiled
CUDA kernels when running from a path with non-ascii multi-byte
characters.
This commit fixes regression and CUDA parts, but OSL still can't be
used from a non-ascii location because it uses non-widechar API to
work with file paths by the looks of it. Not sure we can solve this
just from our side by using some codepage trick (UTF-16?) since even
oslc fails to compile shader when there are non-ascii characters in
the path.
It's not really handy to silence something unused hoping for it'll be
used in the future. We can end up with quite some silencing then.
Also made this flag which i find rather useless to NOT cause -Werror
in Cycles code.
The issue was caused by static vectors allocating some internal
data using rebound element allocator for them, which was causing
access to a non-initialized statistics objects and was failing a
lot when switching Blender to a fully guarded allocation.
Additionally, we were not able to free that internal memory before
Blender exits, which was causing false-positive memory leak prints.
Now we're not using GuardedAllocator for those proxy containers.
Ideally this should be done as a GuardedAllocator::rebind, but
it didn't work for vector<bool> because it seems some internal
parts are converting bool to char32_t, which either makes it so
we can't use GuardedAllocator for those vectors or the compiler
get's confused when we're trying explicitly allow GuardedAllocator
for rebind<char32_t>.
This with current approach we should be fine for the release.
Was caused by some safety things of making sure we've for NULL
terminator for the buffer when doing mbs<->wcs conversion, but
it turns out this simply confuses str::string and it can no
longer have proper .size(). Let's assume behavior of string
allocation is same all over the std, and we can avoid having
that extra null-terminator allocated.