Basically, make re-alloc and memcpy from the same lock, otherwise one
thread might be re-allocating thread while another one is trying to
copy data there.
Reported by Mohamed Sakr in IRC, thanks!
It doesn't seem that useful in practice, was mostly added to match some
other renderers but also seems to be causing user confusing and accidental
long render times. So let's just keep the UI simple and remove this.
Differential Revision: https://developer.blender.org/D2768
We're adding some bias by default, which now I think is the right thing
to do from a usability point of view since you really need to use those
options anyway to get clean renders in a practical time.
Differential Revision: https://developer.blender.org/D2769
This is a bit confusing, especially when one mixes OpenCL code where ulong equals
to uint64_t with CPU side code where ulong is expected to be something else from
the naming.
This commit makes it so we use explicit name, common on all platforms.
Problem was that some code checks to see if device_pointer is null or
not and the new allocator wasn't even setting the pointer to anything
as it tracks memory location separately. Setting the pointer to non
null keeps all users of device_pointer happy.
- Apparently MSVC does not support compound literals
in C++ (at least by the looks of it).
- Not sure how opencl_device_assert was managing to
set protected property of the Device class.
We don't enable global SSE optimizations in regular kernel, and we
keep those disabled on Linux 32bit.
One possible workaround would be to pass arguments by ccl_ref, but
that is quite a few of code which better be done accurately.
Steps to reproduce:
- Create shader Image texture -> Diffuse BSDF -> Output. Do NOT select image yet!
- Start viewport render.
- Select image from the ID browser of Image Texture node.
Thing is: with the memory manager we always need to inform device that memory
was freed.
Common folks, nobody considered master a C++11 only branch. Such decision is to
be done officially and will involve changes in quite a few infrastructure related
areas.
It is defined to & for CPU side compilation, and defined to an empty for any GPU
platform. The idea here is to use this macro instead of #ifdef block with bunch
of duplicated lines just to make it so CPU code is efficient.
Eventually we might switch to references on CUDA as well, but that would require
some intensive testing.
Image textures were being packed into a single buffer for OpenCL, which
limited the amount of memory available for images to the size of one
buffer (usually 4gb on AMD hardware). By packing textures into multiple
buffers that limit is removed, while simultaneously reducing the number
of buffers that need to be passed to each kernel.
Benchmarks were within 2%.
Fixes T51554.
Differential Revision: https://developer.blender.org/D2745
Since all the shadow catchers are already assumed to be in the footage,
the shadows they cast on each other are already in the footage too. So
don't just let shadow catchers skip self, but all shadow catchers.
Another justification is that it should not matter if the shadow catcher
is modeled as one object or multiple separate objects, the resulting
render should be the same.
Differential Revision: https://developer.blender.org/D2763
* Remove some unnecessary SSE emulation defines.
* Use full precision float division so we can enable it.
* Add sqrt(), sqr(), fabs(), shuffle variations, mask().
* Optimize reduce_add(), select().
Differential Revision: https://developer.blender.org/D2764
I need to use some macros defined in util_simd.h for float3/float4, to emulate
SSE4 instructions on SSE2. But due to issues with order of header includes this
was not possible, this does some refactoring to make it work.
Differential Revision: https://developer.blender.org/D2764
We already detect this automatically based on shading nodes and per shader
settings, and performance of this option is ok now all devices.
Differential Revision: https://developer.blender.org/D2767
Two main things here:
1. Replace all unsafe for #line directive characters into a single loop,
avoiding multiple iterations and multiple temporary strings created.
2. Don't merge token char by char but calculate start and end point and
then copy all substring at once.
This gives about 15% speedup of source processing time. At this point
(with all previous commits from today) we've shrinked down compiled
sources size from 108 MB down to ~5.5 MB and lowered processing time
from 4.5 sec down to 0.047 sec on my laptop running Linux (this was a
constant time which Blender will always spent first time loading kernel,
even if we've got compiled clbin).
Add a safe version of normalize since all uses of normalize
did zero length checks, move this into a function.
Also avoid unnecessary conversion.
Gives minor speedup here (approx 3-5%).
Basically gather lines as-is during traversal, avoiding allocating
memory for all the lines in headers.
Brings additional performance improvement abut 20%.
The idea here is that it is possible to mark certain include statements
as "precompiled" which means all subsequent includes of that file will
be replaced with an empty string.
This is a way to deal with tricky include pattern happening in single
program OpenCL split kernel which was including bunch of headers about
10 times.
This brings preprocessing time from ~1sec to ~0.1sec on my laptop.
The idea is to re-use files which were already processed. Gives about 4x speedup
of processing time (~4.5sec vs 1.0sec) on my laptop for the whole OpenCL kernel.
For users it will mean lower delay before OpenCL rendering might start.
The order of evaluation of function arguments is undefined, and the order
was reversed between these compilers. This was causing regressions tests
to give different results between Linux and macOS.