* AVX is available on Intel Sandy Bridge and newer and AMD Bulldozer and newer.
* We don't use dedicated AVX intrinsics yet, but gcc auto vectorization gives a 3% performance improvement for Caminandes. Tested on an i5-3570, Linux x64.
* No change for Windows yet, MSVC 2008 does not support AVX.
Reviewed by: brecht
Differential Revision: https://developer.blender.org/D216
After update to Mac OS X 10.9.1, OpenCL works now on my Intel CPU in the 2013 Macbook Pro (even the entire kernel).
The Intel Iris Pro GPU still segfaults here though, even when all flags are disabled (building "clay like" kernel only).
Maybe we need the -no-missing-prototypes for AMD hardware still, but I couldn't find a way to distuinguish here.
This actually works somewhat now, although viewport rendering is broken and any
kind of network error or connection failure will kill Blender.
* Experimental WITH_CYCLES_NETWORK cmake option
* Networked Device is shown as an option next to CPU and GPU Compute
* Various updates to work with the latest Cycles code
* Locks and thread safety for RPC calls and tiles
* Refactored pointer mapping code
* Fix error in CPU brand string retrieval code
This includes work by Doug Gale, Martijn Berger and Brecht Van Lommel.
Reviewers: brecht
Differential Revision: http://developer.blender.org/D36
This is mostly work towards enabling the __KERNEL_SSE__ option to start using
SIMD operations for vector math operations. This 4.1 kernel performes about 8%
faster with that option but overall is still slower than without the option.
WITH_CYCLES_OPTIMIZED_KERNEL_SSE41 is the cmake flag for testing this kernel.
Alignment of int3, int4, float3, float4 to 16 bytes seems to give a slight 1-2%
speedup on tested systems with the current kernel already, so is enabled now.
* Remove support for CUDA Toolkit 4.x, only Toolkit 5.0 and above are supported now.
* Remove support for sm_1x cards (< Fermi) for good. We didn't officially support those cards for a few releases already, now remove some special code that was still there.
use arrays instead of textures for general storage on this card (image textures
are still stored as texture). Textures were found to be faster on older cards,
but the limits on 1D texture size have not increased along with the memory size,
which meant that the full 6 GB could not be used.
The performance actually seems to be slightly better with arrays in some tests
on Titan. For older cards there seems to be a bit of a mix, some are better and
others not. We may change those to use arrays too, but more testing is needed,
only Titan and Tesla K20 (sm_35) is changed for now.
The fact that arrays are faster is a bit surprising, as others found textures
to be faster on Kepler. However even if they were, the memory limitation is
more important to solve anyway.
https://research.nvidia.com/publication/understanding-efficiency-ray-traversal-gpus-kepler-and-fermi-addendum
except for curves, that's still missing from the OpenColorIO GLSL shader.
The pixels are stored in a half float texture, converterd from full float with
native GPU instructions and SIMD on the CPU, so it should be pretty quick.
Using a GLSL shader is useful for GPU render because it avoids a copy through
CPU memory.
* GPU kernel can now be compiled without __NON_PROGRESSIVE__ again, was broken after my last commit. Also add a check for have_error(), in case the GPU kernel comes without Non-Progressive, to avoid a crash.
* Don't compile progressive kernel twice on CPU, if __NON_PROGRESSIVE__ would be disabled there.
* Non-Progressive integrator is now available on the GPU (CUDA, sm_20 and above).
Implementation details:
* kernel_path_trace() has been split up into two functions:
kernel_path_trace_non_progressive() and kernel_path_trace_progressive().
* We compile two CUDA kernel entry functions (in kernel.cu) for the two integrators, they are still inside one .cubin file but due to the kernel separation there should be no performance problem. I tested with the BMW file on my Geforce 540M and the render times were the same for 100 samples (1.57 min in my case).
This is part of my GSoC project, SVN merge of r59032 + manual merge of UI changes for this from my branch.
* "Auto Detect" now again uses the umber of cores, instead number of cores + 1.
This was added before we had Tile rendering and benchmarks on several systems showed that there is no gain with this now. There might be some slight difference (0.5% or so) slower/faster depending on the scene, but this is negligible.
* Reshuffle SSE #ifdefs to try to avoid compilation errors enabling SSE on 32 bit.
* Remove CUDA kernel launch size exception on Mac, is not needed.
* Make OSL file compilation quiet like c/cpp files.
* Add CUDA compiler version detection to cmake/scons/runtime
* Remove noinline in kernel_shader.h and reenable --use_fast_math if CUDA 5.x
is used, these were workarounds for CUDA 4.2 bugs
* Change max number of registers to 32 for sm 2.x (based on performance tests
from Martijn Berger and confirmed here), and also for NVidia OpenCL.
Overall it seems that with these changes and the latest CUDA 5.0 download, that
performance is as good as or better than the 2.67b release with the scenes and
graphics cards I tested.
the second time, as for example Intel CPU startup time is 9 seconds.
* Adds an cache for contexts and programs for each platform and device pair,
which also ensure now no two threads try to compile and write the binary cache
file at the same time.
* Change clFinish to clFlush so we don't block until the result is done, instead
it will block at the moment we copy back memory.
* Fix error in Cycles time_sleep implementation, does not affect any active code
though.
* Adds some (disabled) debugging code in the task scheduler.
Patch #35559 by Doug Gale.
* Support using devices from all OpenCL platforms, so that you can use e.g. both
Intel and NVidia OpenCL implementations if you have them installed.
* Fix compile error due to missing fmodf after recent math node change.
* Enable advanced shading for Intel OpenCL.
* CYCLES_OPENCL_DEBUG environment variable for generating debug symbols so you
can debug with gdb. This crashes the compiler with Intel OpenCL on Linux though.
To make this work the preprocessed kernel source code is written out, as gdb
needs this.
* Show OpenCL compiler warnings even if the build succeeded.
* Some small fixes to initialize cdDevice to NULL, add missing NULL check when
creating buffer and add missing space at end of build options for Apple OpenCL.
* Fix crash with multi device + opencl, now e.g. CPU + GPU render should work.
I did a few tweaks to the code and also:
* Fix viewport render failing sometimes with Apple CPU OpenCL, was not taking
workgroup size limits into account properly.
* Add compile error when advanced shading in the Blender binary and OpenCL kernel
are not in sync.
for Apple OpenCL on OS X 10.8 and simple AO render.
Also environment variable CYCLES_OPENCL_TEST can now be set to CPU, GPU,
ACCELERATOR, DEFAULT or ALL values to test particuler devices.
* Deprecate computing capability 1.3 (sm_13)
This commit disables auto build of sm_13 CUDA platform, which means that starting with Blender 2.67, we don't support sm_13 devices anymore. It has become difficult to support that and it was already feature incomplete (no render-passes, AO, Multi Closure etc).
It's still possible to manually enable sm_13 for own tests, but building might break in the future.
* CUDA: Make it more clear that sm_12 and below is not supported.
* OpenCL: __KERNEL_SHADING__ was declared twice for nvidia opencl device.
* Some reshuffle of defines in kernel_types.h. No functional changes.
precompiled cubins instead,
Logic here is following now:
- If there're precompiled cubins, assume CUDA compute is available,
otherwise
- If cuda toolkit found, assume CUDA compute is available
- In all other cases CUDA compute is not available
For windows there're still check for only precompiled binaries,
no runtime compilation is allowed.
Ended up with such decision after discussion with Brecht. The thing
is, if we'll support runtime compilation on windows we'll end up
having lots of reports about different aspects of something doesn't
work (you need particular toolkit version, msvc installed, environment
variables set properly and so) and giving feedback on such reports
will waste time.