This makes it easier to have per kernel number of registers. Also, all the
tunable parameters for this are now in kernel.cu, rather than spread over cmake,
scons and device_cuda.cpp.
This fixes the ptaxs "ACCESS_VIOLATION" error and should allow our Linux and Windows build bots to compile again.
Unfortunately this comes with a performance penalty on sm_2x cards, so this is only a workaround for now. Branched Path is still globally disabled on GPU.
* Remove support for CUDA Toolkit 4.x, only Toolkit 5.0 and above are supported now.
* Remove support for sm_1x cards (< Fermi) for good. We didn't officially support those cards for a few releases already, now remove some special code that was still there.
* Add CUDA compiler version detection to cmake/scons/runtime
* Remove noinline in kernel_shader.h and reenable --use_fast_math if CUDA 5.x
is used, these were workarounds for CUDA 4.2 bugs
* Change max number of registers to 32 for sm 2.x (based on performance tests
from Martijn Berger and confirmed here), and also for NVidia OpenCL.
Overall it seems that with these changes and the latest CUDA 5.0 download, that
performance is as good as or better than the 2.67b release with the scenes and
graphics cards I tested.
* Added option "WITH_BF_CYCLES_CUDA_THREADED_COMPILE" for the people who have much RAM (8 or more) and can compile several kernels at the same time. If enabled, it uses the general BF_NUMJOBS flag.
* The option is off per default.
* Compile all of cycles with -ffast-math again
* Add scons compilation of cuda binaries, tested on mac/linux.
* Add UI option for supported/experimental features, to make it
more clear what is supported, opencl/subdivision is experimental.
* Remove cycles xml exporter, was just for testing.