Starting in thrust 1.8 the implementation of scan inclusive inside
thrust became highly optimized by using parallel task groups. This
new implementation has a bug that only exists when using custom
binary operators, large size arrays, release mode, and no
debugger or mem-checker attached.
While I have submitted the issue to thrust, we need to be able
to work around the existing issue. The solution I have chosen is
to mark all vtkm::exec::cuda::interal::WrappedBinaryOperators
as being commutative as far as thrust is concerened. To make
sure we don't get any unexpected behavior I have also had
to create WrappedBinaryPredicate so that we don't mark any
predicate as commutative.
The original implementing tried to run 2^31 kernels and detect a
launch failure to determine this use-case. The issue with this approach
is that on a cuda 3+ gpu, this would take multiple seconds and cause
the gpu to terminate the kernel when opengl was also loaded.
The initial implementation forgot about the fact that SM2.X architectures
can only handle 65k blocks. Now we gracefully handle when compiling for
SM2.X.
Math
Adds a set of portable math functions, most of which are "expected" to be available and are typically useful. They include things like trigonometry (sin, cos, tan, etc.), floating point information (nan, finite, etc), and min/max.
See merge request !58
Also fix some issues that caused the compile to fail when trying to
run some of the math functions on a CUDA device. In particular, CUDA
is picky about using a global const on a device when the const type is
not one of the basic C types.
We need call PrepareForInput on the input argument before invoking a function.
The order of execution of parameters of a function is undefined, so we need to
make sure input is called before output, or else in-place use case breaks.
Instead of having a single specialization for sort and zip handles,
we know handle any fancy handles being passed to the cuda device adapter.
This was done by reworking how we represent fancy iterators inside thrust,
and instead of using a transform iterator + counting iterator we just use
a iterator_facade.
This is required so that we can use ArrayHandleZip with Sort/Reduce and
custom comparison operators. ArrayZip and PortalValue don't combine
well together, when used with DeviceAlgorithm that has a custom operator.
The custom operator is actually passed the PortalValue instead of
the real values, and by that point we can't fix anything since we
don't know what the original operator is.
Fix compile warnings that come up with the flags
-Wconversion -Wno-sign-conversion
This catches several instances (mostly in the testing framework) where
types are implicitly converted. I expect these changes to fix some of
the warnings we are seeing in MSVC.
I was going to add these flags to the list of extra warning flags, but
unfortunately the Thrust library has several warnings of these types,
and I don't know a good way to turn on the warnings for our code but
turn them off for Thrust.
Per a discussion with Robert Maynard, we no longer have plans to bind
textures or use texture objects. Instead, we are now using the __ldg
command to load from global memory inside the execution portal, which
gives us most of the benefits of textures, but doesn't incur any
bookkeeping.
Previously ArrayTransfer and ArrayManagerExecution received a reference
to a Storage class in their constructor and held the reference as an
ivar. In retrospect, this is just asking for trouble. First, it is way
too easy to pass by value when you mean to pass by reference. Second, if
there ever is a bug where the Storage goes out of scope before the
classes holding a reference, it is that much harder to debug.
This includes changing methods like LoadDataForInput to PrepareForInput.
It also changed the interface a bit to save a reference to the storage
object. (Maybe it would be better to save a pointer?) These changes also
extend up to the ArrayManagerExecution class, so it can effect device
adapter implementations.
This API change effects both ArrayTransfer and ArrayManagerExecution.
This is in preparation for a future change to make the API more
consistent with ArrayHandle.
Our approach of using the underlying allocator inside thrust was a bad approach,
for some reason it fails to properly allocate uint8's or int8's on the correct
boundaries. I expect that this logic is somewhere else in the code and
instead we should use thrust::system::cuda::vector which does this properly.
The namespaces need to be different for each test, or else only the first
implementation of the function will be used for all tests that call that
function.
Also updated the test to verify that we can count starting from a non zero
number.
Porting the dax device adapter over to vtkm. Unlike the dax version, doesn't
use the thrust::device_vector, but instead uses thrust::system calls so that
we can support multiple thrust based backends.
Also this has Texture Memory support for input array handles. Some more work
will need to be done to ArrayHandle so that everything works when using an
ArrayHandle inplace with texture memory bindings.