Sandia National Laboratories recently changed management from the
Sandia Corporation to the National Technology & Engineering Solutions
of Sandia, LLC (NTESS). The copyright statements need to be updated
accordingly.
Redesigns the TBB and Serial backends and the vtkm::exec::Task concept so that
we can re-use the same launching logic for all Worklets, instead of generating
per worlet code. To keep the performance the same the TilingTask now is past
a range of indices to work on, rather than a single index.
Binary size reduction:
WorkletTests_SERIAL old - 19MB
WorkletTests_SERIAL new - 18MB
WorkletTests_TBB old - 39MB
WorkletTests_TBB new - 18MB
libvtkAcceleratorsVTKm old - 48MB
libvtkAcceleratorsVTKm new - 19MB
The current design for ArrayPortalVirtual makes it a requirement for all
array portals (that it wraps) to have Set defined. Thus, make sure Set is
defined for all ArrayPortal. Where Set is invalid, an assert is thrown if
something calls it at runtime.
Most functions in the CUDA runtime API return an error code that must be
checked to determine whether the operation completed successfully. Most
operations in VTK-m just called the function and assumed it completed
correctly, which could lead to further errors. This change wraps most
CUDA calls in a VTKM_CUDA_CALL macro that checks the error code and
throws an exception if the call fails.
Change the VTKM_CONT_EXPORT to VTKM_CONT. (Likewise for EXEC and
EXEC_CONT.) Remove the inline from these macros so that they can be
applied to everything, including implementations in a library.
Because inline is not declared in these modifies, you have to add the
keyword to functions and methods where the implementation is not inlined
in the class.
This usage of raw_referenc_cast returns a const PortalValue which is than
assigned too. So to work around the problem we need to mark operator= on
the class as const.
Scatter in worklets
Add the functionality to perform a scatter operation from input to output in a worklet invocation. This allows you to, for example, specify a variable amount of outputs generated for each input.
See merge request !221
The original workaround for inclusive_scan bugs in thrust 1.8 only solved the
issue for basic arithmetic types such as int, float, double. Now we go one
step further and fix the problem for all types.
The solution is to provide a proper implementation of destructive_accumulate_n
and make sure it exists before any includes of thrust occur.
Previously, there was a declaration ConstArrayPortalFromThrust<const T>
in ArrayManagerExecutionThrustDevice. This proved problematic because
values read from the array in the worklet were typed as const T rather
than simply T. Any Vec or Matrix built from that type would then fail
because they are not meant to work with a const value (which means they
have to be set on construction and never changed.
Instead, declare ConstArrayPortalFromThrust<T> and internally set all
the Thrust pointers to have type const T. Also declare other thrust
pointers used as method parameters to have const T rather than T. This
should work as conversion from T to const T should be fine, but not the
other way around.
fd685210 Always install all device headers even when device isn't enabled.
b1663b24 Add an example of using multiple backends from a single translation unit.
fc0ff69d Methods with try/catch need to be host only.
4d635d64 DeviceAdapter Tags now always exist, and contain if the device is valid.
cf32b430 Teach Configure.h to store if TBB and CUDA are enabled.
Acked-by: Kitware Robot <kwrobot@kitware.com>
Acked-by: Kenneth Moreland <kmorel@sandia.gov>
Merge-request: !198
By introducing our own custom thrust execution policy we can make sure
to hit the fastest code paths in thrust for the sort operation. This makes
sure that for UInt32,Int32, and Float32 we use the radix sort from thrust
which offers a 2x to 3x speed improvement over the merge sort implementation.
Secondly by telling thrust that our BinaryOperators are commutative we
make sure that we get the fastest code paths when executing Inclusive
and Exclusive Scan
Benchmark 'Radix Sort on 1048576 random values vtkm::Int32' results:
median = 0.0117049s
median abs dev = 0.00324614s
mean = 0.0167615s
std dev = 0.00786269s
min = 0.00845875s
max = 0.0389063s
Benchmark 'Radix Sort on 1048576 random values vtkm::Float32' results:
median = 0.0234463s
median abs dev = 0.000317249s
mean = 0.021452s
std dev = 0.00470307s
min = 0.011255s
max = 0.0250643s
Benchmark 'Merge Sort on 1048576 random values vtkm::Int32' results:
median = 0.0310486s
median abs dev = 0.000182129s
mean = 0.0286914s
std dev = 0.00634102s
min = 0.0116225s
max = 0.0317379s
Benchmark 'Merge Sort on 1048576 random values vtkm::Float32' results:
median = 0.0310617s
median abs dev = 0.000193583s
mean = 0.0295779s
std dev = 0.00491531s
min = 0.0147257s
max = 0.032307s
Starting in thrust 1.8 the implementation of scan inclusive inside
thrust became highly optimized by using parallel task groups. This
new implementation has a bug that only exists when using custom
binary operators, large size arrays, release mode, and no
debugger or mem-checker attached.
While I have submitted the issue to thrust, we need to be able
to work around the existing issue. The solution I have chosen is
to mark all vtkm::exec::cuda::interal::WrappedBinaryOperators
as being commutative as far as thrust is concerened. To make
sure we don't get any unexpected behavior I have also had
to create WrappedBinaryPredicate so that we don't mark any
predicate as commutative.
On one of my compile platforms, GCC was giving conversion warnings from
any boost include that was not wrapped in pragmas to disable conversion
warnings. To make things easier and more robust, I created a pair of
macros, VTKM_BOOST_PRE_INCLUDE and VTKM_BOOST_POST_INCLUDE, that should
be wrapped around any #include of a boost header file.