9bf14b78 Correct warnings inside worklet::Clip when making array handles
1b6d67e0 Always defer to the serial allocator when allocating basic storage
bf2b4169 Refactor vtk-m ArrayHandle to use mutable over const_cast
705528bf vtk-m ArrayHandle + basic storage has an optimized PrepareForDevice method
22f9ae3d vtk-m ArrayHandle + basic holds control data by StorageBasicBase
b1d0060d Make Storage and ArrayHandle export for the same value types.
d0a68d32 Refactor vtk-m storage basic to generate less code
Acked-by: Kitware Robot <kwrobot@kitware.com>
Merge-request: !1084
By hard coding the PrepareForDevice to know about all the different VTK-m
devices, we can have a single base class do the execution allocation, and not
have that logic repeated in each child class.
Added check for long double arrays, use TBB parallel_sort
Added radix sort instantiations for char16_t, char32_t, and
wchar_t. std::is_arithmetic<T> will evaluate to true for these
types.
Removed VTKM_CONT_EXPORT in DeviceAdapterAlgorithmTBB.h to try
and fix dll related error on Windows.
Created split implementation. Parallel radix
sort calls moved to vtkm_cont library.
Added key value radix sorts. SortByKey will invoke
radix sort when the key is a fundamental C++ numeric
or character type.
Added fast path for vtkm::SortLess and vtkm::SortGreater
calls to Sort and SortByKey.
Parallel radix sorting will be invoked in DeviceAdapterAlgorthmTBB.h when
the input is ArrayHandle<T, vtkm::cont::StorageTagBasic> where T is one of
the following basic C++ types:
unsigned int
unsigned short int
unsigned long int
unsigned long long int
unsigned char
char16_t
char32_t
wchar_t
char
short
int
long long
signed char
float
double
If a comparison operator is provided, it must be type std::less<T> or std::greater<T>.
Radix sort implementation is Satish parallel radix sort as documented in the
following citation:
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort.
N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey.
In Proc. SIGMOD, pages 351–362, 2010
Implementation is based on Takuya Akiba's GitHub source code with the following
changes:
- Changed parallel threading from OpenMP to TBB tasks
- Removed pair sorting
- Added minimum threshold for parallel, will instead invoke serial radix sort (kxsort)
- Added std::greater<T> and std::less<T> to interface for descending order sorts
- Added can_use_parallel_radix_sort<T, F>() function to determine if parallel radix sorting
is possible for type T and compare function F (fallback is std::sort() if not possible)
- Added linear scaling of threads used by the algorithm for more stable performance
on machines with lots of available threads (KNL and Haswell)
Added kxsort (serial MSD radix sort by Dinghua Li via GitHub) implementation without modification.
Sandia National Laboratories recently changed management from the
Sandia Corporation to the National Technology & Engineering Solutions
of Sandia, LLC (NTESS). The copyright statements need to be updated
accordingly.
3b03177c Add TBB specialization of Unique.
94d668dd Add serial version of Unique.
Acked-by: Kitware Robot <kwrobot@kitware.com>
Acked-by: Robert Maynard <robert.maynard@kitware.com>
Merge-request: !933
TBB's ReduceByKey was using the generic DeviceAdapterGeneral
implementation and was about 50x slower than the serial implementation,
which is very efficient.
This patch improves TBB's RBK implementation significantly, though it still
does not scale well. On a quad core processor, this implementation performs
comparably or slightly worse than the highly efficient serial algorithm.
More than 4 cores may be needed to see sufficient parallel speedup that
would overcome the TBB overhead, and grain size does not seem to affect the
performance significantly.
The old templated array transfer mechanism generated a lot of code
that ended up doing a simple, type-agnostic memcpy for most devices.
This patch specialized array handles for basic storage and uses a
fast-path array transfer implementation. This reduces the size of the
vtkm_cont library by 27% on gcc (from 6.2MB to 4.5MB).
Redesigns the TBB and Serial backends and the vtkm::exec::Task concept so that
we can re-use the same launching logic for all Worklets, instead of generating
per worlet code. To keep the performance the same the TilingTask now is past
a range of indices to work on, rather than a single index.
Binary size reduction:
WorkletTests_SERIAL old - 19MB
WorkletTests_SERIAL new - 18MB
WorkletTests_TBB old - 39MB
WorkletTests_TBB new - 18MB
libvtkAcceleratorsVTKm old - 48MB
libvtkAcceleratorsVTKm new - 19MB
ec6589d3 Only enable -fPIC on component static libraries when necessary.
cbfe5fdd Fix up various issues with ArrayHandles in vtkm_cont.
355eea88 Get the vtkm cont cuda object to compile properly.
6ecc22bb First pass at compiling ArrayHandle into vtkm_cont.
Acked-by: Kitware Robot <kwrobot@kitware.com>
Merge-request: !715