Now that we have atomic free functions (e.g. `vtkm::AtomicAdd()`), we no
longer need special implementations for control and each execution
device. (Well, technically we do have special implementations for each,
but they are handled with compiler directives in the free functions.)
Convert the old atomic interface classes (`AtomicInterfaceControl` and
`AtomicInterfaceExecution`) to use the new atomic free functions. This
will allow us to test the new atomic functions everywhere that atomics
are used in VTK-m.
Once verified, we can deprecate the old atomic interface classes.
The `ArrayHandleStreaming` class stems from an old research project
experimenting with bringing data from an `ArrayHandle` in parts and
overlapping device transfer and execution. It works, but only in very
limited contexts. Thus, it is not actually used today. Plus, the feature
requires global indexing to be permutated throughout the worklet
dispatching classes of VTK-m for no further reason.
Because it is not really used, there are other more promising approaches
on the horizon, and it makes further scheduling improvements difficult,
we are removing this functionality.
This behaves just like `ScanExclusive`, but rather than returning the
total sum, it is appended to the end of the output array.
This is in preparation for the CellSetExplicit refactoring described in
issue #408.
- Use AtomicInterface to implement device-specific atomic operations.
- Remove DeviceAdapterAtomicArrayImplementations.
- Extend supported atomic types to include unsigned 32/64-bit ints.
- Add a static_assert to check that AtomicArray type is supported.
- Add documentation for AtomicArrayExecutionObject, including a CAS
example.
- Add a `T Get(idx)` method to AtomicArrayExecutionObject that does
an atomic load, and update existing CAS usage to use this instead
of `Add(idx, 0)`.
BitFields are:
- Stored in memory using a contiguous buffer of bits.
- Accessible via portals, a la ArrayHandle.
- Portals operate on individual bits or words.
- Operations may be atomic for safe use from concurrent kernels.
The new BitFieldToUnorderedSet device algorithm produces an ArrayHandle
containing the indices of all set bits, in no particular order.
The new AtomicInterface classes provide an abstraction into bitwise
atomic operations across control and execution environments and are used
to implement the BitPortals.
It should be possible to query a vtkm::cont::Timer without modifying it.
As such, its query functions (such as Stopped and GetElapsedTime) should
be const.
The timer class now is asynchronous and device independent. it's using an
similiar API as vtkOpenGLRenderTimer with Start(), Stop(), Reset(), Ready(),
and GetElapsedTime() function. For convenience and backward compability, Each
Start() function call will call Reset() internally and each GetElapsedTime()
function call will call Stop() function if it hasn't been called yet for keeping
backward compatibility purpose.
Bascially it can be used in two modes:
* Create a Timer without any device info. vtkm::cont::Timer time;
* It would enable timers for all enabled devices on the machine. Users can get a
specific elapsed time by passing a device id into the GetElapsedtime function.
If no device is provided, it would pick the maximum of all timer results - the
logic behind this decision is that if cuda is disabled, openmp, serial and tbb
roughly give the same results; if cuda is enabled it's safe to return the
maximum elapsed time since users are more interested in the device execution
time rather than the kernal launch time. The Ready function can be handy here
to query the status of the timer.
* Create a Timer with a device id. vtkm::cont::Timer time((vtkm::cont::DeviceAdapterTagCuda()));
* It works as the old timer that times for a specific device id.
The previous implementation of DeviceAdapterRuntimeDetector caused
multiple differing definitions of the same class to exist and
was causing the runtime device tracker to report CUDA as disabled
when it actually was enabled.
The ODR was caused by having a default implementation for
DeviceAdapterRuntimeDetector and a specific specialization for
CUDA. If a library had both CUDA and C++ sources it would pick up
both implementations and would have undefined behavior. In general
it would think the CUDA backend was disabled.
To avoid this kind of situation in the future I have reworked VTK-m
so that each device adapter must implement DeviceAdapterRuntimeDetector
for that device.
bdb9c37e update based on issues pointed out by Robert
a713a0d8 Generalize and documentation for DeviceAdapterAlgorithm::Transform
29232c49 Revert un-intended change to examples
7ef956a9 Merge branch 'master' into connected_component
a9ed1ecf add CMakeLists.txt for header files
ba3cba64 update copyright statements
aa96874e Merge branch 'connected_component' of gitlab.kitware.com:ollielo/vtk-m into connected_component
2f07119e Merge branch 'master' into connected_component
...
Acked-by: Kitware Robot <kwrobot@kitware.com>
Acked-by: Robert Maynard <robert.maynard@kitware.com>
Merge-request: !1044
Generalize DeviceAdapterAlgorithm::Transform to accept input array of different value and storage type.
Add doxygen documentation in DeviceAdapterAlgorithm.h
Sandia National Laboratories recently changed management from the
Sandia Corporation to the National Technology & Engineering Solutions
of Sandia, LLC (NTESS). The copyright statements need to be updated
accordingly.
Redesigns the TBB and Serial backends and the vtkm::exec::Task concept so that
we can re-use the same launching logic for all Worklets, instead of generating
per worlet code. To keep the performance the same the TilingTask now is past
a range of indices to work on, rather than a single index.
Binary size reduction:
WorkletTests_SERIAL old - 19MB
WorkletTests_SERIAL new - 18MB
WorkletTests_TBB old - 39MB
WorkletTests_TBB new - 18MB
libvtkAcceleratorsVTKm old - 48MB
libvtkAcceleratorsVTKm new - 19MB
Change the VTKM_CONT_EXPORT to VTKM_CONT. (Likewise for EXEC and
EXEC_CONT.) Remove the inline from these macros so that they can be
applied to everything, including implementations in a library.
Because inline is not declared in these modifies, you have to add the
keyword to functions and methods where the implementation is not inlined
in the class.
This allows callers to copy a subsection of an array into another array,
without clearing the contents of the destination array if a resize
is required.
The RuntimeDeviceInformation class allows developers to check if a given
device is supported on a machine at runtime. This allows developers to properly
check for CUDA support before running any worklets.