There was an issue where if VTK-m was compiled with CUDA support but then
run on a computer where no CUDA device was available, an inappropriate
exception was thrown (instead of just disabling the device). The
initialization code should now properly check for the existance of a CUDA
device.
The `DeviceAdapter` provides an abstract interface to the accelerator
devices worklets and other algorithms run on. As such, the programmer has
less control about how the device launches each worklet. Each device
adapter has its own configuration parameters and other ways to attempt to
optimize how things are run, but these are always a universal set of
options that are applied to everything run on the device. There is no way
to specify launch parameters for a particular worklet.
To provide this information, VTK-m now supports `Hint`s to the device
adapter. The `DeviceAdapterAlgorithm::Schedule` method takes a templated
argument that is of the type `HintList`. This object contains a template
list of `Hint` types that provide suggestions on how to launch the parallel
execution. The device adapter will pick out hints that pertain to it and
adjust its launching accordingly.
These are called hints rather than, say, directives, because they don't
force the device adapter to do anything. The device adapter is free to
ignore any (and all) hints. The point is that the device adapter can take
into account the information to try to optimize for itself.
A provided hint can be tied to specific device adapters. In this way, an
worklet can further optimize itself. If multiple hints match a device
adapter, the last one in the list will be selected.
The `Worklet` base now has an internal type named `Hints` that points to a
`HintList` that is applied when the worklet is scheduled. Derived worklet
classes can provide hints by simply defining their own `Hints` type.
This commit adds the flag VTKm_ENABLE_GPU_MPI which when enable it
will use GPU AWARE MPI.
- This will only work with GPUs and MPI implementation that supports GPU
AWARE MPI calls.
- Enabling VTKm_ENABLE_GPU_MPI without MPI/GPU support might results in
errors when running VTK-m with DIY/MPI.
- Only the following tests can run with this feature if enabled:
- UnitTestSerializationDataSet
- UnitTestSerializationArrayHandle
Several revisions ago, the ability to use virtual methods in the
execution environment was deprecated. Completely remove this
functionality for the VTK-m 2.0 release.
It is the case that arrays might be deallocated from a device after the
device is closed. This can happen, for example, when an `ArrayHandle` is
declared globally. It gets constructed before VTK-m is initialized. This
is OK as long as you do not otherwise use it until VTK-m is initialized.
However, if you use that `ArrayHandle` to move data to a device and that
data is left on the device when the object closes, then the
`ArrayHandle` will be left holding a reference to invalid device memory
once the device is shut down. This can cause problems when the
`ArrayHandle` destructs itself and attempts to release this memory.
The VTK-m devices should gracefully handle deallocations that happen
after device shutdown.
Deprecate `VirtualObjectHandle` and all other classes that are used to
implement objects with virtual methods in the execution environment.
Additionally, the code is updated so that if the
`VTKm_NO_DEPRECATED_VIRTUAL` flag is set none of the code is compiled at
all. This opens us up to opportunities that do not work with virtual
methods such as backends that do not support virtual methods and dynamic
libraries for CUDA.
The `Reduce` algorithm is sometimes used to convert an input type to a
different output type. For example, you can compute the min and max at
the same time by making the output of the binary functor a pair of the
input type. However, for this to work with the CUDA algorithm, you have
to be able to also convert the input type to the output type. This was
previously done by treating the binary operator as also a unary
operator. That's fine for custom operators, but if you are using
something like `thrust::plus`, it has no unary operation. (Why would
it?)
So, detect whether the operator has a unary operation. If it does, use
it to cast from the input portal to the output type. If it does not,
just use `static_cast`. Thus, the operator only has to have the unary
operation if `static_cast` does not work.
This class was used indirectly by the old `ArrayHandle`, through
`ArrayHandleTransfer`, to move data to and from a device. This
functionality has been replaced in the new `ArrayHandle`s through the
`Buffer` class (which can be compiled into libraries rather than make
every translation unit compile their own template).
This commit removes `ArrayManagerExecution` and all the implementations
that the device adapters were required to make. None of this code was in
any use anymore.
When `DeviceAdapterAlgorithm::ScheduleTask` was called directly (i.e.
not through `Schedule`), nothing was added to the log. Adding
`VTKM_LOG_SCOPE` to these methods so that all scheduling is added to the
performance log.
Now that we have the functions in `vtkm/Atomic.h`, we can deprecate (and
eventually remove) the more cumbersome classes `AtomicInterfaceControl`
and `AtomicInterfaceExecution`.
Also reversed the order of the `expected` and `desired` parameters of
`vtkm::AtomicCompareAndSwap`. I think the former order makes more sense
and matches more other implementations (such as `std::atomic` and the
GCC `__atomic` built ins). However, there are still some non-deprecated
classes with similar methods that cannot easily be switched. Thus, it's
better to be inconsistent with most other libraries and consistent with
ourself than to be inconsitent with ourself.
Now that we have atomic free functions (e.g. `vtkm::AtomicAdd()`), we no
longer need special implementations for control and each execution
device. (Well, technically we do have special implementations for each,
but they are handled with compiler directives in the free functions.)
Convert the old atomic interface classes (`AtomicInterfaceControl` and
`AtomicInterfaceExecution`) to use the new atomic free functions. This
will allow us to test the new atomic functions everywhere that atomics
are used in VTK-m.
Once verified, we can deprecate the old atomic interface classes.
When you try to call the `Reduce` operation in the CUDA device adapter
with a sufficently complex interator type, you get a compile error
that says `error: cannot pass an argument with a user-provided
copy-constructor to a device-side kernel launch`.
This appears to be a bug in either nvcc or Thrust. I believe it is
related to the following reported issues:
* https://github.com/thrust/thrust/issues/928
* https://github.com/thrust/thrust/issues/1044
Work around this problem by making a special condition for calling
`Reduce` with an `ArrayHandleMultiplexer` that calls the generic
algorithm in `DeviceAdapterAlgorithmGeneral` instead of the algorithm in
Thrust.
Often when a user gives memory to an `ArrayHandle`, she wants data to be
written into the memory given to be used elsewhere. Previously, the
`Buffer` objects would delete the given buffer as soon as a write buffer
was created elsewhere. That was a problem if a user wants VTK-m to write
results right into a given buffer.
Instead, when a user provides memory, "pin" that memory so that the
`ArrayHandle` never deletes it.
The buffer class encapsulates the movement of raw C arrays between
host and devices.
The `Buffer` class itself is not associated with any device. Instead,
`Buffer` is used in conjunction with a new templated class named
`DeviceAdapterMemoryManager` that can allocate data on a given
device and transfer data as necessary. `DeviceAdapterMemoryManager`
will eventually replace the more complicated device adapter classes
that manage data on a device.
The code in `DeviceAdapterMemoryManager` is actually enclosed in
virtual methods. This allows us to limit the number of classes that
need to be compiled for a device. Rather, the implementation of
`DeviceAdapterMemoryManager` is compiled once with whatever compiler
is necessary, and then the `RuntimeDeviceInformation` is used to
get the correct object instance.
The `ArrayHandleStreaming` class stems from an old research project
experimenting with bringing data from an `ArrayHandle` in parts and
overlapping device transfer and execution. It works, but only in very
limited contexts. Thus, it is not actually used today. Plus, the feature
requires global indexing to be permutated throughout the worklet
dispatching classes of VTK-m for no further reason.
Because it is not really used, there are other more promising approaches
on the horizon, and it makes further scheduling improvements difficult,
we are removing this functionality.
1f1688483 Initial infrastructure to allow WorkletMapField to have 3D scheduling
Acked-by: Kitware Robot <kwrobot@kitware.com>
Acked-by: Kenneth Moreland <kmorel@sandia.gov>
Merge-request: !1938
Marked the old versions of PrepareFor* that do not use tokens as
deprecated and moved all of the code to use the new versions that
require a token. This makes the scope of the execution object more
explicit so that it will be kept while in use and can potentially be
reclaimed afterward.
The old version of ExecutionObject (that only takes a device) is still
supported, but you will get a deprecated warning if that is what is
defined.
Supporing this also included sending vtkm::cont::Token through the
vtkm::cont::arg::Transport mechanism, which was a change that propogated
through a lot of code.
Sometimes the CUDA runtime would not allocate sufficient stack
space for the particle advection code to run. This issue was exposed by
!1737 -- for some reason, once those changes to unrelated filters/worklets
are added to VTK, CUDA allocates less stack and the following tests would
fail:
UnitTestLagrangianFilterCUDA
UnitTestLagrangianStructuresFilterCUDA
UnitTestStreamlineFilterCUDA
UnitTestStreamSurfaceFilterCUDA
These were fixed by increasing the stack size in the particle advection
worklet Run(...) methods.
An RAII helper has been added that will restore the previous stack size
in case an exception is thrown, and the KDTree code has been updated
to use this helper when it adjusts the CUDA stack allocation.
- Use AtomicInterface to implement device-specific atomic operations.
- Remove DeviceAdapterAtomicArrayImplementations.
- Extend supported atomic types to include unsigned 32/64-bit ints.
- Add a static_assert to check that AtomicArray type is supported.
- Add documentation for AtomicArrayExecutionObject, including a CAS
example.
- Add a `T Get(idx)` method to AtomicArrayExecutionObject that does
an atomic load, and update existing CAS usage to use this instead
of `Add(idx, 0)`.