51 KiB
VTK-m 1.4 Release Notes
Table of Contents
- Core
- Remove templates from
ControlSignature
field tags - Worklets can now be specialized for a specific device adapter
- Worklets now support an execution mask
- Redesign VTK-m Runtime Device Tracking
vtkm::cont::Initialize
added to make setting up VTK-m runtime state easier
- Remove templates from
- ArrayHandle
- Add
vtkm::cont::ArrayHandleVirtual
vtkm::cont::ArrayHandleZip
provides a consistent API even with non-writable handlesvtkm::cont::VariantArrayHandle
replacesvtkm::cont::DynamicArrayHandle
vtkm::cont::VariantArrayHandle
CastAndCall supports casting to concrete typesvtkm::cont::VariantArrayHandle::AsVirtual<T>()
performs castingStorageBasic::StealArray()
now provides delete function to new owner
- Add
- Control Environment
vtkm::cont::CellLocatorGeneral
has been addedvtkm::cont::CellLocatorTwoLevelUniformGrid
has been renamed tovtkm::cont::CellLocatorUniformBins
vtkm::cont::Timer
now supports asynchronous and device independent timersvtkm::cont::DeviceAdapterId
construction from strings are now case-insensitivevtkm::cont::Initialize
will only parse known arguments
- Execution Environment
- VTK-m logs details about each CUDA kernel launch
- VTK-m CUDA allocations can have managed memory (cudaMallocManaged) enabled/disabled from C++
- VTK-m CUDA kernel scheduling improved including better defaults, and user customization support
- VTK-m Reduction algorithm now supports differing input and output types
- Added specialized operators for ArrayPortalValueReference
- Worklets and Filters
vtkm::worklet::Invoker
now supports worklets which require a Scatter objectBitFields
are now a support field input/out type for VTK-m worklets- Added a Point Merging worklet
vtkm::filter::CleanGrid
now can do point merging- Added a connected component worklets and filters
- Build
- CMake 3.8+ now required to build VTK-m
- VTK-m now can verify that it installs itself correctly
- VTK-m now requires
CUDA
separable compilation to build - VTK-m provides a
vtkm_filter
CMake target vtkm::cont::CellLocatorBoundingIntervalHierarchy
is compiled intovtkm_cont
- Other
- LodePNG added as a thirdparty package
- Optionparser added as a thirdparty package
- Thirdparty diy now can coexist with external diy
- Merge benchmark executables into a device dependent shared library
- Merge rendering testing executables to a shared library
- Merge worklet testing executables into a device dependent shared library
- VTK-m runtime device detection properly handles busy CUDA devices
Core
Remove templates from ControlSignature
field tags
Previously, several of the ControlSignature
tags had a template to
specify a type list. This was to specify potential valid value types for an
input array. The importance of this typelist was to limit the number of
code paths created when resolving a vtkm::cont::VariantArrayHandle
(formerly a DynamicArrayHandle
). This (potentially) reduced the compile
time, the size of libraries/executables, and errors from unexpected types.
Much has changed since this feature was originally implemented. Since then,
the filter infrastructure has been created, and it is through this that
most dynamic worklet invocations happen. However, since the filter
infrastrcture does its own type resolution (and has its own policies) the
type arguments in ControlSignature
are now of little value.
Script to update code
This update requires changes to just about all code implementing a VTK-m worklet. To facilitate the update of this code to these new changes (not to mention all the code in VTK-m) a script is provided to automatically remove these template parameters from VTK-m code.
This script is at Utilities/Scripts/update-control-signature-tags.sh. It needs to be run in a Unix-compatible shell. It takes a single argument, which is a top level directory to modify files. The script processes all C++ source files recursively from that directory.
Selecting data types for auxiliary filter fields
The main rational for making these changes is that the types of the inputs to worklets is almost always already determined by the calling filter. However, although it is straightforward to specify the type of the "main" (active) scalars in a filter, it is less clear what to do for additional fields if a filter needs a second or third field.
Typically, in the case of a second or third field, it is up to the
DoExecute
method in the filter implementation to apply a policy to that
field. When applying a policy, you give it a policy object (nominally
passed by the user) and a traits of the filter. Generally, the accepted
list of types for a field should be part of the filter's traits. For
example, consider the WarpVector
filter. This filter only works on
Vec
s of size 3, so its traits class looks like this.
template <>
class FilterTraits<WarpVector>
{
public:
// WarpVector can only applies to Float and Double Vec3 arrays
using InputFieldTypeList = vtkm::TypeListTagFieldVec3;
};
However, the WarpVector
filter also requires two fields instead of one.
The first (active) field is handled by its superclass (FilterField
), but
the second (auxiliary) field must be managed in the DoExecute
. Generally,
this can be done by simply applying the policy with the filter traits.
The corner cases
Most of the calls to worklets happen within filter implementations, which have their own way of narrowing down potential types (as previously described). The majority of the remainder either use static types or work with a variety of types.
However, there is a minority of corner cases that require a reduction of
types. Since the type argument of the worklet ControlSignature
arguments
are no longer available, the narrowing of types must be done before the
call to Invoke
.
This narrowing of arguments is not particularly difficult. Such type-unsure
arguments usually come from a VariantArrayHandle
(or something that uses
one). You can select the types from a VariantArrayHandle
simply by using
the ResetTypes
method. For example, say you know that a variant array is
supposed to be a scalar.
dispatcher.Invoke(variantArray.ResetTypes(vtkm::TypeListTagFieldScalar()),
staticArray);
Even more common is to have a vtkm::cont::Field
object. A Field
object
internally holds a VariantArrayHandle
, which is accessible via the
GetData
method.
dispatcher.Invoke(field.GetData().ResetTypes(vtkm::TypeListTagFieldScalar()),
staticArray);
Change in executable size
The whole intention of these template parameters in the first place was to reduce the number of code paths compiled. The hypothesis of this change was that in the current structure the code paths were not being reduced much if at all. If that is true, the size of executables and libraries should not change.
Here is a recording of the library and executable sizes before this change
(using ds -h
).
3.0M libvtkm_cont-1.2.1.dylib
6.2M libvtkm_rendering-1.2.1.dylib
312K Rendering_SERIAL
312K Rendering_TBB
22M Worklets_SERIAL
23M Worklets_TBB
22M UnitTests_vtkm_filter_testing
5.7M UnitTests_vtkm_cont_serial_testing
6.0M UnitTests_vtkm_cont_tbb_testing
7.1M UnitTests_vtkm_cont_testing
After the changes, the executable sizes are as follows.
3.0M libvtkm_cont-1.2.1.dylib
6.0M libvtkm_rendering-1.2.1.dylib
312K Rendering_SERIAL
312K Rendering_TBB
21M Worklets_SERIAL
21M Worklets_TBB
22M UnitTests_vtkm_filter_testing
5.6M UnitTests_vtkm_cont_serial_testing
6.0M UnitTests_vtkm_cont_tbb_testing
7.1M UnitTests_vtkm_cont_testing
As we can see, the built sizes have not changed significantly. (If anything, the build is a little smaller.)
Worklets can now be specialized for a specific device adapter
This change adds an execution signature tag named Device
that passes
a DeviceAdapterTag
to the worklet's parenthesis operator. This allows the
worklet to specialize its operation. This features is available in all
worklets.
The following example shows a worklet that specializes itself for the CUDA device.
struct DeviceSpecificWorklet : vtkm::worklet::WorkletMapField
{
using ControlSignature = void(FieldIn, FieldOut);
using ExecutionSignature = _2(_1, Device);
// Specialization for the Cuda device.
template <typename T>
T operator()(T x, vtkm::cont::DeviceAdapterTagCuda) const
{
// Special cuda implementation
}
// General implementation
template <typename T, typename Device>
T operator()(T x, Device) const
{
// General implementation
}
};
Effect on compile time and binary size
This change necessitated adding a template parameter for the device that
followed at least from the schedule all the way down. This has the
potential for duplicating several of the support methods (like
DoWorkletInvokeFunctor
) that would otherwise have the same type. This is
especially true between the devices that run on the CPU as they should all
be sharing the same portals from ArrayHandle
s. So the question is whether
it causes compile to take longer or cause a significant increase in
binaries.
To informally test, I first ran a clean debug compile on my Windows machine with the serial and tbb devices. The build itself took 3 minutes, 50 seconds. Here is a list of the binary sizes in the bin directory:
kmorel2 0> du -sh *.exe *.dll
200K BenchmarkArrayTransfer_SERIAL.exe
204K BenchmarkArrayTransfer_TBB.exe
424K BenchmarkAtomicArray_SERIAL.exe
424K BenchmarkAtomicArray_TBB.exe
440K BenchmarkCopySpeeds_SERIAL.exe
580K BenchmarkCopySpeeds_TBB.exe
4.1M BenchmarkDeviceAdapter_SERIAL.exe
5.3M BenchmarkDeviceAdapter_TBB.exe
7.9M BenchmarkFieldAlgorithms_SERIAL.exe
7.9M BenchmarkFieldAlgorithms_TBB.exe
22M BenchmarkFilters_SERIAL.exe
22M BenchmarkFilters_TBB.exe
276K BenchmarkRayTracing_SERIAL.exe
276K BenchmarkRayTracing_TBB.exe
4.4M BenchmarkTopologyAlgorithms_SERIAL.exe
4.4M BenchmarkTopologyAlgorithms_TBB.exe
712K Rendering_SERIAL.exe
712K Rendering_TBB.exe
708K UnitTests_vtkm_cont_arg_testing.exe
1.7M UnitTests_vtkm_cont_internal_testing.exe
13M UnitTests_vtkm_cont_serial_testing.exe
14M UnitTests_vtkm_cont_tbb_testing.exe
18M UnitTests_vtkm_cont_testing.exe
13M UnitTests_vtkm_cont_testing_mpi.exe
736K UnitTests_vtkm_exec_arg_testing.exe
136K UnitTests_vtkm_exec_internal_testing.exe
196K UnitTests_vtkm_exec_serial_internal_testing.exe
196K UnitTests_vtkm_exec_tbb_internal_testing.exe
2.0M UnitTests_vtkm_exec_testing.exe
83M UnitTests_vtkm_filter_testing.exe
476K UnitTests_vtkm_internal_testing.exe
148K UnitTests_vtkm_interop_internal_testing.exe
1.3M UnitTests_vtkm_interop_testing.exe
2.9M UnitTests_vtkm_io_reader_testing.exe
548K UnitTests_vtkm_io_writer_testing.exe
792K UnitTests_vtkm_rendering_testing.exe
3.7M UnitTests_vtkm_testing.exe
320K UnitTests_vtkm_worklet_internal_testing.exe
65M UnitTests_vtkm_worklet_testing.exe
11M vtkm_cont-1.3.dll
2.1M vtkm_interop-1.3.dll
21M vtkm_rendering-1.3.dll
3.9M vtkm_worklet-1.3.dll
After making the singular change to the Invocation
object to add the
DeviceAdapterTag
as a template parameter (which should cause any extra
compile instances) the compile took 4 minuts and 5 seconds. Here is the
new list of binaries.
kmorel2 0> du -sh *.exe *.dll
200K BenchmarkArrayTransfer_SERIAL.exe
204K BenchmarkArrayTransfer_TBB.exe
424K BenchmarkAtomicArray_SERIAL.exe
424K BenchmarkAtomicArray_TBB.exe
440K BenchmarkCopySpeeds_SERIAL.exe
580K BenchmarkCopySpeeds_TBB.exe
4.1M BenchmarkDeviceAdapter_SERIAL.exe
5.3M BenchmarkDeviceAdapter_TBB.exe
7.9M BenchmarkFieldAlgorithms_SERIAL.exe
7.9M BenchmarkFieldAlgorithms_TBB.exe
22M BenchmarkFilters_SERIAL.exe
22M BenchmarkFilters_TBB.exe
276K BenchmarkRayTracing_SERIAL.exe
276K BenchmarkRayTracing_TBB.exe
4.4M BenchmarkTopologyAlgorithms_SERIAL.exe
4.4M BenchmarkTopologyAlgorithms_TBB.exe
712K Rendering_SERIAL.exe
712K Rendering_TBB.exe
708K UnitTests_vtkm_cont_arg_testing.exe
1.7M UnitTests_vtkm_cont_internal_testing.exe
13M UnitTests_vtkm_cont_serial_testing.exe
14M UnitTests_vtkm_cont_tbb_testing.exe
19M UnitTests_vtkm_cont_testing.exe
13M UnitTests_vtkm_cont_testing_mpi.exe
736K UnitTests_vtkm_exec_arg_testing.exe
136K UnitTests_vtkm_exec_internal_testing.exe
196K UnitTests_vtkm_exec_serial_internal_testing.exe
196K UnitTests_vtkm_exec_tbb_internal_testing.exe
2.0M UnitTests_vtkm_exec_testing.exe
86M UnitTests_vtkm_filter_testing.exe
476K UnitTests_vtkm_internal_testing.exe
148K UnitTests_vtkm_interop_internal_testing.exe
1.3M UnitTests_vtkm_interop_testing.exe
2.9M UnitTests_vtkm_io_reader_testing.exe
548K UnitTests_vtkm_io_writer_testing.exe
792K UnitTests_vtkm_rendering_testing.exe
3.7M UnitTests_vtkm_testing.exe
320K UnitTests_vtkm_worklet_internal_testing.exe
68M UnitTests_vtkm_worklet_testing.exe
11M vtkm_cont-1.3.dll
2.1M vtkm_interop-1.3.dll
21M vtkm_rendering-1.3.dll
3.9M vtkm_worklet-1.3.dll
So far the increase is quite negligible.
Worklets now support an execution mask
There have recently been use cases where it would be helpful to mask out some of the invocations of a worklet. The idea is that when invoking a worklet with a mask array on the input domain, you might implement your worklet more-or-less like the following.
VTKM_EXEC void operator()(bool mask, /* other parameters */)
{
if (mask)
{
// Do interesting stuff
}
}
This works, but what if your mask has mostly false values? In that case, you are spending tons of time loading data to and from memory where fields are stored for no reason.
You could potentially get around this problem by adding a scatter to the worklet. However, that will compress the output arrays to only values that are active in the mask. That is problematic if you want the masked output in the appropriate place in the original arrays. You will have to do some complex (and annoying and possibly expensive) permutations of the output arrays.
Thus, we would like a new feature similar to scatter that instead masks out invocations so that the worklet is simply not run on those outputs.
New Interface
The new "Mask" feature that is similar (and orthogonal) to the existing
"Scatter" feature. Worklet objects now define a MaskType
that provides on
object that manages the selections of which invocations are skipped. The
following Mask objects are defined.
MaskNone
- This removes any mask of the output. All outputs are generated. This is the default if noMaskType
is explicitly defined.MaskSelect
- The user to provides an array that specifies whether each output is created with a 1 to mean that the output should be created an 0 the mean that it should not.MaskIndices
- The user provides an array with a list of indices for all outputs that should be created.
It will be straightforward to implement other versions of masks. (For example, you could make a mask class that selectes every Nth entry.) Those could be made on an as-needed basis.
Implementation
The implementation follows the same basic idea of how scatters are implemented.
Mask Classes
The mask class is required to implement the following items.
ThreadToOutputType
- A type for an array that maps a thread index (an index in the array) to an output index. A reasonable type for this could bevtkm::cont::ArrayHandle<vtkm::Id>
.GetThreadToOutputMap
- Given the range for the output (e.g. the number of items in the output domain), returns an array of typeThreadToOutputType
that is the actual map.GetThreadRange
- Given a range for the output (e.g. the number of items in the output domain), returns the range for the threads (e.g. the number of times the worklet will be invoked).
Dispatching
The vtkm::worklet::internal::DispatcherBase
manages a mask class in
the same way it manages the scatter class. It gets the MaskType
from
the worklet it is templated on. It requires a MaskType
object during
its construction.
Previously the dispatcher (and downstream) had to manage the range and indices of inputs and threads. They now have to also manage a separate output range/index as now all three may be different.
The vtkm::Invocation
is changed to hold the ThreadToOutputMap array from
the mask. It likewises has a templated ChangeThreadToOutputMap
method
added (similar to those already existing for the arrays from a scatter).
This method is used in DispatcherBase::InvokeTransportParameters
to add
the mask's array to the invocation before calling InvokeSchedule
.
Thread Indices
With the addition of masks, the ThreadIndices
classes are changed to
manage the actual output index. Previously, the output index was always the
same as the thread index. However, now these two can be different. The
GetThreadIndices
methods of the worklet base classes have an argument
added that is the portal to the ThreadToOutputMap.
The worklet GetThreadIndices
is called from the Task
classes. These
classes are changed to pass in this additional argument. Since the Task
classes get an Invocation
object from the dispatcher, which contains the
ThreadToOutputMap
, this change is trivial.
Interaction Between Mask and Scatter
Although it seems weird, it should work fine to mix scatters and masks. The scatter will first be applied to the input to generate a (potential) list of output elements. The mask will then be applied to these output elements.
Redesign VTK-m Runtime Device Tracking
The device tracking infrastructure in VTK-m has been redesigned to remove multiple redundant codes paths and to simplify reasoning about around what an instance of RuntimeDeviceTracker will modify.
vtkm::cont::RuntimeDeviceTracker
tracks runtime information on
a per-user thread basis. This is done to allow multiple calling
threads to use different vtk-m backends such as seen in this
example:
vtkm::cont::DeviceAdapterTagCuda cuda;
vtkm::cont::DeviceAdapterTagOpenMP openmp;
{ // thread 1
auto& tracker = vtkm::cont::GetRuntimeDeviceTracker();
tracker->ForceDevice(cuda);
vtkm::worklet::Invoker invoke;
invoke(LightTask{}, input, output);
vtkm::cont::Algorithm::Sort(output);
invoke(HeavyTask{}, output);
}
{ // thread 2
auto& tracker = vtkm::cont::GetRuntimeDeviceTracker();
tracker->ForceDevice(openmp);
vtkm::worklet::Invoker invoke;
invoke(LightTask{}, input, output);
vtkm::cont::Algorithm::Sort(output);
invoke(HeavyTask{}, output);
}
Note: GetGlobalRuntimeDeviceTracker
has ben refactored to be GetRuntimeDeviceTracker
as it always returned a unique instance for each control side thread. This design allows
for different threads to have different runtime device settings. By removing the term Global
from the name it becomes more clear what scope this class has.
While this address the ability for threads to specify what
device they should run on. It doesn't make it easy to toggle
the status of a device in a programmatic way, for example
the following block forces execution to only occur on
cuda
and doesn't restore previous active devices after
{
vtkm::cont::DeviceAdapterTagCuda cuda;
auto& tracker = vtkm::cont::GetRuntimeDeviceTracker();
tracker->ForceDevice(cuda);
vtkm::worklet::Invoker invoke;
invoke(LightTask{}, input, output);
}
//openmp/tbb/... still inactive
To resolve those issues we have vtkm::cont::ScopedRuntimeDeviceTracker
which
has the same interface as vtkm::cont::RuntimeDeviceTracker
but additionally
resets any per-user thread modifications when it goes out of scope. So by
switching over the previous example to use ScopedRuntimeDeviceTracker
we
correctly restore the threads RuntimeDeviceTracker
state when tracker
goes out of scope.
{
vtkm::cont::DeviceAdapterTagCuda cuda;
vtkm::cont::ScopedRuntimeDeviceTracker tracker(cuda);
vtkm::worklet::Invoker invoke;
invoke(LightTask{}, input, output);
}
//openmp/tbb/... are now again active
The vtkm::cont::ScopedRuntimeDeviceTracker
is not limited to forcing
execution to occur on a single device. When constructed it can either force
execution to a device, disable a device or enable a device. These options
also work with the DeviceAdapterTagAny
.
{
//enable all devices
vtkm::cont::DeviceAdapterTagAny any;
vtkm::cont::ScopedRuntimeDeviceTracker tracker(any,
vtkm::cont::RuntimeDeviceTrackerMode::Enable);
...
}
{
//disable only cuda
vtkm::cont::DeviceAdapterTagCuda cuda;
vtkm::cont::ScopedRuntimeDeviceTracker tracker(cuda,
vtkm::cont::RuntimeDeviceTrackerMode::Disable);
...
}
vtkm::cont::Initialize
added to make setting up VTK-m runtime state easier
A new initialization function, vtkm::cont::Initialize
, has been added.
Initialization is not required, but will configure the logging utilities (when
enabled) and allows forcing a device via a -d
or --device
command line
option.
Usage:
#include <vtkm/cont/Initialize.h>
int main(int argc, char *argv[])
{
auto config = vtkm::cont::Initialize(argc, argv);
...
}
ArrayHandle
Add vtkm::cont::ArrayHandleVirtual
Added a new class named vtkm::cont::ArrayHandleVirtual
that allows you to type erase an
ArrayHandle storage type by using virtual calls. This simplification makes
storing Fields
and Coordinates
significantly easier as VTK-m doesn't
need to deduce both the storage and value type when executing worklets.
To construct an vtkm::cont::ArrayHandleVirtual
one can do the following:
vtkm::cont::ArrayHandle<vtkm::Float32> pressure;
vtkm::cont::ArrayHandleConstant<vtkm::Float32> constant(42.0f);
// constrcut from an array handle
vtkm::cont::ArrayHandleVirtual<vtkm::Float32> v(pressure);
// or assign from an array handle
v = constant;
To help maintain performance vtkm::cont::ArrayHandleVirtual
provides a collection of helper
functions/methods to query and cast back to the concrete storage and value type:
vtkm::cont::ArrayHandleConstant<vtkm::Float32> constant(42.0f);
vtkm::cont::ArrayHandleVirtual<vtkm::Float32> v = constant;
const bool isConstant = vtkm::cont::IsType< decltype(constant) >(v);
if(isConstant)
vtkm::cont::ArrayHandleConstant<vtkm::Float32> t = vtkm::cont::Cast< decltype(constant) >(v);
Lastly, a common operation of calling code using ArrayHandleVirtual
is a desire to construct a new instance
of an existing virtual handle with the same storage type. This can be done by using the NewInstance
method
as seen below
vtkm::cont::ArrayHandle<vtkm::Float32> pressure;
vtkm::cont::ArrayHandleVirtual<vtkm::Float32> v = pressure;
vtkm::cont::ArrayHandleVirtual<vtkm::Float32> newArray = v->NewInstance();
bool isConstant = vtkm::cont::IsType< vtkm::cont::ArrayHandle<vtkm::Float32> >(newArray); //will be true
vtkm::cont::ArrayHandleZip
provides a consistent API even with non-writable handles
Previously vtkm::cont::ArrayHandleZip
could not wrap an implicit handle and provide a consistent experience.
The primary issue was that if you tried to use the PortalType returned by GetPortalControl()
you
would get a compile failure. This would occur as the PortalType returned would try to call Set
on an ImplicitPortal which doesn't have a set method.
Now with this change, the ZipPortal
use SFINAE to determine if Set
and Get
should call the
underlying zipped portals.
vtkm::cont::VariantArrayHandle
replaces vtkm::cont::DynamicArrayHandle
vtkm::cont::ArrayHandleVariant
replaces vtkm::cont::DynamicArrayHandle
as the
primary method for holding onto a type erased vtkm::cont::ArrayHandle
. The major
difference between the two implementations is how they handle the Storage component of
an array handle.
vtkm::contDynamicArrayHandle
approach was to find the fully deduced type of the ArrayHandle
meaning it would check all value and storage types it knew about until it found a match.
This cross product of values and storages would cause significant compilation times when
a DynamicArrayHandle
had multiple storage types.
vtkm::cont::VariantArrayHandle
approach is to only deduce the value type of the ArrayHandle
and return a vtkm::cont::ArrayHandleVirtual
which uses polymorpishm to hide the actual
storage type. This approach allows for better compile times, and for calling code
to always expect an ArrayHandleVirtual
instead of the fully deduced type. This conversion
to ArrayHandleVirtual
is usually done internally within VTK-m when a worklet or filter
is invoked.
In certain cases users of VariantArrayHandle
want to be able to access the concrete
ArrayHandle<T,S>
and not have it wrapped in a ArrayHandleVirtual
. For those occurrences
VariantArrayHandle
provides a collection of helper functions/methods to query and
cast back to the concrete storage and value type:
vtkm::cont::ArrayHandleConstant<vtkm::Float32> constant(42.0f);
vtkm::cont::ArrayHandleVariant v(constant);
const bool isConstant = vtkm::cont::IsType< decltype(constant) >(v);
if(isConstant)
vtkm::cont::ArrayHandleConstant<vtkm::Float32> t = vtkm::cont::Cast< decltype(constant) >(v);
Lastly, a common operation of calling code using VariantArrayHandle
is a desire to construct a new instance
of an existing virtual handle with the same storage type. This can be done by using the NewInstance
method
as seen below:
vtkm::cont::ArrayHandle<vtkm::Float32> pressure;
vtkm::cont::ArrayHandleVariant v(pressure);
vtkm::cont::ArrayHandleVariant newArray = v->NewInstance();
const bool isConstant = vtkm::cont::IsType< decltype(pressure) >(newArray); //will be true
vtkm::cont::VariantArrayHandle
CastAndCall supports casting to concrete types
Previously, the VariantArrayHandle::CastAndCall
(and indirect calls through
vtkm::cont::CastAndCall
) attempted to cast to only
vtkm::cont::ArrayHandleVirtual
with different value types. That worked, but
it meant that whatever was called had to operate through virtual functions.
Under most circumstances, it is worthwhile to also check for some common
storage types that, when encountered, can be accessed much faster. This
change provides the casting to concrete storage types and now uses
vtkm::cont::ArrayHandleVirtual
as a fallback when no concrete storage
type is found.
By default, CastAndCall
checks all the storage types in
VTKM_DEFAULT_STORAGE_LIST_TAG
, which typically contains only the basic
storage. The ArrayHandleVirtual::CastAndCall
method also allows you to
override this behavior by specifying a different type list in the first
argument. If the first argument is a list type, CastAndCall
assumes that
all the types in the list are storage tags. If you pass in
vtkm::ListTagEmpty
, then CastAndCall
will always cast to an
ArrayHandleVirtual
(the previous behavior). Alternately, you can pass in
storage tags that might be likely under the current usage.
As an example, consider the following simple code.
vtkm::cont::VariantArrayHandle array;
// stuff happens
array.CastAndCall(myFunctor);
Previously, myFunctor
would be called with
vtkm::cont::ArrayHandleVirtual<T>
with different type T
s. After this
change, myFunctor
will be called with that and with
vtkm::cont::ArrayHandle<T>
of the same type T
s.
If you want to only call myFunctor
with
vtkm::cont::ArrayHandleVirtual<T>
, then replace the previous line with
array.CastAndCall(vtkm::ListTagEmpty(), myFunctor);
Let's say that additionally using vtkm::cont::ArrayHandleIndex
was also
common. If you want to also specialize for that array, you can do so with
the following line.
array.CastAndCall(vtkm::ListTagBase<vtkm::cont::StorageBasic,
vtkm::cont::ArrayHandleIndex::StorageTag>,
myFunctor);
Note that myFunctor
will be called with
vtkm::cont::ArrayHandle<T,vtkm::cont::ArrayHandleIndex::StorageTag>
, not
vtkm::cont::ArrayHandleIndex
.
vtkm::cont::VariantArrayHandle::AsVirtual<T>()
performs casting
The AsVirtual<T>
method of vtkm::cont::VariantArrayHandle
now works for
any arithmetic type, not just the actual type of the underlying array. This
works by inserting an ArrayHandleCast
between the underlying concrete array
and the new ArrayHandleVirtual
when needed.
StorageBasic::StealArray()
now provides delete function to new owner
Memory that is stolen from VTK-m has to be freed correctly. This is required
as the memory could have been allocated with new
, malloc
or even cudaMallocManaged
.
Previously it was very easy to transfer ownership of memory out of VTK-m and either fail to capture the free function, or ask for it after the transfer operation which would return a nullptr. Now stealing an array also provides the free function reducing one source of memory leaks.
To properly steal memory from VTK-m you do the following:
vtkm::cont::ArrayHandle<T> arrayHandle;
...
auto* stolen = arrayHandle.StealArray();
T* ptr = stolen.first;
auto free_function = stolen.second;
...
free_function(ptr);
Control Environment
vtkm::cont::CellLocatorGeneral
has been added
vtkm::cont::CellLocatorUniformBins
can work with all kinds of datasets, but there are cell
locators that are more efficient for specific data sets. Therefore, a new cell
locator - vtkm::cont::CellLocatorGeneral
has been implemented that can be configured to use
specialized cell locators based on its input data. A "configurator" function object
can be specified using the SetConfigurator()
function. The configurator should
have the following signature:
void (std::unique_ptr<vtkm::cont::CellLocator>&,
const vtkm::cont::DynamicCellSet&,
const vtkm::cont::CoordinateSystem&);
The configurator is invoked whenever the Update
method is called and the input
has changed. The current cell locator is passed in a std::unique_ptr
. Based on
the types of the input cellset and coordinates, and possibly some heuristics on
their values, the current cell locator's parameters can be updated, or a different
cell-locator can be instantiated and transferred to the unique_ptr
. The default
configurator configures a vtkm::cont::CellLocatorUniformGrid
for uniform grid datasets,
a vtkm::cont::CellLocatorRecitlinearGrid
for rectilinear datasets, and
vtkm::cont::CellLocatorUniformBins
for all other dataset types.
The class CellLocatorHelper
that implemented similar functionality to CellLocatorGeneral
has been removed.
vtkm::cont::CellLocatorTwoLevelUniformGrid
has been renamed to vtkm::cont::CellLocatorUniformBins
CellLocatorTwoLevelUniformGrid
has been renamed to CellLocatorUniformBins
for brevity. It has been modified to be a subclass of vtkm::cont::CellLocator
and can be used wherever a CellLocator
is accepted.
vtkm::cont::Timer
now supports asynchronous and device independent timers
vtkm::cont::Timer
can now track execution time on a single device or across all
enabled devices as seen below:
vtkm::cont::Timer tbb_timer{vtkm::cont::DeviceAdaptertagTBB()};
vtkm::cont::Timer all_timer;
all_timer.Start();
tbb_timer.Start();
// Run blocking algorithm on tbb
tbb_timer.Stop();
// Run async-algorithms cuda
all_timer.Stop();
// Do more work
//Now get time for all tbb work, and tbb_cuda work
auto tbb_time = tbb_timer.GetElapsedTime();
auto all_time = tbb_timer.GetElapsedTime();
When Timer
is constructed without an explicit vtkm::cont::DeviceAdapterId
it
will track all device adapters and return the maximum elapsed time over all devices
when GetElapsedTime
is called.
vtkm::cont::DeviceAdapterId
construction from strings are now case-insensitive
You can now construct a vtkm::cont::DeviceAdapterId
from a string no matter
the case of it. The following all will construct the same vtkm::cont::DeviceAdapterId
.
vtkm::cont::DeviceAdapterId id1 = vtkm::cont::make_DeviceAdapterId("cuda");
vtkm::cont::DeviceAdapterId id2 = vtkm::cont::make_DeviceAdapterId("CUDA");
vtkm::cont::DeviceAdapterId id3 = vtkm::cont::make_DeviceAdapterId("Cuda");
auto& tracker = vtkm::cont::GetRuntimeDeviceTracker();
vtkm::cont::DeviceAdapterId id4 = tracker.GetDeviceAdapterId("cuda");
vtkm::cont::DeviceAdapterId id5 = tracker.GetDeviceAdapterId("CUDA");
vtkm::cont::DeviceAdapterId id6 = tracker.GetDeviceAdapterId("Cuda");
vtkm::cont::Initialize
will only parse known arguments
When a library requires reading some command line arguments through a
function like Initialize, it is typical that it will parse through
arguments it supports and then remove those arguments from argc
and
argv
so that the remaining arguments can be parsed by the calling
program. Recent changes to the vtkm::cont::Initialize
function support
that.
Use Case
Say you are creating a simple benchmark where you want to provide a command
line option --size
that allows you to adjust the size of the data that
you are working on. However, you also want to support flags like --device
and -v
that are performed by vtkm::cont::Initialize
. Rather than have
to re-implement all of Initialize
's parsing, you can now first call
Initialize
to handle its arguments and then parse the remaining objects.
The following is a simple (and rather incomplete) example:
int main(int argc, char** argv)
{
vtkm::cont::InitializeResult initResult = vtkm::cont::Initialize(argc, argv);
if ((argc > 1) && (strcmp(argv[1], "--size") == 0))
{
if (argc < 3)
{
std::cerr << "--size option requires a numeric argument" << std::endl;
std::cerr << "USAGE: " << argv[0] << " [options]" << std::endl;
std::cerr << "Options are:" << std::endl;
std::cerr << " --size <number>\tSpecify the size of the data." << std::endl;
std::cerr << initResult.Usage << std::endl;
exit(1);
}
g_size = atoi(argv[2]);
}
std::cout << "Using device: " << initResult.Device.GetName() << std::endl;
Additional Initialize Options
Because vtkm::cont::Initialize
no longer has the assumption that it is responsible
for parsing all arguments, some options have been added to
vtkm::cont::InitializeOptions
to manage these different use cases. The
following options are now supported.
None
A placeholder for having all options off, which is the default. (Same as before this change.)RequireDevice
Issue an error if the device argument is not specified. (Same as before this change.)DefaultAnyDevice
If no device is specified, treat it as if the user gave --device=Any. This means that DeviceAdapterTagUndefined will never be return in the result.AddHelp
Add a help argument. If-h
or--help
is provided, prints a usage statement. Of course, the usage statement will only print out arguments processed by VTK-m.ErrorOnBadOption
If an unknown option is encountered, the program terminates with an error and a usage statement is printed. If this option is not provided, any unknown options are returned inargv
. If this option is used, it is a good idea to useAddHelp
as well.ErrorOnBadArgument
If an extra argument is encountered, the program terminates with an error and a usage statement is printed. If this option is not provided, any unknown arguments are returned inargv
.Strict
If supplied, Initialize treats its own arguments as the only ones supported by the application and provides an error if not followed exactly. This is a convenience option that is a combination ofErrorOnBadOption
,ErrorOnBadArgument
, andAddHelp
.
InitializeResult Changes
The changes in Initialize
have also necessitated the changing of some of
the fields in the InitializeResult
structure. The following fields are
now provided in the InitializeResult
struct.
Device
Returns the device selected in the command line arguments as aDeviceAdapterId
. If no device was selected,DeviceAdapterTagUndefined
is returned. (Same as before this change.)Usage
Returns a string containing the usage for the options recognized byInitialize
. This can be used to build larger usage statements containing options for bothInitialize
and the calling program. See the example above.
Note that the Arguments
field has been removed from InitializeResult
.
This is because the unparsed arguments are now returned in the modified
argc
and argv
, which provides a more complete result than the
Arguments
field did.
Execution Environment
VTK-m logs details about each CUDA kernel launch
The VTK-m logging infrastructure has been extended with a new log level
KernelLaunches
which exists between MemTransfer
and Cast
.
This log level reports the number of blocks, threads per block, and the PTX version of each CUDA kernel launched.
This logging level was primarily introduced to help developers that are
tracking down issues that occur when VTK-m components have been built with
different sm_XX
flags and help people looking to do kernel performance
tuning.
VTK-m CUDA allocations can have managed memory (cudaMallocManaged) enabled/disabled from C++
Previously it was impossible for calling code to explicitly disable cuda managed memory. This can be desirable for projects that know they don't need managed memory and are super performance critical.
const bool usingManagedMemory = vtkm::cont::cuda::internal::CudaAllocator::UsingManagedMemory();
if(usingManagedMemory)
{ //disable managed memory
vtkm::cont::cuda::internal::CudaAllocator::ForceManagedMemoryOff();
}
VTK-m CUDA kernel scheduling improved including better defaults, and user customization support
VTK-m now offers a more GPU aware set of defaults for kernel scheduling. When VTK-m first launches a kernel we do system introspection and determine what GPU's are on the machine and than match this information to a preset table of values. The implementation is designed in a way that allows for VTK-m to offer both specific presets for a given GPU ( V100 ) or for an entire generation of cards ( Pascal ).
Currently VTK-m offers preset tables for the following GPU's:
- Tesla V100
- Tesla P100
If the hardware doesn't match a specific GPU card we than try to find the nearest know hardware generation and use those defaults. Currently we offer defaults for
- Older than Pascal Hardware
- Pascal Hardware
- Volta+ Hardware
Some users have workloads that don't align with the defaults provided by
VTK-m. When that is the cause, it is possible to override the defaults
by binding a custom function to vtkm::cont::cuda::InitScheduleParameters
.
As shown below:
ScheduleParameters CustomScheduleValues(char const* name,
int major,
int minor,
int multiProcessorCount,
int maxThreadsPerMultiProcessor,
int maxThreadsPerBlock)
{
ScheduleParameters params {
64 * multiProcessorCount, //1d blocks
64, //1d threads per block
64 * multiProcessorCount, //2d blocks
{ 8, 8, 1 }, //2d threads per block
64 * multiProcessorCount, //3d blocks
{ 4, 4, 4 } }; //3d threads per block
return params;
}
vtkm::cont::cuda::InitScheduleParameters(&CustomScheduleValues);
VTK-m Reduction algorithm now supports differing input and output types
It is common to want to perform a reduction where the input and output types
are of differing types. A basic example would be when the input is vtkm::UInt8
but the output is vtkm::UInt64
. This has been supported since v1.2, as the input
type can be implicitly convertible to the output type.
What we now support is when the input type is not implicitly convertible to the output type,
such as when the output type is vtkm::Pair< vtkm::UInt64, vtkm::UInt64>
. For this to work
we require that the custom binary operator implements also an operator()
which handles
the unary transformation of input to output.
An example of a custom reduction operator for differing input and output types is:
struct CustomMinAndMax
{
using OutputType = vtkm::Pair<vtkm::Float64, vtkm::Float64>;
VTKM_EXEC_CONT
OutputType operator()(vtkm::Float64 a) const
{
return OutputType(a, a);
}
VTKM_EXEC_CONT
OutputType operator()(vtkm::Float64 a, vtkm::Float64 b) const
{
return OutputType(vtkm::Min(a, b), vtkm::Max(a, b));
}
VTKM_EXEC_CONT
OutputType operator()(const OutputType& a, const OutputType& b) const
{
return OutputType(vtkm::Min(a.first, b.first), vtkm::Max(a.second, b.second));
}
VTKM_EXEC_CONT
OutputType operator()(vtkm::Float64 a, const OutputType& b) const
{
return OutputType(vtkm::Min(a, b.first), vtkm::Max(a, b.second));
}
VTKM_EXEC_CONT
OutputType operator()(const OutputType& a, vtkm::Float64 b) const
{
return OutputType(vtkm::Min(a.first, b), vtkm::Max(a.second, b));
}
};
Added specialized operators for ArrayPortalValueReference
The ArrayPortalValueReference
is supposed to behave just like the value it
encapsulates and does so by automatically converting to the base type when
necessary. However, when it is possible to convert that to something else,
it is possible to get errors about ambiguous overloads. To avoid these, add
specialized versions of the operators to specify which ones should be used.
Also consolidated the CUDA version of an ArrayPortalValueReference
to the
standard one. The two implementations were equivalent and we would like
changes to apply to both.
Worklets and Filters
vtkm::worklet::Invoker
now supports worklets which require a Scatter object
This change allows the Invoker
class to support launching worklets that require
a custom scatter operation. This is done by providing the scatter as the second
argument when launch a worklet with the ()
operator.
The following example shows a scatter being provided with a worklet launch.
struct CheckTopology : vtkm::worklet::WorkletMapPointToCell
{
using ControlSignature = void(CellSetIn cellset, FieldOutCell);
using ExecutionSignature = _2(FromIndices);
using ScatterType = vtkm::worklet::ScatterPermutation<>;
...
};
vtkm::worklet::Ivoker invoke;
invoke( CheckTopology{}, vtkm::worklet::ScatterPermutation{}, cellset, result );
BitFields
are now a support field input/out type for VTK-m worklets
BitFields
are:
- Stored in memory using a contiguous buffer of bits.
- Accessible via portals, a la ArrayHandle.
- Portals operate on individual bits or words.
- Operations may be atomic for safe use from concurrent kernels.
The new BitFieldToUnorderedSet
device algorithm produces an
ArrayHandle containing the indices of all set bits, in no particular
order.
The new AtomicInterface classes provide an abstraction into bitwise atomic operations across control and execution environments and are used to implement the BitPortals.
BitFields may be used as boolean-typed ArrayHandles using the
ArrayHandleBitField adapter. vtkm::cont::ArrayHandleBitField
uses atomic operations to read
and write bits in the BitField, and is safe to use in concurrent code.
For example, a simple worklet that merges two arrays based on a boolean condition:
class ConditionalMergeWorklet : public vtkm::worklet::WorkletMapField
{
public:
using ControlSignature = void(FieldIn cond,
FieldIn trueVals,
FieldIn falseVals,
FieldOut result);
using ExecutionSignature = _4(_1, _2, _3);
template <typename T>
VTKM_EXEC T operator()(bool cond, const T& trueVal, const T& falseVal) const
{
return cond ? trueVal : falseVal;
}
};
BitField bits = ...;
auto condArray = vtkm::cont::make_ArrayHandleBitField(bits);
auto trueArray = vtkm::cont::make_ArrayHandleCounting<vtkm::Id>(20, 2, NUM_BITS);
auto falseArray = vtkm::cont::make_ArrayHandleCounting<vtkm::Id>(13, 2, NUM_BITS);
vtkm::cont::ArrayHandle<vtkm::Id> output;
vtkm::worklet::Invoker invoke( vtkm::cont::DeviceAdaptertagTBB{} );
invoke(ConditionalMergeWorklet{}, condArray, trueArray, falseArray, output);
Added a Point Merging worklet
We have added vtkm::worklet::PointMerge
which uses a virtual grid approach to
identify nearby points. The worklet works by creating a very fine but
sparsely represented locator grid. It then groups points by grid bins and
finds those within a specified radius.
vtkm::filter::CleanGrid
now can do point merging
The CleanGrid
filter has been extended to use vtkm::worklet::PointMerge
to
allow for point merging. The following flags have been added to CleanGrid
to
modify the behavior of point merging.
Set
/GetMergePoints
- a flag to turn on/off the merging of duplicated coincident points. This extra operation will find points spatially located near each other and merge them together.Set
/GetTolerance
- Defines the tolerance used when determining whether two points are considered coincident. If theToleranceIsAbsolute
flag is false (the default), then this tolerance is scaled by the diagonal of the points. This parameter is only used when merge points is on.Set
/GetToleranceIsAbsolute
- When ToleranceIsAbsolute is false (the default) then the tolerance is scaled by the diagonal of the bounds of the dataset. If true, then the tolerance is taken as the actual distance to use. This parameter is only used when merge points is on.Set
/GetFastMerge
- When FastMerge is true (the default), some corners are cut when computing coincident points. The point merge will go faster but the tolerance will not be strictly followed.
Added a connected component worklets and filters
We have added the vtkm::filter::ImageConnectivity
and vtkm::filter::CellSetConnectivity
filters
to identify connected components in DataSets and the corresponding worklets. The ImageConnectivity
identify connected components in vtkm::cont::CellSetStructured
, based on same field value of neighboring
cells. The CellSetConnectivit
y identify connected components based on cell connectivity.
Currently Moore neighborhood (i.e. 8 neighboring pixels for 2D and 27 neighboring pixels
for 3D) is used for ImageConnectivity
. For CellSetConnectivity
, neighborhood is defined
as cells sharing a common edge.
Build
CMake 3.8+ now required to build VTK-m
Historically VTK-m has offered the ability to build a small subset of device adapters with CMake 3.3. As both our primary consumers have moved to CMake 3.8, and HPC machines continue to provide new CMake versions we have decided to simplify our CMake build system by requiring CMake 3.8 everywhere.
VTK-m now can verify that it installs itself correctly
It was a fairly common occurrence of VTK-m to have a broken install tree as it had no easy way to verify that all headers would be installed.
Now VTK-m offers a testing infrastructure that creates a temporary installed
version and compile tests that build against the installed VTK-m version. Currently
we have tests that verify each header listed in VTK-m is installed, users can
compile a custom vtkm::filter
that uses diy, and users can call vtkm::rendering
.
VTK-m now requires CUDA
separable compilation to build
With the introduction of vtkm::cont::ArrayHandleVirtual
and the related infrastructure, vtk-m now
requires that all CUDA code be compiled using separable compilation ( -rdc ).
VTK-m provides a vtkm_filter
CMake target
VTK-m now provides a vtkm_filter
target that contains pre-built components
of filters for consuming projects.
vtkm::cont::CellLocatorBoundingIntervalHierarchy
is compiled into vtkm_cont
All of the methods in CellLocatorBoundingIntervalHierarchy were listed in header files. This is sometimes problematic with virtual methods. Since everything implemented in it can just be embedded in a library, move the code into the vtkm_cont library.
These changes caused some warnings in clang to show up based on virtual methods in other cell locators. Hence, the rest of the cell locators have also had some of their code moved to vtkm_cont.
Other
LodePNG added as a thirdparty package
The lodepng library was brought is an thirdparty library. This has allowed the VTK-m rendering library to have a robust png decode functionality.
Optionparser added as a thirdparty package
Previously we just took the optionparser.h file and stuck it right in our source code. That was problematic for a variety of reasons.
- It incorrectly assigned our license to external code.
- It made lots of unnecessary changes to the original source (like reformatting).
- It made it near impossible to track patches we make and updates to the original software.
Now we use the third-party system to track changes to optionparser.h in the https://gitlab.kitware.com/third-party/optionparser repository.
Thirdparty diy now can coexist with external diy
Previously VTK-m would leak macros that would cause an external diy to be incorrectly mangled breaking consumers of VTK-m that used diy.
Going forward to use diy
from VTK-m all calls must use the vtkmdiy
namespace instead of the diy
namespace. This allows for VTK-m to
properly forward calls to either the external or internal version correctly.
Merge benchmark executables into a device dependent shared library
VTK-m has been updated to replace old per device benchmark executables with a single
multi-device executable. Selection of the device adapter is done at runtime through
the --device=
argument.
Merge rendering testing executables to a shared library
VTK-m has been updated to replace old per device rendering testing executables with a single
multi-device executable. Selection of the device adapter is done at runtime through
the --device=
argument.
Merge worklet testing executables into a device dependent shared library
VTK-m has been updated to replace old per device working testing executables with a single
multi-device executable. Selection of the device adapter is done at runtime through
the --device=
argument.
VTK-m runtime device detection properly handles busy CUDA devices
When an application that uses VTK-m is first launched it will do a check to see if CUDA is supported at runtime. If for some reason that CUDA card is not allowing kernel execution VTK-m would report the hardware doesn't have CUDA support.
This was problematic as was over aggressive in disabling CUDA support for hardware that could support kernel execution in the future. With the fact that every VTK-m worklet is executed through a TryExecute it is no longer necessary to be so aggressive in disabling CUDA support.
Now the behavior is that VTK-m considers a machine to have CUDA runtime support if it has 1+ GPU's of Kepler or higher hardware (SM_30+).