We previously included windows.h in numerous locations using different
techniques to guard against bringing in parts of the file that are bad
(min/max macros, etc). This solves the problem by consistently using
vtkm/internal/Windows.h to setup everything.
Change the VTKM_CONT_EXPORT to VTKM_CONT. (Likewise for EXEC and
EXEC_CONT.) Remove the inline from these macros so that they can be
applied to everything, including implementations in a library.
Because inline is not declared in these modifies, you have to add the
keyword to functions and methods where the implementation is not inlined
in the class.
The benchmarking header files were not listed. Not a huge deal since
these files do not need to be installed, but they should be listed
anyway. Changed the vtkm_save_benchmarks CMake macro to be able to list
headers.
Also moved everything from BenchmarkDeviceAdapter.h to
BenchmarkDeviceAdapter.cxx. Since this code shouldn't need to be
included by anything except this benchmark, there is no need to have it
in a header file. Plus, the build changes would mean that any change in
the header (where most of the source was) could cause all code in this
directory to recompile. I do not want to set that precedent.
These asserts are consolidated into the unified Assert.h. Also made some
minor edits to add asserts where appropriate and a little bit of
reconfiguring as found.
By introducing our own custom thrust execution policy we can make sure
to hit the fastest code paths in thrust for the sort operation. This makes
sure that for UInt32,Int32, and Float32 we use the radix sort from thrust
which offers a 2x to 3x speed improvement over the merge sort implementation.
Secondly by telling thrust that our BinaryOperators are commutative we
make sure that we get the fastest code paths when executing Inclusive
and Exclusive Scan
Benchmark 'Radix Sort on 1048576 random values vtkm::Int32' results:
median = 0.0117049s
median abs dev = 0.00324614s
mean = 0.0167615s
std dev = 0.00786269s
min = 0.00845875s
max = 0.0389063s
Benchmark 'Radix Sort on 1048576 random values vtkm::Float32' results:
median = 0.0234463s
median abs dev = 0.000317249s
mean = 0.021452s
std dev = 0.00470307s
min = 0.011255s
max = 0.0250643s
Benchmark 'Merge Sort on 1048576 random values vtkm::Int32' results:
median = 0.0310486s
median abs dev = 0.000182129s
mean = 0.0286914s
std dev = 0.00634102s
min = 0.0116225s
max = 0.0317379s
Benchmark 'Merge Sort on 1048576 random values vtkm::Float32' results:
median = 0.0310617s
median abs dev = 0.000193583s
mean = 0.0295779s
std dev = 0.00491531s
min = 0.0147257s
max = 0.032307s
Benchmarker provides its own implementation of is_sorted since this
method was not introduced until C++11 and not all compilers necessarily
support it. However, for those that did, the system is_sorted conflicted
with the provided is_sorted. To get around the problem, specify the full
namespace of the is_sorted being used (which is standard practice in
VTK-m anyway).
The storage used will now be aligned to `VTKM_CACHE_LINE_SIZE bytes,
resulting in slightly better cache usage and load/store performance.
This define is set in `StorageBasic.h We also now detect if Posix is
available in Configure.h and will define VTKM_POSIX with _POSIX_VERSION
if it's available.
The AlignedAllocator used by StorageBasic is also STL compatible
and can be used in STL containers so user's can use it in their
std::vector and pass aligned user memory to the storage.
On one of my compile platforms, GCC was giving conversion warnings from
any boost include that was not wrapped in pragmas to disable conversion
warnings. To make things easier and more robust, I created a pair of
macros, VTKM_BOOST_PRE_INCLUDE and VTKM_BOOST_POST_INCLUDE, that should
be wrapped around any #include of a boost header file.
- A warm up run is done and not timed to allow for any allocation of
room for output data without accounting for it in the run times.
Previously this time spent allocating memory would be included in the
time we measured for the benchmark.
- Benchmarks are run multiple times and we then compute some statistics
about the run time of the benchmark to give a better picture of the
expected run time of the function. To this end we run the benchmark
either 500 times or for 1.5s, whichever comes sooner (though these are
easily changeable). We then perform outlier limiting by Winsorising the
data (similar to how Rust's benchmarking library works) and print out
the median, mean, min and max run times along with the median absolute
deviation and standard deviation.
- Because benchmarks are run many times they can now perform some
initial setup in the constructor, eg. to fill some test input data
array with values to let the main benchmark loop run faster.
- To allow for benchmarks to have members of the data type being
benchmarked the struct must now be templated on this type, leading to
a bit of awkwardness. I've worked around this by adding the
`VTKM_MAKE_BENCHMARK` and `VTKM_RUN_BENCHMARK` macros, the make
benchmark macro generates a struct that has an `operator()` templated on
the value type which will construct and return the benchmark functor
templated on that type. The run macro will then use this generated
struct to run the benchmark functor on the type list passed. You can
also pass arguments to the benchmark functor's constructor through the
make macro however this makes things more awkward because the name of
the MakeBench struct must be different for each variation of constructor
arguments (for example see `BenchLowerBounds`).
- Added a short comment on how to add benchmarks in
`vtkm/benchmarking/Benchmarker.h` as the new system is a bit different
from how the tests work.
- You can now pass an extra argument when running the benchmark suite to
only benchmark specific functions, eg. `Benchmarks_TBB
BenchmarkDeviceAdapter ScanInclusive Sort` will only benchmark
ScanInclusive and Sort. Running without any extra arguments will run all
the benchmarks as before.