The internal array copy has an optimization to use the device the array
exists on to do the copy. However, if that device is disabled the copy
would fail. This problem has been fixed.
There are some common uses of `ArrayCopy` that don't use basic arrays.
Rather than move away from `ArrayCopy` for these use cases, compile a
special fast path for these cases.
This required some restructuring of the code to get the template
resolution to work correctly.
Rather than require `ArrayCopy` to create special versions of copy for
all arrays, use a precompiled versions. This should speed up compiles,
reduce the amount of code being generated, and require the device
compiler on fewer source files.
There are some cases where you still need to copy arrays that are not
well supported by the precompiled versions in `ArrayCopy`. (It will
always work, but the fallback is very slow.) In this case, you will want
to switch over to `ArrayCopyDevice`, which has the old behavior.