Array.Copy & Buffer.BlockCopy x2 to x3 slower < 1kB #4847

benaadams · 2015-12-20T04:20:19Z

Array.Copy & Buffer.BlockCopy are up to x3 times slower than what can currently be written using managed C# for byte[] copies < 1024 bytes.

As these are the low level copy apis they should be faster than anything that can be written using managed code.

The IllyriadGames/ByteArrayExtension repository adds a VectorizedCopy extension method for byte[] that demonstrates this speed-up; and I assume potentially more gains could be made at the clr level.

As Buffer.BlockCopy is essentially calling memmove Would this effect other copies e.g. structs passed as parameters; as they are essentially small copy operations?

The text was updated successfully, but these errors were encountered:

mikedn · 2015-12-20T07:41:22Z

Array.Copy & Buffer.BlockCopy are up to x3 times slower than what can currently be written using managed C# for byte[] copies < 1024 bytes.

That's a bit of apples and oranges comparison, your version only handles byte[] while the framework versions handle all types of arrays. But yes, there's certainly room for improvement, the JIT could treat Array.Copy and Buffer.BlockCopy as intrinsics and that would allow a number of optimizations to be performed, including generating an inline rep movsb.

Would this effect other copies e.g. structs passed as parameters; as they are essentially small copy operations?

Those are handled by the JIT, they have nothing to do with Buffer.BlockCopy.

benaadams · 2015-12-20T09:38:24Z

That's a bit of apples and oranges comparison, your version only handles byte[] while the framework versions handle all types of arrays.

I'm not entirely sure how unless this is particularly onerous

    // Size of the Arrays in bytes
    SIZE_T srcLen = src->GetNumComponents() * src->GetComponentSize();
    SIZE_T dstLen = srcLen;

    // We only want to allow arrays of primitives, no Objects.
    const CorElementType srcET = src->GetArrayElementType();
    if (!CorTypeInfo::IsPrimitiveType_NoThrow(srcET))
        FCThrowArgumentVoid(W("src"), W("Arg_MustBePrimArray"));

    if (src != dst) {
        const CorElementType dstET = dst->GetArrayElementType();
        if (!CorTypeInfo::IsPrimitiveType_NoThrow(dstET))
            FCThrowArgumentVoid(W("dest"), W("Arg_MustBePrimArray"));
        dstLen = dst->GetNumComponents() * dst->GetComponentSize();
    }

If that is heavy it could move into a jitted constant for the type or exception and be essentially eliminated? The cpu vector load and store instructions don't care what the data types are; though there would be some overhead in handling overlapping data; which I don't do.

However that is an identical cost for all calls and doesn't explain the dramatic fall off at 16 bytes; or the slightly smaller fall off at 64 bytes?

Would this effect other copies e.g. structs passed as parameters; as they are essentially small copy operations?

Those are handled by the JIT, they have nothing to do with Buffer.BlockCopy.

Yes, but my point was that the code that does the copying is a call to memmove

    if ((srcPtr != dstPtr) && (count > 0)) {
        memmove(dstPtr, srcPtr, count);
    }

So anything calling BlockCopy, InternalBlockCopy, memmove and perhaps memcopy would exhibit this behaviour. While it might be a deeper issue with the C Run-Time Library, that's not immediately helpful.

mikedn · 2015-12-20T10:07:09Z

I'm not entirely sure how unless this is particularly onerous

At least for small sizes like 16, yes that piece of code's overhead is high.

If that is heavy it could move into a jitted constant for the type or exception and be essentially eliminated?

I'm not very sure what you are suggesting. One possible solution is to have a generic version and then use code like:

if (typeof(T).IsBlittable) Buffer.__memmove(...); else Array.Copy(...};

where IsBlittable is a hypothetical property that returns true if T is not a reference type or a struct that contains reference type fields. The if could then be eliminated at JIT time because for T = reference type IsBlittable is always false and for T = value type the JIT generates specialized code for each T.

However that is an identical cost for all calls and doesn't explain the dramatic fall off at 16 bytes; or the slightly smaller fall off at 64 bytes?

Which calls? Your version has only one managed-managed call. The framework version has one managed-native call and one native-native call.

Yes, but my point was that the code that does the copying is a call to memmove

Are you saying that the problem is in fact caused by memmove? For 16 bytes memmove uses only 2 movdqu instructions. Granted, to get to those 2 instructions it has to execute a few more to check that the count is 16 but still, memmove itself does far less work than BlockCopy.

mikedn · 2015-12-20T10:36:09Z

At least for small sizes like 16, yes that piece of code's overhead is high.

Even for larger sizes the BlockCopy code has significant overhead - for 256 bytes 50% of the time is spent in memcpy and 45% of the time is spent in BlockCopy.

benaadams · 2015-12-20T10:51:03Z

One possible solution is to have a generic version and then use code like

Yeah, essentially the pattern the System.Numerics.Vector uses

/*
* PATTERN:
*    if (typeof(T) == typeof(Int32)) { ... }
*    else if (typeof(T) == typeof(Single)) { ... }
* EXPLANATION:
*    At runtime, each instantiation of Vector<T> will be type-specific, and each of these typeof 
*    blocks will be eliminated, as typeof(T) is a(JIT) compile-time constant for each 
*    instantiation. This design was chosen to eliminate any overhead from delegates and 
*    other patterns.
*/

doesn't explain the dramatic fall off at 16 bytes; or the slightly smaller fall off at 64 bytes?

Which calls?

The graph included in the issue summary or graph and full data table on IllyriadGames/ByteArrayExtension shows the average times and operations per second for lengths 0 - 96, 508 - 513, 543, 544, 547, 576, 1024, 4096, 8512 for Array.Copy, BlockCopy and VectorizedCopy (using BenchmarkDotNet to do the benchmarking over 8.5 hours) and shows a heavy drop off at >16 bytes and a smaller one at >64 bytes; but this isn't exhibited by VectorizedCopy.

I can see there being overhead in BlockCopy but that should be the same overhead for 16 bytes and 17 bytes also for 64 bytes and 65 bytes?

Are you saying that the problem is in fact caused by memmove?

It would suggest to me it was; as the actual BlockCopy cost should always be the same for every array size of the same type? Though I don't think that code is available to look at.

benaadams · 2015-12-20T11:27:19Z

i.e. 17 bytes takes double the time of 16 bytes which is it was alignment alone should have recovered somewhat for 32 bytes; but doesn't; snip from table below:

Method	Len	AvrTime	StdDev	op/s	% Delta vs ArrayCopy
BlockCopy	15	6.0864ns	0.1013ns	164,346,271.10	-2.4%
VectorizedCopy	15	4.6881ns	0.0691ns	213,349,804.77	26.7%
BlockCopy	16	5.9432ns	0.1025ns	168,307,450.63	-2.5%
VectorizedCopy	16	3.8084ns	0.0088ns	262,578,647.13	52.1%
BlockCopy	17	10.7207ns	0.1785ns	93,303,079.33	-0.2%
VectorizedCopy	17	4.1017ns	0.0030ns	243,800,116.81	160.8%
BlockCopy	31	10.8505ns	0.0259ns	92,161,874.17	-2.6%
VectorizedCopy	31	4.8411ns	0.0221ns	206,569,087.60	118.4%
BlockCopy	32	10.4672ns	0.0402ns	95,537,677.55	0.9%
VectorizedCopy	32	2.7518ns	0.0500ns	363,515,306.64	284.1%
BlockCopy	33	10.5894ns	0.1819ns	94,461,058.17	-0.2%
VectorizedCopy	33	4.1079ns	0.0114ns	243,435,478.42	157.1%

At 16 bytes VectorizedCopy is only 52.1% faster than ArrayCopy; but by 32 bytes its 284.1% faster.

So something is odd; and the BlockCopy part of the code that runs should be the same for both byte counts; which would suggest its something in memmove that goes awry?

At a guess, its either not taking advantage of wider datatypes or has instruction result interdependence preventing OoOE and pipelining - the data should all be in L1 cache at this point?

mikedn · 2015-12-20T12:13:37Z

I see, I was looking at the graph upside down so to speak 😄

VC++'s memcpy (and memmove) implementation has special cases for less than 16 bytes and 16-32 bytes. See the code in "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\crt\src\amd64\memcpy.asm".

benaadams · 2015-12-20T13:04:15Z

It looks like its a dependent read causing a pipeline stall for 17-32 and for every higher byte count.

The next step in the XmmCopySmallLoop loop can't be run until the result of add rcx, 16 is known as it uses rcx; similar to why you add to the counter in an unrolled loop and increment it at the end rather than incrementing the counter at each step; and mostly where the speed up in an unrolled loop comes from.

Whereas the two 16 byte copies are independent in the VectorizedCopy

Is there an open source github repo for memcpy.asm? 😉

benaadams · 2015-12-20T21:21:55Z

@mikedn looking into this a bit further it looks like can extend the performance for the full range (getting 304GBit/s read + 304GBit/s write) for 16kB aligned arrays. (assuming a readonly static byte[]'s data is aligned). Its probably all operating in L1 cache at that speed; however it is the same conditions Array.Copy is working with.

I'm just testing with unaligned copies to see what effect that has. memmcopy/BlockCopy has the same results for unaligned as it does some work upfront to make the copies mostly aligned. So it may need some work there; also overlapping data handling.

I could add an extension for Array.CopyTo for the specific types e.g.

public static void CopyTo(this byte[] src, byte[] array, int index) { }
public static void CopyTo(this char[] src, char[] array, int index) { }
public static void CopyTo(this byte[] sourceArray, int sourceIndex, byte destinationArray, int destinationIndex, int length) { }

As they would have a dependency on System.Numerics.Vectors they would need to live further up the stack (e.g. corefx) though I'd imagine this would be a preferred place anyway for new functionality to coreclr?

Higher up the stack I couldn't do Array.Copy itself without Static extension methods https://github.com/dotnet/roslyn/issues/996

Buffer.BlockCopy would be trickier as it would need Static extension methods again but also cause an explosion in method bodies without something like a byte* version of private unsafe Vector(void* dataPointer, int offset) method being made public and either a Vector<T>.CopyTo(T* destination, int startIndex) or just Vector<byte>.CopyTo(byte* destination, int startIndex)?

Would this be worth pursuing as a contribution to corefx? Would only be the T[].CopyTo extensions for the reasons mentioned above.

jkotas · 2015-12-20T21:35:56Z

That's a bit of apples and oranges comparison

Agree. To compare apples to apples, it would be more accurate to compare the vectorized implementation with something like:

    public unsafe static void Copy(byte[] src, int srcOffset, byte[] dst, int dstOffset, int count)
    {
        ... argument validation ...

        fixed (byte * pSrc = src) fixed (byte* pDst = dst)
            Buffer.MemoryCopy(pSrc + srcOffset, pDst + dstOffset, count, count);
    }

The results will vary in both directions based on number of dimensions:

CPU model.
Compiler and C runtime used (VS2013, VS2015, ...)
Optimization performed by the C compiler, e.g. you will get much better results for small sizes with .NET Framework vs CoreCLR, even though the C++ code in both is exactly same. The main factor is that .NET Framework is optimized using PGO, but CoreCLR is not yet.
Alignment of memory blocks and state of CPU caches

I am getting pretty different number on my config: VS2015 Update 1, Intel Core i7-3520M @ 2.90GHz - the difference is a lot smaller than what you are seeing.

Since Array.Copy and Buffer.MemoryCopy are basically just wrappers around memcpy from C runtime, it is really a perf issue in C runtime, specific to your config.

It is actually pretty hard problem to build memcpy with great performance accross all dimensions. I have seen analysis of the problem several years ago, comparing performance of different strategies - it was multidimensional cube that it was hard to make sense of.

benaadams · 2015-12-20T22:27:53Z

accurate to compare the vectorized implementation with something like

Interesting didn't know Buffer.MemoryCopy was a thing; 4.6 & coreclr? Will retest.

The results will vary in both directions based on number of dimensions:
CPU model.

Yeah :-/ Tis very true

.NET Framework vs CoreCLR

Using .Net Framework.4.6.1 (4.6.01038); RyuJIT, x64, Win 10 (10586.29), VS 14.0.24720 Update 1 - though just using provided framework not building myself.

Also have CPU power saving off and CPU fan locked on full to avoid throttling, and nothing else running on test machine; just letting it do its thing. Using static readonly arrays to minimise GC impact; which will mean it will all quickly be hitting cache.

Agreed that since Buffer.MemoryCopy exists it is now an apples and oranges comparison.

But yes, benchmarks are hard, also is why am using BenchmarkDotNet rather than trying to roll my own. It goes through warm up cycles for jitting; idles to get a measure of an empty loop; tries to calculate a good number of operations to run to get a good measure; repeats the benchmark 10 times; then starts from scratch 3 times and from the 30 runs it derives the standard deviation; for the example below that ends up being around 591 billion function executions for the 16kB size - alas it takes a long time though.

// Run, Process: 2 / 3
// IntParam=16384
// Method=AlignedVectorizedCopy 
// BenchmarkDotNet-Dev=v0.8.0.0+
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz, ProcessorCount=8
// CLR=MS.NET 4.0.30319.42000, Arch=64-bit  [RyuJIT]
// Pre-Warmup: 2 op, 0 ms, 3158.1 ns, 8 ticks, 1579.0267 ns/op, 633301.5 op/s
// Pre-Warmup: 2000 op, 1 ms, 1015314.2 ns, 2572 ticks, 507.6571 ns/op, 1969833.6 op/s
// Pre-Warmup: 1970000 op, 771 ms, 770992173.6 ns, 1953082 ticks, 391.3666 ns/op, 2555149.2 op/s
// Pre-Warmup: 3940000 op, 1551.9 ms, 1551853659 ns, 3931165 ticks, 393.8715 ns/op, 2538899.2 op/s
// Warmup (idle): 3940000 op, 4.4 ms, 4357324.3 ns, 11038 ticks, 1.1059 ns/op, 904224645.8 op/s
// Warmup (idle): 3940000 op, 4.2 ms, 4245608.1 ns, 10755 ticks, 1.0776 ns/op, 928017818.7 op/s
// Warmup (idle): 3940000 op, 4.4 ms, 4384167.7 ns, 11106 ticks, 1.1127 ns/op, 898688244.2 op/s
// IterationCount = 1970000
// Target (idle): 3940000 op, 4.5 ms, 4480488.4 ns, 11350 ticks, 1.1372 ns/op, 879368426.4 op/s
// Target (idle): 3940000 op, 4.2 ms, 4232186.4 ns, 10721 ticks, 1.0742 ns/op, 930960884.2 op/s
// Target (idle): 3940000 op, 4.3 ms, 4265740.7 ns, 10806 ticks, 1.0827 ns/op, 923637945.6 op/s
// Target (idle): 3940000 op, 4.4 ms, 4416932.5 ns, 11189 ticks, 1.121 ns/op, 892021775 op/s
// Target (idle): 3940000 op, 4.3 ms, 4269688.3 ns, 10816 ticks, 1.0837 ns/op, 922783990.4 op/s
// Warmup 1: 3940000 op, 1547.1 ms, 1547107499.4 ns, 3919142 ticks, 392.6669 ns/op, 2546687.9 op/s
// Warmup 2: 3940000 op, 1550.6 ms, 1550621228.6 ns, 3928043 ticks, 393.5587 ns/op, 2540917.1 op/s
// Warmup 3: 3940000 op, 1548.1 ms, 1548139393.3 ns, 3921756 ticks, 392.9288 ns/op, 2544990.5 op/s
// Warmup 4: 3940000 op, 1548.8 ms, 1548824296.2 ns, 3923491 ticks, 393.1026 ns/op, 2543865.1 op/s
// Warmup 5: 3940000 op, 1548.4 ms, 1548367562.7 ns, 3922334 ticks, 392.9867 ns/op, 2544615.4 op/s
Target 1: 3940000 op, 1548 ms, 1548044651.7 ns, 3921516 ticks, 392.9047 ns/op, 2545146.2 op/s
Target 2: 3940000 op, 1547.3 ms, 1547276060.5 ns, 3919569 ticks, 392.7097 ns/op, 2546410.5 op/s
Target 3: 3940000 op, 1546.8 ms, 1546808273.8 ns, 3918384 ticks, 392.5909 ns/op, 2547180.6 op/s
Target 4: 3940000 op, 1549.6 ms, 1549584597.5 ns, 3925417 ticks, 393.2956 ns/op, 2542616.9 op/s
Target 5: 3940000 op, 1547.9 ms, 1547870958.8 ns, 3921076 ticks, 392.8606 ns/op, 2545431.8 op/s
Target 6: 3940000 op, 1546.7 ms, 1546704452.8 ns, 3918121 ticks, 392.5646 ns/op, 2547351.6 op/s
Target 7: 3940000 op, 1548.5 ms, 1548512438.4 ns, 3922701 ticks, 393.0235 ns/op, 2544377.4 op/s
Target 8: 3940000 op, 1547.9 ms, 1547853194.7 ns, 3921031 ticks, 392.8561 ns/op, 2545461 op/s
Target 9: 3940000 op, 1546.6 ms, 1546550892.4 ns, 3917732 ticks, 392.5256 ns/op, 2547604.5 op/s
Target 10: 3940000 op, 1546.9 ms, 1546852881.3 ns, 3918497 ticks, 392.6023 ns/op, 2547107.1 op/s

benaadams · 2015-12-20T23:07:18Z

Also needs to be RyuJIT x64 for Vectors to have meaning beyond convenience I believe

mikedn · 2015-12-20T23:21:46Z

@benaadams

The next step in the XmmCopySmallLoop loop can't be run until the result of add rcx, 16 is known as it uses rcx ;

That has nothing to do with the 17-32 case, that code is for the 32-128 case.

for 16kB aligned arrays. (assuming a readonly static byte[] 's data is aligned).

16kB aligned arrays? Maybe you mean 16 byte? And what alignment has to do with static readonly?!

Anyway, I don't understand where are you trying to go with all this. Even if memcpy has some kind of issue the overhead of Buffer.BlockCopy is very high at small sizes so what memcpy does is very much irrelevant. For the record, on my machine memcpy consistently exceeds 200mil ops/s when tested in C++ program while your C# vectorized implementation barely reaches 130mil ops/s (for sizes up to 64 bytes).

@jkotas

Since Array.Copy and Buffer.MemoryCopy are basically just wrappers around memcpy from C runtime, it is really a perf issue in C runtime, specific to your config.

MemoryCopy uses memcpy only for size > 512, you're probably talking about BlockCopy.

benaadams · 2015-12-21T10:19:51Z

That has nothing to do with the 17-32 case, that code is for the 32-128 case.

Ah my mistake, so its doing x2 16 byte copies for 17-32 in Move17to32 one or both of which are unaligned.

16kB aligned arrays?...

16kb arrays, which I assumed to be 16 byte aligned; but probably aren't due to the type and length preamble. Static readonly was to provide a fuller picture of the benchmark and the cpu caching state + GC interplay.

Anyway, I don't understand where are you trying to go with all this ... the overhead of Buffer.BlockCopy is very high at small sizes

Sorry... got a bit side tracked, thank you for following up. Basically a faster Copy for primitive arrays at small sizes; particularly byte[]. If there is any way to reduce Buffer.BlockCopy or Array.Copy overheads for byte[] I'd be very happy.

mikedn · 2015-12-21T11:00:09Z

Ah my mistake, so its doing x2 16 byte copies for 17-32 in Move17to32 one or both of which are unaligned.

Yup. In my C++ tests the 17-32 is actually faster than the rest and given how the code looks I think it's normal. But this case seems to be new in VC++2015 and the desktop version of .NET uses the VC++2013 CRT, that may explain why you're seeing that sharp decrease at 16 bytes.

16kb arrays, which I assumed to be 16 byte aligned

Why would the length of an array imply anything about its alignment?

Static readonly was to provide a fuller picture of the benchmark and the cpu caching state + GC interplay.

I still don't understand what static readonly has to do with all this. Those modifiers only affect the array reference, not the array itself.

Basically a faster Copy for primitive arrays at small sizes; particularly byte[] . If there is any way to reduce Buffer.BlockCopy or Array.Copy overheads for byte[] I'd be very happy.

Adding another API beyond the existing one doesn't seem such a good idea. That said, generic versions of the existing APIs may be useful and I wouldn't exactly consider them new APIs. Some of the stuff mentioned here already appeared in other issues. Of the top of my head:

I've seen an issue about Array.Resize where AFAIR the lack of a generic version also caused problems
I've seen something like IsBlittable mentioned in a request for a way to avoid Array.Clear in collection implementations.

Still, I think that the best thing to do would to treat these methods as JIT intrinsics the same way native compilers do. Unfortunately that probably requires more work.

benaadams · 2015-12-21T22:02:00Z

Could you help me understand the costs? (i.e. what costs is it worth trying to mitigate)

Does the extern call of Buffer.BlockCopy -> FCIMPL5(VOID, Buffer::BlockCopy come with a cost for the transition?

Is 2 x GetArrayElementType() + CorTypeInfo::IsPrimitiveType_NoThrow the main cost?

There is are two size multiplications dst->GetNumComponents() * dst->GetComponentSize() which for byte would be unnecessary extra since its multiplying by 1 and GetComponentSize would return 1.

I assume the other parameter validation isn't more expensive in native than in managed?

I also assume the call to memmove would have no extra overhead to it being in this scenario to regular C/C++?

Anderman · 2015-12-22T08:00:20Z

Cool benaadams, It looks like I have the same results. Vector copy is faster.

Anderman · 2015-12-22T09:34:18Z

@benaadams
I created 50-100% a faster implementation

        public static unsafe void VectorizedCopy(byte[] src, int srcOffset, byte[] dst, int dstOffset, int count)
        {
            if (count > 512 + 64)
            {
                // In-built copy faster for large arrays (vs repeated bounds checks on Vector.ctor?)
                Array.Copy(src, srcOffset, dst, dstOffset, count);
                return;
            }
            var orgCount = count;

            while (count >= Vector<byte>.Count)
            {
                new Vector<byte>(src, srcOffset).CopyTo(dst, dstOffset);
                count -= Vector<byte>.Count;
                srcOffset += Vector<byte>.Count;
                dstOffset += Vector<byte>.Count;
            }
            if (orgCount > Vector<byte>.Count)
            {
                new Vector<byte>(src, orgCount - Vector<byte>.Count).CopyTo(dst, orgCount - Vector<byte>.Count);
                return;
            }
            if (src == null || dst == null) throw new ArgumentNullException(nameof(src));
            if (count < 0 || srcOffset < 0 || dstOffset < 0) throw new ArgumentOutOfRangeException(nameof(count));
            if (srcOffset + count > src.Length) throw new ArgumentException(nameof(src));
            if (dstOffset + count > dst.Length) throw new ArgumentException(nameof(dst));
            fixed (byte* srcOrigin = src)
            fixed (byte* dstOrigin = dst)
            {
                var pSrc = srcOrigin + srcOffset;
                var pDst = dstOrigin + dstOffset;
                switch (count)
                {
                    case 1:
                        pDst[0] = pSrc[0];
                        return;

                    case 2:
                        *((short*)pDst) = *((short*)pSrc);
                        return;

                    case 3:
                        *((short*)pDst) = *((short*)pSrc);
                        pDst[2] = pSrc[2];
                        return;

                    case 4:
                        *((int*)pDst) = *((int*)pSrc);
                        return;

                    case 5:
                        *((int*)pDst) = *((int*)pSrc);
                        pDst[4] = pSrc[4];
                        return;

                    case 6:
                        *((int*)pDst) = *((int*)pSrc);
                        *((short*)(pDst + 4)) = *((short*)(pSrc + 4));
                        return;

                    case 7:
                        *((int*)pDst) = *((int*)pSrc);
                        *((short*)(pDst + 4)) = *((short*)(pSrc + 4));
                        pDst[6] = pSrc[6];
                        return;

                    case 8:
                        *((long*)pDst) = *((long*)pSrc);
                        return;

                    case 9:
                        *((long*)pDst) = *((long*)pSrc);
                        pDst[8] = pSrc[8];
                        return;

                    case 10:
                        *((long*)pDst) = *((long*)pSrc);
                        *((short*)(pDst + 8)) = *((short*)(pSrc + 8));
                        return;

                    case 11:
                        *((long*)pDst) = *((long*)pSrc);
                        *((short*)(pDst + 8)) = *((short*)(pSrc + 8));
                        pDst[10] = pSrc[10];
                        return;

                    case 12:
                        *((long*)pDst) = *((long*)pSrc);
                        *((int*)(pDst + 8)) = *((int*)(pSrc + 8));
                        return;

                    case 13:
                        *((long*)pDst) = *((long*)pSrc);
                        *((int*)(pDst + 8)) = *((int*)(pSrc + 8));
                        pDst[12] = pSrc[12];
                        return;

                    case 14:
                        *((long*)pDst) = *((long*)pSrc);
                        *((int*)(pDst + 8)) = *((int*)(pSrc + 8));
                        *((short*)(pDst + 12)) = *((short*)(pSrc + 12));
                        return;

                    case 15:
                        *((long*)pDst) = *((long*)pSrc);
                        *((int*)(pDst + 8)) = *((int*)(pSrc + 8));
                        *((short*)(pDst + 12)) = *((short*)(pSrc + 12));
                        pDst[14] = pSrc[14];
                        return;

                }
            }
        }

Anderman · 2015-12-22T10:27:12Z

Result from the improved vector copy compared with @benaadams vector copy
vector copy 1 bytes org:0070 imp:0100 mb/s faster:1,4x
vector copy 2 bytes org:0131 imp:0202 mb/s faster:1,5x
vector copy 3 bytes org:0186 imp:0296 mb/s faster:1,6x
vector copy 4 bytes org:0304 imp:0405 mb/s faster:1,3x
vector copy 5 bytes org:0377 imp:0474 mb/s faster:1,3x
vector copy 6 bytes org:0429 imp:0601 mb/s faster:1,4x
vector copy 7 bytes org:0472 imp:0668 mb/s faster:1,4x
vector copy 8 bytes org:0592 imp:0809 mb/s faster:1,4x
vector copy 9 bytes org:0612 imp:0853 mb/s faster:1,4x
vector copy 10 bytes org:0633 imp:1028 mb/s faster:1,6x
vector copy 11 bytes org:0683 imp:1068 mb/s faster:1,6x
vector copy 12 bytes org:0908 imp:1189 mb/s faster:1,3x
vector copy 13 bytes org:0970 imp:1371 mb/s faster:1,4x
vector copy 14 bytes org:1011 imp:1373 mb/s faster:1,4x
vector copy 15 bytes org:0997 imp:1465 mb/s faster:1,5x
vector copy 16 bytes org:1532 imp:1686 mb/s faster:1,1x
vector copy 17 bytes org:1153 imp:2710 mb/s faster:2,4x
vector copy 31 bytes org:1907 imp:4937 mb/s faster:2,6x
vector copy 32 bytes org:2736 imp:3973 mb/s faster:1,5x
vector copy 48 bytes org:3402 imp:4770 mb/s faster:1,4x
vector copy 64 bytes org:4535 imp:5535 mb/s faster:1,2x
vector copy 96 bytes org:5105 imp:6064 mb/s faster:1,2x
vector copy 128 bytes org:6100 imp:6946 mb/s faster:1,1x
vector copy 256 bytes org:7209 imp:8028 mb/s faster:1,1x
vector copy 512 bytes org:8061 imp:8865 mb/s faster:1,1x
vector copy 576 bytes org:8109 imp:8990 mb/s faster:1,1x

mikedn · 2015-12-22T11:58:29Z

Could you help me understand the costs? (i.e. what costs is it worth trying to mitigate)

Best thing to do would be use a profiler to see what happens. I've done that already so here's a quick overview:

There doesn't appear be any transition cost but Buffer::BlockCopy's function prolog takes 11.3%. Not surprising considering that it stores 7 registers to stack - that's 56 bytes of stores in a function that's supposed to copy, say, 16 bytes.
The multiplication supposedly takes 3%
The first CorTypeInfo::IsPrimitiveType_NoThrow takes 16.4%. That's a lot of time but I think the profiler misattributed the time and this includes the previous GetArrayElementType
The various argument checks take around 3% each
There's a FC_GC_POLL at the end of the function which takes 1.2%
The function epilog takes 7.5%, not surprising given that it needs to restore 4 registers from the stack, that's 32 bytes of loads

Now, profiler's claims should be taken with a grain of salt. But it's pretty certain that in my test (with lengths from 0 to 64) 76% of the time is spent in coreclr.dll, 17% is spent in vcruntime140.dll and 5% of the time is spent in the test executable. The influence of memcpy on the final result is pretty small.

Keep in mind that my test is using CoreCLR compiled with VS2015. You may get different results if you use the desktop version of the CLR because that one is compiled using VS2013 which appears to have a slightly different memcpy implementation.

benaadams · 2015-12-22T13:30:27Z

Wow, ok - so sticking to managed code is probably best. If I introduced some methods like

public static void BlockCopy<T>(T[] src, int srcOffset, T[] dst, int dstOffset, int count)

Then have them call though to Buffer.Memmove would that be the best approach and still leave the door open for JitIntrinsic in the future?

I had a look at what could be done with native vectors and intrinsics; but will admit what goes on there is a little exotic for me to confidently make those changes at this time.

using CoreCLR

We are exclusively CoreCLR x64 for production use now; so this is the area I'd most like to target.

@Anderman is similar to what Buffer.Memmove is doing

Anderman · 2015-12-22T13:45:50Z

@benaadams
Yeah almost, I made 3 other optimization.

use Vector.Count instead of var (the compiler makes = 0x10)
do only safety check, if we go unsafe.
Copy the last 16 byte faster if array if >16

if (orgCount > Vector<byte>.Count)
            {
                new Vector<byte>(src, orgCount - Vector<byte>.Count).CopyTo(dst, orgCount - Vector<byte>.Count);
                return;
            }

redknightlois · 2015-12-22T16:45:42Z

@benaadams Calling via p/invoke is very costly for small array sizes. While fixing the array in memory and copying it is not. This is the most optimized routine for aligned buffers we could create for JIT64 (haven't tested it yet con CoreCLR and or unaligned but the difference for Intel architectures shouldn't be too different) https://github.com/Corvalius/ravendb/blob/master/Raven.Sparrow/Sparrow/Memory.cs .

When we tested back in .NET 4.6 against Buffer.BlockCopy and System.Buffer we got this numbers:

BenchmarkDotNet=v0.7.8.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz, ProcessorCount=4
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit  [AttachedDebugger] [RyuJIT]
Type=MemoryPerf  Mode=Throughput  Platform=X64  Jit=RyuJit  .NET=HostFramework

Method	Param	AvrTime	StdDev	op/s
BlockCopy	8	12.7248 ns	0.1382 ns	78,586,659.62
BlockCopy	16	12.7584 ns	0.0580 ns	78,379,941.88
BlockCopy	32	16.0668 ns	0.2440 ns	62,240,025.46
BlockCopy	64	17.1374 ns	0.0775 ns	58,351,865.82
BlockCopy	128	18.8812 ns	0.0938 ns	52,962,775.62
BlockCopy	256	22.9966 ns	0.0746 ns	43,484,656.70
BlockCopy	512	32.4107 ns	0.2387 ns	30,853,998.95
BlockCopy	1024	58.0768 ns	0.4306 ns	17,218,585.03
BlockCopy	2048	96.3630 ns	0.4208 ns	10,377,425.16
BlockCopy	4096	196.2304 ns	1.5321 ns	5,096,049.22
BlockCopy	65536	3,610.3144 ns	13.4609 ns	276,984.20
BlockCopy	524288	38,735.2434 ns	95.4848 ns	25,816.28
BlockCopy	4194304	946,329.9819 ns	49,105.8813 ns	1,056.71
BlockCopy	16777216	4,424,729.8278 ns	65,275.5097 ns	226.00
BlockCopy	268435456	70,876,158.4771 ns	1,095,159.6552 ns	14.11
Buffers	8	6.9664 ns	0.0333 ns	143,546,460.56
Buffers	16	6.6900 ns	0.0539 ns	149,476,401.82
Buffers	32	11.3233 ns	0.0639 ns	88,313,683.68
Buffers	64	12.4547 ns	0.0340 ns	80,290,922.80
Buffers	128	14.7756 ns	0.0730 ns	67,679,251.38
Buffers	256	19.4144 ns	0.1070 ns	51,508,223.83
Buffers	512	33.7256 ns	0.2513 ns	29,651,045.80
Buffers	1024	56.6925 ns	2.1671 ns	17,639,029.38
Buffers	2048	96.3614 ns	1.2635 ns	10,377,603.52
Buffers	4096	198.9255 ns	2.0715 ns	5,027,007.45
Buffers	65536	3,605.6785 ns	17.5528 ns	277,340.31
Buffers	524288	38,803.8312 ns	125.3552 ns	25,770.65
Buffers	4194304	951,981.8515 ns	39,558.8495 ns	1,050.45
Buffers	16777216	4,411,218.8830 ns	78,520.3563 ns	226.69
Buffers	268435456	73,256,631.6686 ns	2,343,338.8837 ns	13.65
Sparrow	8	5.5286 ns	0.0552 ns	180,877,148.01
Sparrow	16	5.8478 ns	0.0481 ns	171,003,095.06
Sparrow	32	8.1731 ns	0.0411 ns	122,352,316.46
Sparrow	64	9.0440 ns	0.0271 ns	110,571,016.33
Sparrow	128	10.9351 ns	0.1389 ns	91,448,246.08
Sparrow	256	15.7591 ns	0.0665 ns	63,455,298.54
Sparrow	512	26.6757 ns	0.1929 ns	37,487,365.81
Sparrow	1024	39.3902 ns	0.1177 ns	25,387,017.99
Sparrow	2048	70.8186 ns	0.7450 ns	14,120,587.33
Sparrow	4096	133.1484 ns	1.6520 ns	7,510,426.68
Sparrow	65536	2,710.1544 ns	33.9204 ns	368,983.08
Sparrow	524288	57,935.3365 ns	2,430.0156 ns	17,260.62
Sparrow	4194304	639,170.8435 ns	39,031.4148 ns	1,564.54
Sparrow	16777216	3,332,685.4896 ns	83,212.1777 ns	300.06
Sparrow	268435456	54,586,531.5946 ns	1,904,690.8827 ns	18.32

Each platform has a different threshold where it makes sense to just resort to memcpy, for our typical machine 512 bytes was a good tradeoff.

I would agree also with @mikedn that MemoryCopy should be a JIT intrinsic.

benaadams · 2015-12-22T17:31:42Z

@redknightlois interesting, also I agree on the JIT intrinsic; I just don't have the domain knowledge for it and I assume with the dotnet cli and approaching RC2/RTM; I doubt the dotnet team would have a chance to work on it in the near term.

However; I can do the managed generics contribution for Buffer and Array; just want to make sure it doesn't mean that it would potentially then cause a breaking change if they then were to become intrinsics in the future.

Would be good to get in coreclr as it certainly looks like there are lots of people rolling their own copy routines for improved performance; whereas they shouldn't have to.

mikedn · 2015-12-22T21:17:03Z

also I agree on the JIT intrinsic

Maybe that's not actually needed. The whole point of a "copy" JIT intrinsic would be that certain cases can be better optimized if the JIT understands what's going on. The most important case that can benefit is copy length being a constant. If the JIT would be more willing to consider implementations such as Buffer.MemoryCopy for inlining then it would discover that the whole inlinee reduces to a single switch case made of a couple of instructions.

However; I can do the managed generics contribution for Buffer and Array.

I'm not sure how you could make generic versions work efficiently without something like IsBlittable.

Would be good to get in coreclr as it certainly looks like there are lots of people rolling their own copy routines for improved performance; whereas they shouldn't have to.

Custom implementations will likely always exist because it's impossible to optimize for every case under the sun. For example, why should I pay the cost of checks for small copy length if I know that copy length will never be smaller than a certain threshold in the application I'm writing?

redknightlois · 2015-12-22T21:42:46Z

@mikedn But the question still lingers. Could the JIT as it currently stand (without a dedicated codepath for Copy) be able to constant propagate such complex code and prune the code-tree to be able to do that?

If that could be necessary, what's the difference with treating it as an intrinsic? Probably that is something only a handful of people can actually give some input into it like @CarolEidt, @erozenfeld or @cmckinsey.

mikedn · 2015-12-22T22:05:41Z

Could the JIT as it currently stand (without a dedicated codepath for Copy ) be able to constant propagate such complex code and prune the code-tree to be able to do that?

What complex code? Most of the Buffer.MemoryCopy is a giant switch. If you know that the switch expression is constant it's trivial to get rid of the entire switch and keep only the required case. It already happens:

int a = 3;

switch (a) {
case 0:
    Console.WriteLine(0);
    break;
     ...
case 15:
    Console.WriteLine(15);
    break;
}

generates:

    mov         ecx,3 
    call        000000005EB3D4F0

If that could be necessary, what's the difference with treating it as an intrinsic?

Not much I'd say. Maybe it could emit a rep movsb in certain cases, that's something that you cannot do otherwise in a managed implementation.

benaadams · 2015-12-23T00:29:27Z

I'm not sure how you could make generic versions work efficiently without something like IsBlittable

Would do for byte->double with fallback of current methods

If that could be necessary, what's the difference with treating it as an intrinsic?

SSE -> AVX -> AVX-512 instructions; the Vectors lib is further up the stack in corefx so coreclr couldn't take a dependency on it (also would want to skip the bounds checks) and the wide types aren't native in the base runtime beyond long

redknightlois · 2015-12-23T00:40:34Z

@mikedn Actually I was thinking on cases where you need to accommodate cases like always copy 1033 bytes which will probably have a few loops of 512 bytes copy and a few more based on switch; the typical < than n bytes is actually pretty simple for propagation (assuming you don't go overboard with the method size and negate optimization altogether).

@benaadams That is basically what I implied, if you have to implement custom paths for copy on the JIT to cover most cases, there is no difference, just treat it as an intrinsic and do whatever optimization available for the target architecture. Say SSE, AVX, AVX-512 or whatever.

benaadams · 2015-12-23T00:57:10Z

@redknightlois its more complicated than that; for example with CopyTo (a good example)

Managed non accelerated version
https://github.com/dotnet/corefx/blob/master/src/System.Numerics.Vectors/src/System/Numerics/Vector.cs#L908
Managed accelerated fallback+debug version
https://github.com/dotnet/corefx/blob/master/src/System.Numerics.Vectors/src/System/Numerics/Vector.cs#L795
The [JitIntrinsic] version
https://github.com/dotnet/coreclr/blob/master/src/jit/simd.cpp#L2048

mikedn · 2015-12-23T06:45:31Z

SSE -> AVX -> AVX-512 instructions; the Vectors lib is further up the stack in corefx so coreclr couldn't take a dependency on it (also would want to skip the bounds checks)

I think you're assigning a slightly different meaning to intrinsic, usually intrinsic means that the function call is replaced with a inline version. What you seem to be suggesting is that the JIT compiler should generate the code for functions like BlockCopy out of thin air. I suppose that could work but it's a bit more work.

and the wide types aren't native in the base runtime beyond long

Hmm, put 2 longs in a struct, the JIT usually copies such structs using SSE instructions. Or put 4 and you get a form of unrolling as well.

also would want to skip the bounds checks

fixed (byte *src = sourceArray)...

Actually I was thinking on cases where you need to accommodate cases like always copy 1033 bytes which will probably have a few loops of 512 bytes copy and a few more based on switch;

With a bit more optimization on the JIT side that could probably work as well. And the additional optimizations could handle other situations unrelated to copying whereas an intrinsic is useful only in this limited circumstance.

Anderman · 2015-12-25T15:54:05Z

@mikedn. Interesting.

So, if I'am correct the VM will mark the Vector assembly as special class.

The importer scans every CALL and will try to convert the call to impSIMDIntrinsic

The impSIMDIntrinsic will first find out if the call is to a special class here, here and here

So it must be possible to do the same Array.CopyTo or Buffer.BlockCopy.

Mark them as special.
generated the code as an copyBlkDst

Skipping bound check did improve the copy performance for 128 byte by 50-80%

sharwell · 2015-12-26T05:59:15Z

What about something like the following? It assumes the argument checks are placed in a separate method VectorizedCopy, which calls this method after the arguments are checked.

.method public hidebysig static 
    void VectorizedCopyUnsafe (
        uint8[] src,
        int32 srcOffset,
        uint8[] dst,
        int32 dstOffset,
        int32 count
    ) cil managed 
{
    .maxstack 2
    .locals init (
        [0] uint8& pinned,
        [1] uint8& pinned
    )

    ldarg.s count
    brtrue.s IL_0005
    ret

    IL_0005: ldarg.0
    ldc.i4.0
    ldelema [mscorlib]System.Byte
    stloc.0
    ldarg.2
    ldc.i4.0
    ldelema [mscorlib]System.Byte
    stloc.1
    ldloc.1
    ldloc.0
    ldarg.s count
    unaligned.cpblk
    ret
}

mikedn · 2015-12-27T08:08:04Z

@Anderman

So, if I'am correct the VM will mark the Vector assembly as special class...

Yes but but a hypothetical Array.Copy/Buffer.BlockCopy intrinsic isn't related to SIMD intrinsics even if the implementation uses SIMD. You may want to check Compiler::impIntrinsic.

@sharwell

What about something like the following?

Currently cpblk is converted to a helper call so it's unlikely this would help with small copies. Ultimately this doesn't seem to be very different from what Buffer.MemoryCopy offers.

mikedn · 2015-12-27T10:01:47Z

This issue is getting long and IMO a bit unfocused. Let's try to sum it up, we'll see there are actually 2 different related issues.

It's clear that Array.Copy & co. are slower than a native memcpy. How much slower depends on the CLR build. For example, the desktop build does 76 mil 8 byte copies while the official CoreCLR rc1-final does only ~48 mil (lack of PGO I assume). Anyway, VC++2015's memcpy does 268 mil copies so it's 3.5 times faster than the best CLR time. Even at larger sizes the difference is still rather large - 1.7x for 512 byte copies.

Now, the CLR may end up using no less than 3 different versions of memcpy depending on how it is build (VS2013 CRT, VS2015 CRT, OS CRT) and this can complicate measurements a bit. But the outcome is largely the same, the native memcpy is faster, especially for very small copy sizes. We shouldn't be distracted by things like "there's a drop in performance at 16 bytes" because with or without drop the performance is still lower.

Issue 1
Where does the difference come from? The CLR uses memcpy as well (at least in the case of blittable types) but before that there are a lot of checks and the cost of those checks is significant for small copy sizes. SIMD won't help here but an intrinsic or managed version of Array.Copy might help because some of those checks could be eliminated. IMO a managed generic version would be best but that too requires some intrinsics, that's what IsBlittable is after all.

Are there workarounds? Well, if you don't mind using unsafe then there's Buffer.MemoryCopy. But is its performance as good as memcpy? It turns out that it isn't and that's the second issue.

Issue 2
Even for size < 16 the performance of Buffer.MemoryCopy is ~1.2x lower than memcpy even if both use the same technique - a jump table (switch). This may be a JIT issue, in any case this case isn't related to SIMD.

Issue 3
And for size = 512 the performance is also lower, around 1.4x. Here is where SIMD would fit in, should work better than the current int/long copy loop.

Hmm, looks like I ended up with 3 different issues :)

benaadams · 2016-02-14T09:22:21Z

Need to remeasure post dotnet/coreclr#3118

mikedn · 2016-02-14T09:32:55Z

Yeah, I love how a change was sneaked in without any reference to a existing long discussion about the affected piece of code 😀

jkotas · 2016-02-14T13:05:25Z

dotnet/coreclr#3118 was @geoffkizer suggestion that I just helped to happen, so it did not immediately occur to me to link it to this one as I was typing the commit description ... sorry about that. Geoff is analyzing performance of the whole .NET Core server stack.

nietras · 2016-03-16T13:48:45Z

@Anderman could you explain how this works:

if (orgCount > Vector<byte>.Count)
{
    new Vector<byte>(src, orgCount - Vector<byte>.Count).CopyTo(dst, orgCount - Vector<byte>.Count);
    return;
}

I don't understand how this works, when it is not taking offsets into account?

I am redoing the different copies benchmarks (on all JITs) and wanted to include your take too.

As far as I understand Kestrel now uses Buffer.MemoryCopy for byte arrays since this is handled by a special case in the JIT. Has anyone done benchmarks on this?

jkotas · 2017-03-04T17:44:56Z

We should look into fixing this using Span.

benaadams · 2017-11-12T02:01:59Z

Closing as very out of date;, though don't know current situation

benaadams changed the title ~~Array.Copy & Buffer.BlockCopy x2 to x3 too slow for small copies~~ Array.Copy & Buffer.BlockCopy x2 to x3 too slow < 1kB Dec 20, 2015

benaadams changed the title ~~Array.Copy & Buffer.BlockCopy x2 to x3 too slow < 1kB~~ Array.Copy & Buffer.BlockCopy x2 to x3 slower < 1kB Dec 20, 2015

benaadams closed this as completed Nov 12, 2017

msftgits transferred this issue from dotnet/coreclr Jan 30, 2020

msftgits added this to the Future milestone Jan 30, 2020

dotnet locked as resolved and limited conversation to collaborators Jan 3, 2021

Array.Copy & Buffer.BlockCopy x2 to x3 slower < 1kB #4847

Array.Copy & Buffer.BlockCopy x2 to x3 slower < 1kB #4847

Comments

benaadams commented Dec 20, 2015

mikedn commented Dec 20, 2015

benaadams commented Dec 20, 2015

mikedn commented Dec 20, 2015

mikedn commented Dec 20, 2015

benaadams commented Dec 20, 2015

benaadams commented Dec 20, 2015

mikedn commented Dec 20, 2015

benaadams commented Dec 20, 2015

benaadams commented Dec 20, 2015

jkotas commented Dec 20, 2015

benaadams commented Dec 20, 2015

benaadams commented Dec 20, 2015

mikedn commented Dec 20, 2015

benaadams commented Dec 21, 2015

mikedn commented Dec 21, 2015

benaadams commented Dec 21, 2015

Anderman commented Dec 22, 2015

Anderman commented Dec 22, 2015

Anderman commented Dec 22, 2015

mikedn commented Dec 22, 2015

benaadams commented Dec 22, 2015

Anderman commented Dec 22, 2015

redknightlois commented Dec 22, 2015

benaadams commented Dec 22, 2015

mikedn commented Dec 22, 2015

redknightlois commented Dec 22, 2015

mikedn commented Dec 22, 2015

benaadams commented Dec 23, 2015

redknightlois commented Dec 23, 2015

benaadams commented Dec 23, 2015

mikedn commented Dec 23, 2015

Anderman commented Dec 25, 2015

sharwell commented Dec 26, 2015

mikedn commented Dec 27, 2015

mikedn commented Dec 27, 2015

benaadams commented Feb 14, 2016

mikedn commented Feb 14, 2016

jkotas commented Feb 14, 2016

nietras commented Mar 16, 2016

jkotas commented Mar 4, 2017

benaadams commented Nov 12, 2017