Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array.Copy & Buffer.BlockCopy x2 to x3 slower < 1kB #4847

Closed
benaadams opened this issue Dec 20, 2015 · 41 comments
Closed

Array.Copy & Buffer.BlockCopy x2 to x3 slower < 1kB #4847

benaadams opened this issue Dec 20, 2015 · 41 comments
Labels
area-System.Memory enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@benaadams
Copy link
Member

Array.Copy & Buffer.BlockCopy are up to x3 times slower than what can currently be written using managed C# for byte[] copies < 1024 bytes.

As these are the low level copy apis they should be faster than anything that can be written using managed code.

The IllyriadGames/ByteArrayExtension repository adds a VectorizedCopy extension method for byte[] that demonstrates this speed-up; and I assume potentially more gains could be made at the clr level.

Performance Graph

As Buffer.BlockCopy is essentially calling memmove Would this effect other copies e.g. structs passed as parameters; as they are essentially small copy operations?

@benaadams benaadams changed the title Array.Copy & Buffer.BlockCopy x2 to x3 too slow for small copies Array.Copy & Buffer.BlockCopy x2 to x3 too slow < 1kB Dec 20, 2015
@benaadams benaadams changed the title Array.Copy & Buffer.BlockCopy x2 to x3 too slow < 1kB Array.Copy & Buffer.BlockCopy x2 to x3 slower < 1kB Dec 20, 2015
@mikedn
Copy link
Contributor

mikedn commented Dec 20, 2015

Array.Copy & Buffer.BlockCopy are up to x3 times slower than what can currently be written using managed C# for byte[] copies < 1024 bytes.

That's a bit of apples and oranges comparison, your version only handles byte[] while the framework versions handle all types of arrays. But yes, there's certainly room for improvement, the JIT could treat Array.Copy and Buffer.BlockCopy as intrinsics and that would allow a number of optimizations to be performed, including generating an inline rep movsb.

Would this effect other copies e.g. structs passed as parameters; as they are essentially small copy operations?

Those are handled by the JIT, they have nothing to do with Buffer.BlockCopy.

@benaadams
Copy link
Member Author

That's a bit of apples and oranges comparison, your version only handles byte[] while the framework versions handle all types of arrays.

I'm not entirely sure how unless this is particularly onerous

    // Size of the Arrays in bytes
    SIZE_T srcLen = src->GetNumComponents() * src->GetComponentSize();
    SIZE_T dstLen = srcLen;

    // We only want to allow arrays of primitives, no Objects.
    const CorElementType srcET = src->GetArrayElementType();
    if (!CorTypeInfo::IsPrimitiveType_NoThrow(srcET))
        FCThrowArgumentVoid(W("src"), W("Arg_MustBePrimArray"));

    if (src != dst) {
        const CorElementType dstET = dst->GetArrayElementType();
        if (!CorTypeInfo::IsPrimitiveType_NoThrow(dstET))
            FCThrowArgumentVoid(W("dest"), W("Arg_MustBePrimArray"));
        dstLen = dst->GetNumComponents() * dst->GetComponentSize();
    }

If that is heavy it could move into a jitted constant for the type or exception and be essentially eliminated? The cpu vector load and store instructions don't care what the data types are; though there would be some overhead in handling overlapping data; which I don't do.

However that is an identical cost for all calls and doesn't explain the dramatic fall off at 16 bytes; or the slightly smaller fall off at 64 bytes?

Would this effect other copies e.g. structs passed as parameters; as they are essentially small copy operations?

Those are handled by the JIT, they have nothing to do with Buffer.BlockCopy.

Yes, but my point was that the code that does the copying is a call to memmove

    if ((srcPtr != dstPtr) && (count > 0)) {
        memmove(dstPtr, srcPtr, count);
    }

So anything calling BlockCopy, InternalBlockCopy, memmove and perhaps memcopy would exhibit this behaviour. While it might be a deeper issue with the C Run-Time Library, that's not immediately helpful.

@mikedn
Copy link
Contributor

mikedn commented Dec 20, 2015

I'm not entirely sure how unless this is particularly onerous

At least for small sizes like 16, yes that piece of code's overhead is high.

If that is heavy it could move into a jitted constant for the type or exception and be essentially eliminated?

I'm not very sure what you are suggesting. One possible solution is to have a generic version and then use code like:

if (typeof(T).IsBlittable) Buffer.__memmove(...); else Array.Copy(...};

where IsBlittable is a hypothetical property that returns true if T is not a reference type or a struct that contains reference type fields. The if could then be eliminated at JIT time because for T = reference type IsBlittable is always false and for T = value type the JIT generates specialized code for each T.

However that is an identical cost for all calls and doesn't explain the dramatic fall off at 16 bytes; or the slightly smaller fall off at 64 bytes?

Which calls? Your version has only one managed-managed call. The framework version has one managed-native call and one native-native call.

Yes, but my point was that the code that does the copying is a call to memmove

Are you saying that the problem is in fact caused by memmove? For 16 bytes memmove uses only 2 movdqu instructions. Granted, to get to those 2 instructions it has to execute a few more to check that the count is 16 but still, memmove itself does far less work than BlockCopy.

@mikedn
Copy link
Contributor

mikedn commented Dec 20, 2015

At least for small sizes like 16, yes that piece of code's overhead is high.

Even for larger sizes the BlockCopy code has significant overhead - for 256 bytes 50% of the time is spent in memcpy and 45% of the time is spent in BlockCopy.

@benaadams
Copy link
Member Author

One possible solution is to have a generic version and then use code like

Yeah, essentially the pattern the System.Numerics.Vector uses

/*
* PATTERN:
*    if (typeof(T) == typeof(Int32)) { ... }
*    else if (typeof(T) == typeof(Single)) { ... }
* EXPLANATION:
*    At runtime, each instantiation of Vector<T> will be type-specific, and each of these typeof 
*    blocks will be eliminated, as typeof(T) is a(JIT) compile-time constant for each 
*    instantiation. This design was chosen to eliminate any overhead from delegates and 
*    other patterns.
*/

doesn't explain the dramatic fall off at 16 bytes; or the slightly smaller fall off at 64 bytes?

Which calls?

The graph included in the issue summary or graph and full data table on IllyriadGames/ByteArrayExtension shows the average times and operations per second for lengths 0 - 96, 508 - 513, 543, 544, 547, 576, 1024, 4096, 8512 for Array.Copy, BlockCopy and VectorizedCopy (using BenchmarkDotNet to do the benchmarking over 8.5 hours) and shows a heavy drop off at >16 bytes and a smaller one at >64 bytes; but this isn't exhibited by VectorizedCopy.

I can see there being overhead in BlockCopy but that should be the same overhead for 16 bytes and 17 bytes also for 64 bytes and 65 bytes?

Are you saying that the problem is in fact caused by memmove?

It would suggest to me it was; as the actual BlockCopy cost should always be the same for every array size of the same type? Though I don't think that code is available to look at.

@benaadams
Copy link
Member Author

i.e. 17 bytes takes double the time of 16 bytes which is it was alignment alone should have recovered somewhat for 32 bytes; but doesn't; snip from table below:

Method Len AvrTime StdDev op/s % Delta vs ArrayCopy
BlockCopy 15 6.0864ns 0.1013ns 164,346,271.10 -2.4%
VectorizedCopy 15 4.6881ns 0.0691ns 213,349,804.77 26.7%
BlockCopy 16 5.9432ns 0.1025ns 168,307,450.63 -2.5%
VectorizedCopy 16 3.8084ns 0.0088ns 262,578,647.13 52.1%
BlockCopy 17 10.7207ns 0.1785ns 93,303,079.33 -0.2%
VectorizedCopy 17 4.1017ns 0.0030ns 243,800,116.81 160.8%
BlockCopy 31 10.8505ns 0.0259ns 92,161,874.17 -2.6%
VectorizedCopy 31 4.8411ns 0.0221ns 206,569,087.60 118.4%
BlockCopy 32 10.4672ns 0.0402ns 95,537,677.55 0.9%
VectorizedCopy 32 2.7518ns 0.0500ns 363,515,306.64 284.1%
BlockCopy 33 10.5894ns 0.1819ns 94,461,058.17 -0.2%
VectorizedCopy 33 4.1079ns 0.0114ns 243,435,478.42 157.1%

At 16 bytes VectorizedCopy is only 52.1% faster than ArrayCopy; but by 32 bytes its 284.1% faster.

So something is odd; and the BlockCopy part of the code that runs should be the same for both byte counts; which would suggest its something in memmove that goes awry?

At a guess, its either not taking advantage of wider datatypes or has instruction result interdependence preventing OoOE and pipelining - the data should all be in L1 cache at this point?

@mikedn
Copy link
Contributor

mikedn commented Dec 20, 2015

I see, I was looking at the graph upside down so to speak 😄

VC++'s memcpy (and memmove) implementation has special cases for less than 16 bytes and 16-32 bytes. See the code in "C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\crt\src\amd64\memcpy.asm".

@benaadams
Copy link
Member Author

It looks like its a dependent read causing a pipeline stall for 17-32 and for every higher byte count.

The next step in the XmmCopySmallLoop loop can't be run until the result of add rcx, 16 is known as it uses rcx; similar to why you add to the counter in an unrolled loop and increment it at the end rather than incrementing the counter at each step; and mostly where the speed up in an unrolled loop comes from.

Whereas the two 16 byte copies are independent in the VectorizedCopy

Is there an open source github repo for memcpy.asm? 😉

@benaadams
Copy link
Member Author

@mikedn looking into this a bit further it looks like can extend the performance for the full range (getting 304GBit/s read + 304GBit/s write) for 16kB aligned arrays. (assuming a readonly static byte[]'s data is aligned). Its probably all operating in L1 cache at that speed; however it is the same conditions Array.Copy is working with.

I'm just testing with unaligned copies to see what effect that has. memmcopy/BlockCopy has the same results for unaligned as it does some work upfront to make the copies mostly aligned. So it may need some work there; also overlapping data handling.

I could add an extension for Array.CopyTo for the specific types e.g.

public static void CopyTo(this byte[] src, byte[] array, int index) { }
public static void CopyTo(this char[] src, char[] array, int index) { }
public static void CopyTo(this byte[] sourceArray, int sourceIndex, byte destinationArray, int destinationIndex, int length) { }

As they would have a dependency on System.Numerics.Vectors they would need to live further up the stack (e.g. corefx) though I'd imagine this would be a preferred place anyway for new functionality to coreclr?

Higher up the stack I couldn't do Array.Copy itself without Static extension methods https://github.com/dotnet/roslyn/issues/996

Buffer.BlockCopy would be trickier as it would need Static extension methods again but also cause an explosion in method bodies without something like a byte* version of private unsafe Vector(void* dataPointer, int offset) method being made public and either a Vector<T>.CopyTo(T* destination, int startIndex) or just Vector<byte>.CopyTo(byte* destination, int startIndex)?

Would this be worth pursuing as a contribution to corefx? Would only be the T[].CopyTo extensions for the reasons mentioned above.

@jkotas
Copy link
Member

jkotas commented Dec 20, 2015

That's a bit of apples and oranges comparison

Agree. To compare apples to apples, it would be more accurate to compare the vectorized implementation with something like:

    public unsafe static void Copy(byte[] src, int srcOffset, byte[] dst, int dstOffset, int count)
    {
        ... argument validation ...

        fixed (byte * pSrc = src) fixed (byte* pDst = dst)
            Buffer.MemoryCopy(pSrc + srcOffset, pDst + dstOffset, count, count);
    }

The results will vary in both directions based on number of dimensions:

  • CPU model.
  • Compiler and C runtime used (VS2013, VS2015, ...)
  • Optimization performed by the C compiler, e.g. you will get much better results for small sizes with .NET Framework vs CoreCLR, even though the C++ code in both is exactly same. The main factor is that .NET Framework is optimized using PGO, but CoreCLR is not yet.
  • Alignment of memory blocks and state of CPU caches

I am getting pretty different number on my config: VS2015 Update 1, Intel Core i7-3520M @ 2.90GHz - the difference is a lot smaller than what you are seeing.

Since Array.Copy and Buffer.MemoryCopy are basically just wrappers around memcpy from C runtime, it is really a perf issue in C runtime, specific to your config.

It is actually pretty hard problem to build memcpy with great performance accross all dimensions. I have seen analysis of the problem several years ago, comparing performance of different strategies - it was multidimensional cube that it was hard to make sense of.

@benaadams
Copy link
Member Author

accurate to compare the vectorized implementation with something like

Interesting didn't know Buffer.MemoryCopy was a thing; 4.6 & coreclr? Will retest.

The results will vary in both directions based on number of dimensions:
CPU model.

Yeah :-/ Tis very true

.NET Framework vs CoreCLR

Using .Net Framework.4.6.1 (4.6.01038); RyuJIT, x64, Win 10 (10586.29), VS 14.0.24720 Update 1 - though just using provided framework not building myself.

Also have CPU power saving off and CPU fan locked on full to avoid throttling, and nothing else running on test machine; just letting it do its thing. Using static readonly arrays to minimise GC impact; which will mean it will all quickly be hitting cache.

Agreed that since Buffer.MemoryCopy exists it is now an apples and oranges comparison.

But yes, benchmarks are hard, also is why am using BenchmarkDotNet rather than trying to roll my own. It goes through warm up cycles for jitting; idles to get a measure of an empty loop; tries to calculate a good number of operations to run to get a good measure; repeats the benchmark 10 times; then starts from scratch 3 times and from the 30 runs it derives the standard deviation; for the example below that ends up being around 591 billion function executions for the 16kB size - alas it takes a long time though.

// Run, Process: 2 / 3
// IntParam=16384
// Method=AlignedVectorizedCopy 
// BenchmarkDotNet-Dev=v0.8.0.0+
// OS=Microsoft Windows NT 6.2.9200.0
// Processor=Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz, ProcessorCount=8
// CLR=MS.NET 4.0.30319.42000, Arch=64-bit  [RyuJIT]
// Pre-Warmup: 2 op, 0 ms, 3158.1 ns, 8 ticks, 1579.0267 ns/op, 633301.5 op/s
// Pre-Warmup: 2000 op, 1 ms, 1015314.2 ns, 2572 ticks, 507.6571 ns/op, 1969833.6 op/s
// Pre-Warmup: 1970000 op, 771 ms, 770992173.6 ns, 1953082 ticks, 391.3666 ns/op, 2555149.2 op/s
// Pre-Warmup: 3940000 op, 1551.9 ms, 1551853659 ns, 3931165 ticks, 393.8715 ns/op, 2538899.2 op/s
// Warmup (idle): 3940000 op, 4.4 ms, 4357324.3 ns, 11038 ticks, 1.1059 ns/op, 904224645.8 op/s
// Warmup (idle): 3940000 op, 4.2 ms, 4245608.1 ns, 10755 ticks, 1.0776 ns/op, 928017818.7 op/s
// Warmup (idle): 3940000 op, 4.4 ms, 4384167.7 ns, 11106 ticks, 1.1127 ns/op, 898688244.2 op/s
// IterationCount = 1970000
// Target (idle): 3940000 op, 4.5 ms, 4480488.4 ns, 11350 ticks, 1.1372 ns/op, 879368426.4 op/s
// Target (idle): 3940000 op, 4.2 ms, 4232186.4 ns, 10721 ticks, 1.0742 ns/op, 930960884.2 op/s
// Target (idle): 3940000 op, 4.3 ms, 4265740.7 ns, 10806 ticks, 1.0827 ns/op, 923637945.6 op/s
// Target (idle): 3940000 op, 4.4 ms, 4416932.5 ns, 11189 ticks, 1.121 ns/op, 892021775 op/s
// Target (idle): 3940000 op, 4.3 ms, 4269688.3 ns, 10816 ticks, 1.0837 ns/op, 922783990.4 op/s
// Warmup 1: 3940000 op, 1547.1 ms, 1547107499.4 ns, 3919142 ticks, 392.6669 ns/op, 2546687.9 op/s
// Warmup 2: 3940000 op, 1550.6 ms, 1550621228.6 ns, 3928043 ticks, 393.5587 ns/op, 2540917.1 op/s
// Warmup 3: 3940000 op, 1548.1 ms, 1548139393.3 ns, 3921756 ticks, 392.9288 ns/op, 2544990.5 op/s
// Warmup 4: 3940000 op, 1548.8 ms, 1548824296.2 ns, 3923491 ticks, 393.1026 ns/op, 2543865.1 op/s
// Warmup 5: 3940000 op, 1548.4 ms, 1548367562.7 ns, 3922334 ticks, 392.9867 ns/op, 2544615.4 op/s
Target 1: 3940000 op, 1548 ms, 1548044651.7 ns, 3921516 ticks, 392.9047 ns/op, 2545146.2 op/s
Target 2: 3940000 op, 1547.3 ms, 1547276060.5 ns, 3919569 ticks, 392.7097 ns/op, 2546410.5 op/s
Target 3: 3940000 op, 1546.8 ms, 1546808273.8 ns, 3918384 ticks, 392.5909 ns/op, 2547180.6 op/s
Target 4: 3940000 op, 1549.6 ms, 1549584597.5 ns, 3925417 ticks, 393.2956 ns/op, 2542616.9 op/s
Target 5: 3940000 op, 1547.9 ms, 1547870958.8 ns, 3921076 ticks, 392.8606 ns/op, 2545431.8 op/s
Target 6: 3940000 op, 1546.7 ms, 1546704452.8 ns, 3918121 ticks, 392.5646 ns/op, 2547351.6 op/s
Target 7: 3940000 op, 1548.5 ms, 1548512438.4 ns, 3922701 ticks, 393.0235 ns/op, 2544377.4 op/s
Target 8: 3940000 op, 1547.9 ms, 1547853194.7 ns, 3921031 ticks, 392.8561 ns/op, 2545461 op/s
Target 9: 3940000 op, 1546.6 ms, 1546550892.4 ns, 3917732 ticks, 392.5256 ns/op, 2547604.5 op/s
Target 10: 3940000 op, 1546.9 ms, 1546852881.3 ns, 3918497 ticks, 392.6023 ns/op, 2547107.1 op/s

@benaadams
Copy link
Member Author

Also needs to be RyuJIT x64 for Vectors to have meaning beyond convenience I believe

@mikedn
Copy link
Contributor

mikedn commented Dec 20, 2015

@benaadams

The next step in the XmmCopySmallLoop loop can't be run until the result of add rcx, 16 is known as it uses rcx ;

That has nothing to do with the 17-32 case, that code is for the 32-128 case.

for 16kB aligned arrays. (assuming a readonly static byte[] 's data is aligned).

16kB aligned arrays? Maybe you mean 16 byte? And what alignment has to do with static readonly?!

Anyway, I don't understand where are you trying to go with all this. Even if memcpy has some kind of issue the overhead of Buffer.BlockCopy is very high at small sizes so what memcpy does is very much irrelevant. For the record, on my machine memcpy consistently exceeds 200mil ops/s when tested in C++ program while your C# vectorized implementation barely reaches 130mil ops/s (for sizes up to 64 bytes).

@jkotas

Since Array.Copy and Buffer.MemoryCopy are basically just wrappers around memcpy from C runtime, it is really a perf issue in C runtime, specific to your config.

MemoryCopy uses memcpy only for size > 512, you're probably talking about BlockCopy.

@benaadams
Copy link
Member Author

That has nothing to do with the 17-32 case, that code is for the 32-128 case.

Ah my mistake, so its doing x2 16 byte copies for 17-32 in Move17to32 one or both of which are unaligned.

16kB aligned arrays?...

16kb arrays, which I assumed to be 16 byte aligned; but probably aren't due to the type and length preamble. Static readonly was to provide a fuller picture of the benchmark and the cpu caching state + GC interplay.

Anyway, I don't understand where are you trying to go with all this ... the overhead of Buffer.BlockCopy is very high at small sizes

Sorry... got a bit side tracked, thank you for following up. Basically a faster Copy for primitive arrays at small sizes; particularly byte[]. If there is any way to reduce Buffer.BlockCopy or Array.Copy overheads for byte[] I'd be very happy.

@mikedn
Copy link
Contributor

mikedn commented Dec 21, 2015

Ah my mistake, so its doing x2 16 byte copies for 17-32 in Move17to32 one or both of which are unaligned.

Yup. In my C++ tests the 17-32 is actually faster than the rest and given how the code looks I think it's normal. But this case seems to be new in VC++2015 and the desktop version of .NET uses the VC++2013 CRT, that may explain why you're seeing that sharp decrease at 16 bytes.

16kb arrays, which I assumed to be 16 byte aligned

Why would the length of an array imply anything about its alignment?

Static readonly was to provide a fuller picture of the benchmark and the cpu caching state + GC interplay.

I still don't understand what static readonly has to do with all this. Those modifiers only affect the array reference, not the array itself.

Basically a faster Copy for primitive arrays at small sizes; particularly byte[] . If there is any way to reduce Buffer.BlockCopy or Array.Copy overheads for byte[] I'd be very happy.

Adding another API beyond the existing one doesn't seem such a good idea. That said, generic versions of the existing APIs may be useful and I wouldn't exactly consider them new APIs. Some of the stuff mentioned here already appeared in other issues. Of the top of my head:

  • I've seen an issue about Array.Resize where AFAIR the lack of a generic version also caused problems
  • I've seen something like IsBlittable mentioned in a request for a way to avoid Array.Clear in collection implementations.

Still, I think that the best thing to do would to treat these methods as JIT intrinsics the same way native compilers do. Unfortunately that probably requires more work.

@benaadams
Copy link
Member Author

Could you help me understand the costs? (i.e. what costs is it worth trying to mitigate)

Does the extern call of Buffer.BlockCopy -> FCIMPL5(VOID, Buffer::BlockCopy come with a cost for the transition?

Is 2 x GetArrayElementType() + CorTypeInfo::IsPrimitiveType_NoThrow the main cost?

There is are two size multiplications dst->GetNumComponents() * dst->GetComponentSize() which for byte would be unnecessary extra since its multiplying by 1 and GetComponentSize would return 1.

I assume the other parameter validation isn't more expensive in native than in managed?

I also assume the call to memmove would have no extra overhead to it being in this scenario to regular C/C++?

@Anderman
Copy link

Cool benaadams, It looks like I have the same results. Vector copy is faster.

@Anderman
Copy link

@benaadams
I created 50-100% a faster implementation

        public static unsafe void VectorizedCopy(byte[] src, int srcOffset, byte[] dst, int dstOffset, int count)
        {
            if (count > 512 + 64)
            {
                // In-built copy faster for large arrays (vs repeated bounds checks on Vector.ctor?)
                Array.Copy(src, srcOffset, dst, dstOffset, count);
                return;
            }
            var orgCount = count;

            while (count >= Vector<byte>.Count)
            {
                new Vector<byte>(src, srcOffset).CopyTo(dst, dstOffset);
                count -= Vector<byte>.Count;
                srcOffset += Vector<byte>.Count;
                dstOffset += Vector<byte>.Count;
            }
            if (orgCount > Vector<byte>.Count)
            {
                new Vector<byte>(src, orgCount - Vector<byte>.Count).CopyTo(dst, orgCount - Vector<byte>.Count);
                return;
            }
            if (src == null || dst == null) throw new ArgumentNullException(nameof(src));
            if (count < 0 || srcOffset < 0 || dstOffset < 0) throw new ArgumentOutOfRangeException(nameof(count));
            if (srcOffset + count > src.Length) throw new ArgumentException(nameof(src));
            if (dstOffset + count > dst.Length) throw new ArgumentException(nameof(dst));
            fixed (byte* srcOrigin = src)
            fixed (byte* dstOrigin = dst)
            {
                var pSrc = srcOrigin + srcOffset;
                var pDst = dstOrigin + dstOffset;
                switch (count)
                {
                    case 1:
                        pDst[0] = pSrc[0];
                        return;

                    case 2:
                        *((short*)pDst) = *((short*)pSrc);
                        return;

                    case 3:
                        *((short*)pDst) = *((short*)pSrc);
                        pDst[2] = pSrc[2];
                        return;

                    case 4:
                        *((int*)pDst) = *((int*)pSrc);
                        return;

                    case 5:
                        *((int*)pDst) = *((int*)pSrc);
                        pDst[4] = pSrc[4];
                        return;

                    case 6:
                        *((int*)pDst) = *((int*)pSrc);
                        *((short*)(pDst + 4)) = *((short*)(pSrc + 4));
                        return;

                    case 7:
                        *((int*)pDst) = *((int*)pSrc);
                        *((short*)(pDst + 4)) = *((short*)(pSrc + 4));
                        pDst[6] = pSrc[6];
                        return;

                    case 8:
                        *((long*)pDst) = *((long*)pSrc);
                        return;

                    case 9:
                        *((long*)pDst) = *((long*)pSrc);
                        pDst[8] = pSrc[8];
                        return;

                    case 10:
                        *((long*)pDst) = *((long*)pSrc);
                        *((short*)(pDst + 8)) = *((short*)(pSrc + 8));
                        return;

                    case 11:
                        *((long*)pDst) = *((long*)pSrc);
                        *((short*)(pDst + 8)) = *((short*)(pSrc + 8));
                        pDst[10] = pSrc[10];
                        return;

                    case 12:
                        *((long*)pDst) = *((long*)pSrc);
                        *((int*)(pDst + 8)) = *((int*)(pSrc + 8));
                        return;

                    case 13:
                        *((long*)pDst) = *((long*)pSrc);
                        *((int*)(pDst + 8)) = *((int*)(pSrc + 8));
                        pDst[12] = pSrc[12];
                        return;

                    case 14:
                        *((long*)pDst) = *((long*)pSrc);
                        *((int*)(pDst + 8)) = *((int*)(pSrc + 8));
                        *((short*)(pDst + 12)) = *((short*)(pSrc + 12));
                        return;

                    case 15:
                        *((long*)pDst) = *((long*)pSrc);
                        *((int*)(pDst + 8)) = *((int*)(pSrc + 8));
                        *((short*)(pDst + 12)) = *((short*)(pSrc + 12));
                        pDst[14] = pSrc[14];
                        return;

                }
            }
        }

@Anderman
Copy link

Result from the improved vector copy compared with @benaadams vector copy
vector copy 1 bytes org:0070 imp:0100 mb/s faster:1,4x
vector copy 2 bytes org:0131 imp:0202 mb/s faster:1,5x
vector copy 3 bytes org:0186 imp:0296 mb/s faster:1,6x
vector copy 4 bytes org:0304 imp:0405 mb/s faster:1,3x
vector copy 5 bytes org:0377 imp:0474 mb/s faster:1,3x
vector copy 6 bytes org:0429 imp:0601 mb/s faster:1,4x
vector copy 7 bytes org:0472 imp:0668 mb/s faster:1,4x
vector copy 8 bytes org:0592 imp:0809 mb/s faster:1,4x
vector copy 9 bytes org:0612 imp:0853 mb/s faster:1,4x
vector copy 10 bytes org:0633 imp:1028 mb/s faster:1,6x
vector copy 11 bytes org:0683 imp:1068 mb/s faster:1,6x
vector copy 12 bytes org:0908 imp:1189 mb/s faster:1,3x
vector copy 13 bytes org:0970 imp:1371 mb/s faster:1,4x
vector copy 14 bytes org:1011 imp:1373 mb/s faster:1,4x
vector copy 15 bytes org:0997 imp:1465 mb/s faster:1,5x
vector copy 16 bytes org:1532 imp:1686 mb/s faster:1,1x
vector copy 17 bytes org:1153 imp:2710 mb/s faster:2,4x
vector copy 31 bytes org:1907 imp:4937 mb/s faster:2,6x
vector copy 32 bytes org:2736 imp:3973 mb/s faster:1,5x
vector copy 48 bytes org:3402 imp:4770 mb/s faster:1,4x
vector copy 64 bytes org:4535 imp:5535 mb/s faster:1,2x
vector copy 96 bytes org:5105 imp:6064 mb/s faster:1,2x
vector copy 128 bytes org:6100 imp:6946 mb/s faster:1,1x
vector copy 256 bytes org:7209 imp:8028 mb/s faster:1,1x
vector copy 512 bytes org:8061 imp:8865 mb/s faster:1,1x
vector copy 576 bytes org:8109 imp:8990 mb/s faster:1,1x

@mikedn
Copy link
Contributor

mikedn commented Dec 22, 2015

Could you help me understand the costs? (i.e. what costs is it worth trying to mitigate)

Best thing to do would be use a profiler to see what happens. I've done that already so here's a quick overview:

  • There doesn't appear be any transition cost but Buffer::BlockCopy's function prolog takes 11.3%. Not surprising considering that it stores 7 registers to stack - that's 56 bytes of stores in a function that's supposed to copy, say, 16 bytes.
  • The multiplication supposedly takes 3%
  • The first CorTypeInfo::IsPrimitiveType_NoThrow takes 16.4%. That's a lot of time but I think the profiler misattributed the time and this includes the previous GetArrayElementType
  • The various argument checks take around 3% each
  • There's a FC_GC_POLL at the end of the function which takes 1.2%
  • The function epilog takes 7.5%, not surprising given that it needs to restore 4 registers from the stack, that's 32 bytes of loads

Now, profiler's claims should be taken with a grain of salt. But it's pretty certain that in my test (with lengths from 0 to 64) 76% of the time is spent in coreclr.dll, 17% is spent in vcruntime140.dll and 5% of the time is spent in the test executable. The influence of memcpy on the final result is pretty small.

Keep in mind that my test is using CoreCLR compiled with VS2015. You may get different results if you use the desktop version of the CLR because that one is compiled using VS2013 which appears to have a slightly different memcpy implementation.

@benaadams
Copy link
Member Author

Wow, ok - so sticking to managed code is probably best. If I introduced some methods like

public static void BlockCopy<T>(T[] src, int srcOffset, T[] dst, int dstOffset, int count)

Then have them call though to Buffer.Memmove would that be the best approach and still leave the door open for JitIntrinsic in the future?

I had a look at what could be done with native vectors and intrinsics; but will admit what goes on there is a little exotic for me to confidently make those changes at this time.

using CoreCLR

We are exclusively CoreCLR x64 for production use now; so this is the area I'd most like to target.

@Anderman is similar to what Buffer.Memmove is doing

@Anderman
Copy link

@benaadams
Yeah almost, I made 3 other optimization.

  • use Vector.Count instead of var (the compiler makes = 0x10)
  • do only safety check, if we go unsafe.
  • Copy the last 16 byte faster if array if >16
if (orgCount > Vector<byte>.Count)
            {
                new Vector<byte>(src, orgCount - Vector<byte>.Count).CopyTo(dst, orgCount - Vector<byte>.Count);
                return;
            }

@redknightlois
Copy link

@benaadams Calling via p/invoke is very costly for small array sizes. While fixing the array in memory and copying it is not. This is the most optimized routine for aligned buffers we could create for JIT64 (haven't tested it yet con CoreCLR and or unaligned but the difference for Intel architectures shouldn't be too different) https://github.com/Corvalius/ravendb/blob/master/Raven.Sparrow/Sparrow/Memory.cs .

When we tested back in .NET 4.6 against Buffer.BlockCopy and System.Buffer we got this numbers:

BenchmarkDotNet=v0.7.8.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz, ProcessorCount=4
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit  [AttachedDebugger] [RyuJIT]
Type=MemoryPerf  Mode=Throughput  Platform=X64  Jit=RyuJit  .NET=HostFramework  
Method Param AvrTime StdDev op/s
BlockCopy 8 12.7248 ns 0.1382 ns 78,586,659.62
BlockCopy 16 12.7584 ns 0.0580 ns 78,379,941.88
BlockCopy 32 16.0668 ns 0.2440 ns 62,240,025.46
BlockCopy 64 17.1374 ns 0.0775 ns 58,351,865.82
BlockCopy 128 18.8812 ns 0.0938 ns 52,962,775.62
BlockCopy 256 22.9966 ns 0.0746 ns 43,484,656.70
BlockCopy 512 32.4107 ns 0.2387 ns 30,853,998.95
BlockCopy 1024 58.0768 ns 0.4306 ns 17,218,585.03
BlockCopy 2048 96.3630 ns 0.4208 ns 10,377,425.16
BlockCopy 4096 196.2304 ns 1.5321 ns 5,096,049.22
BlockCopy 65536 3,610.3144 ns 13.4609 ns 276,984.20
BlockCopy 524288 38,735.2434 ns 95.4848 ns 25,816.28
BlockCopy 4194304 946,329.9819 ns 49,105.8813 ns 1,056.71
BlockCopy 16777216 4,424,729.8278 ns 65,275.5097 ns 226.00
BlockCopy 268435456 70,876,158.4771 ns 1,095,159.6552 ns 14.11
Buffers 8 6.9664 ns 0.0333 ns 143,546,460.56
Buffers 16 6.6900 ns 0.0539 ns 149,476,401.82
Buffers 32 11.3233 ns 0.0639 ns 88,313,683.68
Buffers 64 12.4547 ns 0.0340 ns 80,290,922.80
Buffers 128 14.7756 ns 0.0730 ns 67,679,251.38
Buffers 256 19.4144 ns 0.1070 ns 51,508,223.83
Buffers 512 33.7256 ns 0.2513 ns 29,651,045.80
Buffers 1024 56.6925 ns 2.1671 ns 17,639,029.38
Buffers 2048 96.3614 ns 1.2635 ns 10,377,603.52
Buffers 4096 198.9255 ns 2.0715 ns 5,027,007.45
Buffers 65536 3,605.6785 ns 17.5528 ns 277,340.31
Buffers 524288 38,803.8312 ns 125.3552 ns 25,770.65
Buffers 4194304 951,981.8515 ns 39,558.8495 ns 1,050.45
Buffers 16777216 4,411,218.8830 ns 78,520.3563 ns 226.69
Buffers 268435456 73,256,631.6686 ns 2,343,338.8837 ns 13.65
Sparrow 8 5.5286 ns 0.0552 ns 180,877,148.01
Sparrow 16 5.8478 ns 0.0481 ns 171,003,095.06
Sparrow 32 8.1731 ns 0.0411 ns 122,352,316.46
Sparrow 64 9.0440 ns 0.0271 ns 110,571,016.33
Sparrow 128 10.9351 ns 0.1389 ns 91,448,246.08
Sparrow 256 15.7591 ns 0.0665 ns 63,455,298.54
Sparrow 512 26.6757 ns 0.1929 ns 37,487,365.81
Sparrow 1024 39.3902 ns 0.1177 ns 25,387,017.99
Sparrow 2048 70.8186 ns 0.7450 ns 14,120,587.33
Sparrow 4096 133.1484 ns 1.6520 ns 7,510,426.68
Sparrow 65536 2,710.1544 ns 33.9204 ns 368,983.08
Sparrow 524288 57,935.3365 ns 2,430.0156 ns 17,260.62
Sparrow 4194304 639,170.8435 ns 39,031.4148 ns 1,564.54
Sparrow 16777216 3,332,685.4896 ns 83,212.1777 ns 300.06
Sparrow 268435456 54,586,531.5946 ns 1,904,690.8827 ns 18.32

Each platform has a different threshold where it makes sense to just resort to memcpy, for our typical machine 512 bytes was a good tradeoff.

I would agree also with @mikedn that MemoryCopy should be a JIT intrinsic.

@benaadams
Copy link
Member Author

@redknightlois interesting, also I agree on the JIT intrinsic; I just don't have the domain knowledge for it and I assume with the dotnet cli and approaching RC2/RTM; I doubt the dotnet team would have a chance to work on it in the near term.

However; I can do the managed generics contribution for Buffer and Array; just want to make sure it doesn't mean that it would potentially then cause a breaking change if they then were to become intrinsics in the future.

Would be good to get in coreclr as it certainly looks like there are lots of people rolling their own copy routines for improved performance; whereas they shouldn't have to.

@mikedn
Copy link
Contributor

mikedn commented Dec 22, 2015

also I agree on the JIT intrinsic

Maybe that's not actually needed. The whole point of a "copy" JIT intrinsic would be that certain cases can be better optimized if the JIT understands what's going on. The most important case that can benefit is copy length being a constant. If the JIT would be more willing to consider implementations such as Buffer.MemoryCopy for inlining then it would discover that the whole inlinee reduces to a single switch case made of a couple of instructions.

However; I can do the managed generics contribution for Buffer and Array.

I'm not sure how you could make generic versions work efficiently without something like IsBlittable.

Would be good to get in coreclr as it certainly looks like there are lots of people rolling their own copy routines for improved performance; whereas they shouldn't have to.

Custom implementations will likely always exist because it's impossible to optimize for every case under the sun. For example, why should I pay the cost of checks for small copy length if I know that copy length will never be smaller than a certain threshold in the application I'm writing?

@redknightlois
Copy link

@mikedn But the question still lingers. Could the JIT as it currently stand (without a dedicated codepath for Copy) be able to constant propagate such complex code and prune the code-tree to be able to do that?

If that could be necessary, what's the difference with treating it as an intrinsic? Probably that is something only a handful of people can actually give some input into it like @CarolEidt, @erozenfeld or @cmckinsey.

@mikedn
Copy link
Contributor

mikedn commented Dec 22, 2015

Could the JIT as it currently stand (without a dedicated codepath for Copy ) be able to constant propagate such complex code and prune the code-tree to be able to do that?

What complex code? Most of the Buffer.MemoryCopy is a giant switch. If you know that the switch expression is constant it's trivial to get rid of the entire switch and keep only the required case. It already happens:

int a = 3;

switch (a) {
case 0:
    Console.WriteLine(0);
    break;
     ...
case 15:
    Console.WriteLine(15);
    break;
}

generates:

    mov         ecx,3 
    call        000000005EB3D4F0 

If that could be necessary, what's the difference with treating it as an intrinsic?

Not much I'd say. Maybe it could emit a rep movsb in certain cases, that's something that you cannot do otherwise in a managed implementation.

@benaadams
Copy link
Member Author

I'm not sure how you could make generic versions work efficiently without something like IsBlittable

Would do for byte->double with fallback of current methods

If that could be necessary, what's the difference with treating it as an intrinsic?

SSE -> AVX -> AVX-512 instructions; the Vectors lib is further up the stack in corefx so coreclr couldn't take a dependency on it (also would want to skip the bounds checks) and the wide types aren't native in the base runtime beyond long

@redknightlois
Copy link

@mikedn Actually I was thinking on cases where you need to accommodate cases like always copy 1033 bytes which will probably have a few loops of 512 bytes copy and a few more based on switch; the typical < than n bytes is actually pretty simple for propagation (assuming you don't go overboard with the method size and negate optimization altogether).

@benaadams That is basically what I implied, if you have to implement custom paths for copy on the JIT to cover most cases, there is no difference, just treat it as an intrinsic and do whatever optimization available for the target architecture. Say SSE, AVX, AVX-512 or whatever.

@benaadams
Copy link
Member Author

@mikedn
Copy link
Contributor

mikedn commented Dec 23, 2015

SSE -> AVX -> AVX-512 instructions; the Vectors lib is further up the stack in corefx so coreclr couldn't take a dependency on it (also would want to skip the bounds checks)

I think you're assigning a slightly different meaning to intrinsic, usually intrinsic means that the function call is replaced with a inline version. What you seem to be suggesting is that the JIT compiler should generate the code for functions like BlockCopy out of thin air. I suppose that could work but it's a bit more work.

and the wide types aren't native in the base runtime beyond long

Hmm, put 2 longs in a struct, the JIT usually copies such structs using SSE instructions. Or put 4 and you get a form of unrolling as well.

also would want to skip the bounds checks

fixed (byte *src = sourceArray)...

Actually I was thinking on cases where you need to accommodate cases like always copy 1033 bytes which will probably have a few loops of 512 bytes copy and a few more based on switch;

With a bit more optimization on the JIT side that could probably work as well. And the additional optimizations could handle other situations unrelated to copying whereas an intrinsic is useful only in this limited circumstance.

@Anderman
Copy link

@mikedn. Interesting.

So, if I'am correct the VM will mark the Vector assembly as special class.

The importer scans every CALL and will try to convert the call to impSIMDIntrinsic

The impSIMDIntrinsic will first find out if the call is to a special class here, here and here

So it must be possible to do the same Array.CopyTo or Buffer.BlockCopy.

  1. Mark them as special.
  2. generated the code as an copyBlkDst

Skipping bound check did improve the copy performance for 128 byte by 50-80%

@sharwell
Copy link
Member

What about something like the following? It assumes the argument checks are placed in a separate method VectorizedCopy, which calls this method after the arguments are checked.

.method public hidebysig static 
    void VectorizedCopyUnsafe (
        uint8[] src,
        int32 srcOffset,
        uint8[] dst,
        int32 dstOffset,
        int32 count
    ) cil managed 
{
    .maxstack 2
    .locals init (
        [0] uint8& pinned,
        [1] uint8& pinned
    )

    ldarg.s count
    brtrue.s IL_0005
    ret

    IL_0005: ldarg.0
    ldc.i4.0
    ldelema [mscorlib]System.Byte
    stloc.0
    ldarg.2
    ldc.i4.0
    ldelema [mscorlib]System.Byte
    stloc.1
    ldloc.1
    ldloc.0
    ldarg.s count
    unaligned.cpblk
    ret
}

@mikedn
Copy link
Contributor

mikedn commented Dec 27, 2015

@Anderman

So, if I'am correct the VM will mark the Vector assembly as special class...

Yes but but a hypothetical Array.Copy/Buffer.BlockCopy intrinsic isn't related to SIMD intrinsics even if the implementation uses SIMD. You may want to check Compiler::impIntrinsic.

@sharwell

What about something like the following?

Currently cpblk is converted to a helper call so it's unlikely this would help with small copies. Ultimately this doesn't seem to be very different from what Buffer.MemoryCopy offers.

@mikedn
Copy link
Contributor

mikedn commented Dec 27, 2015

This issue is getting long and IMO a bit unfocused. Let's try to sum it up, we'll see there are actually 2 different related issues.

It's clear that Array.Copy & co. are slower than a native memcpy. How much slower depends on the CLR build. For example, the desktop build does 76 mil 8 byte copies while the official CoreCLR rc1-final does only ~48 mil (lack of PGO I assume). Anyway, VC++2015's memcpy does 268 mil copies so it's 3.5 times faster than the best CLR time. Even at larger sizes the difference is still rather large - 1.7x for 512 byte copies.

Now, the CLR may end up using no less than 3 different versions of memcpy depending on how it is build (VS2013 CRT, VS2015 CRT, OS CRT) and this can complicate measurements a bit. But the outcome is largely the same, the native memcpy is faster, especially for very small copy sizes. We shouldn't be distracted by things like "there's a drop in performance at 16 bytes" because with or without drop the performance is still lower.

Issue 1
Where does the difference come from? The CLR uses memcpy as well (at least in the case of blittable types) but before that there are a lot of checks and the cost of those checks is significant for small copy sizes. SIMD won't help here but an intrinsic or managed version of Array.Copy might help because some of those checks could be eliminated. IMO a managed generic version would be best but that too requires some intrinsics, that's what IsBlittable is after all.

Are there workarounds? Well, if you don't mind using unsafe then there's Buffer.MemoryCopy. But is its performance as good as memcpy? It turns out that it isn't and that's the second issue.

Issue 2
Even for size < 16 the performance of Buffer.MemoryCopy is ~1.2x lower than memcpy even if both use the same technique - a jump table (switch). This may be a JIT issue, in any case this case isn't related to SIMD.

Issue 3
And for size = 512 the performance is also lower, around 1.4x. Here is where SIMD would fit in, should work better than the current int/long copy loop.

Hmm, looks like I ended up with 3 different issues :)

@benaadams
Copy link
Member Author

Need to remeasure post dotnet/coreclr#3118

@mikedn
Copy link
Contributor

mikedn commented Feb 14, 2016

Yeah, I love how a change was sneaked in without any reference to a existing long discussion about the affected piece of code 😀

@jkotas
Copy link
Member

jkotas commented Feb 14, 2016

dotnet/coreclr#3118 was @geoffkizer suggestion that I just helped to happen, so it did not immediately occur to me to link it to this one as I was typing the commit description ... sorry about that. Geoff is analyzing performance of the whole .NET Core server stack.

@nietras
Copy link
Contributor

nietras commented Mar 16, 2016

@Anderman could you explain how this works:

if (orgCount > Vector<byte>.Count)
{
    new Vector<byte>(src, orgCount - Vector<byte>.Count).CopyTo(dst, orgCount - Vector<byte>.Count);
    return;
}

I don't understand how this works, when it is not taking offsets into account?

I am redoing the different copies benchmarks (on all JITs) and wanted to include your take too.

As far as I understand Kestrel now uses Buffer.MemoryCopy for byte arrays since this is handled by a special case in the JIT. Has anyone done benchmarks on this?

@jkotas
Copy link
Member

jkotas commented Mar 4, 2017

We should look into fixing this using Span.

@benaadams
Copy link
Member Author

Closing as very out of date;, though don't know current situation

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 30, 2020
@msftgits msftgits added this to the Future milestone Jan 30, 2020
@dotnet dotnet locked as resolved and limited conversation to collaborators Jan 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Memory enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

8 participants