F. von Never — Code Vectorization in .NET and Other Technologies

Publication date: 2023-10-21

Recently somebody sent me a link to a benchmark that compares the performance of .NET, JVM and C++ on a simple task: provided an array of numbers, find the triples of numbers such as their sum is equal to a predetermined number (in our example, the number will be zero, i.e. x + y + z = 0).

Here's the whole function in C#:

private static int CountTriples(int[] arr, int sum)
{
    int n = arr.Length;
    int count = 0;

    for (int i = 0; i < n; i++)
    {
        for (int j = i + 1; j < n; j++)
        {
            for (int k = j + 1; k < n; k++)
            {
                if (arr[i] + arr[j] + arr[k] == sum)
                {
                    count++;
                }
            }
        }
    }

    return count;
}

For comparison, here's the same function in Java:

private static int countTriples(int[] arr, int sum) {
    int n = arr.length;
    int count = 0;

    for (int i = 0; i < n; i++) {
        for (int j = i + 1; j < n; j++) {
            for (int k = j + 1; k < n; k++) {
                if (arr[i] + arr[j] + arr[k] == sum) {
                    count++;
                }
            }
        }
    }

    return count;
}

The implementations are very similar, there's no difference in behavior and algorithm, and they were verified to work the same way on same input.

But still, if you run the code, you'll find that Java implementation is 1.3 times faster than C# one!

Here are the results from my computer (Intel Core i9-9900 KF @ 3.6 GHz, JVM Corretto-17.0.7.7.1, .NET 8.0.100-rc.1.23463.5):

$ javac Triples.java && java TriplesJ ints1k.txt 100 10
[…]
47.82 ms +/- 4.90
$ dotnet run -c Release -- ints1k.txt 100
[…]
65.03 ms +/- 5.77

Normally, we are used to a situation when .NET implementations of algorithms are a little bit faster than the same implementations in Java, mostly because of value types, generic speicalizations and standard library optimizations that are done in .NET and aren't possible or aren't viable in Java. But in this case, something strange is happening: JVM is notably faster. Let's dig deeper!

First of all, let's take a look at the assembly that Java has generated. For whatever reason, JDK is unable to generate assemby code out of the box, so you have to download a separate "disassembly plugin" from a third party (why isn't it published as a part of JDK is beyond me). I used a version from this site, and put it into the current directory. After that, it's possible to see the assembly:

$ java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,TriplesJ.countTriples TriplesJ ints1k.txt 100 10

You can see that the method was recompiled several times during the warm-up stage, and the final version is here:

============================= C2-compiled nmethod ==============================
----------------------------------- Assembly -----------------------------------

Compiled method (c2)     889  272       4       TriplesJ::countTriples (79 bytes)
 total in heap  [0x000001801f812890,0x000001801f812d58] = 1224
 relocation     [0x000001801f8129e8,0x000001801f812a00] = 24
 main code      [0x000001801f812a00,0x000001801f812c20] = 544
 stub code      [0x000001801f812c20,0x000001801f812c38] = 24
 oops           [0x000001801f812c38,0x000001801f812c40] = 8
 metadata       [0x000001801f812c40,0x000001801f812c48] = 8
 scopes data    [0x000001801f812c48,0x000001801f812cd0] = 136
 scopes pcs     [0x000001801f812cd0,0x000001801f812d40] = 112
 dependencies   [0x000001801f812d40,0x000001801f812d48] = 8
 nul chk table  [0x000001801f812d48,0x000001801f812d58] = 16

--------------------------------------------------------------------------------
[Constant Pool (empty)]

--------------------------------------------------------------------------------

[Verified Entry Point]
  # {method} {0x0000018051400670} 'countTriples' '([II)I' in 'TriplesJ'
  # parm0:    rdx:rdx   = '[I'
  # parm1:    r8        = int
  #           [sp+0x50]  (sp of caller)
  0x000001801f812a00:   mov    %eax,-0x7000(%rsp)
  0x000001801f812a07:   push   %rbp
  0x000001801f812a08:   sub    $0x40,%rsp
  0x000001801f812a0c:   mov    0xc(%rdx),%ebp               ; implicit exception: dispatches to 0x000001801f812bec
  0x000001801f812a0f:   xor    %r9d,%r9d
  0x000001801f812a12:   test   %ebp,%ebp
  0x000001801f812a14:   jbe    0x000001801f812a39
  0x000001801f812a16:   mov    %ebp,%r11d
  0x000001801f812a19:   add    $0xfffffffd,%r11d
  0x000001801f812a1d:   mov    %ebp,%edi
  0x000001801f812a1f:   dec    %edi
  0x000001801f812a21:   mov    $0xfa0,%ebx
  0x000001801f812a26:   mov    $0x80000000,%ecx
  0x000001801f812a2b:   cmp    %r11d,%edi
  0x000001801f812a2e:   cmovl  %ecx,%r11d
  0x000001801f812a32:   xor    %eax,%eax
  0x000001801f812a34:   xor    %r10d,%r10d
  0x000001801f812a37:   jmp    0x000001801f812a5f
  0x000001801f812a39:   xor    %eax,%eax
  0x000001801f812a3b:   add    $0x40,%rsp
  0x000001801f812a3f:   pop    %rbp
  0x000001801f812a40:   cmp    0x340(%r15),%rsp             ;   {poll_return}
  0x000001801f812a47:   ja     0x000001801f812bf8
  0x000001801f812a4d:   ret
  0x000001801f812a4e:   mov    0x348(%r15),%r10             ; ImmutableOopMap {rdx=Oop }
                                                            ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                                            ; - (reexecute) TriplesJ::countTriples@74 (line 11)
  0x000001801f812a55:   test   %eax,(%r10)                  ;   {poll}
  0x000001801f812a58:   cmp    %ebp,%esi
  0x000001801f812a5a:   jge    0x000001801f812a3b
  0x000001801f812a5c:   mov    %esi,%r10d
  0x000001801f812a5f:   mov    %r10d,%esi
  0x000001801f812a62:   inc    %esi
  0x000001801f812a64:   cmp    %ebp,%esi
  0x000001801f812a66:   jge    0x000001801f812a4e
  0x000001801f812a68:   mov    %esi,%ecx
  0x000001801f812a6a:   jmp    0x000001801f812aa8
  0x000001801f812a6c:   mov    0x10(%rdx,%rsi,4),%r10d
  0x000001801f812a71:   add    %ecx,%r10d
  0x000001801f812a74:   cmp    %r8d,%r10d
  0x000001801f812a77:   je     0x000001801f812bbc
  0x000001801f812a7d:   inc    %esi
  0x000001801f812a7f:   nop
  0x000001801f812a80:   cmp    %ebp,%esi
  0x000001801f812a82:   jl     0x000001801f812a6c
  0x000001801f812a84:   vmovd  %xmm1,%edi
  0x000001801f812a88:   vmovd  %xmm2,%r10d
  0x000001801f812a8d:   vmovd  %xmm3,%esi
  0x000001801f812a91:   mov    0x348(%r15),%rcx             ; ImmutableOopMap {rdx=Oop }
                                                            ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                                            ; - (reexecute) TriplesJ::countTriples@68 (line 12)
  0x000001801f812a98:   test   %eax,(%rcx)                  ;   {poll}
  0x000001801f812a9a:   nopw   0x0(%rax,%rax,1)
  0x000001801f812aa0:   cmp    %ebp,%r14d
  0x000001801f812aa3:   jge    0x000001801f812a4e
  0x000001801f812aa5:   mov    %r14d,%ecx
  0x000001801f812aa8:   mov    %ecx,%r14d
  0x000001801f812aab:   inc    %r14d
  0x000001801f812aae:   cmp    %ebp,%r14d
  0x000001801f812ab1:   jge    0x000001801f812a91
  0x000001801f812ab3:   cmp    %ebp,%r10d
  0x000001801f812ab6:   jae    0x000001801f812bc5
  0x000001801f812abc:   nopl   0x0(%rax)
  0x000001801f812ac0:   cmp    %ebp,%ecx
  0x000001801f812ac2:   jae    0x000001801f812bc5
  0x000001801f812ac8:   cmp    %ebp,%r14d
  0x000001801f812acb:   jae    0x000001801f812bc5
  0x000001801f812ad1:   cmp    %ebp,%edi
  0x000001801f812ad3:   jae    0x000001801f812bc5
  0x000001801f812ad9:   vmovd  %ecx,%xmm0
  0x000001801f812add:   vmovd  %esi,%xmm3
  0x000001801f812ae1:   vmovd  %edi,%xmm1
  0x000001801f812ae5:   mov    0x10(%rdx,%rcx,4),%ecx
  0x000001801f812ae9:   add    0x10(%rdx,%r10,4),%ecx
  0x000001801f812aee:   vmovd  %r10d,%xmm2
  0x000001801f812af3:   vmovd  %xmm0,%edi
  0x000001801f812af7:   add    $0x2,%edi
  0x000001801f812afa:   mov    %r14d,%esi
  0x000001801f812afd:   mov    0x10(%rdx,%rsi,4),%r10d
  0x000001801f812b02:   add    %ecx,%r10d
  0x000001801f812b05:   cmp    %r8d,%r10d
  0x000001801f812b08:   je     0x000001801f812bb5
  0x000001801f812b0e:   inc    %esi
  0x000001801f812b10:   cmp    %edi,%esi
  0x000001801f812b12:   jl     0x000001801f812afd
  0x000001801f812b14:   cmp    %r11d,%esi
  0x000001801f812b17:   jge    0x000001801f812b93
  0x000001801f812b1d:   mov    %r11d,%edi
  0x000001801f812b20:   sub    %esi,%edi
  0x000001801f812b22:   cmp    %esi,%r11d
  0x000001801f812b25:   cmovl  %r9d,%edi
  0x000001801f812b29:   cmp    $0xfa0,%edi
  0x000001801f812b2f:   cmova  %ebx,%edi
  0x000001801f812b32:   add    %esi,%edi
  0x000001801f812b34:   mov    0x10(%rdx,%rsi,4),%r13d
  0x000001801f812b39:   add    %ecx,%r13d
  0x000001801f812b3c:   nopl   0x0(%rax)
  0x000001801f812b40:   cmp    %r8d,%r13d
  0x000001801f812b43:   je     0x000001801f812bad
  0x000001801f812b49:   movslq %esi,%r10
  0x000001801f812b4c:   mov    0x14(%rdx,%r10,4),%r13d
  0x000001801f812b51:   add    %ecx,%r13d
  0x000001801f812b54:   cmp    %r8d,%r13d
  0x000001801f812b57:   je     0x000001801f812ba9
  0x000001801f812b5d:   mov    0x18(%rdx,%r10,4),%r13d
  0x000001801f812b62:   add    %ecx,%r13d
  0x000001801f812b65:   cmp    %r8d,%r13d
  0x000001801f812b68:   je     0x000001801f812bb1
  0x000001801f812b6e:   mov    0x1c(%rdx,%r10,4),%r10d
  0x000001801f812b73:   add    %ecx,%r10d
  0x000001801f812b76:   cmp    %r8d,%r10d
  0x000001801f812b79:   je     0x000001801f812ba5
  0x000001801f812b7b:   add    $0x4,%esi
  0x000001801f812b7e:   xchg   %ax,%ax
  0x000001801f812b80:   cmp    %edi,%esi
  0x000001801f812b82:   jl     0x000001801f812b34
  0x000001801f812b84:   mov    0x348(%r15),%r10             ; ImmutableOopMap {rdx=Oop }
                                                            ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                                            ; - (reexecute) TriplesJ::countTriples@62 (line 13)
  0x000001801f812b8b:   test   %eax,(%r10)                  ;   {poll}
  0x000001801f812b8e:   cmp    %r11d,%esi
  0x000001801f812b91:   jl     0x000001801f812b1d
  0x000001801f812b93:   cmp    %ebp,%esi
  0x000001801f812b95:   jl     0x000001801f812a6c
  0x000001801f812b9b:   nopl   0x0(%rax,%rax,1)
  0x000001801f812ba0:   jmp    0x000001801f812a84
  0x000001801f812ba5:   inc    %eax
  0x000001801f812ba7:   jmp    0x000001801f812b7b
  0x000001801f812ba9:   inc    %eax
  0x000001801f812bab:   jmp    0x000001801f812b5d
  0x000001801f812bad:   inc    %eax
  0x000001801f812baf:   jmp    0x000001801f812b49
  0x000001801f812bb1:   inc    %eax
  0x000001801f812bb3:   jmp    0x000001801f812b6e
  0x000001801f812bb5:   inc    %eax
  0x000001801f812bb7:   jmp    0x000001801f812b0e
  0x000001801f812bbc:   inc    %eax
  0x000001801f812bbe:   xchg   %ax,%ax
  0x000001801f812bc0:   jmp    0x000001801f812a7d
  0x000001801f812bc5:   mov    %rdx,(%rsp)
  0x000001801f812bc9:   mov    %r8d,0x8(%rsp)
  0x000001801f812bce:   mov    %eax,0xc(%rsp)
  0x000001801f812bd2:   mov    %r10d,0x10(%rsp)
  0x000001801f812bd7:   mov    %ecx,0x14(%rsp)
  0x000001801f812bdb:   mov    %r14d,0x1c(%rsp)
  0x000001801f812be0:   mov    $0xffffff76,%edx
  0x000001801f812be5:   xchg   %ax,%ax
  0x000001801f812be7:   call   0x000001801f106900           ; ImmutableOopMap {[0]=Oop }
                                                            ;*if_icmpge {reexecute=1 rethrow=0 return_oop=0}
                                                            ; - (reexecute) TriplesJ::countTriples@35 (line 13)
                                                            ;   {runtime_call UncommonTrapBlob}
  0x000001801f812bec:   mov    $0xfffffff6,%edx
  0x000001801f812bf1:   xchg   %ax,%ax
  0x000001801f812bf3:   call   0x000001801f106900           ; ImmutableOopMap {}
                                                            ;*arraylength {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - TriplesJ::countTriples@1 (line 8)
                                                            ;   {runtime_call UncommonTrapBlob}
  0x000001801f812bf8:   movabs $0x1801f812a40,%r10          ;   {internal_word}
  0x000001801f812c02:   mov    %r10,0x358(%r15)
  0x000001801f812c09:   jmp    0x000001801f107a00           ;   {runtime_call SafepointBlob}
  0x000001801f812c0e:   hlt
  0x000001801f812c0f:   hlt
  0x000001801f812c10:   hlt
  0x000001801f812c11:   hlt
  0x000001801f812c12:   hlt
  0x000001801f812c13:   hlt
  0x000001801f812c14:   hlt
  0x000001801f812c15:   hlt
  0x000001801f812c16:   hlt
  0x000001801f812c17:   hlt
  0x000001801f812c18:   hlt
  0x000001801f812c19:   hlt
  0x000001801f812c1a:   hlt
  0x000001801f812c1b:   hlt
  0x000001801f812c1c:   hlt
  0x000001801f812c1d:   hlt
  0x000001801f812c1e:   hlt
  0x000001801f812c1f:   hlt
[Exception Handler]
  0x000001801f812c20:   jmp    0x000001801f118800           ;   {no_reloc}
[Deopt Handler Code]
  0x000001801f812c25:   call   0x000001801f812c2a
  0x000001801f812c2a:   subq   $0x5,(%rsp)
  0x000001801f812c2f:   jmp    0x000001801f106ca0           ;   {runtime_call DeoptimizationBlob}
  0x000001801f812c34:   hlt
  0x000001801f812c35:   hlt
  0x000001801f812c36:   hlt
  0x000001801f812c37:   hlt
--------------------------------------------------------------------------------

While I am not an assembly expert and it's a bit hard for me to read this printout, I can see that it uses vmovd instruction and xmm registers, highlighting the use of SIMD or automatic code vectorization that happened on JVM.

(Feel free to leave your comments if my analysis is mistaken in this part.)

Let's compare it with what .NET prints:

$ $env:DOTNET_JitDisasm = 'CountTriples'
$ dotnet run -c Release -- ints1k.txt 100

; Assembly listing for method TriplesCS:CountTriples(int[],int):int (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Blended PGO
; rsp based frame
; fully interruptible
; with Blended PGO: edge weights are invalid, and fgCalledCount is 100

G_M000_IG01:                ;; offset=0x0000
       push     rdi
       push     rsi
       push     rbp
       push     rbx

G_M000_IG02:                ;; offset=0x0004
       mov      eax, dword ptr [rcx+0x08]
       xor      r8d, r8d
       xor      r10d, r10d
       test     eax, eax
       jle      G_M000_IG23

G_M000_IG03:                ;; offset=0x0015
       lea      r9d, [r10+0x01]
       mov      r11d, r9d
       cmp      r11d, eax
       jge      G_M000_IG20

G_M000_IG04:                ;; offset=0x0025
       test     r11d, r11d
       jl       SHORT G_M000_IG16
       mov      r10d, r10d
       mov      r10d, dword ptr [rcx+4*r10+0x10]

G_M000_IG05:                ;; offset=0x0032
       lea      ebx, [r11+0x01]
       mov      esi, ebx
       cmp      esi, eax
       jge      SHORT G_M000_IG13

G_M000_IG06:                ;; offset=0x003C
       test     esi, esi
       jl       SHORT G_M000_IG11
       mov      r11d, r11d
       mov      r11d, dword ptr [rcx+4*r11+0x10]
       align    [0 bytes for IG07]

G_M000_IG07:                ;; offset=0x0048
       lea      edi, [r10+r11]
       mov      ebp, esi
       add      edi, dword ptr [rcx+4*rbp+0x10]
       cmp      edi, edx
       je       SHORT G_M000_IG10

G_M000_IG08:                ;; offset=0x0056
       inc      esi
       cmp      esi, eax
       jl       SHORT G_M000_IG07

G_M000_IG09:                ;; offset=0x005C
       jmp      SHORT G_M000_IG13

G_M000_IG10:                ;; offset=0x005E
       inc      r8d
       jmp      SHORT G_M000_IG08

G_M000_IG11:                ;; offset=0x0063
       mov      edi, r11d
       mov      edi, dword ptr [rcx+4*rdi+0x10]
       add      edi, r10d
       mov      ebp, esi
       add      edi, dword ptr [rcx+4*rbp+0x10]
       cmp      edi, edx
       je       SHORT G_M000_IG15

G_M000_IG12:                ;; offset=0x0077
       inc      esi
       cmp      esi, eax
       jl       SHORT G_M000_IG11

G_M000_IG13:                ;; offset=0x007D
       mov      r11d, ebx
       cmp      r11d, eax
       jl       SHORT G_M000_IG05

G_M000_IG14:                ;; offset=0x0085
       jmp      SHORT G_M000_IG20

G_M000_IG15:                ;; offset=0x0087
       inc      r8d
       jmp      SHORT G_M000_IG12

G_M000_IG16:                ;; offset=0x008C
       lea      ebx, [r11+0x01]
       mov      esi, ebx
       cmp      esi, eax
       jge      SHORT G_M000_IG19

G_M000_IG17:                ;; offset=0x0096
       mov      edi, r10d
       mov      edi, dword ptr [rcx+4*rdi+0x10]
       mov      ebp, r11d
       add      edi, dword ptr [rcx+4*rbp+0x10]
       mov      ebp, esi
       add      edi, dword ptr [rcx+4*rbp+0x10]
       cmp      edi, edx
       je       SHORT G_M000_IG22

G_M000_IG18:                ;; offset=0x00AE
       inc      esi
       cmp      esi, eax
       jl       SHORT G_M000_IG17

G_M000_IG19:                ;; offset=0x00B4
       mov      r11d, ebx
       cmp      r11d, eax
       jl       SHORT G_M000_IG16

G_M000_IG20:                ;; offset=0x00BC
       mov      r10d, r9d
       cmp      r10d, eax
       jl       G_M000_IG03

G_M000_IG21:                ;; offset=0x00C8
       jmp      SHORT G_M000_IG23

G_M000_IG22:                ;; offset=0x00CA
       inc      r8d
       jmp      SHORT G_M000_IG18

G_M000_IG23:                ;; offset=0x00CF
       mov      eax, r8d

G_M000_IG24:                ;; offset=0x00D2
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       ret

; Total bytes of code 215

The first important detail is that we were a bit too optimistic about not needed warmup in .NET code: you can see that the final version of the method gets emitted somewhere after 63'th iteration. And the second important detail is that this is extremely simple x86-64 assembly, no SIMD operations or XMM registers are mentioned whatsoever. You don't even have to be an assembly expert to note that.

So, we can see that JVM has vectorized its code, while .NET didn't. Could we help it a little bit? Or course! Let's rewrite the C# part (thanks to @vanbukin, @EgorBo, and @ilchert for their collaboration on this code):

private static int CountTriples(int[] arr, int sum)
{
    int n = arr.Length;
    int count = 0;
    var vCount = Vector<int>.Count;
    ref int p = ref arr[0];
    var sumVector = new Vector<int>(sum);
    for (int i = 0; i < n; i++)
    {
        for (int j = i + 1; j < n; j++)
        {
            var ijSum = arr[i] + arr[j];
            var ijSumVector = new Vector<int>(ijSum);
            var k = j + 1;
            for (; k < n - vCount; k += vCount)
            {
                var kVector = Vector.LoadUnsafe(ref p, (nuint)k);
                var ijkSumVector = kVector + ijSumVector;
                var subResult = Vector.Equals(sumVector, ijkSumVector);
                if (subResult != Vector<int>.Zero)
                {
                    var sumCount = Vector.Sum(subResult);
                    count -= sumCount;
                }
            }

            for (; k < n; k++)
                count += ijSum + arr[k] == sum ? 1 : 0;
        }
    }
    return count;
}

This code uses explicit Vector type that supports vectorization in .NET, and thus is more complex than the straightforward implementation (in my opinion, the most complex part is the requirement to handle a non-vectorized "tail" that didn't fit into the vector size: it is easy to miss and end up with an incorrect result), but it is significantly faster than JVM:

20.58 ms +/- 1.45

On some computers, it is even faster than C++ implementation (that uses auto-vectorization by the compiler as well).

Let's examine the assembly.

; Assembly listing for method TriplesCS:CountTriples(int[],int):int (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Blended PGO
; rsp based frame
; fully interruptible
; with Blended PGO: edge weights are invalid, and fgCalledCount is 100

G_M000_IG01:                ;; offset=0x0000
       push     r15
       push     r14
       push     r13
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 32
       vzeroupper

G_M000_IG02:                ;; offset=0x0011
       mov      eax, dword ptr [rcx+0x08]
       mov      r8d, eax
       xor      r10d, r10d
       test     eax, eax
       je       G_M000_IG28
       lea      r9, bword ptr [rcx+0x10]
       vmovd    xmm0, edx
       vpbroadcastd ymm0, ymm0
       xor      r11d, r11d
       test     r8d, r8d
       jle      G_M000_IG26

G_M000_IG03:                ;; offset=0x003B
       mov      ebx, r8d
       sub      ebx, 8

G_M000_IG04:                ;; offset=0x0041
       lea      esi, [r11+0x01]
       mov      edi, esi
       cmp      edi, r8d
       jge      G_M000_IG23

G_M000_IG05:                ;; offset=0x0050
       test     edi, edi
       jl       G_M000_IG17
       mov      r11d, r11d
       mov      r11d, dword ptr [rcx+4*r11+0x10]

G_M000_IG06:                ;; offset=0x0060
       mov      ebp, edi
       mov      r14d, r11d
       add      r14d, dword ptr [rcx+4*rbp+0x10]
       vmovd    xmm1, r14d
       vpbroadcastd ymm1, ymm1
       inc      edi
       mov      ebp, edi
       cmp      ebx, ebp
       jle      SHORT G_M000_IG09
       align    [4 bytes for IG07]

G_M000_IG07:                ;; offset=0x0080
       movsxd   r15, ebp
       vpaddd   ymm2, ymm1, ymmword ptr [r9+4*r15]
       vpcmpeqd ymm2, ymm0, ymm2
       vptest   ymm2, ymm2
       jne      SHORT G_M000_IG13

G_M000_IG08:                ;; offset=0x0094
       add      ebp, 8
       cmp      ebx, ebp
       jg       SHORT G_M000_IG07

G_M000_IG09:                ;; offset=0x009B
       cmp      ebp, r8d
       jge      SHORT G_M000_IG15

G_M000_IG10:                ;; offset=0x00A0
       test     ebp, ebp
       jl       SHORT G_M000_IG14
       align    [0 bytes for IG11]

G_M000_IG11:                ;; offset=0x00A4
       mov      r15d, ebp
       mov      r13d, r14d
       add      r13d, dword ptr [rcx+4*r15+0x10]
       xor      r15d, r15d
       cmp      r13d, edx
       sete     r15b
       add      r15d, r10d
       mov      r10d, r15d
       inc      ebp
       cmp      ebp, r8d
       jl       SHORT G_M000_IG11

G_M000_IG12:                ;; offset=0x00C6
       jmp      SHORT G_M000_IG15

G_M000_IG13:                ;; offset=0x00C8
       vphaddd  ymm3, ymm2, ymm2
       vphaddd  ymm4, ymm3, ymm3
       vextracti128 xmm2, ymm4, 1
       vpaddd   xmm3, xmm2, xmm4
       vmovd    r15d, xmm3
       sub      r10d, r15d
       jmp      SHORT G_M000_IG08

G_M000_IG14:                ;; offset=0x00E6
       cmp      ebp, eax
       jae      G_M000_IG28
       mov      r15d, ebp
       mov      r13d, r14d
       add      r13d, dword ptr [rcx+4*r15+0x10]
       xor      r15d, r15d
       cmp      r13d, edx
       sete     r15b
       add      r15d, r10d
       mov      r10d, r15d
       inc      ebp
       cmp      ebp, r8d
       jl       SHORT G_M000_IG14

G_M000_IG15:                ;; offset=0x0110
       cmp      edi, r8d
       jl       G_M000_IG06

G_M000_IG16:                ;; offset=0x0119
       jmp      SHORT G_M000_IG23

G_M000_IG17:                ;; offset=0x011B
       mov      r14d, r11d
       mov      ebp, dword ptr [rcx+4*r14+0x10]
       mov      r14d, edi
       add      ebp, dword ptr [rcx+4*r14+0x10]
       mov      r14d, ebp
       vmovd    xmm1, r14d
       vpbroadcastd ymm1, ymm1
       inc      edi
       mov      ebp, edi
       cmp      ebx, ebp
       jle      SHORT G_M000_IG20

G_M000_IG18:                ;; offset=0x0140
       movsxd   r15, ebp
       vpaddd   ymm2, ymm1, ymmword ptr [r9+4*r15]
       vpcmpeqd ymm2, ymm0, ymm2
       vptest   ymm2, ymm2
       jne      SHORT G_M000_IG25

G_M000_IG19:                ;; offset=0x0154
       add      ebp, 8
       cmp      ebx, ebp
       jg       SHORT G_M000_IG18

G_M000_IG20:                ;; offset=0x015B
       cmp      ebp, r8d
       jge      SHORT G_M000_IG22

G_M000_IG21:                ;; offset=0x0160
       cmp      ebp, eax
       jae      SHORT G_M000_IG28
       mov      r15d, ebp
       mov      r13d, r14d
       add      r13d, dword ptr [rcx+4*r15+0x10]
       xor      r15d, r15d
       cmp      r13d, edx
       sete     r15b
       add      r15d, r10d
       mov      r10d, r15d
       inc      ebp
       cmp      ebp, r8d
       jl       SHORT G_M000_IG21

G_M000_IG22:                ;; offset=0x0186
       cmp      edi, r8d
       jl       SHORT G_M000_IG17

G_M000_IG23:                ;; offset=0x018B
       mov      r11d, esi
       cmp      r11d, r8d
       jl       G_M000_IG04

G_M000_IG24:                ;; offset=0x0197
       jmp      SHORT G_M000_IG26

G_M000_IG25:                ;; offset=0x0199
       vphaddd  ymm3, ymm2, ymm2
       vphaddd  ymm4, ymm3, ymm3
       vextracti128 xmm2, ymm4, 1
       vmovaps  ymm3, ymm4
       vpaddd   xmm2, xmm2, xmm3
       vmovd    r15d, xmm2
       sub      r10d, r15d
       jmp      SHORT G_M000_IG19

G_M000_IG26:                ;; offset=0x01BB
       mov      eax, r10d

G_M000_IG27:                ;; offset=0x01BE
       vzeroupper
       add      rsp, 32
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       pop      r13
       pop      r14
       pop      r15
       ret

G_M000_IG28:                ;; offset=0x01D0
       call     CORINFO_HELP_RNGCHKFAIL
       int3

; Total bytes of code 470

Okay, now we are talking: you can easily see a lot of SIMD instructions and registers here. Once again, I am not a SIMD expert so it's a bit hard for me to describe what exactly is going on here, but hopefully it's possible to figure out, provided you have the original source above.

What could we conclude? JVM has auto-vectorization that just rocks in certain cases, at no cost for the user. Simple code may be very fast after C2 JIT. .NET doesn't provide the same JIT facility, so you sometimes have to optimize the code manually (which is always tricky, so make sure you have a good test suite!). But if you do that then oh boy it is fast: it is on par with the C++ compiler.