Code Vectorization in .NET and Other Technologies
Recently somebody sent me a link to a benchmark that compares the performance of .NET, JVM and C++ on a simple task: provided an array of numbers, find the triples of numbers such as their sum is equal to a predetermined number (in our example, the number will be zero, i.e. x + y + z = 0
).
Here's the whole function in C#:
private static int CountTriples(int[] arr, int sum)
{
int n = arr.Length;
int count = 0;
for (int i = 0; i < n; i++)
{
for (int j = i + 1; j < n; j++)
{
for (int k = j + 1; k < n; k++)
{
if (arr[i] + arr[j] + arr[k] == sum)
{
count++;
}
}
}
}
return count;
}
For comparison, here's the same function in Java:
private static int countTriples(int[] arr, int sum) {
int n = arr.length;
int count = 0;
for (int i = 0; i < n; i++) {
for (int j = i + 1; j < n; j++) {
for (int k = j + 1; k < n; k++) {
if (arr[i] + arr[j] + arr[k] == sum) {
count++;
}
}
}
}
return count;
}
The implementations are very similar, there's no difference in behavior and algorithm, and they were verified to work the same way on same input.
But still, if you run the code, you'll find that Java implementation is 1.3 times faster than C# one!
Here are the results from my computer (Intel Core i9-9900 KF @ 3.6 GHz, JVM Corretto-17.0.7.7.1, .NET 8.0.100-rc.1.23463.5):
$ javac Triples.java && java TriplesJ ints1k.txt 100 10
[…]
47.82 ms +/- 4.90
$ dotnet run -c Release -- ints1k.txt 100
[…]
65.03 ms +/- 5.77
Normally, we are used to a situation when .NET implementations of algorithms are a little bit faster than the same implementations in Java, mostly because of value types, generic speicalizations and standard library optimizations that are done in .NET and aren't possible or aren't viable in Java. But in this case, something strange is happening: JVM is notably faster. Let's dig deeper!
First of all, let's take a look at the assembly that Java has generated. For whatever reason, JDK is unable to generate assemby code out of the box, so you have to download a separate "disassembly plugin" from a third party (why isn't it published as a part of JDK is beyond me). I used a version from this site, and put it into the current directory. After that, it's possible to see the assembly:
$ java -XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,TriplesJ.countTriples TriplesJ ints1k.txt 100 10
You can see that the method was recompiled several times during the warm-up stage, and the final version is here:
============================= C2-compiled nmethod ==============================
----------------------------------- Assembly -----------------------------------
Compiled method (c2) 889 272 4 TriplesJ::countTriples (79 bytes)
total in heap [0x000001801f812890,0x000001801f812d58] = 1224
relocation [0x000001801f8129e8,0x000001801f812a00] = 24
main code [0x000001801f812a00,0x000001801f812c20] = 544
stub code [0x000001801f812c20,0x000001801f812c38] = 24
oops [0x000001801f812c38,0x000001801f812c40] = 8
metadata [0x000001801f812c40,0x000001801f812c48] = 8
scopes data [0x000001801f812c48,0x000001801f812cd0] = 136
scopes pcs [0x000001801f812cd0,0x000001801f812d40] = 112
dependencies [0x000001801f812d40,0x000001801f812d48] = 8
nul chk table [0x000001801f812d48,0x000001801f812d58] = 16
--------------------------------------------------------------------------------
[Constant Pool (empty)]
--------------------------------------------------------------------------------
[Verified Entry Point]
# {method} {0x0000018051400670} 'countTriples' '([II)I' in 'TriplesJ'
# parm0: rdx:rdx = '[I'
# parm1: r8 = int
# [sp+0x50] (sp of caller)
0x000001801f812a00: mov %eax,-0x7000(%rsp)
0x000001801f812a07: push %rbp
0x000001801f812a08: sub $0x40,%rsp
0x000001801f812a0c: mov 0xc(%rdx),%ebp ; implicit exception: dispatches to 0x000001801f812bec
0x000001801f812a0f: xor %r9d,%r9d
0x000001801f812a12: test %ebp,%ebp
0x000001801f812a14: jbe 0x000001801f812a39
0x000001801f812a16: mov %ebp,%r11d
0x000001801f812a19: add $0xfffffffd,%r11d
0x000001801f812a1d: mov %ebp,%edi
0x000001801f812a1f: dec %edi
0x000001801f812a21: mov $0xfa0,%ebx
0x000001801f812a26: mov $0x80000000,%ecx
0x000001801f812a2b: cmp %r11d,%edi
0x000001801f812a2e: cmovl %ecx,%r11d
0x000001801f812a32: xor %eax,%eax
0x000001801f812a34: xor %r10d,%r10d
0x000001801f812a37: jmp 0x000001801f812a5f
0x000001801f812a39: xor %eax,%eax
0x000001801f812a3b: add $0x40,%rsp
0x000001801f812a3f: pop %rbp
0x000001801f812a40: cmp 0x340(%r15),%rsp ; {poll_return}
0x000001801f812a47: ja 0x000001801f812bf8
0x000001801f812a4d: ret
0x000001801f812a4e: mov 0x348(%r15),%r10 ; ImmutableOopMap {rdx=Oop }
;*goto {reexecute=1 rethrow=0 return_oop=0}
; - (reexecute) TriplesJ::countTriples@74 (line 11)
0x000001801f812a55: test %eax,(%r10) ; {poll}
0x000001801f812a58: cmp %ebp,%esi
0x000001801f812a5a: jge 0x000001801f812a3b
0x000001801f812a5c: mov %esi,%r10d
0x000001801f812a5f: mov %r10d,%esi
0x000001801f812a62: inc %esi
0x000001801f812a64: cmp %ebp,%esi
0x000001801f812a66: jge 0x000001801f812a4e
0x000001801f812a68: mov %esi,%ecx
0x000001801f812a6a: jmp 0x000001801f812aa8
0x000001801f812a6c: mov 0x10(%rdx,%rsi,4),%r10d
0x000001801f812a71: add %ecx,%r10d
0x000001801f812a74: cmp %r8d,%r10d
0x000001801f812a77: je 0x000001801f812bbc
0x000001801f812a7d: inc %esi
0x000001801f812a7f: nop
0x000001801f812a80: cmp %ebp,%esi
0x000001801f812a82: jl 0x000001801f812a6c
0x000001801f812a84: vmovd %xmm1,%edi
0x000001801f812a88: vmovd %xmm2,%r10d
0x000001801f812a8d: vmovd %xmm3,%esi
0x000001801f812a91: mov 0x348(%r15),%rcx ; ImmutableOopMap {rdx=Oop }
;*goto {reexecute=1 rethrow=0 return_oop=0}
; - (reexecute) TriplesJ::countTriples@68 (line 12)
0x000001801f812a98: test %eax,(%rcx) ; {poll}
0x000001801f812a9a: nopw 0x0(%rax,%rax,1)
0x000001801f812aa0: cmp %ebp,%r14d
0x000001801f812aa3: jge 0x000001801f812a4e
0x000001801f812aa5: mov %r14d,%ecx
0x000001801f812aa8: mov %ecx,%r14d
0x000001801f812aab: inc %r14d
0x000001801f812aae: cmp %ebp,%r14d
0x000001801f812ab1: jge 0x000001801f812a91
0x000001801f812ab3: cmp %ebp,%r10d
0x000001801f812ab6: jae 0x000001801f812bc5
0x000001801f812abc: nopl 0x0(%rax)
0x000001801f812ac0: cmp %ebp,%ecx
0x000001801f812ac2: jae 0x000001801f812bc5
0x000001801f812ac8: cmp %ebp,%r14d
0x000001801f812acb: jae 0x000001801f812bc5
0x000001801f812ad1: cmp %ebp,%edi
0x000001801f812ad3: jae 0x000001801f812bc5
0x000001801f812ad9: vmovd %ecx,%xmm0
0x000001801f812add: vmovd %esi,%xmm3
0x000001801f812ae1: vmovd %edi,%xmm1
0x000001801f812ae5: mov 0x10(%rdx,%rcx,4),%ecx
0x000001801f812ae9: add 0x10(%rdx,%r10,4),%ecx
0x000001801f812aee: vmovd %r10d,%xmm2
0x000001801f812af3: vmovd %xmm0,%edi
0x000001801f812af7: add $0x2,%edi
0x000001801f812afa: mov %r14d,%esi
0x000001801f812afd: mov 0x10(%rdx,%rsi,4),%r10d
0x000001801f812b02: add %ecx,%r10d
0x000001801f812b05: cmp %r8d,%r10d
0x000001801f812b08: je 0x000001801f812bb5
0x000001801f812b0e: inc %esi
0x000001801f812b10: cmp %edi,%esi
0x000001801f812b12: jl 0x000001801f812afd
0x000001801f812b14: cmp %r11d,%esi
0x000001801f812b17: jge 0x000001801f812b93
0x000001801f812b1d: mov %r11d,%edi
0x000001801f812b20: sub %esi,%edi
0x000001801f812b22: cmp %esi,%r11d
0x000001801f812b25: cmovl %r9d,%edi
0x000001801f812b29: cmp $0xfa0,%edi
0x000001801f812b2f: cmova %ebx,%edi
0x000001801f812b32: add %esi,%edi
0x000001801f812b34: mov 0x10(%rdx,%rsi,4),%r13d
0x000001801f812b39: add %ecx,%r13d
0x000001801f812b3c: nopl 0x0(%rax)
0x000001801f812b40: cmp %r8d,%r13d
0x000001801f812b43: je 0x000001801f812bad
0x000001801f812b49: movslq %esi,%r10
0x000001801f812b4c: mov 0x14(%rdx,%r10,4),%r13d
0x000001801f812b51: add %ecx,%r13d
0x000001801f812b54: cmp %r8d,%r13d
0x000001801f812b57: je 0x000001801f812ba9
0x000001801f812b5d: mov 0x18(%rdx,%r10,4),%r13d
0x000001801f812b62: add %ecx,%r13d
0x000001801f812b65: cmp %r8d,%r13d
0x000001801f812b68: je 0x000001801f812bb1
0x000001801f812b6e: mov 0x1c(%rdx,%r10,4),%r10d
0x000001801f812b73: add %ecx,%r10d
0x000001801f812b76: cmp %r8d,%r10d
0x000001801f812b79: je 0x000001801f812ba5
0x000001801f812b7b: add $0x4,%esi
0x000001801f812b7e: xchg %ax,%ax
0x000001801f812b80: cmp %edi,%esi
0x000001801f812b82: jl 0x000001801f812b34
0x000001801f812b84: mov 0x348(%r15),%r10 ; ImmutableOopMap {rdx=Oop }
;*goto {reexecute=1 rethrow=0 return_oop=0}
; - (reexecute) TriplesJ::countTriples@62 (line 13)
0x000001801f812b8b: test %eax,(%r10) ; {poll}
0x000001801f812b8e: cmp %r11d,%esi
0x000001801f812b91: jl 0x000001801f812b1d
0x000001801f812b93: cmp %ebp,%esi
0x000001801f812b95: jl 0x000001801f812a6c
0x000001801f812b9b: nopl 0x0(%rax,%rax,1)
0x000001801f812ba0: jmp 0x000001801f812a84
0x000001801f812ba5: inc %eax
0x000001801f812ba7: jmp 0x000001801f812b7b
0x000001801f812ba9: inc %eax
0x000001801f812bab: jmp 0x000001801f812b5d
0x000001801f812bad: inc %eax
0x000001801f812baf: jmp 0x000001801f812b49
0x000001801f812bb1: inc %eax
0x000001801f812bb3: jmp 0x000001801f812b6e
0x000001801f812bb5: inc %eax
0x000001801f812bb7: jmp 0x000001801f812b0e
0x000001801f812bbc: inc %eax
0x000001801f812bbe: xchg %ax,%ax
0x000001801f812bc0: jmp 0x000001801f812a7d
0x000001801f812bc5: mov %rdx,(%rsp)
0x000001801f812bc9: mov %r8d,0x8(%rsp)
0x000001801f812bce: mov %eax,0xc(%rsp)
0x000001801f812bd2: mov %r10d,0x10(%rsp)
0x000001801f812bd7: mov %ecx,0x14(%rsp)
0x000001801f812bdb: mov %r14d,0x1c(%rsp)
0x000001801f812be0: mov $0xffffff76,%edx
0x000001801f812be5: xchg %ax,%ax
0x000001801f812be7: call 0x000001801f106900 ; ImmutableOopMap {[0]=Oop }
;*if_icmpge {reexecute=1 rethrow=0 return_oop=0}
; - (reexecute) TriplesJ::countTriples@35 (line 13)
; {runtime_call UncommonTrapBlob}
0x000001801f812bec: mov $0xfffffff6,%edx
0x000001801f812bf1: xchg %ax,%ax
0x000001801f812bf3: call 0x000001801f106900 ; ImmutableOopMap {}
;*arraylength {reexecute=0 rethrow=0 return_oop=0}
; - TriplesJ::countTriples@1 (line 8)
; {runtime_call UncommonTrapBlob}
0x000001801f812bf8: movabs $0x1801f812a40,%r10 ; {internal_word}
0x000001801f812c02: mov %r10,0x358(%r15)
0x000001801f812c09: jmp 0x000001801f107a00 ; {runtime_call SafepointBlob}
0x000001801f812c0e: hlt
0x000001801f812c0f: hlt
0x000001801f812c10: hlt
0x000001801f812c11: hlt
0x000001801f812c12: hlt
0x000001801f812c13: hlt
0x000001801f812c14: hlt
0x000001801f812c15: hlt
0x000001801f812c16: hlt
0x000001801f812c17: hlt
0x000001801f812c18: hlt
0x000001801f812c19: hlt
0x000001801f812c1a: hlt
0x000001801f812c1b: hlt
0x000001801f812c1c: hlt
0x000001801f812c1d: hlt
0x000001801f812c1e: hlt
0x000001801f812c1f: hlt
[Exception Handler]
0x000001801f812c20: jmp 0x000001801f118800 ; {no_reloc}
[Deopt Handler Code]
0x000001801f812c25: call 0x000001801f812c2a
0x000001801f812c2a: subq $0x5,(%rsp)
0x000001801f812c2f: jmp 0x000001801f106ca0 ; {runtime_call DeoptimizationBlob}
0x000001801f812c34: hlt
0x000001801f812c35: hlt
0x000001801f812c36: hlt
0x000001801f812c37: hlt
--------------------------------------------------------------------------------
While I am not an assembly expert and it's a bit hard for me to read this printout, I can see that it uses vmovd
instruction and xmm
registers, highlighting the use of SIMD or automatic code vectorization that happened on JVM.
(Feel free to leave your comments if my analysis is mistaken in this part.)
Let's compare it with what .NET prints:
$ $env:DOTNET_JitDisasm = 'CountTriples'
$ dotnet run -c Release -- ints1k.txt 100
; Assembly listing for method TriplesCS:CountTriples(int[],int):int (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Blended PGO
; rsp based frame
; fully interruptible
; with Blended PGO: edge weights are invalid, and fgCalledCount is 100
G_M000_IG01: ;; offset=0x0000
push rdi
push rsi
push rbp
push rbx
G_M000_IG02: ;; offset=0x0004
mov eax, dword ptr [rcx+0x08]
xor r8d, r8d
xor r10d, r10d
test eax, eax
jle G_M000_IG23
G_M000_IG03: ;; offset=0x0015
lea r9d, [r10+0x01]
mov r11d, r9d
cmp r11d, eax
jge G_M000_IG20
G_M000_IG04: ;; offset=0x0025
test r11d, r11d
jl SHORT G_M000_IG16
mov r10d, r10d
mov r10d, dword ptr [rcx+4*r10+0x10]
G_M000_IG05: ;; offset=0x0032
lea ebx, [r11+0x01]
mov esi, ebx
cmp esi, eax
jge SHORT G_M000_IG13
G_M000_IG06: ;; offset=0x003C
test esi, esi
jl SHORT G_M000_IG11
mov r11d, r11d
mov r11d, dword ptr [rcx+4*r11+0x10]
align [0 bytes for IG07]
G_M000_IG07: ;; offset=0x0048
lea edi, [r10+r11]
mov ebp, esi
add edi, dword ptr [rcx+4*rbp+0x10]
cmp edi, edx
je SHORT G_M000_IG10
G_M000_IG08: ;; offset=0x0056
inc esi
cmp esi, eax
jl SHORT G_M000_IG07
G_M000_IG09: ;; offset=0x005C
jmp SHORT G_M000_IG13
G_M000_IG10: ;; offset=0x005E
inc r8d
jmp SHORT G_M000_IG08
G_M000_IG11: ;; offset=0x0063
mov edi, r11d
mov edi, dword ptr [rcx+4*rdi+0x10]
add edi, r10d
mov ebp, esi
add edi, dword ptr [rcx+4*rbp+0x10]
cmp edi, edx
je SHORT G_M000_IG15
G_M000_IG12: ;; offset=0x0077
inc esi
cmp esi, eax
jl SHORT G_M000_IG11
G_M000_IG13: ;; offset=0x007D
mov r11d, ebx
cmp r11d, eax
jl SHORT G_M000_IG05
G_M000_IG14: ;; offset=0x0085
jmp SHORT G_M000_IG20
G_M000_IG15: ;; offset=0x0087
inc r8d
jmp SHORT G_M000_IG12
G_M000_IG16: ;; offset=0x008C
lea ebx, [r11+0x01]
mov esi, ebx
cmp esi, eax
jge SHORT G_M000_IG19
G_M000_IG17: ;; offset=0x0096
mov edi, r10d
mov edi, dword ptr [rcx+4*rdi+0x10]
mov ebp, r11d
add edi, dword ptr [rcx+4*rbp+0x10]
mov ebp, esi
add edi, dword ptr [rcx+4*rbp+0x10]
cmp edi, edx
je SHORT G_M000_IG22
G_M000_IG18: ;; offset=0x00AE
inc esi
cmp esi, eax
jl SHORT G_M000_IG17
G_M000_IG19: ;; offset=0x00B4
mov r11d, ebx
cmp r11d, eax
jl SHORT G_M000_IG16
G_M000_IG20: ;; offset=0x00BC
mov r10d, r9d
cmp r10d, eax
jl G_M000_IG03
G_M000_IG21: ;; offset=0x00C8
jmp SHORT G_M000_IG23
G_M000_IG22: ;; offset=0x00CA
inc r8d
jmp SHORT G_M000_IG18
G_M000_IG23: ;; offset=0x00CF
mov eax, r8d
G_M000_IG24: ;; offset=0x00D2
pop rbx
pop rbp
pop rsi
pop rdi
ret
; Total bytes of code 215
The first important detail is that we were a bit too optimistic about not needed warmup in .NET code: you can see that the final version of the method gets emitted somewhere after 63'th iteration. And the second important detail is that this is extremely simple x86-64 assembly, no SIMD operations or XMM registers are mentioned whatsoever. You don't even have to be an assembly expert to note that.
So, we can see that JVM has vectorized its code, while .NET didn't. Could we help it a little bit? Or course! Let's rewrite the C# part (thanks to @vanbukin, @EgorBo, and @ilchert for their collaboration on this code):
private static int CountTriples(int[] arr, int sum)
{
int n = arr.Length;
int count = 0;
var vCount = Vector<int>.Count;
ref int p = ref arr[0];
var sumVector = new Vector<int>(sum);
for (int i = 0; i < n; i++)
{
for (int j = i + 1; j < n; j++)
{
var ijSum = arr[i] + arr[j];
var ijSumVector = new Vector<int>(ijSum);
var k = j + 1;
for (; k < n - vCount; k += vCount)
{
var kVector = Vector.LoadUnsafe(ref p, (nuint)k);
var ijkSumVector = kVector + ijSumVector;
var subResult = Vector.Equals(sumVector, ijkSumVector);
if (subResult != Vector<int>.Zero)
{
var sumCount = Vector.Sum(subResult);
count -= sumCount;
}
}
for (; k < n; k++)
count += ijSum + arr[k] == sum ? 1 : 0;
}
}
return count;
}
This code uses explicit Vector
type that supports vectorization in .NET, and thus is more complex than the straightforward implementation (in my opinion, the most complex part is the requirement to handle a non-vectorized "tail" that didn't fit into the vector size: it is easy to miss and end up with an incorrect result), but it is significantly faster than JVM:
20.58 ms +/- 1.45
On some computers, it is even faster than C++ implementation (that uses auto-vectorization by the compiler as well).
Let's examine the assembly.
; Assembly listing for method TriplesCS:CountTriples(int[],int):int (Tier1)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; Tier1 code
; optimized code
; optimized using Blended PGO
; rsp based frame
; fully interruptible
; with Blended PGO: edge weights are invalid, and fgCalledCount is 100
G_M000_IG01: ;; offset=0x0000
push r15
push r14
push r13
push rdi
push rsi
push rbp
push rbx
sub rsp, 32
vzeroupper
G_M000_IG02: ;; offset=0x0011
mov eax, dword ptr [rcx+0x08]
mov r8d, eax
xor r10d, r10d
test eax, eax
je G_M000_IG28
lea r9, bword ptr [rcx+0x10]
vmovd xmm0, edx
vpbroadcastd ymm0, ymm0
xor r11d, r11d
test r8d, r8d
jle G_M000_IG26
G_M000_IG03: ;; offset=0x003B
mov ebx, r8d
sub ebx, 8
G_M000_IG04: ;; offset=0x0041
lea esi, [r11+0x01]
mov edi, esi
cmp edi, r8d
jge G_M000_IG23
G_M000_IG05: ;; offset=0x0050
test edi, edi
jl G_M000_IG17
mov r11d, r11d
mov r11d, dword ptr [rcx+4*r11+0x10]
G_M000_IG06: ;; offset=0x0060
mov ebp, edi
mov r14d, r11d
add r14d, dword ptr [rcx+4*rbp+0x10]
vmovd xmm1, r14d
vpbroadcastd ymm1, ymm1
inc edi
mov ebp, edi
cmp ebx, ebp
jle SHORT G_M000_IG09
align [4 bytes for IG07]
G_M000_IG07: ;; offset=0x0080
movsxd r15, ebp
vpaddd ymm2, ymm1, ymmword ptr [r9+4*r15]
vpcmpeqd ymm2, ymm0, ymm2
vptest ymm2, ymm2
jne SHORT G_M000_IG13
G_M000_IG08: ;; offset=0x0094
add ebp, 8
cmp ebx, ebp
jg SHORT G_M000_IG07
G_M000_IG09: ;; offset=0x009B
cmp ebp, r8d
jge SHORT G_M000_IG15
G_M000_IG10: ;; offset=0x00A0
test ebp, ebp
jl SHORT G_M000_IG14
align [0 bytes for IG11]
G_M000_IG11: ;; offset=0x00A4
mov r15d, ebp
mov r13d, r14d
add r13d, dword ptr [rcx+4*r15+0x10]
xor r15d, r15d
cmp r13d, edx
sete r15b
add r15d, r10d
mov r10d, r15d
inc ebp
cmp ebp, r8d
jl SHORT G_M000_IG11
G_M000_IG12: ;; offset=0x00C6
jmp SHORT G_M000_IG15
G_M000_IG13: ;; offset=0x00C8
vphaddd ymm3, ymm2, ymm2
vphaddd ymm4, ymm3, ymm3
vextracti128 xmm2, ymm4, 1
vpaddd xmm3, xmm2, xmm4
vmovd r15d, xmm3
sub r10d, r15d
jmp SHORT G_M000_IG08
G_M000_IG14: ;; offset=0x00E6
cmp ebp, eax
jae G_M000_IG28
mov r15d, ebp
mov r13d, r14d
add r13d, dword ptr [rcx+4*r15+0x10]
xor r15d, r15d
cmp r13d, edx
sete r15b
add r15d, r10d
mov r10d, r15d
inc ebp
cmp ebp, r8d
jl SHORT G_M000_IG14
G_M000_IG15: ;; offset=0x0110
cmp edi, r8d
jl G_M000_IG06
G_M000_IG16: ;; offset=0x0119
jmp SHORT G_M000_IG23
G_M000_IG17: ;; offset=0x011B
mov r14d, r11d
mov ebp, dword ptr [rcx+4*r14+0x10]
mov r14d, edi
add ebp, dword ptr [rcx+4*r14+0x10]
mov r14d, ebp
vmovd xmm1, r14d
vpbroadcastd ymm1, ymm1
inc edi
mov ebp, edi
cmp ebx, ebp
jle SHORT G_M000_IG20
G_M000_IG18: ;; offset=0x0140
movsxd r15, ebp
vpaddd ymm2, ymm1, ymmword ptr [r9+4*r15]
vpcmpeqd ymm2, ymm0, ymm2
vptest ymm2, ymm2
jne SHORT G_M000_IG25
G_M000_IG19: ;; offset=0x0154
add ebp, 8
cmp ebx, ebp
jg SHORT G_M000_IG18
G_M000_IG20: ;; offset=0x015B
cmp ebp, r8d
jge SHORT G_M000_IG22
G_M000_IG21: ;; offset=0x0160
cmp ebp, eax
jae SHORT G_M000_IG28
mov r15d, ebp
mov r13d, r14d
add r13d, dword ptr [rcx+4*r15+0x10]
xor r15d, r15d
cmp r13d, edx
sete r15b
add r15d, r10d
mov r10d, r15d
inc ebp
cmp ebp, r8d
jl SHORT G_M000_IG21
G_M000_IG22: ;; offset=0x0186
cmp edi, r8d
jl SHORT G_M000_IG17
G_M000_IG23: ;; offset=0x018B
mov r11d, esi
cmp r11d, r8d
jl G_M000_IG04
G_M000_IG24: ;; offset=0x0197
jmp SHORT G_M000_IG26
G_M000_IG25: ;; offset=0x0199
vphaddd ymm3, ymm2, ymm2
vphaddd ymm4, ymm3, ymm3
vextracti128 xmm2, ymm4, 1
vmovaps ymm3, ymm4
vpaddd xmm2, xmm2, xmm3
vmovd r15d, xmm2
sub r10d, r15d
jmp SHORT G_M000_IG19
G_M000_IG26: ;; offset=0x01BB
mov eax, r10d
G_M000_IG27: ;; offset=0x01BE
vzeroupper
add rsp, 32
pop rbx
pop rbp
pop rsi
pop rdi
pop r13
pop r14
pop r15
ret
G_M000_IG28: ;; offset=0x01D0
call CORINFO_HELP_RNGCHKFAIL
int3
; Total bytes of code 470
Okay, now we are talking: you can easily see a lot of SIMD instructions and registers here. Once again, I am not a SIMD expert so it's a bit hard for me to describe what exactly is going on here, but hopefully it's possible to figure out, provided you have the original source above.
What could we conclude? JVM has auto-vectorization that just rocks in certain cases, at no cost for the user. Simple code may be very fast after C2 JIT. .NET doesn't provide the same JIT facility, so you sometimes have to optimize the code manually (which is always tricky, so make sure you have a good test suite!). But if you do that then oh boy it is fast: it is on par with the C++ compiler.