V.B. V.B. - 3 months ago 17
C# Question

C# marker structures performance

In F#, we have several very nice solutions for design-time type safety: type aliases and single-case struct union (and no implicit conversions to start with!):

// type aliases are erased at compile time
type Offset = int64<offset>

// no allocations
[<Struct>]
type Offset = Offset of int64


What would be an alternative for C#?

I have never seen a practical usage of marker structures (containing a single element), but it looks like if we add explicit type conversions then we could get design-time behavior very similar to type aliases in F#. That is - IDE will complain about type mismatches and one will have to explicitly cast values.

Below is some POC code:

public struct Offset {
private readonly long _value;
private Offset(long value) {
_value = value;
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static explicit operator Offset(long value) {
return new Offset(value);
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static explicit operator long(Offset offset) {
return offset._value;
}
}

public interface IIndex<T> {
Offset OffsetOf(T value);
T AtOffset(Offset offset);
}

public class SmapleUsage
{
public void Test(IIndex<long> idx)
{
// without explicit cast we have nice red squiggles
var valueAt = idx.AtOffset((Offset)123);
long offset = (long)idx.OffsetOf(42L);
}
}


So, the IDE thing is nice! But I was going to ask what are performance implications and other downsides, and to avoid "just measure it" comments have just measured it and stopped writing this question initially... But the results came out counter-intuitive:

[Test]
public void OffsetTests() {
var array = Enumerable.Range(0, 1024).ToArray();
var sw = new Stopwatch();

for (int rounds = 0; rounds < 10; rounds++) {
sw.Restart();
long sum = 0;
for (int rp = 0; rp < 1000000; rp++) {
for (int i = 0; i < array.Length; i++) {
sum += GetAtIndex(array, i);
}
}
sw.Stop();
if (sum < 0) throw new Exception(); // use sum after loop
Console.WriteLine($"Index: {sw.ElapsedMilliseconds}");

sw.Restart();
sum = 0;
for (int rp = 0; rp < 1000000; rp++) {
for (int i = 0; i < array.Length; i++) {
sum += GetAtOffset(array, (Offset)i);
}
}
if (sum < 0) throw new Exception(); // use sum after loop
sw.Stop();
Console.WriteLine($"Offset: {sw.ElapsedMilliseconds}");

sw.Restart();
sum = 0;
for (int rp = 0; rp < 1000000; rp++) {
for (int i = 0; i < array.Length; i++) {
sum += array[i];
}
}
if (sum < 0) throw new Exception(); // use sum after loop
sw.Stop();
Console.WriteLine($"Direct: {sw.ElapsedMilliseconds}");
}
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private int GetAtIndex(int[] array, long index) {
return array[index];
}

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private int GetAtOffset(int[] array, Offset offset) {
return array[(long)offset];
}


Surprisingly, on i7@2.2Hz x64/Release the case with
Offset
is visibly faster on every test round - typical values are:

Int64: 1046
Offset: 932
Direct: 730


I would expect equal or slower result compared to just using
int64
. So what is going one here? Could you reproduce the same difference or spot some deficiency, e.g. if I measure different things?

Answer

1. Once you replace for (int i = 0; with for (long i = 0; in the Int64 test, the performance will be identical to the Direct test.

While using int it generates such x86-64 instructions:

inc         ecx  
cmp         ecx,0F4240h

While using long it generates such x86-64 instructions:

inc         rcx  
cmp         rcx,0F4240h  

So the only difference in using 32-bit register ecx or its 64-bit version rcx, where later is faster due to a CPU design.

2. Use long for iterator in Offset test, and you'll see similar performance.

3. Because the code is optimized in release mode, there is almost no difference between using Int64 or Offset, however at some point the instructions are little bit re-arranged.

While using Offset (one instruction less):

movsxd      rdx,eax  
movsxd      r8,r14d  
cmp         rdx,r8  
jae         <address>  

While using Int64 (one instruction more):

movsxd      rdx,r14d  
movsxd      r8,eax  
cmp         r8,rdx  
jae         <address>  
movsxd      rdx,eax  

4. The direct test is the fastest, because it does not do array boundary checks with instructions shown above at #3. This optimization happens when you write a loop like this:

for (var i=0; i<array.Length; i++) { ... array[i] ... }

Normally, if your index is outside of the bounds of the array it throws the IndexOutOfRangeException, but in this case compiler knows that it cannot happen, so it omits the check.

Then, even having extra instruction in other tests, they have similar performance due to CPU branch predictor, which starts running instructions in advance if needed, and discards the results if condition fails.

Comments