Michal Michal - 3 months ago 11
C# Question

Accessing 5-byte struct is much slower than 8-byte

I have a code which updates an array based on values in another, small array.

for (var i = 0; i < result.Length; i++)
{
var c = cards[i];
result[i] -= one[c.C0] + one[c.C1];
}


Where
c
is a struct that is a pair of bytes representing cards from a deck.
one
is an array size of 52 (with entries for each of 52 cards from a deck)

I wrote a benchmark to profile this code:

private void TestCards2(int testRepetitions, float[] result, float[] one, Cards[] cards)
{
for (var r = 0; r < testRepetitions; r++)
for (var i = 0; i < result.Length; i++)
{
var c = cards[i];
result[i] -= one[c.C0] + one[c.C1];
}
}


Setting
testRepetitions
= 25 million, and using array of 256 elements (
result.Length = 256
), it runs in about 8.5 seconds on my machine.

Here is the
Cards
struct:

struct Cards
{
public byte C0;
public byte C1;

public Cards(byte c0, byte c1)
{
C0 = c0;
C1 = c1;
}
}


When I modify that struct to hold 5 cards (5 bytes), the same benchmark now takes ~13s.
Why would that happen? The computation is the same, remaining 3 cards are unused, and all arrays are small enough to fit in L1 cache.

What is even stranger, is that if I further change Cards to now hold 8 bytes, benchmark is now faster, taking ~10 seconds.

My Setup:

VS 2015 Update 3.
.NET 4.6.2
Release Build x64
CPU: Haswell i7-5820K CPU @ 3.30GHz


Here are exact timings I got:

Test With 2 Cards. Time = 8582 ms
Test With 5 Cards. Time = 12910 ms
Test With 8 Cards. Time = 10180 ms


What is going on here?

Benchmark Code:

class TestAdjustment
{
public void Test()
{
using (Process p = Process.GetCurrentProcess())
p.PriorityClass = ProcessPriorityClass.High;

var size = 256;

float[] one = ArrayUtils.CreateRandomFloatArray(size:52);
int[] card0 = ArrayUtils.RandomIntArray(size, minValue:0, maxValueInclusive:51);
int[] card1 = ArrayUtils.RandomIntArray(size, minValue: 0, maxValueInclusive: 51);

Cards[] cards = CreateCardsArray(card0, card1);
Cards5[] cards5 = CreateCards5Array(card0, card1);
Cards8[] cards8 = CreateCards8Array(card0, card1);

float[] result = ArrayUtils.CreateRandomFloatArray(size);
float[] resultClone = result.ToArray();


var testRepetitions = 25*1000*1000;

var sw = Stopwatch.StartNew();


TestCards2(testRepetitions, result, one, cards);
WriteLine($"Test With 2 Cards. Time = {sw.ElapsedMilliseconds} ms");
result = resultClone.ToArray(); //restore original array from the clone, so that next method works on the same data
sw.Restart();


TestCards5(testRepetitions, result, one, cards5);
WriteLine($"Test With 5 Cards. Time = {sw.ElapsedMilliseconds} ms");
result = resultClone.ToArray();
sw.Restart();


TestCards8(testRepetitions, result, one, cards8);
WriteLine($"Test With 8 Cards. Time = {sw.ElapsedMilliseconds} ms");


}



private void TestCards2(int testRepetitions, float[] result, float[] one, Cards[] cards)
{
for (var r = 0; r < testRepetitions; r++)
for (var i = 0; i < result.Length; i++)
{
var c = cards[i];
result[i] -= one[c.C0] + one[c.C1];
}
}

private void TestCards5(int testRepetitions, float[] result, float[] one, Cards5[] cards)
{
for (var r = 0; r < testRepetitions; r++)
for (var i = 0; i < result.Length; i++)
{
var c = cards[i];
result[i] -= one[c.C0] + one[c.C1];
}
}


private void TestCards8(int testRepetitions, float[] result, float[] one, Cards8[] cards)
{
for (var r = 0; r < testRepetitions; r++)
for (var i = 0; i < result.Length; i++)
{
var c = cards[i];
result[i] -= one[c.C0] + one[c.C1];
}
}


private Cards[] CreateCardsArray(int[] c0, int[] c1)
{
var result = new Cards[c0.Length];
for (var i = 0; i < result.Length; i++)
result[i] = new Cards((byte)c0[i], (byte)c1[i]);

return result;
}

private Cards5[] CreateCards5Array(int[] c0, int[] c1)
{
var result = new Cards5[c0.Length];
for (var i = 0; i < result.Length; i++)
result[i] = new Cards5((byte)c0[i], (byte)c1[i]);

return result;
}

private Cards8[] CreateCards8Array(int[] c0, int[] c1)
{
var result = new Cards8[c0.Length];
for (var i = 0; i < result.Length; i++)
result[i] = new Cards8((byte)c0[i], (byte)c1[i]);

return result;
}
}


struct Cards
{
public byte C0;
public byte C1;

public Cards(byte c0, byte c1)
{
C0 = c0;
C1 = c1;
}
}

struct Cards5
{
public byte C0, C1, C2, C3, C4;

public Cards5(byte c0, byte c1)
{
C0 = c0;
C1 = c1;
C2 = C3 = C4 = 0;
}
}

struct Cards8
{
public byte C0, C1, C2, C3, C4, C5, C6, C7;


public Cards8(byte c0, byte c1)
{
C0 = c0;
C1 = c1;
C2 = C3 = C4 = C5 = C6 = C7 = 0;
}
}


Edit
I've rerun the benchmark again, with 100 million iterations. Here are the results:

Test With 5 Cards. Time = 52245 ms
Test With 8 Cards. Time = 40531 ms


And in reverse order:

Test With 8 Cards. Time = 41041 ms
Test With 5 Cards. Time = 52034 ms


Running it on Surface Pro 4 (Skylake i7-6650U Turbo-boosted to ~3.4ghz):

Test With 8 Cards. Time = 47913 ms
Test With 5 Cards. Time = 55182 ms


So the difference persists and doesn't depend on the order.

I also ran profiling using Intel VTune, and it shows CPI of
0.3
for "5 cards" version, and
0.27
for "8 cards".

Edit2 Added ArrayUtils class for creating initial random arrays.

public static class ArrayUtils
{
static Random rand = new Random(137);

public static float[] CreateRandomFloatArray(int size)
{
var result = new float[size];
for (int i = 0; i < size; i++)
result[i] = (float) rand.NextDouble();

return result;
}

public static int[] RandomIntArray(int size, int minValue, int maxValueInclusive)
{
var result = new int[size];
for (int i = 0; i < size; i++)
result[i] = rand.Next(minValue, maxValueInclusive + 1);

return result;

}
}

Answer

It's all about unaligned memory access. The unaligned memory ready latency is greater than the aligned memory read latency. To complete the experiment, let's add structs Cards3, Cards4 and so on. And let's look how the corresponded arrays are represented in memory.

enter image description here

Next, let's improve your benchmark.

  1. We will use BenchmarkDotNet (this tool will do a lot of benchmarking routine like warmup, auto-selection of iteration amount, statistics calculation, and so on).
  2. We will do our benchmark for all the Cards2..Cards8 arrays, not only 3 of them.
  3. Also we will check all the JIT compilers for the Full .NET Framework (LegacyJIT-x86, LegacyJIT-x64, RyuJIT-x64) and Mono.

Here is my environment:

Host Process Environment Information:
BenchmarkDotNet.Core=v0.9.9.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4810MQ CPU 2.80GHz, ProcessorCount=8
Frequency=2728068 ticks, Resolution=366.5598 ns, Timer=TSC
CLR1=MS.NET 4.0.30319.42000, Arch=64-bit RELEASE [RyuJIT]
CLR2=Mono JIT compiler version 4.4.0, Arch=32-bit
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1080.0

And my results:

 Method | Platform |       Jit | Toolchain | Runtime |    Median |    StdDev |
------- |--------- |---------- |---------- |-------- |---------- |---------- |
     C2 |     Host |      Host |      Mono |    Mono | 3.9230 ns | 0.0532 ns |
     C3 |     Host |      Host |      Mono |    Mono | 4.8223 ns | 0.0920 ns |
     C4 |     Host |      Host |      Mono |    Mono | 5.9149 ns | 0.1207 ns |
     C5 |     Host |      Host |      Mono |    Mono | 6.3981 ns | 0.0913 ns |
     C6 |     Host |      Host |      Mono |    Mono | 7.1179 ns | 0.1222 ns |
     C7 |     Host |      Host |      Mono |    Mono | 7.6318 ns | 0.1269 ns |
     C8 |     Host |      Host |      Mono |    Mono | 8.4650 ns | 0.1497 ns |
     C2 |      X64 | LegacyJit |      Host |    Host | 2.3515 ns | 0.0150 ns |
     C3 |      X64 | LegacyJit |      Host |    Host | 4.2553 ns | 0.0700 ns |
     C4 |      X64 | LegacyJit |      Host |    Host | 1.4366 ns | 0.0385 ns |
     C5 |      X64 | LegacyJit |      Host |    Host | 2.3688 ns | 0.0359 ns |
     C6 |      X64 | LegacyJit |      Host |    Host | 2.3684 ns | 0.0404 ns |
     C7 |      X64 | LegacyJit |      Host |    Host | 3.0404 ns | 0.0664 ns |
     C8 |      X64 | LegacyJit |      Host |    Host | 1.4510 ns | 0.0333 ns |
     C2 |      X64 |    RyuJit |      Host |    Host | 1.9281 ns | 0.0306 ns |
     C3 |      X64 |    RyuJit |      Host |    Host | 2.1183 ns | 0.0348 ns |
     C4 |      X64 |    RyuJit |      Host |    Host | 1.9395 ns | 0.0397 ns |
     C5 |      X64 |    RyuJit |      Host |    Host | 2.7706 ns | 0.0387 ns |
     C6 |      X64 |    RyuJit |      Host |    Host | 2.6471 ns | 0.0513 ns |
     C7 |      X64 |    RyuJit |      Host |    Host | 2.9743 ns | 0.0541 ns |
     C8 |      X64 |    RyuJit |      Host |    Host | 2.6280 ns | 0.1526 ns |
     C2 |      X86 | LegacyJit |      Host |    Host | 3.0854 ns | 0.2172 ns |
     C3 |      X86 | LegacyJit |      Host |    Host | 3.1627 ns | 0.1126 ns |
     C4 |      X86 | LegacyJit |      Host |    Host | 3.0577 ns | 0.0929 ns |
     C5 |      X86 | LegacyJit |      Host |    Host | 5.0957 ns | 0.1601 ns |
     C6 |      X86 | LegacyJit |      Host |    Host | 6.1723 ns | 0.1177 ns |
     C7 |      X86 | LegacyJit |      Host |    Host | 7.1155 ns | 0.0803 ns |
     C8 |      X86 | LegacyJit |      Host |    Host | 3.7703 ns | 0.1276 ns |

enter image description here

Full source code:

using System;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Attributes.Exporters;
using BenchmarkDotNet.Attributes.Jobs;
using BenchmarkDotNet.Running;

[LegacyJitX86Job, LegacyJitX64Job, RyuJitX64Job, MonoJob]
[RPlotExporter]
public class CardBenchmarks
{
  public const int Size = 256;

  private float[] result, one;
  private Cards2[] cards2;
  private Cards3[] cards3;
  private Cards4[] cards4;
  private Cards5[] cards5;
  private Cards6[] cards6;
  private Cards7[] cards7;
  private Cards8[] cards8;

  [Setup]
  public void Setup()
  {
    result = ArrayUtils.CreateRandomFloatArray(Size);
    one = ArrayUtils.CreateRandomFloatArray(size: 52);
    var c0 = ArrayUtils.RandomByteArray(Size, minValue: 0, maxValueInclusive: 51);
    var c1 = ArrayUtils.RandomByteArray(Size, minValue: 0, maxValueInclusive: 51);

    cards2 = CardUtls.Create2(c0, c1);
    cards3 = CardUtls.Create3(c0, c1);
    cards4 = CardUtls.Create4(c0, c1);
    cards5 = CardUtls.Create5(c0, c1);
    cards6 = CardUtls.Create6(c0, c1);
    cards7 = CardUtls.Create7(c0, c1);
    cards8 = CardUtls.Create8(c0, c1);
  }

  [Benchmark(OperationsPerInvoke = Size)]
  public void C2()
  {
    for (var i = 0; i < result.Length; i++)
    {
      var c = cards2[i];
      result[i] -= one[c.C0] + one[c.C1];
    }
  }

  [Benchmark(OperationsPerInvoke = Size)]
  public void C3()
  {
    for (var i = 0; i < result.Length; i++)
    {
      var c = cards3[i];
      result[i] -= one[c.C0] + one[c.C1];
    }
  }

  [Benchmark(OperationsPerInvoke = Size)]
  public void C4()
  {
    for (var i = 0; i < result.Length; i++)
    {
      var c = cards4[i];
      result[i] -= one[c.C0] + one[c.C1];
    }
  }

  [Benchmark(OperationsPerInvoke = Size)]
  public void C5()
  {
    for (var i = 0; i < result.Length; i++)
    {
      var c = cards5[i];
      result[i] -= one[c.C0] + one[c.C1];
    }
  }

  [Benchmark(OperationsPerInvoke = Size)]
  public void C6()
  {
    for (var i = 0; i < result.Length; i++)
    {
      var c = cards6[i];
      result[i] -= one[c.C0] + one[c.C1];
    }
  }

  [Benchmark(OperationsPerInvoke = Size)]
  public void C7()
  {
    for (var i = 0; i < result.Length; i++)
    {
      var c = cards7[i];
      result[i] -= one[c.C0] + one[c.C1];
    }
  }

  [Benchmark(OperationsPerInvoke = Size)]
  public void C8()
  {
    for (var i = 0; i < result.Length; i++)
    {
      var c = cards8[i];
      result[i] -= one[c.C0] + one[c.C1];
    }
  }
}

public static class ArrayUtils
{
  private static readonly Random Rand = new Random(137);

  public static float[] CreateRandomFloatArray(int size)
  {
    var result = new float[size];
    for (int i = 0; i < size; i++)
      result[i] = (float) Rand.NextDouble();
    return result;
  }

  public static byte[] RandomByteArray(int size, int minValue, int maxValueInclusive)
  {
    var result = new byte[size];
    for (int i = 0; i < size; i++)
      result[i] = (byte) Rand.Next(minValue, maxValueInclusive + 1);
    return result;
  }
}

public static class CardUtls
{
  private static T[] Create<T>(int length, Func<int, T> create) => Enumerable.Range(0, length).Select(create).ToArray();

  public static Cards2[] Create2(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards2 {C0 = c0[i], C1 = c1[i]});
  public static Cards3[] Create3(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards3 {C0 = c0[i], C1 = c1[i]});
  public static Cards4[] Create4(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards4 {C0 = c0[i], C1 = c1[i]});
  public static Cards5[] Create5(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards5 {C0 = c0[i], C1 = c1[i]});
  public static Cards6[] Create6(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards6 {C0 = c0[i], C1 = c1[i]});
  public static Cards7[] Create7(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards7 {C0 = c0[i], C1 = c1[i]});
  public static Cards8[] Create8(byte[] c0, byte[] c1) => Create(c0.Length, i => new Cards8 {C0 = c0[i], C1 = c1[i]});
}

public struct Cards2
{
  public byte C0, C1;
}

public struct Cards3
{
  public byte C0, C1, C2;
}

public struct Cards4
{
  public byte C0, C1, C2, C3;
}

public struct Cards5
{
  public byte C0, C1, C2, C3, C4;
}

public struct Cards6
{
  public byte C0, C1, C2, C3, C4, C5;
}

public struct Cards7
{
  public byte C0, C1, C2, C3, C4, C5, C6;
}

public struct Cards8
{
  public byte C0, C1, C2, C3, C4, C5, C6, C7;
}


class Program
{
  static void Main()
  {
    BenchmarkRunner.Run<CardBenchmarks>();
  }
}

Answer

As you can see, your benchmark is a hard one, there are a lot of different factors which affect your performance. One of the most important things is how your runtime layouts your data. For example, you will not observe the described behavior on Mono, because Mono and the Full Framework have different layout algorithms (on Mono we have Marshal.SizeOf(typeof(Card2)) == 8).

We have Time(Card5) > Time(Card8) on the Full Framework because Card5 produces a lot of unaligned reads unlike Card8 (see the first picture).

Update

Question from the comment:

Any idea why 3-byte performs better than 8-byte on your RyuJIT64 benchmark?

Let's look at the asm code:

; RyuJIT-x64, C3
                var c = cards3[i];
00007FFEDF0CADCE  mov         r10,r9  
00007FFEDF0CADD1  mov         r11d,dword ptr [r10+8]  
00007FFEDF0CADD5  cmp         eax,r11d  
00007FFEDF0CADD8  jae         00007FFEDF0CAE44  
00007FFEDF0CADDA  movsxd      r11,eax  
00007FFEDF0CADDD  imul        r11,r11,3  
00007FFEDF0CADE1  lea         r10,[r10+r11+10h]  
00007FFEDF0CADE6  movzx       r11d,byte ptr [r10]          ; !!!
00007FFEDF0CADEA  movzx       r10d,byte ptr [r10+1]        ; !!!

; RyuJIT-x64, C8
                var c = cards8[i];
00007FFEDF0CAE8C  mov         rdx,qword ptr [rcx+48h]  
00007FFEDF0CAE90  mov         r8d,dword ptr [rdx+8]  
00007FFEDF0CAE94  cmp         eax,r8d  
00007FFEDF0CAE97  jae         00007FFEDF0CAF0C  
00007FFEDF0CAE99  movsxd      r8,eax  
00007FFEDF0CAE9C  mov         rdx,qword ptr [rdx+r8*8+10h] ; !!! 
00007FFEDF0CAEA1  mov         qword ptr [rsp+28h],rdx      ; !!!

In the C3 case, RyuJIT keeps the target bytes in the r11d, r10d registers; in the C8 case, RyuJIT keeps them on stack (qword ptr [rsp+28h]). The explanation: the current version of RyuJIT (v4.6.1080.0 in my case) does scalar replacement for structs with no more than 4 fields (see coreclr#6839). Thus, RyuJIT performance of C5, C6, C7, and C8 is worse than performance of C2, C3, C4. Be careful: this behavior may be changed in the future versions of RyuJIT.