Performance impact of array writes is much larger than expected

I've stumbled upon this effect when debugging an application - see the repro code below.

It gives me the following results:

Data init, count: 100,000 x 10,000, 4.6133365 secs Perf test 0 (False): 5.8289565 secs Perf test 0 (True): 5.8485172 secs Perf test 1 (False): 32.3222312 secs Perf test 1 (True): 217.0089923 secs

As far as I understand, the array store operations shouldn't normally have such a drastic performance effect (32 vs 217 seconds). I wonder if anyone understands what effects are at play here?

UPD extra test added; Perf 0 shows the results as expected, Perf 1 - shows the performance anomaly.

class Program
{
    static void Main(string[] args)
    {
        var data = InitData();

        TestPerf0(data, false);
        TestPerf0(data, true);

        TestPerf1(data, false);
        TestPerf1(data, true);

        if (Debugger.IsAttached)
            Console.ReadKey();
    }

    private static string[] InitData()
    {
        var watch = Stopwatch.StartNew();

        var data = new string[100_000];
        var maxString = 10_000;

        for (int i = 0; i < data.Length; i++)
        {
            data[i] = new string('-', maxString);
        }

        watch.Stop();
        Console.WriteLine($"Data init, count: {data.Length:n0} x {maxString:n0}, {watch.Elapsed.TotalSeconds} secs");

        return data;
    }

    private static void TestPerf1(string[] vals, bool testStore)
    {
        var watch = Stopwatch.StartNew();

        var counters = new int[char.MaxValue];
        int tmp = 0;

        for (var j = 0; ; j++)
        {
            var allEmpty = true;

            for (var i = 0; i < vals.Length; i++)
            {
                var val = vals[i];

                if (j < val.Length)
                {
                    allEmpty = false;

                    var ch = val[j];
                    var count = counters[ch];
                    tmp ^= count;

                    if (testStore)
                        counters[ch] = count + 1;
                }
            }

            if (allEmpty)
                break;
        }

        // prevent the compiler from optimizing away our computations
        tmp.GetHashCode();

        watch.Stop();
        Console.WriteLine($"Perf test 1 ({testStore}): {watch.Elapsed.TotalSeconds} secs");
    }

    private static void TestPerf0(string[] vals, bool testStore)
    {
        var watch = Stopwatch.StartNew();

        var counters = new int[65536];
        int tmp = 0;

        for (var i = 0; i < 1_000_000_000; i++)
        {
            var j = i % counters.Length;
            var count = counters[j];
            tmp ^= count;

            if (testStore)
                counters[j] = count + 1;
        }

        // prevent the compiler from optimizing away our computations
        tmp.GetHashCode();

        watch.Stop();
        Console.WriteLine($"Perf test 0 ({testStore}): {watch.Elapsed.TotalSeconds} secs");
    }
}
728x90

1 Answers Performance impact of array writes is much larger than expected

After testing your code for quite some time my best guess is, as already said in the comments, that you experience a lot of cache-misses with your current solution. The line:

if (testStore)
    counters[ch] = count + 1;

might be force the compiler to completely load a new cache-line into the memory and displace the current content. There might also be some problems with branch-prediction in this scenario. This is highly hardware dependent and I'm not aware of a really good solution to test this in any interpreted language (It's also quite hard in compiled languages where the hardware is set and well-known).

After going through the disassembly, you can clearly see that you also introduce a whole bunch of new instruction which might increase the before mentioned problems further.

enter image description here

Overall I'd advice you the re-write the complete algorithm as there are better places to improve performance instead of picking at this one little assignment. This would be the optimizations I'd suggest (this also improves readability):

  1. Invert your i and j loop. This will remove the allEmpty variable completely.
  2. Cast ch to int with var ch = (int) val[j]; - because you ALWAYS use it as index.
  3. Think about why this might be a problem at all. You introduce a new instruction and any instruction comes at a cost. If this is really the primary "hot-spot" of your code you can start to think about better solutions (Remember: "premature optimization is the root of all evil").
  4. As this is a "test setting" which the name suggests, is this important at all? Just remove it.

EDIT: Why did I suggest to invert to loops? With this little rearrangement of code:

foreach (var val in vals)
{
    foreach (int ch in val)
    {
        var count = counters[ch];
        tmp ^= count;
        if (testStore)
        {
            counters[ch] = count + 1;
        }
    }
}

I come from runtimes like this:

enter image description here

to runtimes like this:

enter image description here

Do you still think it's not worth a try? I saved some orders of magnitude here and nearly eliminated the effect of the if (to be clear - all optimizations are disabled in the settings). If there are special reasons not to do this you should tell us more about the context in which this code will be used.


EDIT2: For the in-depth answer. My best explanation for why this problem occurs is because you cross-reference your cache-lines. In the lines:

for (var i = 0; i < vals.Length; i++)
{
    var val = vals[i];

you load a really massive dataset. This is by far bigger than a cache-line itself. So it will most likely need to be loaded every iteration fresh from the memory into a new cache-line (displacing the old content). This is also known as "cache-thrashing" if I remember correctly. Thanks to @mjwills for pointing this out in his comment.

In my suggested solution, on the other hand, the content of a cache-line can stay alive as long as the inner loop did not exceed its boundaries (which happens a lot less if you use this direction of memory access).

This is the closest explanation why me code runs that much faster and it also supports the assumption that you have serious caching problems with your code.

4 months ago