Dreamscape May 28, 2023

Cache locality

Let’s examine code (in C++) below:


    for (int i = 0; i < 1000; ++i)
    {
        for (int j = 0; j < 1000; ++j)
        {
            matrix[j][i] = i + j;
        }
    }

Did you notice anything odd with it? Take a look at how we access elements of the matrix. I think you’ll agree that we usually do that row first, but in this case it’s column first. Some might think: “so what, the operation performed on the elements is the same, it works either way, it doesn’t matter!”. But does it really not matter? Let’s do a quick test and try to see if one way of accessing matrix elements is faster than the other. We’ll do some basic time measurement by adding instrumentation code like this:


    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < 1000; ++i)
    {
        for (int j = 0; j < 1000; ++j)
        {
            matrix[j][i] = i + j;
        }
    }
         
    auto end = std::chrono::high_resolution_clock::now();
    auto time = std::chrono::duration_cast(end - start).count();
    std::cout << time << std::endl;

Results:

Rows first: 2398 us
Columns first: 3217 us

What's going on?

Optimizing memory access is a crucial aspect of improving the performance of programs. One key concept that can significantly impact performance is spatial cache locality. Spatial cache locality refers to the tendency of a program to access data that is close together in memory. By accessing data that is nearby in memory, we can take advantage of the cache hierarchy, which stores recently accessed data in faster levels of memory, closer to the processor. This avoids the need to fetch data from slower main memory, resulting in substantial performance gains. When you try to load a value, if it is not present in the cache, the CPU will request fetching a cache line that contains that value. A cache line is the smallest portion of data that can be mapped into cache and its size can vary depending on the architecture (for example it can be 64 bytes). If however, the value you want to load is present in the cache, then the fetching of the cache line doesn’t happen and CPU can load the needed value much quicker.

Pour Conclure

Accessing portions of memory that are closer together will be faster in general. This is also a kind of a natural tendency of most programs. Nevertheless, it’s good to be aware of this for such cases where it might seem like it doesn’t matter in which order we access memory, we should always pay attention and try to access memory in sequential order if it is whenever it’s possible.