CPU cache thrashing

When you work as a web developer you rarely think in terms of CPU, registers and memory cells. And most modern developers know a little about what is going on at the low-level. Recently I’ve decided to eliminate this gap and started to read about microprocessors architecture and other stuff related to it. This is really an interesting are and I recommend you to learn more about it. You might want to start from great book called “Inside the Machine”.

Today I’ll show you an interesting effect of L1 cache thrashing. As you might heard, modern architecture has several types of memory with different speed and cost: hard drives, RAM, CPU caches (L3,L2,L1), registers. Faster memory is more expensive so amount of it is a bounded resource . That is why developers employ technique called caching – keep only active information in faster memory closer to processor.

Level 1 CPU cache is a piece of fast memory usually a few KB is size (32KB in my case) that works at the same speed as CPU. If required information is not in this cache (neither in L2,L3) then CPU accesses RAM. In fact RAM is not fast enough for CPU, fetching information from RAM usually takes about a dozen of CPU cycles. If there was no on-chip cache then modern CPU waste most of its time just waiting a response from RAM. Thankfully we have it and as Intel reports ~90% of all memory accesses hit L1 cache.

Very often access to memory has sequence pattern – if you accessed address X recently then there is a high chance that you will access some memory cell nearby. That is why when CPU fetches a cell from RAM to cache it also fetches a few bytes nearby. This block called “cache line” and has size 16 bytes on x86 Intel architecture. Cache address range 0-15 is line #0, 16-31 – line #1, 32-47 – line #2, and so on. When you fetch some byte at a memory address you fetch the whole cache line, if you discard an information from cache – you discard the whole line as well. Got it? This cache line addressing schema also helps to reduce complexity and price of cache circuit.

Another question is cache associativity. A specific line in RAM can be mapped to only one line in L1 cache. RAM address 0-15 can be stored in the line #0 only, 16-31 in line #1 only. I have 32K of L1 cache and in my case RAM address 32768-32783 also map to cache line #0. Any address in range from L1_SIZE*N+LINE_SIZE*K to LINE_SIZE*N+LINE_SIZE*(K+1)-1 map to line #K. This is called direct-mapped cache. Again, it is done to simplify the cache circuit.
Imagine scenario when I try to copy 16 bytes from RAM address 0 to RAM address 32768. In C I would use naive code like *d++=*s++. Let’s check what this code does:
1) fetch address 0-15 from RAM to line #0 of the cache
2) copy byte at address 0 to a temporary register
3) now we need to copy this byte to address 32768 and for that address 32768-32783 should be fetched to cache first. And its destination line cache is also #0. CPU discards recently fetched data from the line #0 and puts 32768-32783 to it.
4) copies data from temporary register to line #0 cell 0 (that represents address 32768).

The index variables increments and now my program wants to copy byte at address 1 to address 32769. But for that it should discard line #0 again and refetch address 0-15 from RAM. This refetching repeats for every byte. Do you remember I told that fetching data from RAM is very slow operation from CPU point of view? Copying like this wastes a lot of CPU time and called “cache thrashing”.

Here is a program in C that demonstrates cache thrashing. It fills matrix of size 32K*32K with numbers. The most interesting part is line 32
array[i][j] = i*j;
on my machine gives 0.3 sec, but if I swap i and j like this
array[j][i] = i*j;
it gives 4 seconds.

CPU executes the same amount of instructions and the only difference is memory access pattern. First example writes to memory serially – load 0-15 to cache line #0 then change value of address 0,1,2,3,… and only then flushes the cache line to RAM. The second example thrashes L1 cache by loading 0-15 to line #0, write value to cell 0, flush it, load address 32768-32783 to line #0, write value to address 32768, flush cache line, load the next RAM address to line #0 write and flush it and so on. These access patterns show 10 times difference in application performance.

This effect called “Cache Thrashing” and developers who work on high-efficient applications try to avoid it e.g. by laying out frequently accessing data nearby.

Advertisements

5 responses to this post.

  1. Posted by Kim on March 10, 2012 at 04:02

    That was helpful :), thanks a lot! Well explained 🙂

    Reply

  2. Posted by Augustin on April 12, 2012 at 04:10

    Nice explanation dude !!

    Reply

  3. Thank you for the clear explanation. Did you test with optimizations disabled? Do you happen to know if the compiler optimizes the column by column iteration?

    Reply

    • Nope, I ran this kind of thing recently on an intel i7 processor. A matrix product was about twice as fast if one of them was transposed first (optimizes for row access) even including the transpose operation.

      Reply

  4. Another, possibly more problematic example arises from multiple cores sharing an L2 or L3 cache. If each core (running different threads) starts using the same cache line for different data. They will keep replacing each other’s data resulting in more or less 100% miss rate. Bad times.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: