Understanding CPU cache and cache line

Understanding CPU cache and cache line - c

I am trying to understand how CPU cache is operating. Lets say we have this configuration (as an example).
Cache size 1024 bytes
Cache line 32 bytes
1024/32 = 32 cache lines all together.
Singel cache line can store 32/4 = 8 ints.
1) According to these configuration length of tag should be 32-5=27 bits, and size of index 5 bits (2^5 = 32 addresses for each byte in cache line).
If total cache size is 1024 and there are 32 cache lines, where is tags+indexes are stored? (There is another 4*32 = 128 bytes.) Does it means that actual size of the cache is 1024+128 = 1152?
2) If cache line is 32 bytes in this example, this means that 32 bytes getting copied in cache whenerever CPU need to get new byte from RAM. Am I right to assume that cache line position of the requested byte will be determined by its adress?
This is what I mean: if CPU requested byte at [FF FF 00 08], then available cache line will be filled with bytes from [FF FF 00 00] to [FF FF 00 1F]. And our requseted single byte will be at position [08].
3) If previous statement is correct, does it mean that 5 bits that used for index, are technically not needed since all 32 bytes are in the cache line anyway?
Please let me know if I got something wrong.
Thanks

A cache consists of data and tag RAM, arranged as a compromise of access time vs efficiency and physical layout. You're missing an important stat: number of ways (sets). You rarely have 1-way caches, because they perform pathologically badly with simple patterns. Anyway:
1) Yes, tags take extra space. This is part of the design compromise - you don't want it to be a large fraction of the total area, and why line size isn't just 1 byte or 1 word. Also, all tags for an index are simultaneously accessed, and that can affect efficiency and layout if there's a large number of ways. The size is slightly bigger than your estimate. There's usually also a few bits extra bits to mark validity and sometimes hints. More ways and smaller lines needs a larger fraction taken up by tags, so generally lines are large (32+ bytes) and ways are small (4-16).
2) Yes. Some caches also do a "critical word first" fetch, where they start with the word that caused the line fill, then fetch the rest. This reduces the number of cycles the CPU is waiting for the data it actually asked for. Some caches will "write thru" and not allocate a line if you miss on a write, which avoids having to read the entire cache line first, before writing to it (this isn't always a win).
3) The tags won't store the lower 5 bits as they're not needed to match a cache line. They just index into individual lines.
Wikipedia has a pretty good, if a bit intense, write-up on caches: http://en.wikipedia.org/wiki/CPU_cache - see "Implementation". There's a diagram of how data and tags are split. Me, I think everyone should learn this stuff because you really can improve performance of code when you know what the underlying machine is actually capable of.

The cache metadata is typically not counted as a part of the cache itself. It might not even be stored in the same part of the CPU (it could be in another cache, implemented using special CPU registers, etc).
This depends on whether your CPU will fetch unaligned addresses. If it will only fetch aligned addresses, then the example you gave would be correct. If the CPU fetches unaligned addresses, then it might fetch the range 0xFFFF0008 to 0xFFFF0027.
The index bytes are still useful, even when cache access is aligned. This gives the CPU a shorthand method for referencing a byte within a cache line that it can use in its internal bookkeeping. You could get the same information by knowing the address associated with the cache line and the address associated with the byte, but that's a whole lot more information to carry around.
Different CPUs implement caching very differently. For the best answer to your question, please give some additional details about the particular CPU (type, model, etc) that you are talking about.

This is based on my vague memory, you should read books like "Computer Architecture: A Quantitative Approach" by Hennessey and Patterson. Great book.
Assuming a 32-bit CPU... (otherwise your figures would need to use >4 bytes (maybe <8 bytes since some/most 64-bit CPU don't have all 64 bits of address line used)) for the address.
1) I believe it's at least 4*32 bytes. Depending on the CPU, the chip architects may have decided to keep track of other info besides the full address. But it's usually not considered part of the cache.
2) Yes, but how that mapping is done is different. See Wikipedia - CPU cache - associativity There's the simple direct mapped cache and the more complex associative mapped cache. You want to avoid the case where some code needs two piece of information but the two addresses map to the exact same cache line.

Related

Fastest use of a dataset of just over 64 bytes?

Structure: I have 8 64-bit integers (512 bits = 64 bytes, the assumed cache line width) that I would like to compare to another, single 64-bit integer, in turn, without cache misses. The data set is, unfortunately, absolutely inflexible -- it's already as small as possible.
Access pattern: Each uint64_t is in fact an array of 4x4x4 bits, each bit representing the presence or absence of a voxel. This means sometimes I will be using half of one chunk and half of another, or even corners of 8 different 64-bit chunks.... I guess what this means is there is a high likelihood of a lack of alignment.
How can I do this as fast as possible i.e. without thrashing the cache?
P.S. The idea is that this code will ultimately run on a fairly wide range of architectures of at least a 64B cache line width, so I'd prefer this were absolutely as fast as possible. This also means I can't rely on MOVNTDQA, which anyway may incur a performance hit of it's own inspite of loading the 9th element directly to the CPU.
P.P.S. My knowledge of this area is fairly limited so please take it easy on me. But please spare me the premature optimisation comments; be sure that this is the 3% of this application that really counts.

I wouldn't worry about it. If your dataset is really only 9 integers, most of it will likely be stored in registers anyway. Also, there isn't really any way to optimize cache usage without specifying an architecture, since cache structure is architecture dependent. If you can list several target architectures you may be able to find some commonalities that you can optimize toward, but without knowing those architectures, I don't think there's much we can do for you.
Lastly, this seems like a good example of optimizing too early. I would suggest you take the following steps:
Decide what your maximum acceptable run time is
Finish your program in C
Compile for all of your target architectures
For those platforms that don't meet your speed spec, hand-optimize the intermediate assembly files and recompile until you meet your spec.

Are you sure you get cache-misses?
Even if the comparing value is not in an register, i think your first uint64 array should be on one cache stage (or what ever it is called) and your other data in another.
Your cache surely has some n-way associativity, that prevents your data row from being removed from the cache just by accessing your compare value.
Do not lose your time on Micro Optimizations. Improve your algorithms and data structures.

Last used cache line versus different cache lines

Let's assume cache lines are 64 bytes wide and I have two arrays a and b which fill a cache line and are also aligned to a cache line. Let's also assume that both arrays are in the L1 cache so when I read from them I don't get a cache miss.
float a[16]; //64 byte aligned e.g. with __attribute__((aligned (64)))
float b[16]; //64 byte aligned
I read a[0]. My question is it faster to now read a[1] than to read b[0]? In other words, is it faster to read from the last used cache line?
Does the set matter? Let's now assume that I have a 32 kb L1 data cache which is 4 way. So if a and b are 8192 bytes apart they end up in the same set. Will this change the answer to my question?
Another way to ask my question (which is what I really care about) is in regards to reading a matrix.
In other words which one of these two code options will be more efficient assuming matrix M fits in the L1 cache and is 64 byte aligned and is already in the L1 cache.
float M[16][16]; //64 byte aligned
Version 1:
for(int i=0; i<16; i++) {
for(int j=0; j<16; j++) {
x += M[i][j];
}
}
Version 2:
for(int i=0; i<16; i++) {
for(int j=0; j<16; j++) {
x += M[j][i];
}
}
Edit: To make this clear due to SSE/AVX lets assume I read the first eight values from a at once with AVX (e.g. with _mm256_load_ps()). Will reading the next eight values from a be faster than reading the first eight values from b (recall that a and b are already in the cache so there will not be a cahce miss)?
Edit:: I'm mostly interested in all processors since Intel Core 2 and Nehalem but I'm currently working with an Ivy Bridge processor and plan to use Haswell soon.

With current Intel processors, there is no performance difference between loading two different cache lines that are both in L1 cache, all else being equal. Given float a[16], b[16]; with a[0] recently loaded, a[1] in the same cache line as a[0], and b[1] not recently loaded but still in L1 cache, then there will be no performance difference between loading a[1] and b[0] in the absence of some other factor.
One thing that can cause a difference is if there has recently been a store to some address that shares some bits with one of the values being loaded, although the entire address is different. Intel processors compare some of the bits of addresses to determine whether they might match a store that is currently in progress. If the bits match, some Intel processors delay the load instruction to give the processor time to resolve the complete virtual address and compare it to the address being stored. However, this is an incidental effect that is not particular to a[1] or b[0].
It is also theoretically possible that a compiler that sees your code is loading both a[0] and a[1] in short succession might make some optimization, such as loading them both with one instruction. My comments above apply to hardware behavior, not C implementation behavior.
With the two-dimensional array scenario, there should still be no difference as long as the entire array M is in L1 cache. However, column traversals of arrays are notorious for performance problems when the array exceeds L1 cache. A problem occurs because addresses are mapped to sets in cache by fixed bits in the address, and each cache set can hold only a limited number of cache lines, such as four. Here is a problem scenario:
An array M has a row length that is a multiple of the distance that results in addresses being mapped to the same cache sets, such as 4096 bytes. E.g., in the array float M[1024][1024];, M[0][0] and M[1][0] are 4096 bytes apart and map to the same cache set.
As you traverse a column of the array, you access M[0][0], M[1][0], M[2][0], M[3][0], and so on. The cache line for each of these elements is loaded into cache.
As you continue along the column, you access M[8][0], M[9][0], and so on. Since each of these uses the same cache set as the previous ones and the cache set can hold only four lines, the earlier lines containing M[0][0] and so on are evicted from cache.
When you complete the column and start the next column by reading M[0][1], the data is no longer in L1 cache, and all of your loads must fetch the data from L2 cache (or worse if you also thrashed L2 cache in the same way).

Fetching a[0] and then either a[1] or b[0] should amount to 2 cache access that hit the L1 in either case. You didn't say which uArch you're using but i'm not familiar with any mechanism that does further "caching" of the full cacheline above the L1 (anywhere in the memory unit), and I don't think such a mechanism could be feasible (at least not for any reasonable price).
Assume you read a[0] and then a[1], and would like to save the effort of accessing the L1 again for that line - your HW would have to not only keep the full cache line somewhere in the memory unit in case it's going to be accessed again (not sure how much that's a common case, so this feature is probably not the effort), but also keep it snoopable as a logical extension of your cache in case some other core tries to modify a[1] between these two reads (which x86 permits for wb memory). In fact, it could even be a store in the same thread context, and you'll have to guard against that (since most common x86 CPUs today are performing loads out of order). If you don't maintain both of these (and probably other safeguards too) - you break coherency, if you do - you've created a monster logic that does that same as your L1 already does, just to save meager 1-2 cycles of access.
However, even though both options would require the same number of cache accesses, there may be other considerations effecting their efficiency, such as L1 banking, same-set access restrictions, lazy LRU updating, etc.. All of which depend on your exact machine implementation.
If you don't focus only on memory/cache access efficiency, your compiler should be able to vectorize accesses to consecutive memory locations, which would still incur the same accesses but will be lighter on execution BW. I think that any decent compiler should be able to unroll your loops at this size, and combine the consecutive accesses into a single vector, but you may be able to help it by using option 1 (especially if there are also writes or other problematic instructions in the middle that would compilcate the job for the compiler)
Edit
Since you're also asking about fitting the matrix in the L2 - that simplifies the question - in that case using the same line(s) multiple times as in option 1 is better as it allows you to hit the L1, while the alternative is to constantly fetch from the L2, which gives you lower latency and bandwidth. This is the basic principle behind loop tiling / blocking

Spatial locality is king so version #1 is faster. A good compiler can even vectorize the reads using SSE/AVX.
The CPU rearranges reads so it doesn't matter which one is first. In out-of-order CPUs it should matter very little if the both cache lines are on the same way.
For large matrices, it is even more important to keep locality so the L1 cache remains hot (less cache misses).

Although I don't know the answer to your question(s) directly (someone else may have more knowledge about processor architecture), have you tried / is it possible to find out the answer yourself by some form of benchmarking?
You can get a high resolution timer by some function such as QueryPerformanceCounter (assuming you're on Windows) or OS equivalent, then iterate the reads you want to test by x amount of times, then get the high resolution timer again to get the average time a read took.
Perform this process again for different reads and you should be able to compare average read times for different types of read, which should answer your question. That's not to say that the answer will remain the same on different processors though.

word alignment of 4 byte for XOR operations

Is there any advantage in doing bitwise operations on word boundaries? Any CPU or memory optimization in doing so?
Actual problem:
I am trying to create XOR of two structure. Lets say structure-1 and structure-2 both of same size 10000 bytes. I leave first few hundreds bytes as it is and then start XOR of 1 and 2.
Lets say I start with 302 to begin with. This will take 4 byte at a time and do XOR. 302, 303, 304 and 305 of both structure will be XORed. This cycle will be repeated till 10000.
Now, If I start from 304, Is there any performance improvement expected?

Yes, there are at least two advantages for using proper alignment:
Portability. Not all processor support non-aligned numbers. For maximum portability, you should only use fully aligned (i.e. an N-byte integer starts at an address that is a multiple of N) numbers
Speed. AFAIK, even a processor that supports non-aligned numbers is still faster with aligned numbers.

Premature optimization is the root of all evil
Just do it the straightforward way, then optimize it if your profiler tells you it's important.
Yes, you will go faster if you're properly aligned. You'll go even faster if you use the SSE2 vector XOR instructions, where properly aligned you'll do it 16 bytes at a time and not pollute the cache. And it's highly unlikely that optimizing this is where you should be spending your time.

Some processors only allow 4-byte operations on 32-bit word boundaries (some allow them only on halfword boundaries).
On these processors non-aligned access causes a processor exception which - depending on CPU, OS and settings - will cause a process crash or just a lot of work for the OS.
On other processors (e.g. x86) you will just get the performance hit of having to do two reads and writes (plus a bit of shifting) per operation.
See link text to see problems with ARM CPUs

Design code to fit in CPU Cache?

When writing simulations my buddy says he likes to try to write the program small enough to fit into cache. Does this have any real meaning? I understand that cache is faster than RAM and the main memory. Is it possible to specify that you want the program to run from cache or at least load the variables into cache? We are writing simulations so any performance/optimization gain is a huge benefit.
If you know of any good links explaining CPU caching, then point me in that direction.

At least with a typical desktop CPU, you can't really specify much about cache usage directly. You can still try to write cache-friendly code though. On the code side, this often means unrolling loops (for just one obvious example) is rarely useful -- it expands the code, and a modern CPU typically minimizes the overhead of looping. You can generally do more on the data side, to improve locality of reference, protect against false sharing (e.g. two frequently-used pieces of data that will try to use the same part of the cache, while other parts remain unused).
Edit (to make some points a bit more explicit):
A typical CPU has a number of different caches. A modern desktop processor will typically have at least 2 and often 3 levels of cache. By (at least nearly) universal agreement, "level 1" is the cache "closest" to the processing elements, and the numbers go up from there (level 2 is next, level 3 after that, etc.)
In most cases, (at least) the level 1 cache is split into two halves: an instruction cache and a data cache (the Intel 486 is nearly the sole exception of which I'm aware, with a single cache for both instructions and data--but it's so thoroughly obsolete it probably doesn't merit a lot of thought).
In most cases, a cache is organized as a set of "lines". The contents of a cache is normally read, written, and tracked one line at a time. In other words, if the CPU is going to use data from any part of a cache line, that entire cache line is read from the next lower level of storage. Caches that are closer to the CPU are generally smaller and have smaller cache lines.
This basic architecture leads to most of the characteristics of a cache that matter in writing code. As much as possible, you want to read something into cache once, do everything with it you're going to, then move on to something else.
This means that as you're processing data, it's typically better to read a relatively small amount of data (little enough to fit in the cache), do as much processing on that data as you can, then move on to the next chunk of data. Algorithms like Quicksort that quickly break large amounts of input in to progressively smaller pieces do this more or less automatically, so they tend to be fairly cache-friendly, almost regardless of the precise details of the cache.
This also has implications for how you write code. If you have a loop like:
for i = 0 to whatever
step1(data);
step2(data);
step3(data);
end for
You're generally better off stringing as many of the steps together as you can up to the amount that will fit in the cache. The minute you overflow the cache, performance can/will drop drastically. If the code for step 3 above was large enough that it wouldn't fit into the cache, you'd generally be better off breaking the loop up into two pieces like this (if possible):
for i = 0 to whatever
step1(data);
step2(data);
end for
for i = 0 to whatever
step3(data);
end for
Loop unrolling is a fairly hotly contested subject. On one hand, it can lead to code that's much more CPU-friendly, reducing the overhead of instructions executed for the loop itself. At the same time, it can (and generally does) increase code size, so it's relatively cache unfriendly. My own experience is that in synthetic benchmarks that tend to do really small amounts of processing on really large amounts of data, that you gain a lot from loop unrolling. In more practical code where you tend to have more processing on an individual piece of data, you gain a lot less--and overflowing the cache leading to a serious performance loss isn't particularly rare at all.
The data cache is also limited in size. This means that you generally want your data packed as densely as possible so as much data as possible will fit in the cache. Just for one obvious example, a data structure that's linked together with pointers needs to gain quite a bit in terms of computational complexity to make up for the amount of data cache space used by those pointers. If you're going to use a linked data structure, you generally want to at least ensure you're linking together relatively large pieces of data.
In a lot of cases, however, I've found that tricks I originally learned for fitting data into minuscule amounts of memory in tiny processors that have been (mostly) obsolete for decades, works out pretty well on modern processors. The intent is now to fit more data in the cache instead of the main memory, but the effect is nearly the same. In quite a few cases, you can think of CPU instructions as nearly free, and the overall speed of execution is governed by the bandwidth to the cache (or the main memory), so extra processing to unpack data from a dense format works out in your favor. This is particularly true when you're dealing with enough data that it won't all fit in the cache at all any more, so the overall speed is governed by the bandwidth to main memory. In this case, you can execute a lot of instructions to save a few memory reads, and still come out ahead.
Parallel processing can exacerbate that problem. In many cases, rewriting code to allow parallel processing can lead to virtually no gain in performance, or sometimes even a performance loss. If the overall speed is governed by the bandwidth from the CPU to memory, having more cores competing for that bandwidth is unlikely to do any good (and may do substantial harm). In such a case, use of multiple cores to improve speed often comes down to doing even more to pack the data more tightly, and taking advantage of even more processing power to unpack the data, so the real speed gain is from reducing the bandwidth consumed, and the extra cores just keep from losing time to unpacking the data from the denser format.
Another cache-based problem that can arise in parallel coding is sharing (and false sharing) of variables. If two (or more) cores need to write to the same location in memory, the cache line holding that data can end up being shuttled back and forth between the cores to give each core access to the shared data. The result is often code that runs slower in parallel than it did in serial (i.e., on a single core). There's a variation of this called "false sharing", in which the code on the different cores is writing to separate data, but the data for the different cores ends up in the same cache line. Since the cache controls data purely in terms of entire lines of data, the data gets shuffled back and forth between the cores anyway, leading to exactly the same problem.

Here's a link to a really good paper on caches/memory optimization by Christer Ericsson (of God of War I/II/III fame). It's a couple of years old but it's still very relevant.

A useful paper that will tell you more than you ever wanted to know about caches is What Every Programmer Should Know About Memory by Ulrich Drepper. Hennessey covers it very thoroughly. Christer and Mike Acton have written a bunch of good stuff about this too.
I think you should worry more about data cache than instruction cache — in my experience, dcache misses are more frequent, more painful, and more usefully fixed.

UPDATE: 1/13/2014
According to this senior chip designer, cache misses are now THE overwhelmingly dominant factor in code performance, so we're basically all the way back to the mid-80s and fast 286 chips in terms of the relative performance bottlenecks of load, store, integer arithmetic, and cache misses.
A Crash Course In Modern Hardware by Cliff Click # Azul
.
.
.
.
.
--- we now return you to your regularly scheduled program ---
Sometimes an example is better than a description of how to do something. In that spirit here's a particularly successful example of how I changed some code to better use on chip caches. This was done some time ago on a 486 CPU and latter migrated to a 1st Generation Pentium CPU. The effect on performance was similar.
Example: Subscript Mapping
Here's an example of a technique I used to fit data into the chip's cache that has general purpose utility.
I had a double float vector that was 1,250 elements long, which was an epidemiology curve with very long tails. The "interesting" part of the curve only had about 200 unique values but I didn't want a 2-sided if() test to make a mess of the CPU's pipeline(thus the long tails, which could use as subscripts the most extreme values the Monte Carlo code would spit out), and I needed the branch prediction logic for a dozen other conditional tests inside the "hot-spot" in the code.
I settled on a scheme where I used a vector of 8-bit ints as a subscript into the double vector, which I shortened to 256 elements. The tiny ints all had the same values before 128 ahead of zero, and 128 after zero, so except for the middle 256 values, they all pointed to either the first or last value in the double vector.
This shrunk the storage requirement to 2k for the doubles, and 1,250 bytes for the 8-bit subscripts. This shrunk 10,000 bytes down to 3,298. Since the program spent 90% or more of it's time in this inner-loop, the 2 vectors never got pushed out of the 8k data cache. The program immediately doubled its performance. This code got hit ~ 100 billion times in the process of computing an OAS value for 1+ million mortgage loans.
Since the tails of the curve were seldom touched, it's very possible that only the middle 200-300 elements of the tiny int vector were actually kept in cache, along with 160-240 middle doubles representing 1/8ths of percents of interest. It was a remarkable increase in performance, accomplished in an afternoon, on a program that I'd spent over a year optimizing.
I agree with Jerry, as it has been my experience also, that tilting the code towards the instruction cache is not nearly as successful as optimizing for the data cache/s. This is one reason I think AMD's common caches are not as helpful as Intel's separate data and instruction caches. IE: you don't want instructions hogging up the cache, as it just isn't very helpful. In part this is because CISC instruction sets were originally created to make up for the vast difference between CPU and memory speeds, and except for an aberration in the late 80's, that's pretty much always been true.
Another favorite technique I use to favor the data cache, and savage the instruction cache, is by using a lot of bit-ints in structure definitions, and the smallest possible data sizes in general. To mask off a 4-bit int to hold the month of the year, or 9 bits to hold the day of the year, etc, etc, requires the CPU use masks to mask off the host integers the bits are using, which shrinks the data, effectively increases cache and bus sizes, but requires more instructions. While this technique produces code that doesn't perform as well on synthetic benchmarks, on busy systems where users and processes are competing for resources, it works wonderfully.

Mostly this will serve as a placeholder until I get time to do this topic justice, but I wanted to share what I consider to be a truly groundbreaking milestone - the introduction of dedicated bit manipulation instructions in the new Intel Hazwell microprocessor.
It became painfully obvious when I wrote some code here on StackOverflow to reverse the bits in a 4096 bit array that 30+ yrs after the introduction of the PC, microprocessors just don't devote much attention or resources to bits, and that I hope will change. In particular, I'd love to see, for starters, the bool type become an actual bit datatype in C/C++, instead of the ridiculously wasteful byte it currently is.
UPDATE: 12/29/2013
I recently had occasion to optimize a ring buffer which keeps track of 512 different resource users' demands on a system at millisecond granularity. There is a timer which fires every millisecond which added the sum of the most current slice's resource requests and subtracted out the 1,000th time slice's requests, comprising resource requests now 1,000 milliseconds old.
The Head, Tail vectors were right next to each other in memory, except when first the Head, and then the Tail wrapped and started back at the beginning of the array. The (rolling)Summary slice however was in a fixed, statically allocated array that wasn't particularly close to either of those, and wasn't even allocated from the heap.
Thinking about this, and studying the code a few particulars caught my attention.
The demands that were coming in were added to the Head and the Summary slice at the same time, right next to each other in adjacent lines of code.
When the timer fired, the Tail was subtracted out of the Summary slice, and the results were left in the Summary slice, as you'd expect
The 2nd function called when the timer fired advanced all the pointers servicing the ring. In particular....
The Head overwrote the Tail, thereby occupying the same memory location
The new Tail occupied the next 512 memory locations, or wrapped
The user wanted more flexibility in the number of demands being managed, from 512 to 4098, or perhaps more. I felt the most robust, idiot-proof way to do this was to allocate both the 1,000 time slices and the summary slice all together as one contiguous block of memory so that it would be IMPOSSIBLE for the Summary slice to end up being a different length than the other 1,000 time slices.
Given the above, I began to wonder if I could get more performance if, instead of having the Summary slice remain in one location, I had it "roam" between the Head and the Tail, so it was always right next to the Head for adding new demands, and right next to the Tail when the timer fired and the Tail's values had to be subtracted from the Summary.
I did exactly this, but then found a couple of additional optimizations in the process. I changed the code that calculated the rolling Summary so that it left the results in the Tail, instead of the Summary slice. Why? Because the very next function was performing a memcpy() to move the Summary slice into the memory just occupied by the Tail. (weird but true, the Tail leads the Head until the end of the ring when it wraps). By leaving the results of the summation in the Tail, I didn't have to perform the memcpy(), I just had to assign pTail to pSummary.
In a similar way, the new Head occupied the now stale Summary slice's old memory location, so again, I just assigned pSummary to pHead, and zeroed all its values with a memset to zero.
Leading the way to the end of the ring(really a drum, 512 tracks wide) was the Tail, but I only had to compare its pointer against a constant pEndOfRing pointer to detect that condition. All of the other pointers could be assigned the pointer value of the vector just ahead of it. IE: I only needed a conditional test for 1:3 of the pointers to correctly wrap them.
The initial design had used byte ints to maximize cache usage, however, I was able to relax this constraint - satisfying the users request to handle higher resource counts per user per millisecond - to use unsigned shorts and STILL double performance, because even with 3 adjacent vectors of 512 unsigned shorts, the L1 cache's 32K data cache could easily hold the required 3,720 bytes, 2/3rds of which were in locations just used. Only when the Tail, Summary, or Head wrapped were 1 of the 3 separated by any significant "step" in the 8MB L3cache.
The total run-time memory footprint for this code is under 2MB, so it runs entirely out of on-chip caches, and even on an i7 chip with 4 cores, 4 instances of this process can be run without any degradation in performance at all, and total throughput goes up slightly with 5 processes running. It's an Opus Magnum on cache usage.

Most C/C++ compilers prefer to optimize for size rather than for "speed". That is, smaller code generally executes faster than unrolled code because of cache effects.

If I were you, I would make sure I know which parts of code are hotspots, which I define as
a tight loop not containing any function calls, because if it calls any function, then the PC will be spending most of its time in that function,
that accounts for a significant fraction of execution time (like >= 10%) which you can determine from a profiler. (I just sample the stack manually.)
If you have such a hotspot, then it should fit in the cache. I'm not sure how you tell it to do that, but I suspect it's automatic.

Why is that data structures usually have a size of 2^n?

Is there a historical reason or something ? I've seen quite a few times something like char foo[256]; or #define BUF_SIZE 1024. Even I do mostly only use 2n sized buffers, mostly because I think it looks more elegant and that way I don't have to think of a specific number. But I'm not quite sure if that's the reason most people use them, more information would be appreciated.

There may be a number of reasons, although many people will as you say just do it out of habit.
One place where it is very useful is in the efficient implementation of circular buffers, especially on architectures where the % operator is expensive (those without a hardware divide - primarily 8 bit micro-controllers). By using a 2^n buffer in this case, the modulo, is simply a case of bit-masking the upper bits, or in the case of say a 256 byte buffer, simply using an 8-bit index and letting it wraparound.
In other cases alignment with page boundaries, caches etc. may provide opportunities for optimisation on some architectures - but that would be very architecture specific. But it may just be that such buffers provide the compiler with optimisation possibilities, so all other things being equal, why not?

Cache lines are usually some multiple of 2 (often 32 or 64). Data that is an integral multiple of that number would be able to fit into (and fully utilize) the corresponding number of cache lines. The more data you can pack into your cache, the better the performance.. so I think people who design their structures in that way are optimizing for that.

Another reason in addition to what everyone else has mentioned is, SSE instructions take multiple elements, and the number of elements input is always some power of two. Making the buffer a power of two guarantees you won't be reading unallocated memory. This only applies if you're actually using SSE instructions though.
I think in the end though, the overwhelming reason in most cases is that programmers like powers of two.

Hash Tables, Allocation by Pages
This really helps for hash tables, because you compute the index modulo the size, and if that size is a power of two, the modulus can be computed with a simple bitwise-and or & rather than using a much slower divide-class instruction implementing the % operator.
Looking at an old Intel i386 book, and is 2 cycles and div is 40 cycles. A disparity persists today due to the much greater fundamental complexity of division, even though the 1000x faster overall cycle times tend to hide the impact of even the slowest machine ops.
There was also a time when malloc overhead was occasionally avoided at great length. Allocation's available directly from the operating system would be (still are) a specific number of pages, and so a power of two would be likely to make the most use of the allocation granularity.
And, as others have noted, programmers like powers of two.

I can think of a few reasons off the top of my head:
2^n is a very common value in all of computer sizes. This is directly related to the way bits are represented in computers (2 possible values), which means variables tend to have ranges of values whose boundaries are 2^n.
Because of the point above, you'll often find the value 256 as the size of the buffer. This is because it is the largest number that can be stored in a byte. So, if you want to store a string together with a size of the string, then you'll be most efficient if you store it as: SIZE_BYTE+ARRAY, where the size byte tells you the size of the array. This means the array can be any size from 1 to 256.
Many other times, sizes are chosen based on physical things (for example, the size of the memory an operating system can choose from is related to the size of the registers of the CPU etc) and these are also going to be a specific amount of bits. Meaning, the amount of memory you can use will usually be some value of 2^n (for a 32bit system, 2^32).
There might be performance benefits/alignment issues for such values. Most processors can access a certain amount of bytes at a time, so even if you have a variable whose size is let's say) 20 bits, a 32 bit processor will still read 32 bits, no matter what. So it's often times more efficient to just make the variable 32 bits. Also, some processors require variables to be aligned to a certain amount of bytes (because they can't read memory from, for example, addresses in the memory that are odd). Of course, sometimes it's not about odd memory locations, but locations that are multiples of 4, or 6 of 8, etc. So in these cases, it's more efficient to just make buffers that will always be aligned.
Ok, those points came out a bit jumbled. Let me know if you need further explanation, especially point 4 which IMO is the most important.

Because of the simplicity (read also cost) of base 2 arithmetic in electronics: shift left (multiply by 2), shift right (divide by 2).
In the CPU domain, lots of constructs revolve around base 2 arithmetic. Busses (control & data) to access memory structure are often aligned on power 2. The cost of logic implementation in electronics (e.g. CPU) makes for arithmetics in base 2 compelling.
Of course, if we had analog computers, the story would be different.
FYI: the attributes of a system sitting at layer X is a direct consequence of the server layer attributes of the system sitting below i.e. layer < x. The reason I am stating this stems from some comments I received with regards to my posting.
E.g. the properties that can be manipulated at the "compiler" level are inherited & derived from the properties of the system below it i.e. the electronics in the CPU.

I was going to use the shift argument, but could think of a good reason to justify it.
One thing that is nice about a buffer that is a power of two is that circular buffer handling can use simple ands rather than divides:
#define BUFSIZE 1024
++index; // increment the index.
index &= BUFSIZE; // Make sure it stays in the buffer.
If it weren't a power of two, a divide would be necessary. In the olden days (and currently on small chips) that mattered.

It's also common for pagesizes to be powers of 2.
On linux I like to use getpagesize() when doing something like chunking a buffer and writing it to a socket or file descriptor.

It's makes a nice, round number in base 2. Just as 10, 100 or 1000000 are nice, round numbers in base 10.
If it wasn't a power of 2 (or something close such as 96=64+32 or 192=128+64), then you could wonder why there's the added precision. Not base 2 rounded size can come from external constraints or programmer ignorance. You'll want to know which one it is.
Other answers have pointed out a bunch of technical reasons as well that are valid in special cases. I won't repeat any of them here.

In hash tables, 2^n makes it easier to handle key collissions in a certain way. In general, when there is a key collission, you either make a substructure, e.g. a list, of all entries with the same hash value; or you find another free slot. You could just add 1 to the slot index until you find a free slot; but this strategy is not optimal, because it creates clusters of blocked places. A better strategy is to calculate a second hash number h2, so that gcd(n,h2)=1; then add h2 to the slot index until you find a free slot (with wrap around). If n is a power of 2, finding a h2 that fulfills gcd(n,h2)=1 is easy, every odd number will do.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight