prefetching data at L1 and L2

prefetching data at L1 and L2 - c

In Agner Fog's manual Optimizing software in C++ in section 9.10 "Cahce contentions in large data structures" he describes a problem transposing a matrix when the matrix width is equal to something called the critical stride. In his test the cost for for a matrix in L1 is 40% greater when the width is equal to the critical stride. If the matrix is is even larger and only fits in L2 the cost is 600%! This is summed up nicely in Table 9.1 in his text. This is essential the same thing observed at
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Later he writes:
The reason why this effect is so much stronger for
level-2 cache contentions than for level-1 cache contentions is that the level-2 cache cannot
prefetch more than one line at a time.
So my questions are related to prefetching data.
From his comment I infer that L1 can prefetch more than one cache line at a time. How many can it prefetch?
From what I understand trying to write code to prefetch the data (e.g. with _mm_prefetch) is rarely ever helpful. The only example I have read of is Prefetching Examples? and it's only a O(10%) improvement (on some machines). Agner later explains this:
The reason is that modern processors prefetch data automatically thanks to
out-of-order execution and advanced prediction mechanisms. Modern microprocessors are
able to automatically prefetch data for regular access patterns containing multiple streams
with different strides. Therefore, you don't have to prefetch data explicitly if data access can
be arranged in regular patterns with fixed strides.
So how does the CPU decide which data to prefetch and are there ways to help the CPU make better choices for the prefetching (e.g. "regular patterns with fixed strides")?
Edit: Based on a comment by Leeor let me add to my questions and make it more interesting. Why does the critical stride have so much more of an effect on L2 compared to L1?
Edit: I tried to reproduce Agner Fog's table using the code at Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
I ran this with MSVC2013 64-bit release mode on a Xeon E5 1620 (Ivy Bridge) which has L1 32KB 8-way, L2 256 KB 8-way, and L3 10MB 20-way. The max matrix size for L1 is about 90x90, 256x256 for L3, and 1619 for L3.
Matrix Size Average Time
64x64 0.004251 0.004472 0.004412 (three times)
65x65 0.004422 0.004442 0.004632 (three times)
128x128 0.0409
129x129 0.0169
256x256 0.219 //max L2 matrix size
257x257 0.0692
512x512 2.701
513x513 0.649
1024x1024 12.8
1025x1025 10.1
I'm not seeing any performance loss in L1 however L2 clearly has the critical stride problem and maybe L3. I'm not sure yet why L1 does not show a problem. It's possible there is some other source of background (overhead) which is dominating the L1 times.

This statement :
the level-2 cache cannot prefetch more than one line at a time.
is incorrect
In fact, the L2 prefetchers are often stronger and more aggressive than L1 prefetchers. It depends on the actual machine you use, but Intels' L2 prefetcher for e.g. can trigger 2 prefetches for each request, while the L1 is usually limited (there are several types of prefetches that can coexist in the L1, but they're likely to be competing on a more limited BW than the L2 has at its disposal, so there will probably be less prefetches coming out of the L1.
The optimization guide, in Section 2.3.5.4 (Data Prefetching) counts the following prefetcher types:
Two hardware prefetchers load data to the L1 DCache:
- Data cache unit (DCU) prefetcher: This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.
- Instruction pointer (IP)-based stride prefetcher: This prefetcher keeps track of individual load instructions. If a load instruction is detected to have a regular stride, then a prefetch is sent to the next address which is the sum of the current address and the stride. This prefetcher can prefetch forward or backward and can detect strides of up to 2K bytes.
Data Prefetch to the L2 and Last Level Cache -
- Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk.
- Streamer: This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page.
And a bit further ahead:
... The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20 lines ahead of the load request.
Of the above, only the IP-based can handle strides greater than one cache line (the streaming ones can deal with anything that uses consecutive cachelines, meaning up to 64byte stride (or actually up to 128 bytes if you don't mind some extra lines). To use that, make sure that loads/stores at a given address would perform strided accesses - that's usually the case already in loops going over arrays. Compiler loop-unrolling may split that into multiple different stride streams with larger strides - that would work even better (the lookahead would be larger), unless you exceed the number of outstanding tracked IPs - again, that depends on the exact implementation.
However, if your access pattern does consist of consecutive lines, the L2 streamer is much more efficient than the L1 since it runs ahead faster.

Related

Last used cache line versus different cache lines

Let's assume cache lines are 64 bytes wide and I have two arrays a and b which fill a cache line and are also aligned to a cache line. Let's also assume that both arrays are in the L1 cache so when I read from them I don't get a cache miss.
float a[16]; //64 byte aligned e.g. with __attribute__((aligned (64)))
float b[16]; //64 byte aligned
I read a[0]. My question is it faster to now read a[1] than to read b[0]? In other words, is it faster to read from the last used cache line?
Does the set matter? Let's now assume that I have a 32 kb L1 data cache which is 4 way. So if a and b are 8192 bytes apart they end up in the same set. Will this change the answer to my question?
Another way to ask my question (which is what I really care about) is in regards to reading a matrix.
In other words which one of these two code options will be more efficient assuming matrix M fits in the L1 cache and is 64 byte aligned and is already in the L1 cache.
float M[16][16]; //64 byte aligned
Version 1:
for(int i=0; i<16; i++) {
for(int j=0; j<16; j++) {
x += M[i][j];
}
}
Version 2:
for(int i=0; i<16; i++) {
for(int j=0; j<16; j++) {
x += M[j][i];
}
}
Edit: To make this clear due to SSE/AVX lets assume I read the first eight values from a at once with AVX (e.g. with _mm256_load_ps()). Will reading the next eight values from a be faster than reading the first eight values from b (recall that a and b are already in the cache so there will not be a cahce miss)?
Edit:: I'm mostly interested in all processors since Intel Core 2 and Nehalem but I'm currently working with an Ivy Bridge processor and plan to use Haswell soon.

With current Intel processors, there is no performance difference between loading two different cache lines that are both in L1 cache, all else being equal. Given float a[16], b[16]; with a[0] recently loaded, a[1] in the same cache line as a[0], and b[1] not recently loaded but still in L1 cache, then there will be no performance difference between loading a[1] and b[0] in the absence of some other factor.
One thing that can cause a difference is if there has recently been a store to some address that shares some bits with one of the values being loaded, although the entire address is different. Intel processors compare some of the bits of addresses to determine whether they might match a store that is currently in progress. If the bits match, some Intel processors delay the load instruction to give the processor time to resolve the complete virtual address and compare it to the address being stored. However, this is an incidental effect that is not particular to a[1] or b[0].
It is also theoretically possible that a compiler that sees your code is loading both a[0] and a[1] in short succession might make some optimization, such as loading them both with one instruction. My comments above apply to hardware behavior, not C implementation behavior.
With the two-dimensional array scenario, there should still be no difference as long as the entire array M is in L1 cache. However, column traversals of arrays are notorious for performance problems when the array exceeds L1 cache. A problem occurs because addresses are mapped to sets in cache by fixed bits in the address, and each cache set can hold only a limited number of cache lines, such as four. Here is a problem scenario:
An array M has a row length that is a multiple of the distance that results in addresses being mapped to the same cache sets, such as 4096 bytes. E.g., in the array float M[1024][1024];, M[0][0] and M[1][0] are 4096 bytes apart and map to the same cache set.
As you traverse a column of the array, you access M[0][0], M[1][0], M[2][0], M[3][0], and so on. The cache line for each of these elements is loaded into cache.
As you continue along the column, you access M[8][0], M[9][0], and so on. Since each of these uses the same cache set as the previous ones and the cache set can hold only four lines, the earlier lines containing M[0][0] and so on are evicted from cache.
When you complete the column and start the next column by reading M[0][1], the data is no longer in L1 cache, and all of your loads must fetch the data from L2 cache (or worse if you also thrashed L2 cache in the same way).

Fetching a[0] and then either a[1] or b[0] should amount to 2 cache access that hit the L1 in either case. You didn't say which uArch you're using but i'm not familiar with any mechanism that does further "caching" of the full cacheline above the L1 (anywhere in the memory unit), and I don't think such a mechanism could be feasible (at least not for any reasonable price).
Assume you read a[0] and then a[1], and would like to save the effort of accessing the L1 again for that line - your HW would have to not only keep the full cache line somewhere in the memory unit in case it's going to be accessed again (not sure how much that's a common case, so this feature is probably not the effort), but also keep it snoopable as a logical extension of your cache in case some other core tries to modify a[1] between these two reads (which x86 permits for wb memory). In fact, it could even be a store in the same thread context, and you'll have to guard against that (since most common x86 CPUs today are performing loads out of order). If you don't maintain both of these (and probably other safeguards too) - you break coherency, if you do - you've created a monster logic that does that same as your L1 already does, just to save meager 1-2 cycles of access.
However, even though both options would require the same number of cache accesses, there may be other considerations effecting their efficiency, such as L1 banking, same-set access restrictions, lazy LRU updating, etc.. All of which depend on your exact machine implementation.
If you don't focus only on memory/cache access efficiency, your compiler should be able to vectorize accesses to consecutive memory locations, which would still incur the same accesses but will be lighter on execution BW. I think that any decent compiler should be able to unroll your loops at this size, and combine the consecutive accesses into a single vector, but you may be able to help it by using option 1 (especially if there are also writes or other problematic instructions in the middle that would compilcate the job for the compiler)
Edit
Since you're also asking about fitting the matrix in the L2 - that simplifies the question - in that case using the same line(s) multiple times as in option 1 is better as it allows you to hit the L1, while the alternative is to constantly fetch from the L2, which gives you lower latency and bandwidth. This is the basic principle behind loop tiling / blocking

Spatial locality is king so version #1 is faster. A good compiler can even vectorize the reads using SSE/AVX.
The CPU rearranges reads so it doesn't matter which one is first. In out-of-order CPUs it should matter very little if the both cache lines are on the same way.
For large matrices, it is even more important to keep locality so the L1 cache remains hot (less cache misses).

Although I don't know the answer to your question(s) directly (someone else may have more knowledge about processor architecture), have you tried / is it possible to find out the answer yourself by some form of benchmarking?
You can get a high resolution timer by some function such as QueryPerformanceCounter (assuming you're on Windows) or OS equivalent, then iterate the reads you want to test by x amount of times, then get the high resolution timer again to get the average time a read took.
Perform this process again for different reads and you should be able to compare average read times for different types of read, which should answer your question. That's not to say that the answer will remain the same on different processors though.

Understanding CPU cache and cache line

I am trying to understand how CPU cache is operating. Lets say we have this configuration (as an example).
Cache size 1024 bytes
Cache line 32 bytes
1024/32 = 32 cache lines all together.
Singel cache line can store 32/4 = 8 ints.
1) According to these configuration length of tag should be 32-5=27 bits, and size of index 5 bits (2^5 = 32 addresses for each byte in cache line).
If total cache size is 1024 and there are 32 cache lines, where is tags+indexes are stored? (There is another 4*32 = 128 bytes.) Does it means that actual size of the cache is 1024+128 = 1152?
2) If cache line is 32 bytes in this example, this means that 32 bytes getting copied in cache whenerever CPU need to get new byte from RAM. Am I right to assume that cache line position of the requested byte will be determined by its adress?
This is what I mean: if CPU requested byte at [FF FF 00 08], then available cache line will be filled with bytes from [FF FF 00 00] to [FF FF 00 1F]. And our requseted single byte will be at position [08].
3) If previous statement is correct, does it mean that 5 bits that used for index, are technically not needed since all 32 bytes are in the cache line anyway?
Please let me know if I got something wrong.
Thanks

A cache consists of data and tag RAM, arranged as a compromise of access time vs efficiency and physical layout. You're missing an important stat: number of ways (sets). You rarely have 1-way caches, because they perform pathologically badly with simple patterns. Anyway:
1) Yes, tags take extra space. This is part of the design compromise - you don't want it to be a large fraction of the total area, and why line size isn't just 1 byte or 1 word. Also, all tags for an index are simultaneously accessed, and that can affect efficiency and layout if there's a large number of ways. The size is slightly bigger than your estimate. There's usually also a few bits extra bits to mark validity and sometimes hints. More ways and smaller lines needs a larger fraction taken up by tags, so generally lines are large (32+ bytes) and ways are small (4-16).
2) Yes. Some caches also do a "critical word first" fetch, where they start with the word that caused the line fill, then fetch the rest. This reduces the number of cycles the CPU is waiting for the data it actually asked for. Some caches will "write thru" and not allocate a line if you miss on a write, which avoids having to read the entire cache line first, before writing to it (this isn't always a win).
3) The tags won't store the lower 5 bits as they're not needed to match a cache line. They just index into individual lines.
Wikipedia has a pretty good, if a bit intense, write-up on caches: http://en.wikipedia.org/wiki/CPU_cache - see "Implementation". There's a diagram of how data and tags are split. Me, I think everyone should learn this stuff because you really can improve performance of code when you know what the underlying machine is actually capable of.

The cache metadata is typically not counted as a part of the cache itself. It might not even be stored in the same part of the CPU (it could be in another cache, implemented using special CPU registers, etc).
This depends on whether your CPU will fetch unaligned addresses. If it will only fetch aligned addresses, then the example you gave would be correct. If the CPU fetches unaligned addresses, then it might fetch the range 0xFFFF0008 to 0xFFFF0027.
The index bytes are still useful, even when cache access is aligned. This gives the CPU a shorthand method for referencing a byte within a cache line that it can use in its internal bookkeeping. You could get the same information by knowing the address associated with the cache line and the address associated with the byte, but that's a whole lot more information to carry around.
Different CPUs implement caching very differently. For the best answer to your question, please give some additional details about the particular CPU (type, model, etc) that you are talking about.

This is based on my vague memory, you should read books like "Computer Architecture: A Quantitative Approach" by Hennessey and Patterson. Great book.
Assuming a 32-bit CPU... (otherwise your figures would need to use >4 bytes (maybe <8 bytes since some/most 64-bit CPU don't have all 64 bits of address line used)) for the address.
1) I believe it's at least 4*32 bytes. Depending on the CPU, the chip architects may have decided to keep track of other info besides the full address. But it's usually not considered part of the cache.
2) Yes, but how that mapping is done is different. See Wikipedia - CPU cache - associativity There's the simple direct mapped cache and the more complex associative mapped cache. You want to avoid the case where some code needs two piece of information but the two addresses map to the exact same cache line.

Optimizing ARM cache usage for different arrays

I want to port a small piece of code on ARM Cortex A8 processor. Both L1 cache and L2 cache are very limited. There are 3 arrays in my program. Two of them are sequentially accessed(size> Array A: 6MB and Array B: 3MB) and the access pattern for the third array(size> Array C: 3MB) is unpredictable. Though the calculations are not very rigorous but there are huge cache misses for accessing array C. One solution that I thought would be to allocate more cache (L2) space for array C and less for Array A & B. But I'm not able to find any way to achieve this. I went through preload engine of ARM but could not find anything useful.

It would be a good idea to split the cache and allocate each array in a different part of it.
Unfortunately that is not possible. The caches of the CortexA8 just are not that flexible. The good old StrongArm had a secondary cache for exactly this splitting purpose, but it's not available anymore. We have L1 and L2 caches instead (overall a good change imho.)
However, there is a thing you can do:
The NEON SIMD unit of the CortexA8 lags behind the general purpose processing unit by around 10 processor cycles. With clever programming you can issue cache prefetches from the general purpose unit but do the accesses via NEON. The delay between the two pipelines gives the cache a bit of time to do the prefetches, so your average cache miss time will be lower.
The drawback is that if you must never move the result of a calculation back from NEON to the ARM unit. Since NEON lags behind this will cause a full CPU pipeline flush. Almost if not even more costly as a cache miss.
The difference in performance can be significant. Out of the blue I would expect something between 20% and 30% of speed improvement.

From what I could find via Google, it looks like ARMv7 (which is the version of the ISA that Cortex A8 supports) has cache-flush capability, though I couldn't find a clear reference on how to use it -- perhaps you can do better if you spend more time on it than the minute or two I spent typing "ARM cache flush" into a search box and reading the results.
In any case, you should be able to achieve an approximation of what you want by periodically issuing "flush" instructions to flush out the parts of A and B that you know you no longer need.

Design code to fit in CPU Cache?

When writing simulations my buddy says he likes to try to write the program small enough to fit into cache. Does this have any real meaning? I understand that cache is faster than RAM and the main memory. Is it possible to specify that you want the program to run from cache or at least load the variables into cache? We are writing simulations so any performance/optimization gain is a huge benefit.
If you know of any good links explaining CPU caching, then point me in that direction.

At least with a typical desktop CPU, you can't really specify much about cache usage directly. You can still try to write cache-friendly code though. On the code side, this often means unrolling loops (for just one obvious example) is rarely useful -- it expands the code, and a modern CPU typically minimizes the overhead of looping. You can generally do more on the data side, to improve locality of reference, protect against false sharing (e.g. two frequently-used pieces of data that will try to use the same part of the cache, while other parts remain unused).
Edit (to make some points a bit more explicit):
A typical CPU has a number of different caches. A modern desktop processor will typically have at least 2 and often 3 levels of cache. By (at least nearly) universal agreement, "level 1" is the cache "closest" to the processing elements, and the numbers go up from there (level 2 is next, level 3 after that, etc.)
In most cases, (at least) the level 1 cache is split into two halves: an instruction cache and a data cache (the Intel 486 is nearly the sole exception of which I'm aware, with a single cache for both instructions and data--but it's so thoroughly obsolete it probably doesn't merit a lot of thought).
In most cases, a cache is organized as a set of "lines". The contents of a cache is normally read, written, and tracked one line at a time. In other words, if the CPU is going to use data from any part of a cache line, that entire cache line is read from the next lower level of storage. Caches that are closer to the CPU are generally smaller and have smaller cache lines.
This basic architecture leads to most of the characteristics of a cache that matter in writing code. As much as possible, you want to read something into cache once, do everything with it you're going to, then move on to something else.
This means that as you're processing data, it's typically better to read a relatively small amount of data (little enough to fit in the cache), do as much processing on that data as you can, then move on to the next chunk of data. Algorithms like Quicksort that quickly break large amounts of input in to progressively smaller pieces do this more or less automatically, so they tend to be fairly cache-friendly, almost regardless of the precise details of the cache.
This also has implications for how you write code. If you have a loop like:
for i = 0 to whatever
step1(data);
step2(data);
step3(data);
end for
You're generally better off stringing as many of the steps together as you can up to the amount that will fit in the cache. The minute you overflow the cache, performance can/will drop drastically. If the code for step 3 above was large enough that it wouldn't fit into the cache, you'd generally be better off breaking the loop up into two pieces like this (if possible):
for i = 0 to whatever
step1(data);
step2(data);
end for
for i = 0 to whatever
step3(data);
end for
Loop unrolling is a fairly hotly contested subject. On one hand, it can lead to code that's much more CPU-friendly, reducing the overhead of instructions executed for the loop itself. At the same time, it can (and generally does) increase code size, so it's relatively cache unfriendly. My own experience is that in synthetic benchmarks that tend to do really small amounts of processing on really large amounts of data, that you gain a lot from loop unrolling. In more practical code where you tend to have more processing on an individual piece of data, you gain a lot less--and overflowing the cache leading to a serious performance loss isn't particularly rare at all.
The data cache is also limited in size. This means that you generally want your data packed as densely as possible so as much data as possible will fit in the cache. Just for one obvious example, a data structure that's linked together with pointers needs to gain quite a bit in terms of computational complexity to make up for the amount of data cache space used by those pointers. If you're going to use a linked data structure, you generally want to at least ensure you're linking together relatively large pieces of data.
In a lot of cases, however, I've found that tricks I originally learned for fitting data into minuscule amounts of memory in tiny processors that have been (mostly) obsolete for decades, works out pretty well on modern processors. The intent is now to fit more data in the cache instead of the main memory, but the effect is nearly the same. In quite a few cases, you can think of CPU instructions as nearly free, and the overall speed of execution is governed by the bandwidth to the cache (or the main memory), so extra processing to unpack data from a dense format works out in your favor. This is particularly true when you're dealing with enough data that it won't all fit in the cache at all any more, so the overall speed is governed by the bandwidth to main memory. In this case, you can execute a lot of instructions to save a few memory reads, and still come out ahead.
Parallel processing can exacerbate that problem. In many cases, rewriting code to allow parallel processing can lead to virtually no gain in performance, or sometimes even a performance loss. If the overall speed is governed by the bandwidth from the CPU to memory, having more cores competing for that bandwidth is unlikely to do any good (and may do substantial harm). In such a case, use of multiple cores to improve speed often comes down to doing even more to pack the data more tightly, and taking advantage of even more processing power to unpack the data, so the real speed gain is from reducing the bandwidth consumed, and the extra cores just keep from losing time to unpacking the data from the denser format.
Another cache-based problem that can arise in parallel coding is sharing (and false sharing) of variables. If two (or more) cores need to write to the same location in memory, the cache line holding that data can end up being shuttled back and forth between the cores to give each core access to the shared data. The result is often code that runs slower in parallel than it did in serial (i.e., on a single core). There's a variation of this called "false sharing", in which the code on the different cores is writing to separate data, but the data for the different cores ends up in the same cache line. Since the cache controls data purely in terms of entire lines of data, the data gets shuffled back and forth between the cores anyway, leading to exactly the same problem.

Here's a link to a really good paper on caches/memory optimization by Christer Ericsson (of God of War I/II/III fame). It's a couple of years old but it's still very relevant.

A useful paper that will tell you more than you ever wanted to know about caches is What Every Programmer Should Know About Memory by Ulrich Drepper. Hennessey covers it very thoroughly. Christer and Mike Acton have written a bunch of good stuff about this too.
I think you should worry more about data cache than instruction cache — in my experience, dcache misses are more frequent, more painful, and more usefully fixed.

UPDATE: 1/13/2014
According to this senior chip designer, cache misses are now THE overwhelmingly dominant factor in code performance, so we're basically all the way back to the mid-80s and fast 286 chips in terms of the relative performance bottlenecks of load, store, integer arithmetic, and cache misses.
A Crash Course In Modern Hardware by Cliff Click # Azul
.
.
.
.
.
--- we now return you to your regularly scheduled program ---
Sometimes an example is better than a description of how to do something. In that spirit here's a particularly successful example of how I changed some code to better use on chip caches. This was done some time ago on a 486 CPU and latter migrated to a 1st Generation Pentium CPU. The effect on performance was similar.
Example: Subscript Mapping
Here's an example of a technique I used to fit data into the chip's cache that has general purpose utility.
I had a double float vector that was 1,250 elements long, which was an epidemiology curve with very long tails. The "interesting" part of the curve only had about 200 unique values but I didn't want a 2-sided if() test to make a mess of the CPU's pipeline(thus the long tails, which could use as subscripts the most extreme values the Monte Carlo code would spit out), and I needed the branch prediction logic for a dozen other conditional tests inside the "hot-spot" in the code.
I settled on a scheme where I used a vector of 8-bit ints as a subscript into the double vector, which I shortened to 256 elements. The tiny ints all had the same values before 128 ahead of zero, and 128 after zero, so except for the middle 256 values, they all pointed to either the first or last value in the double vector.
This shrunk the storage requirement to 2k for the doubles, and 1,250 bytes for the 8-bit subscripts. This shrunk 10,000 bytes down to 3,298. Since the program spent 90% or more of it's time in this inner-loop, the 2 vectors never got pushed out of the 8k data cache. The program immediately doubled its performance. This code got hit ~ 100 billion times in the process of computing an OAS value for 1+ million mortgage loans.
Since the tails of the curve were seldom touched, it's very possible that only the middle 200-300 elements of the tiny int vector were actually kept in cache, along with 160-240 middle doubles representing 1/8ths of percents of interest. It was a remarkable increase in performance, accomplished in an afternoon, on a program that I'd spent over a year optimizing.
I agree with Jerry, as it has been my experience also, that tilting the code towards the instruction cache is not nearly as successful as optimizing for the data cache/s. This is one reason I think AMD's common caches are not as helpful as Intel's separate data and instruction caches. IE: you don't want instructions hogging up the cache, as it just isn't very helpful. In part this is because CISC instruction sets were originally created to make up for the vast difference between CPU and memory speeds, and except for an aberration in the late 80's, that's pretty much always been true.
Another favorite technique I use to favor the data cache, and savage the instruction cache, is by using a lot of bit-ints in structure definitions, and the smallest possible data sizes in general. To mask off a 4-bit int to hold the month of the year, or 9 bits to hold the day of the year, etc, etc, requires the CPU use masks to mask off the host integers the bits are using, which shrinks the data, effectively increases cache and bus sizes, but requires more instructions. While this technique produces code that doesn't perform as well on synthetic benchmarks, on busy systems where users and processes are competing for resources, it works wonderfully.

Mostly this will serve as a placeholder until I get time to do this topic justice, but I wanted to share what I consider to be a truly groundbreaking milestone - the introduction of dedicated bit manipulation instructions in the new Intel Hazwell microprocessor.
It became painfully obvious when I wrote some code here on StackOverflow to reverse the bits in a 4096 bit array that 30+ yrs after the introduction of the PC, microprocessors just don't devote much attention or resources to bits, and that I hope will change. In particular, I'd love to see, for starters, the bool type become an actual bit datatype in C/C++, instead of the ridiculously wasteful byte it currently is.
UPDATE: 12/29/2013
I recently had occasion to optimize a ring buffer which keeps track of 512 different resource users' demands on a system at millisecond granularity. There is a timer which fires every millisecond which added the sum of the most current slice's resource requests and subtracted out the 1,000th time slice's requests, comprising resource requests now 1,000 milliseconds old.
The Head, Tail vectors were right next to each other in memory, except when first the Head, and then the Tail wrapped and started back at the beginning of the array. The (rolling)Summary slice however was in a fixed, statically allocated array that wasn't particularly close to either of those, and wasn't even allocated from the heap.
Thinking about this, and studying the code a few particulars caught my attention.
The demands that were coming in were added to the Head and the Summary slice at the same time, right next to each other in adjacent lines of code.
When the timer fired, the Tail was subtracted out of the Summary slice, and the results were left in the Summary slice, as you'd expect
The 2nd function called when the timer fired advanced all the pointers servicing the ring. In particular....
The Head overwrote the Tail, thereby occupying the same memory location
The new Tail occupied the next 512 memory locations, or wrapped
The user wanted more flexibility in the number of demands being managed, from 512 to 4098, or perhaps more. I felt the most robust, idiot-proof way to do this was to allocate both the 1,000 time slices and the summary slice all together as one contiguous block of memory so that it would be IMPOSSIBLE for the Summary slice to end up being a different length than the other 1,000 time slices.
Given the above, I began to wonder if I could get more performance if, instead of having the Summary slice remain in one location, I had it "roam" between the Head and the Tail, so it was always right next to the Head for adding new demands, and right next to the Tail when the timer fired and the Tail's values had to be subtracted from the Summary.
I did exactly this, but then found a couple of additional optimizations in the process. I changed the code that calculated the rolling Summary so that it left the results in the Tail, instead of the Summary slice. Why? Because the very next function was performing a memcpy() to move the Summary slice into the memory just occupied by the Tail. (weird but true, the Tail leads the Head until the end of the ring when it wraps). By leaving the results of the summation in the Tail, I didn't have to perform the memcpy(), I just had to assign pTail to pSummary.
In a similar way, the new Head occupied the now stale Summary slice's old memory location, so again, I just assigned pSummary to pHead, and zeroed all its values with a memset to zero.
Leading the way to the end of the ring(really a drum, 512 tracks wide) was the Tail, but I only had to compare its pointer against a constant pEndOfRing pointer to detect that condition. All of the other pointers could be assigned the pointer value of the vector just ahead of it. IE: I only needed a conditional test for 1:3 of the pointers to correctly wrap them.
The initial design had used byte ints to maximize cache usage, however, I was able to relax this constraint - satisfying the users request to handle higher resource counts per user per millisecond - to use unsigned shorts and STILL double performance, because even with 3 adjacent vectors of 512 unsigned shorts, the L1 cache's 32K data cache could easily hold the required 3,720 bytes, 2/3rds of which were in locations just used. Only when the Tail, Summary, or Head wrapped were 1 of the 3 separated by any significant "step" in the 8MB L3cache.
The total run-time memory footprint for this code is under 2MB, so it runs entirely out of on-chip caches, and even on an i7 chip with 4 cores, 4 instances of this process can be run without any degradation in performance at all, and total throughput goes up slightly with 5 processes running. It's an Opus Magnum on cache usage.

Most C/C++ compilers prefer to optimize for size rather than for "speed". That is, smaller code generally executes faster than unrolled code because of cache effects.

If I were you, I would make sure I know which parts of code are hotspots, which I define as
a tight loop not containing any function calls, because if it calls any function, then the PC will be spending most of its time in that function,
that accounts for a significant fraction of execution time (like >= 10%) which you can determine from a profiler. (I just sample the stack manually.)
If you have such a hotspot, then it should fit in the cache. I'm not sure how you tell it to do that, but I suspect it's automatic.

What is the maximum theoretical speed-up due to SSE for a simple binary subtraction?

In trying to figure out whether or not my code's inner loop is hitting a hardware design barrier or a lack of understanding on my part barrier. There's a bit more to it, but the simplest question I can come up with to answer is as follows:
If I have the following code:
float px[32768],py[32768],pz[32768];
float xref, yref, zref, deltax, deltay, deltaz;
initialize_with_random(px);
initialize_with_random(py);
initialize_with_random(pz);
for(i=0;i<32768-1;i++) {
xref=px[i];
yref=py[i];
zref=pz[i];
for(j=0;j<32768-1;j++ {
deltx=xref-px[j];
delty=yref-py[j];
deltz=zref-pz[j];
} }
What type of maximum theoretical speed up would I be able to see by going to SSE instructions in a situation where I have complete control over code (assembly, intrinsics, whatever) but no control over runtime environment other than architecture (i.e. it's a multi-user environment so I can't do anything about how the OS kernel assigns time to my particular process).
Right now I'm seeing a speed up of 3x with my code, when I would have thought using SSE would give me much more vector depth than the 3x speed up is indicating (presumably the 3x speed up tells me I have a 4x maximum theoretical throughput). (I've tried things such as letting deltx/delty/deltz be arrays in case the compiler wasn't smart enough to auto-promote them, but I still see only 3x speed up.) I'm using the intel C compiler with the appropriate compiler flags for vectorization, but no intrinsics obviously.

It depends on the CPU. But the theoretical max won't get above 4x. I don't know of a CPU which can execute more than one SSE instruction per clock cycle, which means that it can at most compute 4 values per cycle.
Most CPU's can do at least one floating point scalar instruction per cycle, so in this case you'd see a theoretical max of a 4x speedup.
But you'll have to look up the specific instruction throughput for the CPU you're running on.
A practical speedup of 3x is pretty good though.

I think you'd probably have to interleave the inner loop somehow. The 3-component vector is getting done at once, but that's only 3 operations at once. To get to 4, you'd do 3 components from the first vector, and 1 from the next, then 2 and 2, and so on. If you established some kind of queue that loads and processes the data 4 components at a time, then separate it after, that might work.
Edit: You could unroll the inner loop to do 4 vectors per iteration (assuming the array size is always a multiple of 4). That would accomplish what I said above.

Consider: How wide is a float? How wide is the SSEx instruction? The ratio should should give you some kind of reasonable upper bound.
It's also worth noting that out-of-order pipes play havok with getting good estimates of speedup.

You should consider loop tiling - the way you are accessing values in the inner loop is probably causing a lot of thrashing in the L1 data cache. It's not too bad, because everything probably still fits in the L2 at 384 KB, but there is easily an order of magnitude difference between an L1 cache hit and an L2 cache hit, so this could make a big difference for you.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight