Today when I was in computer organization class, teacher talked about something interesting to me. When it comes to talk about Why cache memory works, he said that:
for (i=0; i<M; i++)
for(j=0; j<N; j++)
X[i][j] = X[i][j] + K; //X is double(8 bytes)
it is not good to change the first line with the second. What is your opinions on this? And why it is like that?
There is a very good paper by Ulrich Drepper of Red Hat and glibc fame, What Every Programmer Should Know About Memory. One section discussed caches in great detail. For example, there are cache effects in SMP systems where CPUs can end up thrashing ownership of a modified cache line back and forth, greatly harming performance.
Locality of reference. Because the data is stored by rows, for each row the j columns are in adjacent memory addresses. The OS will typically load an entire page from memory into the cache and adjacent address references will likely refer to that same page. If you increment by the row index in the inner loop it is possible that these rows will be on different pages (since they are separated by j doubles each) and the cache may have to constantly bring in and throw away pages of memory as it references the data. This is called thrashing and is bad for performance.
In practice and with larger, modern caches, the sizes of the rows/columns would need to be reasonably large before this would come into play, but it's still good practice.
[EDIT] The answer above is specific to C and may differ for other languages. The only one that I know is different is FORTRAN. FORTRAN stores things in column major order (the above is row major) and it would be correct to change the order of the statements in FORTRAN. If you want/need efficiency, it's important to know how your language implements data storage.
It is like that becauses caches like locality. The same number of memory accessed, but spaced further apart, will hit different "lines" of cache, or might even miss the cache altogether. It is therefore good, whenever you have the choice, to organize data so that accesses that are likely to happen close to each other in time, also do so in space. This increases the chance of a cache hit, and gives you more performance.
There is of course a wealth of information about this topic available, see for instancethis wikipedia entry on locality of reference. Or, I guess, your own course text book. :)
In C, n-dimensional matrices are row major, meaning the last index into the matrix represents adjacent spaces in memory. This is different than some other languages, FORTRAN for example, which are column major. In FORTRAN, it's more efficient to iterate through a 2D matrix like this:
do jj = 1,N
do ii = 1,M
x(ii,jj) = x(ii,jj) + K;
enddo
enddo
Cache memory is very fast and very expensive memory that sits close to the CPU. Rather than fetch one small piece of data from the RAM each time, the CPU fetches a chunk of data and stores it in the cache. The bet is that if you just read one byte, then the next byte you read is likely to be right after it. If this is the case, then it can come from the cache.
By laying out your loop as you have it, you read the bytes in the order that they are stored in memory. This means that they are in the cache, and can be read very quickly by the CPU. If you swapped around lines 1 and 2, then you'd read every "N" bytes each time around the loop. The bytes you are reading are no longer consecutive in memory, and so they may not be in the cache. The CPU has to fetch them from the (slower) RAM, and so your performance decreases.
Related
I'm reading Computer Systems book from Bryant & O'Hallaron, there is an exercises the solution of which seems to be incorrect. So I'd like to make it sure
given
struct point {
int x;
int y; };
struct array[32][32];
for(i = 31; i >= 0; i--) {
for(j = 31; j >= 0; j--) {
sum_x += array[j][i].x;
sum_y += array[j][i].y; }}
sizeof(int) = 4;
we have 4096 byte cache with block (line) size 32 byte.
The hit rate is asked.
My reasoning was, we have 4096/32 = 128 blocks, each block can store 4 points (2*4*4 = 32), therefore the cache can store 1/2 of the array, i.e. 512 points (total 32*32 = 1024). Since the code accesses array in column major order, access to each point is miss. So we have array[j][i].x is always miss, while array[j][i].y is hit. Finally miss rate = hit rate = 1/2.
Problem: The solution says the hit rate is 3/4 because the cache can store the whole array.
But according to my reasoning the cache can store only half points
Did I miss something?
The array's top four rows occupy a part of the cache:
|*ooooooooooooooooooooooooooooooo|
|*ooooooooooooooooooooooooooooooo|
|*ooooooooooooooooooooooooooooooo|
|*ooooooooooooooooooooooooooooooo|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|...
Above is a schematic of the array as an applied mathematician would write the array on paper. Each element consists of an (x,y) pair, a point.
The four rows labeled o in the diagram comprise 128 points, enough to fill 1024 bytes, which is only one quarter of the cache, but see: in your code, the variable i is
the major loop counter and also
the array's row index (as written on paper).
So, let's look at the diagram again. How do your nested loops step through the array as diagrammed?
Answer: apparently, your loops step rightward across the top row as diagrammed, with j (column) as the minor loop counter. However, as you have observed, the array is stored by columns. Therefore, when element [j][i] == [0][0] is loaded, an entire cache line is loaded with it. And what comprises that cache line? It's the four elements marked * in the diagram.
Therefore, while your inner loop iterates across the array's top row as diagrammed, the cache misses every time, fetching four elements each time. And then for the next three rows, it's all hits.
This isn't easy to think about. It's a fine problem, nor would I expect you to grasp my answer instantly, but if you carefully consider the sequence of loads as I have explained, it should (after a bit of pondering) begin to make sense.
With the given loop nesting, the hit rate is indeed 3/4.
FURTHER DISCUSSION
In comments, you have asked a good follow-up question:
Can you write an element (e.g. array[3][14].x) that would hit?
I can. The array[j][i] == array[10][5] would hit. (Both .x and .y would hit.)
I will explain. The array[j][i] == array[10][4] would miss, whereas array[10][5], array[10][6] and array[10][7] would eventually hit. Why eventually? This is significant. Although all four of the elements I have named are loaded by cache hardware at once, array[10][5] is not accessed by your code (that is, by the CPU) when array[10][4] is accessed. Rather, after array[10][4] is accessed, array[11][4] is next accessed by the program and CPU.
The program and CPU only get around to accessing array[10][5] rather later.
And, indeed, if you think about it, this makes sense, doesn't it, because that is part of what caches do: they load additional data now, quietly as part of a cache line, so that the CPU can quickly access the additional data later if it needs it.
APPENDIX: FORTRAN/BLAS/LAPACK MATRIX ORDERING
It is standard in numerical computing to store matrices by column rather than by row. This is called column-major storage. Unfortunately, unlike the earlier Fortran programming language, the C programming language was not originally designed for numerical computing, so, in C, to store arrays by column, one must write array[column][row] == array[j][i]—which notation of course reverses the way an applied mathematician with his or her pencil would write it.
This is an artifact of the C programming language. The artifact has no mathematical significance but, when programming in C, you must remember to type [j][i]. [Were you programming in the now mostly obsolete Fortran programming language, you would type (i, j), but this isn't Fortran.]
The reason column-major storage is standard has to do with the sequence in which the CPU performs scalar, floating-point multiplications and additions when, in mathematical/pencil terminology, a matrix [A] left-operates on a column vector x. The standard Basic Linear Algebra Subroutines (BLAS) library, used by LAPACK and others, works this way. You and I should work this way, too, not only because we are likely to need to interface with BLAS and/or LAPACK but because, numerically, it's smoother.
If you've transcribed the program correctly then you're correct, the 3/4 answer is wrong.
The 3/4 answer would be correct if the indexes in the innermost sum += ... statements were arranged so that the rightmost index varied the most quickly, i.e. as:
sum_x += array[i][j].x;
sum_y += array[i][j].y;
In that case the 1st, 5th, 9th ... iterations of the loop would miss, but the line loaded into the cache by each of those misses would cause the next three iterations to hit.
However, with the program as written, every iteration misses. Each cache line that is loaded from memory supplies data for only a single point, and then that line is always replaced before the data for any of the other three points in the line is accessed.
As an example (assuming for simplicity that the address of the first member array[0][0] is aligned with the start of the cache), the reference to array[31][31] in the first pass through the loop is a miss that causes line 127 of the cache to be loaded. Line 127 now contains the data for [31][28], [31][29], [31][30] and [31][31]. However, the fetch of array[15][31] causes line 127 to be overwritten before array[31][30] is referenced, so when [31][30]'s turn eventually arrives it is a miss too. And then a miss at [15][30] replaces the line before [31][29] is referenced.
IMO your 1/2 hit ratio is overgenerous because it counts the access to the .y coordinate as a hit. However, that's not what the original 3/4 answer does. If the fetch of the .y coordinate were counted as a hit then the original answer would have been 7/8. Instead it counts each complete point, or perhaps each loop iteration, as a hit or a miss. By that measure the hit rate for the program as written in your question is a nice round 0.
Let's assume cache lines are 64 bytes wide and I have two arrays a and b which fill a cache line and are also aligned to a cache line. Let's also assume that both arrays are in the L1 cache so when I read from them I don't get a cache miss.
float a[16]; //64 byte aligned e.g. with __attribute__((aligned (64)))
float b[16]; //64 byte aligned
I read a[0]. My question is it faster to now read a[1] than to read b[0]? In other words, is it faster to read from the last used cache line?
Does the set matter? Let's now assume that I have a 32 kb L1 data cache which is 4 way. So if a and b are 8192 bytes apart they end up in the same set. Will this change the answer to my question?
Another way to ask my question (which is what I really care about) is in regards to reading a matrix.
In other words which one of these two code options will be more efficient assuming matrix M fits in the L1 cache and is 64 byte aligned and is already in the L1 cache.
float M[16][16]; //64 byte aligned
Version 1:
for(int i=0; i<16; i++) {
for(int j=0; j<16; j++) {
x += M[i][j];
}
}
Version 2:
for(int i=0; i<16; i++) {
for(int j=0; j<16; j++) {
x += M[j][i];
}
}
Edit: To make this clear due to SSE/AVX lets assume I read the first eight values from a at once with AVX (e.g. with _mm256_load_ps()). Will reading the next eight values from a be faster than reading the first eight values from b (recall that a and b are already in the cache so there will not be a cahce miss)?
Edit:: I'm mostly interested in all processors since Intel Core 2 and Nehalem but I'm currently working with an Ivy Bridge processor and plan to use Haswell soon.
With current Intel processors, there is no performance difference between loading two different cache lines that are both in L1 cache, all else being equal. Given float a[16], b[16]; with a[0] recently loaded, a[1] in the same cache line as a[0], and b[1] not recently loaded but still in L1 cache, then there will be no performance difference between loading a[1] and b[0] in the absence of some other factor.
One thing that can cause a difference is if there has recently been a store to some address that shares some bits with one of the values being loaded, although the entire address is different. Intel processors compare some of the bits of addresses to determine whether they might match a store that is currently in progress. If the bits match, some Intel processors delay the load instruction to give the processor time to resolve the complete virtual address and compare it to the address being stored. However, this is an incidental effect that is not particular to a[1] or b[0].
It is also theoretically possible that a compiler that sees your code is loading both a[0] and a[1] in short succession might make some optimization, such as loading them both with one instruction. My comments above apply to hardware behavior, not C implementation behavior.
With the two-dimensional array scenario, there should still be no difference as long as the entire array M is in L1 cache. However, column traversals of arrays are notorious for performance problems when the array exceeds L1 cache. A problem occurs because addresses are mapped to sets in cache by fixed bits in the address, and each cache set can hold only a limited number of cache lines, such as four. Here is a problem scenario:
An array M has a row length that is a multiple of the distance that results in addresses being mapped to the same cache sets, such as 4096 bytes. E.g., in the array float M[1024][1024];, M[0][0] and M[1][0] are 4096 bytes apart and map to the same cache set.
As you traverse a column of the array, you access M[0][0], M[1][0], M[2][0], M[3][0], and so on. The cache line for each of these elements is loaded into cache.
As you continue along the column, you access M[8][0], M[9][0], and so on. Since each of these uses the same cache set as the previous ones and the cache set can hold only four lines, the earlier lines containing M[0][0] and so on are evicted from cache.
When you complete the column and start the next column by reading M[0][1], the data is no longer in L1 cache, and all of your loads must fetch the data from L2 cache (or worse if you also thrashed L2 cache in the same way).
Fetching a[0] and then either a[1] or b[0] should amount to 2 cache access that hit the L1 in either case. You didn't say which uArch you're using but i'm not familiar with any mechanism that does further "caching" of the full cacheline above the L1 (anywhere in the memory unit), and I don't think such a mechanism could be feasible (at least not for any reasonable price).
Assume you read a[0] and then a[1], and would like to save the effort of accessing the L1 again for that line - your HW would have to not only keep the full cache line somewhere in the memory unit in case it's going to be accessed again (not sure how much that's a common case, so this feature is probably not the effort), but also keep it snoopable as a logical extension of your cache in case some other core tries to modify a[1] between these two reads (which x86 permits for wb memory). In fact, it could even be a store in the same thread context, and you'll have to guard against that (since most common x86 CPUs today are performing loads out of order). If you don't maintain both of these (and probably other safeguards too) - you break coherency, if you do - you've created a monster logic that does that same as your L1 already does, just to save meager 1-2 cycles of access.
However, even though both options would require the same number of cache accesses, there may be other considerations effecting their efficiency, such as L1 banking, same-set access restrictions, lazy LRU updating, etc.. All of which depend on your exact machine implementation.
If you don't focus only on memory/cache access efficiency, your compiler should be able to vectorize accesses to consecutive memory locations, which would still incur the same accesses but will be lighter on execution BW. I think that any decent compiler should be able to unroll your loops at this size, and combine the consecutive accesses into a single vector, but you may be able to help it by using option 1 (especially if there are also writes or other problematic instructions in the middle that would compilcate the job for the compiler)
Edit
Since you're also asking about fitting the matrix in the L2 - that simplifies the question - in that case using the same line(s) multiple times as in option 1 is better as it allows you to hit the L1, while the alternative is to constantly fetch from the L2, which gives you lower latency and bandwidth. This is the basic principle behind loop tiling / blocking
Spatial locality is king so version #1 is faster. A good compiler can even vectorize the reads using SSE/AVX.
The CPU rearranges reads so it doesn't matter which one is first. In out-of-order CPUs it should matter very little if the both cache lines are on the same way.
For large matrices, it is even more important to keep locality so the L1 cache remains hot (less cache misses).
Although I don't know the answer to your question(s) directly (someone else may have more knowledge about processor architecture), have you tried / is it possible to find out the answer yourself by some form of benchmarking?
You can get a high resolution timer by some function such as QueryPerformanceCounter (assuming you're on Windows) or OS equivalent, then iterate the reads you want to test by x amount of times, then get the high resolution timer again to get the average time a read took.
Perform this process again for different reads and you should be able to compare average read times for different types of read, which should answer your question. That's not to say that the answer will remain the same on different processors though.
I need to allocate memory of order of 10^15 to store integers which can be of long long type.
If i use an array and declare something like
long long a[1000000000000000];
that's never going to work. So how can i allocate such a huge amount of memory.
Really large arrays generally aren't a job for memory, more one for disk. 1015 array elements at 64 bits apiece is (I think) 8 petabytes. You can pick up 8G memory slices for about $15 at the moment so, even if your machine could handle that much memory or address space, you'd be outlaying about $15 million dollars.
In addition, with upcoming DDR4 being clocked up to about 4GT/s (giga-transfers), even if each transfer was a 64-bit value, it would still take about one million seconds just to initialise that array to zero. Do you really want to be waiting around for eleven and a half days before your code even starts doing anything useful?
And, even if you go the disk route, that's quite a bit. At (roughly) $50 per TB, you're still looking at $400,000 and you'll possibly have to provide your own software for managing those 8,000 disks somehow. And I'm not even going to contemplate figuring out how long it would take to initialise the array on disk.
You may want to think about rephrasing your question to indicate the actual problem rather than what you currently have, a proposed solution. It may be that you don't need that much storage at all.
For example, if you're talking about an array where many of the values are left at zero, a sparse array is one way to go.
You can't. You don't have all this memory, and you'll don't have it for a while. Simple.
EDIT: If you really want to work with data that does not fit into your RAM, you can use some library that work with mass storage data, like stxxl, but it will work a lot slower, and you have always disk size limits.
MPI is what you need, that's actually a small size for parallel computing problems the blue gene Q monster at Lawerence Livermore National Labs holds around 1.5 PB of ram. you need to use block decomposition to divide up your problem and viola!
the basic approach is dividing up the array into equal blocks or chunks among many processors
You need to uppgrade to a 64-bit system. Then get 64-bit-capable compiler then put a l at the end of 100000000000000000.
Have you heard of sparse matrix implementation? In one of the sparse matrices, you just use very little part of the matrix despite of the matrix being huge.
Here are some libraries for you.
Here is a basic info about sparse-matrices You dont actually use all of it. Just the needed few points.
A matrix is created in processor 0 and scattered to other processors. A matrix is a symmetric dense matrix. That's why it is initialized in processor 0.
A matrix is created in this way:
A=malloc(sizeof(double)*N*N);
for (i=0; i<N; i++)
for(j=0; j<N; j++)
A(i,j)=rand()%10; // The code will be changed.
A(i,j) is defined as:
#define A(i,j) A[i*N+j]
and N has to be 100,000 to test the algorithm.
The problem here is: if N=100,000 then the memory needed is approximately 76GB. What do you suggest to store the A matrix?
PS: Algorithm works very well when N<20.000 and the cluster is a distrubed memory system(2GB RAM per processor)
If you are doing this, as stated in comments, to do a scaling test, then Oli Charlesworth is completely right; anything you do is going to make this an apples-to-oranges comparison, because your node doesn't have 76GB to use. Which is fine; one of the big reasons to use MPI is to tackle problems that couldn't fit on one node. But by trying to shoehorn 76GB of data onto one processor, the comparison you're doing isn't going to make any sense. As mentioned by both Oli Charlesworth and caf, through various methods you can use disk instead of RAM, but then your 1 processor answer is going not going to be directly comparable to the fits-in-RAM numbers you get from larger number of nodes, so you're going to be going to a lot of work to get a number which won't actually mean anything.
If you want scaling results on this sort of problem, you either start with the lowest number of nodes that the problem does fit on, and take data at increasing numbers of processors, or you do weak scaling, rather than strong scaling tests -- you keep the work-per-processor constant while scaling up the number of processors, rather than the total work being constant.
Incidentally, however you do the measurements, you'll end up with better results if, as Oli Charlesworth suggests, you have each procesor generate its own data rather than have a serial bottleneck by having rank 0 do the generation of the matrix and then have all the processors receive their parts.
If you are programming on a POSIX system with sufficient virtual address space (which in practice will mean a 64 bit system), you can use mmap().
Either create an anonymous mapping of the required size (this will be swap-backed, which will mean you'll need at least 76GB of swap), or create a real file of the required size and map that.
The file-backed solution has the advantage that if your cluster has a shared file system, you don't need to explicitly transfer the matrix to each processor - you can simply msync() it after creating it, and then map the right region on each processor.
If you can switch to C++, you might look into STXXL, which is an STL implementation specifically designed for huge datasets, with transparent disk-backed support, etc.
When writing simulations my buddy says he likes to try to write the program small enough to fit into cache. Does this have any real meaning? I understand that cache is faster than RAM and the main memory. Is it possible to specify that you want the program to run from cache or at least load the variables into cache? We are writing simulations so any performance/optimization gain is a huge benefit.
If you know of any good links explaining CPU caching, then point me in that direction.
At least with a typical desktop CPU, you can't really specify much about cache usage directly. You can still try to write cache-friendly code though. On the code side, this often means unrolling loops (for just one obvious example) is rarely useful -- it expands the code, and a modern CPU typically minimizes the overhead of looping. You can generally do more on the data side, to improve locality of reference, protect against false sharing (e.g. two frequently-used pieces of data that will try to use the same part of the cache, while other parts remain unused).
Edit (to make some points a bit more explicit):
A typical CPU has a number of different caches. A modern desktop processor will typically have at least 2 and often 3 levels of cache. By (at least nearly) universal agreement, "level 1" is the cache "closest" to the processing elements, and the numbers go up from there (level 2 is next, level 3 after that, etc.)
In most cases, (at least) the level 1 cache is split into two halves: an instruction cache and a data cache (the Intel 486 is nearly the sole exception of which I'm aware, with a single cache for both instructions and data--but it's so thoroughly obsolete it probably doesn't merit a lot of thought).
In most cases, a cache is organized as a set of "lines". The contents of a cache is normally read, written, and tracked one line at a time. In other words, if the CPU is going to use data from any part of a cache line, that entire cache line is read from the next lower level of storage. Caches that are closer to the CPU are generally smaller and have smaller cache lines.
This basic architecture leads to most of the characteristics of a cache that matter in writing code. As much as possible, you want to read something into cache once, do everything with it you're going to, then move on to something else.
This means that as you're processing data, it's typically better to read a relatively small amount of data (little enough to fit in the cache), do as much processing on that data as you can, then move on to the next chunk of data. Algorithms like Quicksort that quickly break large amounts of input in to progressively smaller pieces do this more or less automatically, so they tend to be fairly cache-friendly, almost regardless of the precise details of the cache.
This also has implications for how you write code. If you have a loop like:
for i = 0 to whatever
step1(data);
step2(data);
step3(data);
end for
You're generally better off stringing as many of the steps together as you can up to the amount that will fit in the cache. The minute you overflow the cache, performance can/will drop drastically. If the code for step 3 above was large enough that it wouldn't fit into the cache, you'd generally be better off breaking the loop up into two pieces like this (if possible):
for i = 0 to whatever
step1(data);
step2(data);
end for
for i = 0 to whatever
step3(data);
end for
Loop unrolling is a fairly hotly contested subject. On one hand, it can lead to code that's much more CPU-friendly, reducing the overhead of instructions executed for the loop itself. At the same time, it can (and generally does) increase code size, so it's relatively cache unfriendly. My own experience is that in synthetic benchmarks that tend to do really small amounts of processing on really large amounts of data, that you gain a lot from loop unrolling. In more practical code where you tend to have more processing on an individual piece of data, you gain a lot less--and overflowing the cache leading to a serious performance loss isn't particularly rare at all.
The data cache is also limited in size. This means that you generally want your data packed as densely as possible so as much data as possible will fit in the cache. Just for one obvious example, a data structure that's linked together with pointers needs to gain quite a bit in terms of computational complexity to make up for the amount of data cache space used by those pointers. If you're going to use a linked data structure, you generally want to at least ensure you're linking together relatively large pieces of data.
In a lot of cases, however, I've found that tricks I originally learned for fitting data into minuscule amounts of memory in tiny processors that have been (mostly) obsolete for decades, works out pretty well on modern processors. The intent is now to fit more data in the cache instead of the main memory, but the effect is nearly the same. In quite a few cases, you can think of CPU instructions as nearly free, and the overall speed of execution is governed by the bandwidth to the cache (or the main memory), so extra processing to unpack data from a dense format works out in your favor. This is particularly true when you're dealing with enough data that it won't all fit in the cache at all any more, so the overall speed is governed by the bandwidth to main memory. In this case, you can execute a lot of instructions to save a few memory reads, and still come out ahead.
Parallel processing can exacerbate that problem. In many cases, rewriting code to allow parallel processing can lead to virtually no gain in performance, or sometimes even a performance loss. If the overall speed is governed by the bandwidth from the CPU to memory, having more cores competing for that bandwidth is unlikely to do any good (and may do substantial harm). In such a case, use of multiple cores to improve speed often comes down to doing even more to pack the data more tightly, and taking advantage of even more processing power to unpack the data, so the real speed gain is from reducing the bandwidth consumed, and the extra cores just keep from losing time to unpacking the data from the denser format.
Another cache-based problem that can arise in parallel coding is sharing (and false sharing) of variables. If two (or more) cores need to write to the same location in memory, the cache line holding that data can end up being shuttled back and forth between the cores to give each core access to the shared data. The result is often code that runs slower in parallel than it did in serial (i.e., on a single core). There's a variation of this called "false sharing", in which the code on the different cores is writing to separate data, but the data for the different cores ends up in the same cache line. Since the cache controls data purely in terms of entire lines of data, the data gets shuffled back and forth between the cores anyway, leading to exactly the same problem.
Here's a link to a really good paper on caches/memory optimization by Christer Ericsson (of God of War I/II/III fame). It's a couple of years old but it's still very relevant.
A useful paper that will tell you more than you ever wanted to know about caches is What Every Programmer Should Know About Memory by Ulrich Drepper. Hennessey covers it very thoroughly. Christer and Mike Acton have written a bunch of good stuff about this too.
I think you should worry more about data cache than instruction cache — in my experience, dcache misses are more frequent, more painful, and more usefully fixed.
UPDATE: 1/13/2014
According to this senior chip designer, cache misses are now THE overwhelmingly dominant factor in code performance, so we're basically all the way back to the mid-80s and fast 286 chips in terms of the relative performance bottlenecks of load, store, integer arithmetic, and cache misses.
A Crash Course In Modern Hardware by Cliff Click # Azul
.
.
.
.
.
--- we now return you to your regularly scheduled program ---
Sometimes an example is better than a description of how to do something. In that spirit here's a particularly successful example of how I changed some code to better use on chip caches. This was done some time ago on a 486 CPU and latter migrated to a 1st Generation Pentium CPU. The effect on performance was similar.
Example: Subscript Mapping
Here's an example of a technique I used to fit data into the chip's cache that has general purpose utility.
I had a double float vector that was 1,250 elements long, which was an epidemiology curve with very long tails. The "interesting" part of the curve only had about 200 unique values but I didn't want a 2-sided if() test to make a mess of the CPU's pipeline(thus the long tails, which could use as subscripts the most extreme values the Monte Carlo code would spit out), and I needed the branch prediction logic for a dozen other conditional tests inside the "hot-spot" in the code.
I settled on a scheme where I used a vector of 8-bit ints as a subscript into the double vector, which I shortened to 256 elements. The tiny ints all had the same values before 128 ahead of zero, and 128 after zero, so except for the middle 256 values, they all pointed to either the first or last value in the double vector.
This shrunk the storage requirement to 2k for the doubles, and 1,250 bytes for the 8-bit subscripts. This shrunk 10,000 bytes down to 3,298. Since the program spent 90% or more of it's time in this inner-loop, the 2 vectors never got pushed out of the 8k data cache. The program immediately doubled its performance. This code got hit ~ 100 billion times in the process of computing an OAS value for 1+ million mortgage loans.
Since the tails of the curve were seldom touched, it's very possible that only the middle 200-300 elements of the tiny int vector were actually kept in cache, along with 160-240 middle doubles representing 1/8ths of percents of interest. It was a remarkable increase in performance, accomplished in an afternoon, on a program that I'd spent over a year optimizing.
I agree with Jerry, as it has been my experience also, that tilting the code towards the instruction cache is not nearly as successful as optimizing for the data cache/s. This is one reason I think AMD's common caches are not as helpful as Intel's separate data and instruction caches. IE: you don't want instructions hogging up the cache, as it just isn't very helpful. In part this is because CISC instruction sets were originally created to make up for the vast difference between CPU and memory speeds, and except for an aberration in the late 80's, that's pretty much always been true.
Another favorite technique I use to favor the data cache, and savage the instruction cache, is by using a lot of bit-ints in structure definitions, and the smallest possible data sizes in general. To mask off a 4-bit int to hold the month of the year, or 9 bits to hold the day of the year, etc, etc, requires the CPU use masks to mask off the host integers the bits are using, which shrinks the data, effectively increases cache and bus sizes, but requires more instructions. While this technique produces code that doesn't perform as well on synthetic benchmarks, on busy systems where users and processes are competing for resources, it works wonderfully.
Mostly this will serve as a placeholder until I get time to do this topic justice, but I wanted to share what I consider to be a truly groundbreaking milestone - the introduction of dedicated bit manipulation instructions in the new Intel Hazwell microprocessor.
It became painfully obvious when I wrote some code here on StackOverflow to reverse the bits in a 4096 bit array that 30+ yrs after the introduction of the PC, microprocessors just don't devote much attention or resources to bits, and that I hope will change. In particular, I'd love to see, for starters, the bool type become an actual bit datatype in C/C++, instead of the ridiculously wasteful byte it currently is.
UPDATE: 12/29/2013
I recently had occasion to optimize a ring buffer which keeps track of 512 different resource users' demands on a system at millisecond granularity. There is a timer which fires every millisecond which added the sum of the most current slice's resource requests and subtracted out the 1,000th time slice's requests, comprising resource requests now 1,000 milliseconds old.
The Head, Tail vectors were right next to each other in memory, except when first the Head, and then the Tail wrapped and started back at the beginning of the array. The (rolling)Summary slice however was in a fixed, statically allocated array that wasn't particularly close to either of those, and wasn't even allocated from the heap.
Thinking about this, and studying the code a few particulars caught my attention.
The demands that were coming in were added to the Head and the Summary slice at the same time, right next to each other in adjacent lines of code.
When the timer fired, the Tail was subtracted out of the Summary slice, and the results were left in the Summary slice, as you'd expect
The 2nd function called when the timer fired advanced all the pointers servicing the ring. In particular....
The Head overwrote the Tail, thereby occupying the same memory location
The new Tail occupied the next 512 memory locations, or wrapped
The user wanted more flexibility in the number of demands being managed, from 512 to 4098, or perhaps more. I felt the most robust, idiot-proof way to do this was to allocate both the 1,000 time slices and the summary slice all together as one contiguous block of memory so that it would be IMPOSSIBLE for the Summary slice to end up being a different length than the other 1,000 time slices.
Given the above, I began to wonder if I could get more performance if, instead of having the Summary slice remain in one location, I had it "roam" between the Head and the Tail, so it was always right next to the Head for adding new demands, and right next to the Tail when the timer fired and the Tail's values had to be subtracted from the Summary.
I did exactly this, but then found a couple of additional optimizations in the process. I changed the code that calculated the rolling Summary so that it left the results in the Tail, instead of the Summary slice. Why? Because the very next function was performing a memcpy() to move the Summary slice into the memory just occupied by the Tail. (weird but true, the Tail leads the Head until the end of the ring when it wraps). By leaving the results of the summation in the Tail, I didn't have to perform the memcpy(), I just had to assign pTail to pSummary.
In a similar way, the new Head occupied the now stale Summary slice's old memory location, so again, I just assigned pSummary to pHead, and zeroed all its values with a memset to zero.
Leading the way to the end of the ring(really a drum, 512 tracks wide) was the Tail, but I only had to compare its pointer against a constant pEndOfRing pointer to detect that condition. All of the other pointers could be assigned the pointer value of the vector just ahead of it. IE: I only needed a conditional test for 1:3 of the pointers to correctly wrap them.
The initial design had used byte ints to maximize cache usage, however, I was able to relax this constraint - satisfying the users request to handle higher resource counts per user per millisecond - to use unsigned shorts and STILL double performance, because even with 3 adjacent vectors of 512 unsigned shorts, the L1 cache's 32K data cache could easily hold the required 3,720 bytes, 2/3rds of which were in locations just used. Only when the Tail, Summary, or Head wrapped were 1 of the 3 separated by any significant "step" in the 8MB L3cache.
The total run-time memory footprint for this code is under 2MB, so it runs entirely out of on-chip caches, and even on an i7 chip with 4 cores, 4 instances of this process can be run without any degradation in performance at all, and total throughput goes up slightly with 5 processes running. It's an Opus Magnum on cache usage.
Most C/C++ compilers prefer to optimize for size rather than for "speed". That is, smaller code generally executes faster than unrolled code because of cache effects.
If I were you, I would make sure I know which parts of code are hotspots, which I define as
a tight loop not containing any function calls, because if it calls any function, then the PC will be spending most of its time in that function,
that accounts for a significant fraction of execution time (like >= 10%) which you can determine from a profiler. (I just sample the stack manually.)
If you have such a hotspot, then it should fit in the cache. I'm not sure how you tell it to do that, but I suspect it's automatic.