Switch with tons of cases vs single if - c

Let's say we need to go through a 128-element array of uint8 and compare neighbour elements and put the result to another array. The code below is the most simple and readable way to solve this problem.
for (i = 1; i < 128; i++)
if (arr[i] < arr[i-1] + 64) //don't care about overflow
arr2[i] = 1;
It looks like this code 1) will not use branch table.
And as far as I know, a cpu doesn't read just 1 byte, it actually reads 8 bytes (assuming a 64bit machine), and that 2) makes cpu do some extra work.
So here comes another approach. Read 2 (or 4 or 8) bytes at a time and create an extremely huge switch (2^16, 2^32 or 2^64 cases respectively), which has every possible combination of bytes in our array. Does this make any sense?
For this discussion let's assume the following:
1) Our main priority is speed
2) Next is RAM consumption.
We don't care about the size of the executable (unless they somehow affect speed or RAM)

You should know that switches are actually very slow as branch would be likely mispredicted. What makes switch fast is jump table:
switch (i) {
case 0: ...
case 1: ...
}
gets translated into this:
labels = {&case0, &case1}
goto labels[i]
However, you do not need this either as your only writing memory cell and you can write a "jump table", or more specifically pre-computed matrix of answers yourself:
for (i = 1; i < 128; i++)
arr2[i] = answers[arr[i]][arr[i-1]];
uint8 have only 256 possible values which gives us 64k of RAM required for that matrix.

Related

Understanding how to write cache-friendly code

I have been trying to understand how to write the cache-friendly code. So as a first step, i was trying to understand the performance difference between array row-major access and column major access.
So I created an int array of size 512×512 so that total size is 1MB. My L1 cache is 32KB, L2 cache is 256KB, and L3 cache is 3MB. So my array fits in L3 cache.
I simply calculated the sum of array elements in row major order and column major order and compared their speed. All the time, column major order is slightly faster. i expected row major order to be faster than the other (may be several times faster).
I thought problem may be due to small size of array, so I made another array of size 8192×8192 (256 MB). Still the same result.
Below is the code snippet I used:
#include "time.h"
#include <stdio.h>
#define S 512
#define M S
#define N S
int main() {
// Summing in the row major order
int x = 0;
int iter = 25000;
int i, j;
int k[M][N];
int sum = 0;
clock_t start, end;
start = clock();
while(x < iter) {
for (i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
sum += k[i][j];
}
}
x++;
}
end = clock();
printf("%i\n", end-start);
// Summing in the column major order
x = 0;
sum = 0;
int h[M][N];
start = clock();
while(x < iter) {
for (j = 0; j < N; j++) {
for(i = 0; i < M; i++){
sum += k[i][j];
}
}
x++;
}
end = clock();
printf("%i\n", end-start);
}
Question : can some one tell me what is my mistake and why I am getting this result?
I don't really know why you get this behaviour, but let me clarify some things.
There are at least 2 things to consider when thinking about cache: cache size and cache line size. For instance, my Intel i7 920 processor has a 256KB L2 Cache with 64 bytes line size. If your data fits inside the cache, then it really doesn't matter in which order you access it. All the problems of optimizing a code to be cache friendly must target 2 things: if possible split the access to the memory in blocks such in a way that a block fits in cache. Do all the computations possible with that block and then bring the next block, do the computations with it and so on. The other thing, (the one you are trying to do) is to access the memory in a consecutive way. When you request a data from the memory (lets say an int - 4 bytes) a whole cache line is brought to the cache (in my case 64 bytes: that is 16 adjacent integers (including the one you requested) are brought to cache). Here comes in play row-order vs column-order. With row order you have 1 cache miss for every 16 memory requests, with column order you get a cache-miss for every request (but only if your data doesn't fit in cache; if your data fits in cache, then you get the same ratio as with row-order because you still have the lines in cache, from way back when you requested the first element in the line; of course associativeness can come into play and a cache line can be rewritten even if not all cache is filled with your data).
Regarding your problem, when the data fits in cache, as I said, the access order doesn't matter that much, but when you do the second summing, the data is already in the cache from when you did the first sum, so that's why it is faster. If you do the column-order sum first you should see that the row-order sum becomes faster simply because is done after. However, when the data is large enough, you shouldn't get the same behaviour. Try the following: between the two sums, do something with another large data in order to invalidate the whole cache.
Edit
I see a 3-4x speedup for row major (although I expected >8x speedup. any idea why?). [..] it would be great if you could tell me why speedup is only 3x
Is not that accessing the matrix the "right way" doesn't improve much, is more like accessing the matrix the "wrong way" doesn't hurt that much, if that makes any sense.
Although I can't provide you with a specific and exact answer, what I can tell you is that modern processors have very complicated and extremely efficient cache models. They are so powerful that, for instance, in many common cases they can mask the cache levels, making to seem like instead of 3 level cache you have a big one level cache (you don't see a penalty when increasing your data size from a size that fits in L2 to a size that fits only in L3). Running your code in an older processor (lets say 10 years old) probably you will see the speedup you expect. Modern day processors however have mechanisms that help a lot with cache misses. Desktop processors are design with the philosophy of running "bad code" fast so a lot of investment is made in improving "bad code" performance because the vast majority of desktop applications aren't written by people who understand branching issues or cache models. This is opposed to the high-performance market where specialized processors make a bad code hurt very much because they implement weak mechanisms that deal with "bad code" (or don't implement at all). These mechanisms take up a lot of transistors and so they increase the power consumption and the heat generated, but they are worth implementing in a desktop processor where most of the code is "bad code".

Fastest way to traverse columns in a multidimensional array in C

I'm currently working on a program to solve the red/blue computation; program is written in C.
Description of the problem is here : http://www.cs.utah.edu/~mhall/cs4961f10/CS4961-L9.pdf
tl;dr you have a grid of colors (red/blue/white), first red cells move to the right according to certain rules, then blue cells move down according to other rules.
I've got my program working and giving correct output, and I'm now trying to see if I can't speed it up at all.
Using Intel's VTune Amplifier (this is for a parallel programming course, and we're doing pthreads in visual studio with parallel studio integrated), I've identified that the biggest hotspot in my code is when moving blue cells.
Implementation details: grid is stored as a dynamically allocated int **, set up this way
globalBoard = malloc(sizeof(int *) * size);
for (i = 0; i < size; i++)
{
globalBoard[i] = malloc(sizeof(int) * size);
for (j = 0; j < size; j++)
globalBoard[i][j] = rand() % 3;
}
After some research, I believe the cause of the hotspot (almost 4 times as much CPU time as moving red cells) is cache misses when traversing column by column.
I understand that under the hood, this grid will be stored as a 1d array, so when I move red cells to the right and go row by row, I'm most often checking contiguous values, so the CPU doesn't need to load new values into the cache as often, whereas going column by column results in jumping around through the array by amounts that only increase as the size of the board does.
All that being said, I want this particular section to go faster. Here's the code as it stands now :
void blueStep(int col)
{
int i;
int local[size];
for (i = 0; i < size; local[i] = globalBoard[i++][col]);
for (i = 0; i < size; i++)
{
if (i < size - 1)
{
if (globalBoard[i][col] == 2 && globalBoard[i + 1][col] == 0)
{
local[i++] = 0;
local[i] = 2;
}
}
else
{
if (globalBoard[i][col] == 2 && globalBoard[0][col] == 0)
{
local[i++] = 0;
local[0] = 2;
}
}
}
for (i = 0; i < size; i++)
globalBoard[i][col] = local[i];
}
Here, col is which column to work on and size is how big the grid is (it's always square).
I was thinking that I might be able to do some kind of fancy pointer arithmetic to speed this up, and was reading this : http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/pointer.html.
Looking at that, I feel like I might need to change how I declare the grid in order to take advantage of 2d array pointer arithmetic, but I'm still not sure how I would go about traversing columns using that method.
Any help with that, or any other suggestions of fast ways to go through a column are welcome.
UPDATE: After a bit more research and discussion, it would seem my assumptions were incorrect. Turns out it's actually taking almost twice as long to write the results back to the global array than it is to loop over columns, due to false sharing. That said, I'm still somewhat curious to see if there are any better ways of doing column traversal.
I think the answer is to process the grid in tiles. You can do a very quick tile move, either down or right, in a 16x16 or 32x32 tile. They two moves will be effectively the same, and run at the same speed: read all values into XMM registers, process, write. You may want to investigate MASKMOVDQU instruction here. If I understand the nature of the problem, you can overlap tiles by one row/column and this will work okay if you process them in the usual (scan) order. If not, you have to handle stitching the tiles separately.
There is no truly fast way to do this in C code. However, you can try (1) changing your board type to be a unit8_t, (2) replacing all if .. statements with arithmetic, like this: value = (mask & value) | (^mask & newvalue), and (3) turning on maximum loop unrolling and auto-vectorization in the compiler options. This will give you a nice speedup - especially avoiding conditionals.
EDIT In addition to tiles that can fit in registers, you can also do a second level of tiles sized to fit in your cache. I think the combination will run at roughly your memory bandwidth.
EDIT Or, make your board type be two bits: pack four cells to a byte. Goes nicely with the replacing if statements with arithmetic idea :)

C cache optimization for direct mapped cache

Having some trouble figuring out the hit and miss rates of the following two snippets of code.
Given info: we have a 1024 Byte direct-mapped cache with block sizes of 16 bytes. So that makes 64 lines (sets in this case) then. Assume the cache starts empty. Consider the following code:
struct pos {
int x;
int y;
};
struct pos grid[16][16];
int total_x = 0; int total_y = 0;
void function1() {
int i, j;
for (i = 0; i < 16; i++) {
for (j = 0; j < 16; j++) {
total_x += grid[j][i].x;
total_y += grid[j][i].y;
}
}
}
void function2() {
int i, j;
for (i = 0; i < 16; i++) {
for (j = 0; j < 16; j++) {
total_x += grid[i][j].x;
total_y += grid[i][j].y;
}
}
}
I can tell from some basic rules (i.e. C arrays are row-major order) that function2 should be better. But I don't understand how to calculate the hit/miss percentages. Apparently function1() misses 50% of the time, while function2() only misses 25% of the time.
Could somebody walk me through how those calculations work? All I can really see is that no more than half the grid will ever fit inside the cache at once. Also, is this concept easy to extend to k-way associative caches?
Thanks.
How data are stored in memory
Every structure pos has a size of 8 Bytes, thus the total size of pos[16][16] is 2048 Bytes. And the order of the array are as follows:
pos[0][0] pos[0][1] pos[0][2] ...... pos[0][15] pos[1]0[] ...... pos[1][15].......pos[15][0] ......pos[15][15]
The cache organization compared to the data
For the cache, each block is 16 Bytes, which is the same size as two elements of the array. The Entire cache is 1024 Bytes, which is half the size of the entire array. Since cache is direct-mapped, that means if we label the cache block from 0 to 63, we can safely assume that the mapping should look like this
------------ memory----------------------------cache
pos[0][0] pos[0][1] -----------> block 0
pos[0][2] pos[0][3] -----------> block 1
pos[0][4] pos[0][5] -----------> block 2
pos[0][14] pos[0][15] --------> block 7
.......
pos[1][0] pos[1][1] -----------> block 8
pos[1][2] pos[1][3] -----------> block 9
.......
pos[7][14] pos[7][15] --------> block 63
pos[8][0] pos[8][1] -----------> block 0
.......
pos[15][14] pos[15][15] -----> block 63
How function1 manipulates memory
The loop follows a column-wise inner loop, that means the first iteration loads pos[0][0] and pos[0][1] to cache block 0, the second iteration loads pos[1][0] and pos[1][1] to cache block 8. Caches are cold, so the first column x is always miss, while y is always hit. The second column data are supposedly all loaded in cache during the first column access, but this is NOT the case. Since pos[8][0] access has already evict the former pos[0][0] page(they both map to block 0!).So on, the miss rate is 50%.
How function2 manipulates memory
The second function has nice stride-1 access pattern. That means when accessing pos[0][0].x pos[0][0].y pos[0][1].x pos[0][1].y only the first one is a miss due to the cold cache. The following patterns are all the same. So the miss rate is only 25%.
K-way associative cache follows the same analysis, although that may be more tedious. For getting the most out of the cache system, try to initiate a nice access pattern, say stride-1, and use the data as much as possible during each loading from memory. Real world cpu microarchitecture employs other intelligent design and algorithm to enhance the efficiency. The best method is always to measure the time in real world, dump the core code, and do a thorough analysis.
Ok, my computer science lectures are a bit far off but I think I figured it out (it's actually a very easy example when you think about it).
Your struct is 8 byte long (2 x 4). Since your cache blocks are 16 bytes, a memory access grid[i][j] will fetch exactly two struct entries (grid[i][j] and grid[i][j+1]). Therefore, if you loop through the second index only every 4th access will lead to a memory read. If you loop through the first index, you probably throw away the second entry that has been fetched, that depends on the number of fetches in the inner loop vs. the overall cache-size though.
Now we have to think about the cache size as well: You say that you have 64 lines that are directly mapped. In function 1, an inner loop is 16 fetches. That means, the 17th fetch you get to grid[j][i+1]. This should actually be a hit, since it should have been kept in the cache since the last inner loop walk. Every second inner loop should therefore only consist of hits.
Well, if my reasonings are correct, the answer that has been given to you should be wrong. Both functions should perform with 25% misses. Maybe someone finds a better answer but if you understand my reasoning I'd ask a TA about that.
Edit: Thinking about it again, we should first define what actually qualifies as a miss/hit. When you look at
total_x += grid[j][i].x;
total_y += grid[j][i].y;
are these defined as two memory accesses or one? A decent compiler with optimization settings should optimize this to
pos temp = grid[j][i];
total_x += temp.x;
total_y += temp.y;
which could be counted as one memory access. I therefore propose the universal answer to all CS questions: "It depends."

Manually optimize a nested loop

I'm working on a homework assignment where I must manually optimize a nested loop (my program will be compiled with optimizations disabled). The goal of the assignment is to run the entire program in less than 6 seconds (extra credit for less than 4.5 seconds).
I'm only allowed to change a small block of code, and the starting point is such:
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
Where ARRAY_SIZE is 9973. This loop is contained within another loop that is run 200,000 times. This particular version runs in 16 seconds.
What I've done so far is change the implementation to unroll the loop and use pointers as my iterator:
(These declarations are not looped over 200,000 times)
register int unroll_length = 16;
register int *unroll_end = array + (ARRAY_SIZE - (ARRAY_SIZE % unroll_length));
register int *end = array + (ARRAY_SIZE -1);
register int *curr_end;
curr_end = end;
while (unroll_end != curr_end) {
sum += *curr_end;
curr_end--;
}
do {
sum += *curr_end + *(curr_end-1) + *(curr_end-2) + *(curr_end-3) +
*(curr_end-4) + *(curr_end-5) + *(curr_end-6) + *(curr_end-7) +
*(curr_end-8) + *(curr_end-9) + *(curr_end-10) + *(curr_end-11) +
*(curr_end-12) + *(curr_end-13) + *(curr_end-14) + *(curr_end-15);
}
while ((curr_end -= unroll_length) != array);
sum += *curr_end;
Using these techniques, I was able to get the execution down to 5.5 seconds, which will give me full credit. However; I sure do want to earn the extra credit, but I'm also curious what additional optimizations I can make that I might be overlooking?
Edit #1 (Adding outer loop)
srand(time(NULL));
for(j = 0; j < ARRAY_SIZE; j++) {
x = rand() / (int)(((unsigned)RAND_MAX + 1) / 14);
array[j] = x;
checksum += x;
}
for (i = 0; i < N_TIMES; i++) {
// inner loop goes here
if (sum != checksum)
printf("Checksum error!\n");
sum = 0;
}
you could try to store your variables in CPU register with :
register int *unroll_limit = array + (ARRAY_SIZE - (ARRAY_SIZE % 10));
register int *end = array + ARRAY_SIZE;
register int *curr;
and try with different size of manual loops to check when you maximize cache usage.
I'm going to assume you're on x86, if you're not most of this will still apply but the details differ.
Use SIMD/SSE, this will get you a 4x speed increase without much effort, it needs 16-byte aligned data that you can get with _aligned_malloc or regular malloc + manual alignment. Besides that all you'll need in this case is _mm_add_epi32 to do four additions at the same time. (Different architectures have different SIMD units so check yours).
Use multi-threading/ multiple cores in this case it'd be easiest to have each thread sum half the array to a temporary variable and sum those two results when done. This will scale linearly across the number of cores available.
Prefetch to L1 cache; this only works when you've got a huge array and are sure to be able to stress the CPU for at least ~200 cycles (eg. a roundtrip to main RAM).
Completely go out of your way to optimize the hell out of it and use a GPU based approach. This will require you to set up a CUDA or OpenCL environment and upload the array to the GPU. This is about ~400 LoC excluding the compute kernel. But might not be worth it if you have a small dataset (eg. too much overhead in setting up/tearing down) or if you have a huge changing dataset (eg. too much time spend in streaming to the GPU).
Align to page boundaries to prevent page-faults (expensive) on windows these are usually 4K in size.
Manually unroll the loop while taking into account dual issuing instructions and instruction latencies. This information is available from your CPU manufacturer (Intel provides these too). But on x86 this isn't really useful because of it's CPUs out of order execution.
Depending on your platform actually getting the data to the CPU for processing is the slowest part (this is mainly true for recent consoles & PS, I've never developed for small embedded devices) so you'll want to optimize for that. Tricks like iterating backwards are nice on a 6502 when cycles were the bottleneck but these days you'll want to access RAM linearly.
If you do happen to be on a machine with fast RAM (eg. NOT PC/Consoles), converting from the plain array to a more fancy data-structure (eg. one that does more pointer chasing) might totally be worth it.
All in all, I guess that 1 & 2 are easiest and most feasible and will gain you more than enough performance (eg. 8x on a Core 2 Duo). However, it all comes down to knowing your hardware and programming PIC will require completely different optimizations (eg. instruction level manual pipelining) than a general PC will.
Try to align the array on a page boundary ( i.e. 4K )
Try to compute with a wider data type, i.e. 64 bit- instead of 32-bit integers. This way you can add 2 numbers at once. As the final step add up the both halves.
Convert part of the array or the computation to floating point, so you can use FPU and CPU in parallel
I don't expect the following suggestions to be allowed but I mention them anyway
Multithreading
Specialized CPU-Instructions, i.e. SSE
If the array values don't change, you could memoize the sum (i.e. calculate it on first run, and use the calculated sum on subsequent runs).
Some nice optimization tricks:
make your loop count backwards from ARRAY_SIZE to 0 so that way you can remove the comparisons from your code. Less comparisons speed up the program.
Furthermore x86 nowadays are optimized for short loops which they can "preload" to run faster then normal.
Try to use registers wherever possible
Use pointers instead of array indices
So if you would use arrays, try to use:
register int idx = ARRAY_SIZE - 1;
register int sum = 0;
do {
sum += array[idx];
} while (idx-- % 10 != 0);
do {
sum += array[idx] + array[idx - 1] + array[idx - 2] + array[idx - 3] + array[idx - 4] + array[idx - 5] + array[idx - 6] + array[idx - 7] + array[idx - 8] + array[idx - 9];
} while (idx -= 10);
// now we don't use a comparison and the ZERO flag will be set in FLAG
// register on which we can conditional jump. With a comparison you do VALUE - VALUE
// and then check if the ZERO flag is set or the NEGATIVE flag or whatever you are testing on

algorithm comparison in C, what's the difference?

#define IMGX 8192
#define IMGY 8192
int red_freq[256];
char img[IMGY][IMGX][3];
main(){
int i, j;
long long total;
long long redness;
for (i = 0; i < 256; i++)
red_freq[i] = 0;
for (i = 0; i < IMGY; i++)
for (j = 0; j < IMGX; j++)
red_freq[img[i][j][0]] += 1;
total = 0;
for (i = 0; i < 256; i++)
total += (long long)i * (long long)red_freq[i];
redness = (total + (IMGX*IMGY/2))/(IMGX*IMGY);
what's the difference when you replace the second for loop into
for (j = 0; j < IMGX; j++)
for (i = 0; i < IMGY; i++)
red_freq[img[i][j][0]] += 1;
everything else are stay the same and why the first algorithm is faster than then second algorithm ?
Does it have something to do with the memory allocation?
The first version alters memory in sequence, so uses the processor cache optimally.
The second version uses one value from each cache line it loads, so it pessimal for cache use.
The point to understand is that the cache is divided into lines, each of which will contain many values in the overall structure.
The first version might also be optimized by the compiler to use more clever instructions (SIMD instructions) which would be even faster.
It is because the first version is iterating through the memory in the order that it is physically laid out, while the second one is jumping around in memory from one column in the array to the next. This will cause cache thrashing and interfere with the optimal performance of the CPU, which then has to spend lots of time waiting for the cache to be refreshed over and over again.
It's because big modern processor architectures (like the one in a PC) are massively optimised to work on memory which is 'near' (in address-related terms) memory which they've recently accessed. Actual physical memory access is much, much slower than the CPU can theoretically run, so everything which helps the process do its access in the most efficient fashion helps with performance.
It's pretty much impossibly to generalise more than that, but 'locality of reference' is a good thing to aim for.
Due to how the memory is laid out the first version maintains data locality and therefore causes less cache misses.
memory allocation happens only once and it is at the beginning so it can not be the reason. the reason is how the runtime calculates the address. In both cases memory address is calculated as
(i * (IMGY * IMGX)) + (j * IMGX) + 0
In the first algorithm
(i * (IMGY * IMGX)) gets calculates 8192 times
(j * IMGX) gets calculated 8192 * 8192 times
In the second algorithm
(i * (IMGY * IMGX)) gets calculates 8192 * 8192 times
(j * IMGX) gets calculated 8192 times
Since
(i * (IMGY * IMGX))
involves two multiplications, doing it more takes more time. that is the reason
Yes it has something to do with memory allocation. The first loop indexes the inner dimension of img, which happens to span over only 3 bytes each time. That's within one memory page easily (i believe a common size here is 4kB for one page). But with your second version, the outer dimension's index changes fast. That will cause memory reads spread over a much larger range of memory - namely sizeof (char[IMGX][3]) bytes, which is 24kB. And with each change of the inner index, those jumps start to happen again. That will hit different pages and is probably somewhat slower. Also i heard the CPU reads ahead memory. That will make the first version benefit, because at the time it reads, that data is probably already in the cache. I can imagine the second version doesn't benefit from that, because it makes those large jumps around the memory back and forth.
I would suspect the difference is not that much, but if the algorithm runs many times, it eventually becomes noticeable. You probably want to read the article Row-major Order on wikipedia. That is the scheme used to store multi-dimensional arrays in C.

Resources