c: pointers - how to increase every 2nd byte by X - c

I have a pointer that's holding 100 bytes of data.
i would like to add 5 to every 2nd byte.
example:
1 2 3 4 5 6
will become:
1 7 3 9 5 11
Now i know i can do a for loop, is there a quicker way? something like memset that will increase the value of every 2nd byte ?
thanks

A loop would be the best way. memset() is efficient for setting a contiguous block of memory. Wouldn't be much help for you here.

In which format do you have your bytes? As concatenated chars? Or are the bytes a subpart of e.g. a uint32?
In general, a loop is the best way for doing this - even if you would be able to apply a pattern-like mask with memset, you would still need to create it and that would take the same amount of CPU cycles.
If you would have 4 bytes per element (e.g. uint32), you could cut the cpu cycles in half by creating a pre-defined mask for adding. But attention: such a solution would not check for overflows (pseudocode):
uint32* ptr = new uint32[16]; // creates 64 bytes of data
(...) fill data
for (int k=0; k < 16; ++k)
{
// Hardcored Add-Mask for Little Endian systems
ptr[k] += 0x05000500; // dereference and add mask to content
}
Edit: Please note that this assumes a little endian system and is C++ Pseudocode.

If memset supported increasing the value of every nth byte, how do you think it would accomplish it?

You can in fact make it faster by using loop unrolling, but you'll need to know that your array is a fixed size. Then you would skip the loop overhead by just repeatedly assigning the value:
array[ 1 ] += 5;
array[ 3 ] += 5;
array[ 5 ] += 5;
...
By doing this you don't have the overhead incurred by jump and test instructions that would be present in a loop, but you pay for it in code bloat.

Related

Switch with tons of cases vs single if

Let's say we need to go through a 128-element array of uint8 and compare neighbour elements and put the result to another array. The code below is the most simple and readable way to solve this problem.
for (i = 1; i < 128; i++)
if (arr[i] < arr[i-1] + 64) //don't care about overflow
arr2[i] = 1;
It looks like this code 1) will not use branch table.
And as far as I know, a cpu doesn't read just 1 byte, it actually reads 8 bytes (assuming a 64bit machine), and that 2) makes cpu do some extra work.
So here comes another approach. Read 2 (or 4 or 8) bytes at a time and create an extremely huge switch (2^16, 2^32 or 2^64 cases respectively), which has every possible combination of bytes in our array. Does this make any sense?
For this discussion let's assume the following:
1) Our main priority is speed
2) Next is RAM consumption.
We don't care about the size of the executable (unless they somehow affect speed or RAM)
You should know that switches are actually very slow as branch would be likely mispredicted. What makes switch fast is jump table:
switch (i) {
case 0: ...
case 1: ...
}
gets translated into this:
labels = {&case0, &case1}
goto labels[i]
However, you do not need this either as your only writing memory cell and you can write a "jump table", or more specifically pre-computed matrix of answers yourself:
for (i = 1; i < 128; i++)
arr2[i] = answers[arr[i]][arr[i-1]];
uint8 have only 256 possible values which gives us 64k of RAM required for that matrix.

"Blocking" method to make code cache friendly

Hey so I'm looking at a matrix shift code, and need to make it cache friendly (fewest cache misses possible). The code looks like this:
int i, j, temp;
for(i=1;, i< M; i++){
for(j=0; j< N; j++){
temp = A[i][j];
A[i][j] = A[i-1][j];
A[i-1]][j] = temp;
}
}
Assume M and N are parameters of the function, noting M to number of rows, and N to number of columns. Now to make this more cache friendly, the book gives out two problems for optimization. When the matrix is 4x4, s=1, E=2, b=3 , and when the matrix is 128x128, s=5, E=2, b=3.
(s = # of set index bits (S = s^2 is the number of sets, E = number of lines per set, and b = # of block bits (so B = b^2 is block size))
So using the blocking method, I should access the matrix by block size, to avoid getting a miss, and the cache having to fetch the information from the cache a level higher. So here is what I assume:
Block size is 9 bytes for each
With the 4x4 matrix, the number of elements that fit evenly on a block is:
blocksize*(number of columns/blocksize) = 9*(4/9) = 4
So if each row will fit on one block, why is it not cache friendly?
With the 128x128 matrix, with the same logic as above, each block will hold (9*(128/9)) = 128.
So obviously after calculating that, this equation is wrong. I'm looking at the code from this page http://csapp.cs.cmu.edu/public/waside/waside-blocking.pdf
Once I reached this point, I knew I was lost, which is where you guys come in! Is it as simple as saying each block holds 9 bytes, and 8 bytes (two integers) are what fits evenly into it? Sorry this stuff really confuses me, I know I'm all over the place. Just to be clear, these are my concerns:
How do you know how many elements will fit in a block?
Do the number of lines or sets affect this number? If so, how?
Any in depth explanation of the code posted on the linked page.
Really just trying to get a grasp of this.
UPDATE:
Okay so here is where I'm at for the 4x4 matrix.
I can read 8 bytes at a time, which is 2 integers. The original function will have cache misses because C loads into row-major order, so every time it wants A[i-1][j] it will miss, and load the block that holds A[i-1][j] which would either be A[i-1][0] and A[i-1][1] or A[i-1][2] and A[i-1][3].
So, would the best way to go about this be to create another temp variable, and do A[i][0] = temp, A[i][1] = temp2, then load A[i-1][0] A[i-1][1] and set them to temp, and temp2 and just set the loop to j<2? For this question, it is specifically for the matrices described; I understand this wouldn't work on all sizes.
The solution to this problem was to think of the matrix in column major order rather than row major order.
Hopefully this helps someone in the future. Thanks to #Michael Dorgan for getting me thinking.
End results for 128x128 matrix:
Original: 16218 misses
Optimized: 8196 misses

C cache optimization for direct mapped cache

Having some trouble figuring out the hit and miss rates of the following two snippets of code.
Given info: we have a 1024 Byte direct-mapped cache with block sizes of 16 bytes. So that makes 64 lines (sets in this case) then. Assume the cache starts empty. Consider the following code:
struct pos {
int x;
int y;
};
struct pos grid[16][16];
int total_x = 0; int total_y = 0;
void function1() {
int i, j;
for (i = 0; i < 16; i++) {
for (j = 0; j < 16; j++) {
total_x += grid[j][i].x;
total_y += grid[j][i].y;
}
}
}
void function2() {
int i, j;
for (i = 0; i < 16; i++) {
for (j = 0; j < 16; j++) {
total_x += grid[i][j].x;
total_y += grid[i][j].y;
}
}
}
I can tell from some basic rules (i.e. C arrays are row-major order) that function2 should be better. But I don't understand how to calculate the hit/miss percentages. Apparently function1() misses 50% of the time, while function2() only misses 25% of the time.
Could somebody walk me through how those calculations work? All I can really see is that no more than half the grid will ever fit inside the cache at once. Also, is this concept easy to extend to k-way associative caches?
Thanks.
How data are stored in memory
Every structure pos has a size of 8 Bytes, thus the total size of pos[16][16] is 2048 Bytes. And the order of the array are as follows:
pos[0][0] pos[0][1] pos[0][2] ...... pos[0][15] pos[1]0[] ...... pos[1][15].......pos[15][0] ......pos[15][15]
The cache organization compared to the data
For the cache, each block is 16 Bytes, which is the same size as two elements of the array. The Entire cache is 1024 Bytes, which is half the size of the entire array. Since cache is direct-mapped, that means if we label the cache block from 0 to 63, we can safely assume that the mapping should look like this
------------ memory----------------------------cache
pos[0][0] pos[0][1] -----------> block 0
pos[0][2] pos[0][3] -----------> block 1
pos[0][4] pos[0][5] -----------> block 2
pos[0][14] pos[0][15] --------> block 7
.......
pos[1][0] pos[1][1] -----------> block 8
pos[1][2] pos[1][3] -----------> block 9
.......
pos[7][14] pos[7][15] --------> block 63
pos[8][0] pos[8][1] -----------> block 0
.......
pos[15][14] pos[15][15] -----> block 63
How function1 manipulates memory
The loop follows a column-wise inner loop, that means the first iteration loads pos[0][0] and pos[0][1] to cache block 0, the second iteration loads pos[1][0] and pos[1][1] to cache block 8. Caches are cold, so the first column x is always miss, while y is always hit. The second column data are supposedly all loaded in cache during the first column access, but this is NOT the case. Since pos[8][0] access has already evict the former pos[0][0] page(they both map to block 0!).So on, the miss rate is 50%.
How function2 manipulates memory
The second function has nice stride-1 access pattern. That means when accessing pos[0][0].x pos[0][0].y pos[0][1].x pos[0][1].y only the first one is a miss due to the cold cache. The following patterns are all the same. So the miss rate is only 25%.
K-way associative cache follows the same analysis, although that may be more tedious. For getting the most out of the cache system, try to initiate a nice access pattern, say stride-1, and use the data as much as possible during each loading from memory. Real world cpu microarchitecture employs other intelligent design and algorithm to enhance the efficiency. The best method is always to measure the time in real world, dump the core code, and do a thorough analysis.
Ok, my computer science lectures are a bit far off but I think I figured it out (it's actually a very easy example when you think about it).
Your struct is 8 byte long (2 x 4). Since your cache blocks are 16 bytes, a memory access grid[i][j] will fetch exactly two struct entries (grid[i][j] and grid[i][j+1]). Therefore, if you loop through the second index only every 4th access will lead to a memory read. If you loop through the first index, you probably throw away the second entry that has been fetched, that depends on the number of fetches in the inner loop vs. the overall cache-size though.
Now we have to think about the cache size as well: You say that you have 64 lines that are directly mapped. In function 1, an inner loop is 16 fetches. That means, the 17th fetch you get to grid[j][i+1]. This should actually be a hit, since it should have been kept in the cache since the last inner loop walk. Every second inner loop should therefore only consist of hits.
Well, if my reasonings are correct, the answer that has been given to you should be wrong. Both functions should perform with 25% misses. Maybe someone finds a better answer but if you understand my reasoning I'd ask a TA about that.
Edit: Thinking about it again, we should first define what actually qualifies as a miss/hit. When you look at
total_x += grid[j][i].x;
total_y += grid[j][i].y;
are these defined as two memory accesses or one? A decent compiler with optimization settings should optimize this to
pos temp = grid[j][i];
total_x += temp.x;
total_y += temp.y;
which could be counted as one memory access. I therefore propose the universal answer to all CS questions: "It depends."

Which ordering of nested loops for iterating over a 2D array is more efficient [duplicate]

This question already has answers here:
Why does the order of the loops affect performance when iterating over a 2D array?
(7 answers)
Closed 3 years ago.
Which of the following orderings of nested loops to iterate over a 2D array is more efficient in terms of time (cache performance)? Why?
int a[100][100];
for(i=0; i<100; i++)
{
for(j=0; j<100; j++)
{
a[i][j] = 10;
}
}
or
for(i=0; i<100; i++)
{
for(j=0; j<100; j++)
{
a[j][i] = 10;
}
}
The first method is slightly better, as the cells being assigned to lays next to each other.
First method:
[ ][ ][ ][ ][ ] ....
^1st assignment
^2nd assignment
[ ][ ][ ][ ][ ] ....
^101st assignment
Second method:
[ ][ ][ ][ ][ ] ....
^1st assignment
^101st assignment
[ ][ ][ ][ ][ ] ....
^2nd assignment
For array[100][100] - they are both the same, if the L1 cache is larger then 100*100*sizeof(int) == 10000*sizeof(int) == [usually] 40000. Note in Sandy Bridge - 100*100 integers should be enough elements to see a difference, since the L1 cache is only 32k.
Compilers will probably optimize this code all the same
Assuming no compiler optimizations, and matrix does not fit in L1 cache - the first code is better due to cache performance [usually]. Every time an element is not found in cache - you get a cache miss - and need to go to the RAM or L2 cache [which are much slower]. Taking elements from RAM to cache [cache fill] is done in blocks [usually 8/16 bytes] - so in the first code, you get at most miss rate of 1/4 [assuming 16 bytes cache block, 4 bytes ints] while in the second code it is unbounded, and can be even 1. In the second code snap - elements that were already in cache [inserted in the cache fill for the adjacent elements] - were taken out, and you get a redundant cache miss.
This is closely related to the principle of locality, which is the general assumption used when implementing the cache system. The first code follows this principle while the second doesn't - so cache performance of the first will be better of those of the second.
Conclusion:
For all cache implementations I am aware of - the first will be not worse then the second. They might be the same - if there is no cache at all or all the array fits in cache completely - or due to compiler optimization.
This sort of micro-optimization is platform-dependent so you'll need to profile the code in order to be able to draw a reasonable conclusion.
In your second snippet, the change in j in each iteration produces a pattern with low spatial locality. Remember that behind the scenes, an array reference computes:
( ((y) * (row->width)) + (x) )
Consider a simplified L1 cache that has enough space for only 50 rows of our array. For the first 50 iterations, you will pay the unavoidable cost for 50 cache misses, but then what happens? For each iteration from 50 to 99, you will still cache miss and have to fetch from L2 (and/or RAM, etc). Then, x changes to 1 and y starts over, leading to another cache miss because the first row of your array has been evicted from the cache, and so forth.
The first snippet does not have this problem. It accesses the array in row-major order, which achieves better locality - you only have to pay for cache misses at most once (if a row of your array is not present in the cache at the time the loop starts) per row.
That being said, this is a very architecture-dependent question, so you would have to take into consideration the specifics (L1 cache size, cache line size, etc.) to draw a conclusion. You should also measure both ways and keep track of hardware events to have concrete data to draw conclusions from.
Considering C++ is row major, I believe first method is going to be a bit faster. In memory a 2D array is represented in a Single dimension array and performance depends in accessing it either using row major or column major
This is a classic problem about cache line bouncing
In most time the first one is better, but I think the exactly answer is: IT DEPENDS, different architecture maybe different result.
In second method, Cache miss, because the cache stores contigous data.
hence the first method is efficient than second method.
In your case (fill all array 1 value), that will be faster:
for(j = 0; j < 100 * 100; j++){
a[j] = 10;
}
and you could still treat a as 2 dimensional array.
EDIT:
As Binyamin Sharet mentioned, you could do it if your a is declared that way:
int **a = new int*[100];
for(int i = 0; i < 100; i++){
a[i] = new int[100];
}
In general, better locality (noticed by most of responders) is only the first advantage for loop #1 performance.
The second (but related) advantage, is that for loops like #1 - compiler is normally capable to efficiently auto-vectorize the code with stride-1 memory access pattern (stride-1 means there is contiguous access to array elements one by one in every next iteration).
On the contrary, for loops like #2, auto-vectorizations will not normally work fine, because there is no consecutive stride-1 iterative access to contiguos blocks in memory.
Well, my answer is general. For very simple loops exactly like #1 or #2, there could be even simpler aggressive compiler optimizations used (grading any difference) and also compiler will normally be able to auto-vectorize #2 with stride-1 for outer loop (especially with #pragma simd or similar).
First option is better as we can store a[i] in a temp variable inside first loop and then lookup for j index in that. In this sense it can be said as cached variable.

Computationally efficient three dimensional arrays in C

I am trying to solve numerically a set of partial differential equations in three dimensions. In each of the equations the next value of the unknown in a point depends on the current value of each unknown in the closest points.
To write an efficient code I need to keep the points close in the three dimensions close in the (one-dimensional) memory space, so that each value is called from memory just once.
I was thinking of using octtrees, but I was wondering if someone knows a better method.
Octtrees are the way to go. You subdivide the array into 8 octants:
1 2
3 4
---
5 6
7 8
And then lay them out in memory in the order 1, 2, 3, 4, 5, 6, 7, 8 as above. You repeat this recursively within each octant until you get down to some base size, probably around 128 bytes or so (this is just a guess -- make sure to profile to determine the optimal cutoff point). This has much, much better cache coherency and locality of reference than the naive layout.
One alternative to the tree-method: Use the Morton-Order to encode your data.
In three dimension it goes like this: Take the coordinate components and interleave each bit two zero bits. Here shown in binary: 11111b becomes 1001001001b
A C-function to do this looks like this (shown for clarity and only for 11 bits):
int morton3 (int a)
{
int result = 0;
int i;
for (i=0; i<11; i++)
{
// check if the i'th bit is set.
int bit = a&(1<<i);
if (bit)
{
// if so set the 3*i'th bit in the result:
result |= 1<<(i*3);
}
}
return result;
}
You can use this function to combine your positions like this:
index = morton3 (position.x) +
morton3 (position.y)*2 +
morton3 (position.z)*4;
This turns your three dimensional index into a one dimensional one. Best part of it: Values that are close in 3D space are close in 1D space as well. If you access values close to each other frequently you will also get a very nice speed-up because the morton-order encoding is optimal in terms of cache locality.
For morton3 you better not use the code above. Use a small table to look up 4 or 8 bits at a time and combine them together.
Hope it helps,
Nils
The book Foundations of Multidimensional and Metric Data Structures can help you decide which data structure is fastest for range queries: octrees, kd-trees, R-trees, ...
It also describes data layouts for keeping points together in memory.

Resources