How does the loop become more efficient

How does the loop become more efficient - loops

Hi I came across the following code as a response to a stackoverflow questions. Following was the answer which has got no direct connection to the question but it seems to improve the efficiency the of code.
for (int i = 0; i < nodeList.getLength(); i++)
change to
for (int i = 0, len = nodeList.getLength(); i < len; i++)
to be more efficient. The second way may be the best as it tends to
use a flatter, predictable memory model.
I read about flat memory model but I couldn't get the point here. i.e., in what way does it makes the code more efficient. Can somebody explain me.
Ref: https://stackoverflow.com/a/12736268/3320657

Flat memory model or linear memory model refers to a memory addressing paradigm in which "memory appears to the program as a single contiguous address space." The CPU can directly (and linearly) address all of the available memory locations without having to resort to any sort of memory segmentation or paging schemes.
Keeping this in mind, the computer declares memory one line at a time. When declaring the variable within the line itself it causes less of a strain on seeking the value.
for (int i = 0, len = nodeList.getLength(); i < len; i++)
is more efficient than;
len = nodelist.getLength();
for (int i = 0, i < len; i++)

nodeList.getLength() isn't called every time the program loops, instead it is called once, stored in the integer len, and then compares i to len instead of running nodeList.getLength().

Related

Is there an approach to traverse array randomly?

I am trying to compare linear memory access to random memory access. I am traversing an array in the order of its indices to log performance of linear memory access. However to log memory's performance with random memory access I want to traverse my array randomly i.e arr[8], arr[17], arr[34], arr[2]...
Can I use pointer chasing to achieve this while ensuring that no index are accessed twice? Is pointer chasing most optimal approach in this case?

If your goal is to show that sequential access is faster than non-sequential access, simply pointer chasing the latter is not a good way to demonstrate that. You would be comparing access via a single pointer plus simple offset against deterrencing one or more pointers before offsetting.
To use pointer chasing, you'd have to apply it to both cases. Here's an example:
int arr[n], i;
int *unshuffled[n];
int *shuffled[n];
for(i = 0; i < n; i++) {
unshuffled[i] = arr + i;
}
/* I'll let you figure out how to randomize your indices */
shuffle(unshuffled, shuffled)
/* Do toning on these two loops */
for(i = 0; i < n; i++) {
do_stuff(*unshuffled[i]);
}
for(i = 0; i < n; i++) {
do_stuff(*shuffled[i]);
}
It you want to time the direct access better though, you could construct some simple formula for advancing the index instead of randomizing the access completely:
for(i = 0; i < n; i++) {
do_stuff(arr[i]);
}
for(i = 0; i < n; i++) {
do_stuff(arr[i / 2 + (i % 2) * (n / 2)]);
}
This will only work properly for even n as shown, but it illustrates the idea. You could go so far as to compensate for the extra flops in computing the index within do_stuff.
Probably the most apples-to-apples test would be to literally access the indices you want, without loops or additional computations:
do_stuff(arr[0]);
do_stuff(arr[1]);
do_stuff(arr[2]);
...
do_stuff(arr[123]);
do_stuff(arr[17]);
do_stuff(arr[566]);
...
Since I'd imagine you'd want to test with large arrays, you can write a program to generate the actual test code for you, and possibly compile and run the result.

I can tell you that for arrays in C the access time is constant regardless of the index being accessed. There will be no difference between accessing them randomly or sequentially other than the fact that randomizing will in itself introduce additional computations.
But, to really answer your question, you would probably be best off to build some kind of lookup array and shuffle it a few times and use that array to get the next index. Obviously, you would be accessing two arrays, one sequentially and another randomly, by doing so, thus making the exercise pretty much useless.

cannot allocate memory in C [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I want to define a 3D array/pointer to store my computing results:
#include <stdio.h>
int main(void) {
float ***ww;
int i, j, k;
int n1 = 3000, n2 = 6000, n3 = 9000;
ww = floatalloc3(n1, n2, n3); /* floatalloc3 is a self-defined function used to allocate memory space for 3D array/pointer */
for (i = 0; i < n1; i++) {
for (j = 0; j < n2; j++) {
for (k = 0; k < n3; k++) {
ww[i][j][k] = 0.0;
}
}
}
free(**ww);
free(*ww);
free(ww);
return 0;
}
But an error pops up says cannot allocate 648000000000 bytes: Cannot allocate memory. So how can I solve such problem, any idea?
Is it possible to split the data and store to different cores?

You want to allocate about 3000 * 6000 * 9000 * 4 bytes of data, which is ~648GB, a little too much.

If you truly intend to compute 3000×6000×9000 floating-point values (162,000,000,000 values), you need to reconsider your approach.
There are several approaches, but the most typical is to split the task into smaller sections, and compute them one by one.
If the grid is mostly empty, but you need to access it in an unpredictable fashion (making splitting the task undesirable), you can use OS-specific methods to memory-map the dataset. (You do need a 64-bit OS to do this with this particular dataset, though; and you do also need sufficient storage on some filesystem to store the data in.) I've shown how to do this in Linux in 2011 in another forum here; this example program manipulates a terabyte-sized memory-mapped dataset, almost twice the size of the dataset OP is considering.
In any case, you definitely do not wish to use two-level indirection to access the data. This wreaks havoc with current CPU's ability to predict and cache accesses, and will lead to poor performance. Instead, you should use a linear data structure. For example:
size_t xsize;
size_t ysize;
size_t zsize;
float *cells;
#define CELL(x,y,z) cells[(x) + xsize*( (y) + ysize * (z) )]
In other words, the index of each cell in the grid is (x) + (y)*xsize + (z)*xsize*ysize. Not only is the data then consecutive in memory (which is important for caching), but the CPU (and your compiler) can also better predict future accesses, based on access patterns.

The amount of memory you are trying to allocate seems humongous: 648 billion bytes plus the overhead of 2 levels of indirection! Your system probably does not let you allocate that much memory.
You should test the return value of floatalloc3 to detect allocation failure. As a matter of fact, it would be useful to post the source code for this self-defined function to ascertain its correctness.
Note also that the 3 free calls might not be enough to free the allocated blocks, but without the source code to floatalloc3, one can only speculate.

Cache Performance (concerning loops) in C

I was wondering, why does one set of loops allow for better cache performance than another in spite of logically doing the same thing?
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
accum = 0.0;
for (k = 0; k < n; k++) {
accum += b[j][k] * a[k][i];
}
c[j][i] = accum;
}
}
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
val = b[j][k];
for (i = 0; i < n; i++) {
c[j][i] += val * a[k][i];
}
}
}
I believe the first one above delivers better cache performance, but why?
Also, when we increase block size, but keep cache size and associativity constant, does it influence the miss rate? At a certain point increasing block size can cause a higher miss rate, right?

Just generally speaking, the most efficient loops through a matrix are going to cycle through the last dimension, not the first ("last" being c in m[a][b][c]).
For example, given a 2D matrix like an image which has its pixels represented in memory from top-left to bottom-right, the quickest way to sequentially iterate through it is going to be horizontally across each scanline, like so:
for (int y=0; y < h; ++y) {
for (int x=0; x < w; ++x)
// access pixel[y][x]
}
... not like this:
for (int x=0; x < w; ++x) {
for (int y=0; y < h; ++y)
// access pixel[y][x]
}
... due to spatial locality. It's because the computer grabs memory from slower, bigger regions of the hierarchy and moves it to faster, small regions in large, aligned chunks (ex: 64 byte cache lines, 4 kilobyte pages, and down to a little teeny 64-bit general-purpose register, e.g.). The first example accesses all the data from such a contiguous chunk immediately and prior to eviction.
harold on this site gave me a nice view on how to look at and explain this subject by suggesting not to focus so much on cache misses, but instead focusing on striving to use all the data in a cache prior to eviction. The second example fails to do that for all but the most trivially-small images by iterating through the image vertically with a large, scanline-sized stride rather than horizontally with a small, pixel-sized one.
Also, when we increase block size, but keep cache size and associativity constant, does it influence the miss rate? At a certain point increasing block size can cause a higher miss rate, right?
The answer here would be "yes", as an increase in block size would naturally equate to more compulsory misses (that would be more simply "misses" though rather than "miss rate") but also just more data to process which won't all necessarily fit into the fastest L1 cache. If we're accessing a large amount of data with a large stride, we end up getting a higher non-compulsory miss rate as a result of more data being evicted from the cache before we utilize it, only to then redundantly load it back into a faster cache.
There is also a case where, if the block size is small enough and aligned properly, all the data will just fit into a single cache line and it wouldn't matter so much how we sequentially access it.
Matrix Multiplication
Now your example is quite a bit more complex than this straightforward image example above, but the same concepts tend to apply.
Let's look at the first one:
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
accum = 0.0;
for (k = 0; k < n; k++)
accum += b[j][k] * a[k][i];
c[j][i] = accum;
}
}
If we look at the innermost k loop, we access b[j][k]. That's a fairly optimal access pattern: "horizontal" if we imagine a row-order memory layout. However, we also access a[k][i]. That's not so optimal, especially for a very large matrix, as it's accessing memory in a vertical pattern with a large stride and will tend to suffer from data being evicted from the fastest but smallest forms of memory before it is used, only to load that chunk of data again redundantly.
If we look at the second j loop, that's accessing c[j][i], again in a vertical fashion which is not so optimal.
Now let's have a glance at the second example:
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
val = b[j][k];
for (i = 0; i < n; i++)
c[j][i] += val * a[k][i];
}
}
If we look at the second k loop in this case, it's starting off accessing b[j][k] which is optimal (horizontal). Furthermore, it's explicitly memoizing the value to val, which might improve the odds of the compiler moving that to a register and keeping it there for the following loop (this relates to compiler concepts related to aliasing, however, rather than CPU cache).
In the innermost i loop, we're accessing c[j][i] which is also optimal (horizontal) along with a[k][i] which is also optimal (horizontal).
So this second version is likely to be more efficient in practice. Note that we can't absolutely say that, as aggressive optimizing compilers can do all sorts of magical things like rearranging and unrolling loops for you. Yet short of that, we should be able to say the second one has higher odds of being more efficient.
"What's a profiler?"
I just noticed this question in the comments. A profiler is a measuring tool that can give you a precise breakdown of where time is spent in your code, along with possibly further statistics like cache misses and branch mispredictions.
It's not only good for optimizing real-world production code and helping you more effectively prioritize your efforts to places that really matter, but it can also accelerate the learning process of understanding why inefficiencies exist through the process of chasing one hotspot after another.
Loop Tiling/Blocking
It's worth mentioning an advanced optimization technique which can be useful for large matrices -- loop tiling/blocking. It's beyond the scope of this subject but that one plays to temporal locality.
Deep C Optimization
Hopefully later you will be able to C these things clearly as a deep C explorer. While most optimization is best saved for hindsight with a profiler in hand, it's useful to know the basics of how the memory hierarchy works as you go deeper and deeper exploring the C.

Copying specific data from a source buffer to several target buffers

I have a source buffer which i declared using malloc and i have used fread to read into the buffer some data from a big file. Now I want to separate out alternate chunks of data (say 2 bytes each) from this source buffer into two target buffers. This problem can be extrapolated to copying every nth chunk to n number of target buffers. I need help in the form of a sample code for the simplest case of two target buffers. This is what I thought about which I am quite sure isn't the right thing.
int totsamples = 256*2*2;
int *sbuff = malloc(totsamples);
int *tbuff1 = malloc(totsamples/2);
int *tbuff2 = malloc(totsamples/2);
elements = fread(sbuff, 2, 256*2, fs);
for(i = 0; i<256; i++)
{
tbuff1[i] = sbuff[i*2];
tbuff2[i] = sbuff[(i*2) + 1];
}

Maybe this will give you and idea:
for(i = 0; i<256; i++)
{
tbuff1[2*i+0] = sbuff[i*4+0];
tbuff1[2*i+1] = sbuff[i*4+1];
tbuff2[2*i+0] = sbuff[i*4+2];
tbuff2[2*i+1] = sbuff[i*4+3];
}
Note: Above code is wrong with respect to your malloc() parameters, as it is unclear what your totsamples means, so fix something before using...
Another note: If you want longer than 2 items long chunk, it starts to make sense to use memcpy to do the copying.
Suggestion: Use constants instead of magic numbers, such as const int SAMPLES=256;. Also I'm not sure, but it appears you think size of int is 2? Don't, instead use sizeof(int) etc (and size of int is rarely 2, btw).
Hmm... Are you actually trying to optimize things by copying bytes using integers to copy 4 bytes at a time? Don't! "Premature optimization is root of all evil". You may consider that later, after you code works otherwise, but first create a working non-hacky version, and doubly so, if you need to ask how to do even that, like here...

algorithm comparison in C, what's the difference?

#define IMGX 8192
#define IMGY 8192
int red_freq[256];
char img[IMGY][IMGX][3];
main(){
int i, j;
long long total;
long long redness;
for (i = 0; i < 256; i++)
red_freq[i] = 0;
for (i = 0; i < IMGY; i++)
for (j = 0; j < IMGX; j++)
red_freq[img[i][j][0]] += 1;
total = 0;
for (i = 0; i < 256; i++)
total += (long long)i * (long long)red_freq[i];
redness = (total + (IMGX*IMGY/2))/(IMGX*IMGY);
what's the difference when you replace the second for loop into
for (j = 0; j < IMGX; j++)
for (i = 0; i < IMGY; i++)
red_freq[img[i][j][0]] += 1;
everything else are stay the same and why the first algorithm is faster than then second algorithm ?
Does it have something to do with the memory allocation?

The first version alters memory in sequence, so uses the processor cache optimally.
The second version uses one value from each cache line it loads, so it pessimal for cache use.
The point to understand is that the cache is divided into lines, each of which will contain many values in the overall structure.
The first version might also be optimized by the compiler to use more clever instructions (SIMD instructions) which would be even faster.

It is because the first version is iterating through the memory in the order that it is physically laid out, while the second one is jumping around in memory from one column in the array to the next. This will cause cache thrashing and interfere with the optimal performance of the CPU, which then has to spend lots of time waiting for the cache to be refreshed over and over again.

It's because big modern processor architectures (like the one in a PC) are massively optimised to work on memory which is 'near' (in address-related terms) memory which they've recently accessed. Actual physical memory access is much, much slower than the CPU can theoretically run, so everything which helps the process do its access in the most efficient fashion helps with performance.
It's pretty much impossibly to generalise more than that, but 'locality of reference' is a good thing to aim for.

Due to how the memory is laid out the first version maintains data locality and therefore causes less cache misses.

memory allocation happens only once and it is at the beginning so it can not be the reason. the reason is how the runtime calculates the address. In both cases memory address is calculated as
(i * (IMGY * IMGX)) + (j * IMGX) + 0
In the first algorithm
(i * (IMGY * IMGX)) gets calculates 8192 times
(j * IMGX) gets calculated 8192 * 8192 times
In the second algorithm
(i * (IMGY * IMGX)) gets calculates 8192 * 8192 times
(j * IMGX) gets calculated 8192 times
Since
(i * (IMGY * IMGX))
involves two multiplications, doing it more takes more time. that is the reason

Yes it has something to do with memory allocation. The first loop indexes the inner dimension of img, which happens to span over only 3 bytes each time. That's within one memory page easily (i believe a common size here is 4kB for one page). But with your second version, the outer dimension's index changes fast. That will cause memory reads spread over a much larger range of memory - namely sizeof (char[IMGX][3]) bytes, which is 24kB. And with each change of the inner index, those jumps start to happen again. That will hit different pages and is probably somewhat slower. Also i heard the CPU reads ahead memory. That will make the first version benefit, because at the time it reads, that data is probably already in the cache. I can imagine the second version doesn't benefit from that, because it makes those large jumps around the memory back and forth.
I would suspect the difference is not that much, but if the algorithm runs many times, it eventually becomes noticeable. You probably want to read the article Row-major Order on wikipedia. That is the scheme used to store multi-dimensional arrays in C.