I'm wondering whether compilers (gcc with -O3 more specifically) can/will optimize out nested struct element dereferences (or not nested even).
For example, is there any point in doing the following code
register int i = 0;
register double multiple = struct1->struct2->element1;
for (i = 0; i < 10000; i++)
result[i] = multiple * -struct1->struct3->element3[i];
instead of
register int i = 0;
for (i = 0; i < 10000; i++)
result[i] = struct1->struct2->element1 * -struct1->struct3->element3[i];
I'm looking for the most optimized, but am not going to go through and bring outside of the loop struct dereferences if a compiler will optimize this out. If it does I think my best option is the following
register int i = 0;
register double* R = &result[0];
register double* amount = &struct1->struct3->element[0];
for (i = 0; i < 10000; i++, R++, amount++)
*R = struct1->struct2->element1 * -*amount;
which eliminates all unnecessary dereferences etc. (I think). Would the 2 deferences to get to element3 be optimized?
Any thoughts?
Thanks
This optimization is known as Loop-invariant code motion. Loop invariants (things that never change inside the loop) are moved outside of the loop, to avoid re-calculating the same thing over and over.
GCC supports it, and is enabled by the -fmove-loop-invariants flag:
-fmove-loop-invariants
Enables the loop invariant motion pass in the new loop optimizer. Enabled at level -O1
Today, compilers are almost always smart enough to do the "right thing" no matter how you formulate your code. Focus on writing the simplest, cleanest, easiest to read (for a human!) code you can. Let the compiler take care of the rest by enabling optimizations. -O2 is commonly used.
Related
There is some abstract loop that loops through array, and processes every single element without dependency of it's place in array. Element value is checked against conditions many times.
Adding int value = arr[i] will help reading the code, and theoretically, will take sizeof(int) memory away.
Will it boost performance? Or will compiler optimize it?
I know, early-optimization is very very bad, that is theoretical question.
int arr[10];
...
for (int i = 0; i < 10; i++)
{
if (arr[i] > 2)
{
if (arr[i] < 100)
{
do_smth1(arr[i]);
}
else if (arr[i] > 500 && arr[i] < 1000)
{
do_smth2(arr[i]);
}
else
{
do_smth3(arr[i]);
}
}
else
{
do_smth4(arr[i]);
}
}
Will it boost performance?
No.
Or will compiler optimize it?
Yes. If not, get a better compiler.
... and theoretically, will take sizeof(int) memory away.
No, not certainly.
With such candidate linear optimizations, coding for clarity is overwhelming the best option.
Use your valuable coding time to improve bigger picture items and let the compiler handle the small stuff.
See also Is premature optimization really the root of all evil?.
Warning: Compilers today recognize many common idioms and emit efficient code. Should you code in an uncommon fashion, in order to "improve" things, the compiler may miss optimizations and then your "improved" code is worse than clear common code.
Instead of trying to outthink the compiler, use it to leverage your productively.
I have ran into a problem when I was playing around with the auto parallelization function of the Oracle Solaris Compiler. Let's say I have the following code:
int var = -1;
int i;
for (i = 0; i < 3; i++){
bool flag = false;
// do operations to set the flag
if (flag == true)
var = i;
}
// do other operations with var
when I run this code, the compiler complains that it cannot be parallelized because of unsafe dependences.
Does anyone know what might be wrong here? Is there any way to avoid this but maintain the original functionality of the code?
Any help would be appreciated, thank you!
What the compiler sees is a bunch of loop iterations, all of which can potentially assign to V. If writing to V happens to be atomic, then this will get some random value of i, and one might argue that is OK. Under the assumption that writing to a variable is not atomic, most compilers see this as a data race ... (then what exactly ends up in V?). Thus the complaint.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Is this type of expression in C valid (on all compilers)?
If it is, is this good C?
char cloneString (char clone[], char string[], int Length)
{
if(!Length)
Length=64
;
int i = 0
;
while((clone[i++] = string[i]) != '\0', --Length)
;
clone[i] = '\0';
printf("cloneString = %s\n", clone);
};
Would this be better, worse, indifferent?
char *cloneString (char clone[], char string[], int Length)
{
if(!Length)
Length=STRING_LENGTH
;
char *r = clone
;
while
( //(clone[i++] = string[i]) != '\0'
*clone++ = *string++
, --Length
);
*clone = '\0';
return clone = r
;
printf("cloneString = %s\n", clone);
};
Stackoverflow wants me to add more text to this question!
Okay! I'm concerned about
a.) expressions such as c==(a=b)
b.) performance between indexing vs pointer
Any comments?
Thanks so much.
Yes, it's syntactically valid on all compilers (though semantically valid on none), and no, it isn't considered good C. Most developers will agree that the comma operator is a bad thing, and most developers will generally agree that a single line of code should do only one specific thing. The while loop does a whole four and has undefined behavior:
it increments i;
it assigns string[i] to clone[i++]; (undefined behavior: you should use i only once in a statement that increments/decrements it)
it checks that string[i] isn't 0 (but discards the result of the comparison);
it decrements Length, and terminates the loop if Length == 0 after being decremented.
Not to mention that assuming that Length is 64 if it wasn't provided is a terrible idea and leaves plenty of room for more undefined behavior that can easily be exploited to crash or hack the program.
I see that you wrote it yourself and that you're concerned about performance, and this is apparently the reason you're sticking everything together. Don't. Code made short by squeezing statements together isn't faster than code longer because the statements haven't been squeezed together. It still does the same number of things. In your case, you're introducing bugs by squeezing things together.
The code has Undefined Behavior:
The expression
(clone[i++] = string[i])
both modifies and accesses the object i from two different subexpressions in an unsequenced way, which is not allowed. A compiler might use the old value of i in string[i], or might use the new value of i, or might do something entirely different and unexpected.
Simple answer no.
Why return char and the function has no return statement?
Why 64?
I assume that the two arrays are of length Length - Add documentation to say this.
Why the ; on a new line and not after the statement?
...
Ok so I decided to evolve my comments into an actual answer. Although this doesn’t address the specific piece of code in your question, it answers the underlying issue and I think you will find it illuminating as you can use this — let’s call it guide — on your general programming.
What I advocate, especially if you are just learning programming is to focus on readability instead of small gimmicks that you think or was told that improve speed / performance.
Let’s take a simple example. The idiomatic way to iterate through a vector in C (not in C++) is using indexing:
int i;
for (i = 0; i < size; ++i) {
v[i] = /* code */;
}
I was told when I started programming that v[i] is actually computed as *(v + i) so in generated assembler this is broken down (please note that this discussion is simplified):
multiply i with sizeof(int)
add that result to the address of v
access the element at this computed address
So basically you have 3 operations.
Let’s compare this with accessing via pointers:
int *p;
for (p = v; p != v + size; ++p) {
*p = /*..*/;
}
This has the advantage that *p actually expands to just one instruction:
access the element at the address p.
2 extra instructions don’t seam much but if your program spends most of it’s time in this loop (either extremely large size or multiple calls to (the functions containing this) loop) you realise that the second version makes your program almost 3 times faster. That is a lot. So if you are like me when I started, you will choose the second variant. Don’t!
So the first version has readability (you explicitly describe that you access the i-th element of vector v), the second one uses a gimmick in detriment of readability (you say that you access a memory location). Now this might not be the best example for unreadable code, but the principle is valid.
So why do I tell you to use the first version: until you have a firm grasp on concepts like cache, branching, induction variables (and a lot more) and how they apply in real world compilers and programs performance, you should stay clear of these gimmicks and rely on the compiler to do the optimizations. They are very smart and will generate the same code for both variants (with optimization enabled of course). So the second variant actually differs just by readability and is identical performance-wise with the first.
Another example:
const char * str = "Some string"
int i;
// variant 1:
for (i = 0; i < strlen(str); ++i) {
// code
}
// variant 2:
int l = strlen(str);
for (i = 0; i < l; ++i) {
// code
}
The natural way would be to write the first variant. You might think that the second improves performance because you call the function strlen on each iteration of the loop. And you know that getting the length of a string means iterating through all the string until you reach the end. So basically a call to strlen means adding an inner loop. Ouch that has to slow the program down. Not necessarily: the compiler can optimize the call out because it always produces the same result. Actually you can do harm as you introduce a new variable which will have to be assigned a different register from a very limited registry pool (a little extreme example, but nevertheless a point is to be made here).
Don’t spend your energy on things like this until much later.
Let me show you something else that will illustrate further more that any assumptions that you make about performance will be most likely be false and misleading (I am not trying to tell you that you are a bad programmer — far from it — just that as you learn, you should invest your energy in something else than performance):
Let’s multiply two matrices:
for (k = 0; k < n; ++k) {
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
r[i][j] += a[i][k] * b[k][j];
}
}
}
versus
for (k = 0; k < n; ++k) {
for (j = 0; j < n; ++j) {
for (i = 0; i < n; ++i) {
r[i][j] += a[i][k] * b[k][j];
}
}
}
The only difference between the two is the order the operations get executed. They are the exact same operations (number, kind and operands), just in a different order. The result is equivalent (addition is commutative) so on paper they should take the EXACT amount of time to execute. In practice, even with optimizations enable (some very smart compilers can however reorder the loops) the second example can be up to 2-3 times slower than the first. And even the first variant is still a long long way from being optimal (in regards to speed).
So basic point: worry about UB as the other answers show you, don’t worry about performance at this stage.
The second block of code is better.
The line
printf("cloneString = %s\n", clone);
there will never get executed since there a return statement before that.
To make your code a bit more readable, change
while
(
*clone++ = *string++
, --Length
);
to
while ( Length > 0 )
{
*clone++ = *string++;
--Length;
}
This is probably a better approach to your problem:
#include <stdio.h>
#include <string.h>
void cloneString(char *clone, char *string)
{
for (int i = 0; i != strlen(string); i++)
clone[i] = string[i];
printf("Clone string: %s\n", clone);
}
That been said, there's already a standard function to to that:
strncpy(const char *dest, const char *source, int n)
dest is the destination string, and source is the string that must be copied. This function will copy a maximum of n characters.
So, your code will be:
#include <stdio.h>
#include <string.h>
void cloneString(char *clone, char *string)
{
strncpy(clone, string, strlen(string));
printf("Clone string: %s\n", clone);
}
I have been working on making my code able to be auto vectorised by GCC, however, when I include the the -fopenmp flag it seems to stop all attempts at auto vectorisation. I am using the ftree-vectorize -ftree-vectorizer-verbose=5 to vectorise and monitor it.
If I do not include the flag, it starts to give me a lot of information about each loop, if it is vectorised and why not. The compiler stops when I try to use the omp_get_wtime() function, since it can't be linked. Once the flag is included, it simply lists every function and tells me it vectorised 0 loops in it.
I've read a few other places the issue has been mentioned, but they don't really come to any solutions: http://software.intel.com/en-us/forums/topic/295858 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032. Does OpenMP have its own way of handling vectorisation? Does I need to explicitly tell it to?
There is a shortcoming in the GCC vectoriser which appears to have been resolved in recent GCC versions. In my test case GCC 4.7.2 vectorises successfully the following simple loop:
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
a[i] = b[i] + c[i] * d;
In the same time GCC 4.6.1 does not and it complains, that the loop contains function calls or data references that cannot be analysed. The bug in the vectoriser is triggered by the way parallel for loops are implemented by GCC. When the OpenMP constructs are processed and expanded, the simple loop code is transformed into something akin to this:
struct omp_fn_0_s
{
int N;
double *a;
double *b;
double *c;
double d;
};
void omp_fn_0(struct omp_fn_0_s *data)
{
int start, end;
int nthreads = omp_get_num_threads();
int threadid = omp_get_thread_num();
// This is just to illustrate the case - GCC uses a bit different formulas
start = (data->N * threadid) / nthreads;
end = (data->N * (threadid+1)) / nthreads;
for (int i = start; i < end; i++)
data->a[i] = data->b[i] + data->c[i] * data->d;
}
...
struct omp_fn_0_s omp_data_o;
omp_data_o.N = N;
omp_data_o.a = a;
omp_data_o.b = b;
omp_data_o.c = c;
omp_data_o.d = d;
GOMP_parallel_start(omp_fn_0, &omp_data_o, 0);
omp_fn_0(&omp_data_o);
GOMP_parallel_end();
N = omp_data_o.N;
a = omp_data_o.a;
b = omp_data_o.b;
c = omp_data_o.c;
d = omp_data_o.d;
The vectoriser in GCC before 4.7 fails to vectorise that loop. This is NOT OpenMP-specific problem. One can easily reproduce it with no OpenMP code at all. To confirm this I wrote the following simple test:
struct fun_s
{
double *restrict a;
double *restrict b;
double *restrict c;
double d;
int n;
};
void fun1(double *restrict a,
double *restrict b,
double *restrict c,
double d,
int n)
{
int i;
for (i = 0; i < n; i++)
a[i] = b[i] + c[i] * d;
}
void fun2(struct fun_s *par)
{
int i;
for (i = 0; i < par->n; i++)
par->a[i] = par->b[i] + par->c[i] * par->d;
}
One would expect that both codes (notice - no OpenMP here!) should vectorise equally well because of the restrict keywords used to specify that no aliasing can happen. Unfortunately this is not the case with GCC < 4.7 - it successfully vectorises the loop in fun1 but fails to vectorise that in fun2 citing the same reason as when it compiles the OpenMP code.
The reason for this is that the vectoriser is unable to prove that par->d does not lie within the memory that par->a, par->b, and par->c point to. This is not always the case with fun1, where two cases are possible:
d is passed as a value argument in a register;
d is passed as a value argument on the stack.
On x64 systems the System V ABI mandates that the first several floating-point arguments get passed in the XMM registers (YMM on AVX-enabled CPUs). That's how d gets passed in this case and hence no pointer can ever point to it - the loop gets vectorised. On x86 systems the ABI mandates that arguments are passed onto the stack, hence d might be aliased by any of the three pointers. Indeed, GCC refuses to vectorise the loop in fun1 if instructed to generate 32-bit x86 code with the -m32 option.
GCC 4.7 gets around this by inserting run-time checks which ensure that neither d nor par->d get aliased.
Getting rid of d removes the unprovable non-aliasing and the following OpenMP code gets vectorised by GCC 4.6.1:
#pragma omp parallel for schedule(static)
for (int i = 0; i < N; i++)
a[i] = b[i] + c[i];
I'll try to briefly answer your question.
Does OpenMP have its own way of handling vectorisation?
Yes... but starting from the incoming OpenMP 4.0. The link posted above provides a good insight on this construct. The current OpenMP 3.1, on the other hand, is not "aware" of the SIMD concept. What happens therefore in practice (or, at least, in my experience) is that auto-vectorization mechanisms are inhibited whenever an openmp worksharing construct is used on a loop. Anyhow the two concepts are orthogonal and you can still benefit from both (see this other answer).
Do I need to explicitly tell it to?
I am afraid yes, at least at present. I would start rewriting the loops under consideration in a way that makes vectorization explicit (i.e. I will use intrinsics on Intel platform, Altivec on IBM and so on).
You are asking "why GCC can't do vectorization when OpenMP is enabled?".
It seems that this may be a bug of GCC :)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46032
Otherwise, an OpenMP API may introduce dependency (either control or data) that prevents automatic vectorization. To auto-vertorize, a given code must be data/control-dependency free. It's possible that using OpenMP may cause some spurious dependency.
Note: OpenMP (prior to 4.0) is to use thread-level parallelism, which is orthogonal to SIMD/vectorization. A program can use both OpenMP and SIMD parallelism at the same time.
I ran across this post while searching for comments about the gcc 4.9 option openmp-simd, which should activate OpenMP 4 #pragma omp simd without activating omp parallel (threading). gcc bugzilla pr60117 (confirmed) shows a case where the pragma omp prevents auto-vectorization which occurred without the pragma.
gcc doesn't vectorize omp parallel for even with the simd clause (parallel regions can auto-vectorize only the inner loop nested under a parallel for). I don't know any compiler other than icc 14.0.2 which could be recommended for implementation of #pragma omp parallel for simd; with other compilers, SSE intrinsics coding would be required to get this effect.
The Microsoft compiler doesn't perform any auto-vectorization inside parallel regions in my tests, which show clear superiority of gcc for such cases.
Combined parallelization and vectorization of a single loop has several difficulties, even with the best implementation. I seldom see more than 2x or 3x speedup by adding vectorization to a parallel loop. Vectorization with AVX double data type, for example, effectively cuts the chunk size by a factor of 4. Typical implementation can achieve aligned data chunks only for the case where the entire array is aligned, and the chunks also are exact multiples of the vector width. When the chunks are not all aligned, there is inherent work imbalance due to the varying alignments.
gprof is not working properly on my system (MinGW) so I'd like to know which one of the following snippets is more efficient, on average.
I'm aware that internally C compilers convert everything into pointers arithmetic, but nevertheless I'd like to know if any of the following snippets has any significant advantage over the others.
The array has been allocated dynamically in contiguous memory as 1d array and may be re-allocated at run time (its for a simple board game, in which the player is allowed to re-define the board's size, as often as he wants to).
Please note that i & j must get calculated and passed into the function set_cell() in every loop iteration (gridType is a simple struct with a few ints and a pointer to another cell struct).
Thanks in advance!
Allocate memory
grid = calloc( (nrows * ncols), sizeof(gridType) );
Snippet #1 (parse sequentially as 1D)
gridType *gp = grid;
register int i=0 ,j=0; // we need to pass those in set_cell()
if ( !grid )
return;
for (gp=grid; gp < grid+(nrows*ncols); gp++)
{
set_cell( gp, i, j, !G_OPENED, !G_FOUND, value, NULL );
if (j == ncols-1) { // last col of current row has been reached
j=0;
i++;
}
else // last col of current row has NOT been reached
j++;
}
Snippet #2 (parse as 2D array, using pointers only)
gridType *gp1, *gp2;
if ( !grid )
return;
for (gp1=grid; gp1 < grid+nrows; gp1+=ncols)
for (gp2=gp1; gp2 < gp1+ncols; gp2++)
set_cell( gp2, (gp1-grid), (gp2-gp1), !G_OPENED, !G_FOUND, value, NULL );
Snippet #3 (parse as 2D, using counters only)
register int i,j; // we need to pass those in set_cell()
for (i=0; i<nrows; i++)
for (j=0; j<ncols; j++)
set_cell( &grid[i * ncols + j], i, j, !G_OPENED, !G_FOUND, value, NULL);
Free memory
free( grid );
EDIT:
I fixed #2 form gp1++) to gp1+=ncols), in the 1st loop, after Paul's correction (thx!)
For anything like this, the answer is going to depend on the compiler and the machine you're running it on. You could try each of your code snippets, and calculating how long each one takes.
However, this is a prime example of premature optimization. The best thing to do is to pick the snippet which looks the clearest and most maintainable. You'll get much more benefit from doing that in the long run than from any savings you'd make from choosing the one that's fastest on your machine (which might not be fastest on someone else's anyway!)
Well, snippet 2 doesn't exactly work. You need different incrementing behavior; the outer loop should read for (gp1 = grid; gp1 < grid + (nrows * ncols); gp1 += ncols).
Of the other two, any compiler that's paying attention will almost certainly convert snippet 3 into something equivalent to snippet 1. But really, there's no way to know without profiling them.
Also, remember the words of Knuth: "Premature optimization is the ROOT OF ALL EVIL. I have seen more damage done in the name of 'optimization' than for all other causes combined, including sheer, wrongheaded stupidity." People who write compilers are smarter than you (unless you're secretly Knuth or Hofstadter), so let the compiler do its job and you can get on with yours. Trying to write "clever" optimized code will usually just confuse the compiler, preventing it from writing even better, more optimized code.
This is the way I'd write it. IMHO it's shorter, clearer and simpler than any of your ways.
int i, j;
gridType *gp = grid;
for (i = 0; i < nrows; i++)
for (j = 0; j < ncols; j++)
set_cell( gp++, i, j, !G_OPENED, !G_FOUND, value, NULL );
gprof not working isn't a real
excuse. You can still set up a
benchmark and measure execution
time.
You might not be able to measure any
difference on modern CPUs until
nrows*ncols is getting very
large or the reallocation happens
very often, so you might optimize the wrong part of your code.
This certainly is micro-optimization as the most runtime will most probably be spent in set_cell and everything else could be optimized to the same or very similar code by the compiler.
You don't know until you measure it.
Any decent compiler may produce the same code, even if it doesn't the effects of caching, pilelining, predictive branching and other clever stuff means that simply guessing the number of instructions isn't enough