int a[5] = {0};
VS
typedef struct
{
int a[5];
} ArrStruct;
ArrStruct arrStruct;
sizeA = sizeof(arrStruct.a)/sizeof(int);
for (it = 0 ; it < sizeA ; ++it)
arrStruct.a[it] = 0;
Does initializing by for loop takes more execution time? if so, why?
It depends upon the compiler and the optimization flags.
On recent GCC (e.g. 4.8 or 4.9) with gcc -O3 (or probably even -O1 or -O2) it should not matter, since the same code would be emitted (GCC has even an optimization which would transform your loop into a builtin_memset which would be further optimized).
On some compilers, it could happen that the int a[5] = {0}; might be faster, because the compiler might emit e.g. vector instruction (or on x86 a rep stosw) to clear an array.
The best thing is to examine the generated (gimple representation and) assembler code (e.g. with gcc -fdump-tree-gimple -O3 -fverbose-asm -mtune=native -S) and to benchmark. Most of the cases it does not matter. Be sure to enable optimizations when compiling.
Generally, don't care about such micro-optimization; a good optimizing compiler is better than you have time to code.
It depends on the scope of the variables. For a static or global variable, the first initialization
int a[5]={0};
may be done at compile time, while the loop is run at, well, run time. Thus there is no "execution" associated with the former.
You may find the discussion of this question (and in particular this answer ) interesting.
Related
I wanted to create a formal comparison between C and Julia performance. For this purpose I wanted to compare different sorting algorithms, starting with the bubble. In Julia I wrote it like:
using BenchmarkTools
function bubble_sort(v::AbstractArray{T}) where T<:Real
for _ in 1:length(v)-1
for i in 1:length(v)-1
if v[i] > v[i+1]
v[i], v[i+1] = v[i+1], v[i]
end
end
end
return v
end
v = rand(Int32, 100_000)
#timed bubble_sort(_v)
In the case of C code (I don't know to program in C so I apologize for the code):
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
static void swap(int *xp, int *yp){
int temp = *xp;
*xp = *yp;
*yp = temp;
}
void bubble_sort(int arr[], int n){
int i, j;
for (j = 0; j < n - 1; j++){
for (i = 0; i < n - 1; i++){
if (arr[i] > arr[i+1]){
swap(&arr[i], &arr[i+1]);
}
}
}
}
int main(){
int arr_sz = 100000;
int arr[arr_sz], i;
for (i = 0; i < arr_sz; i++){
arr[i] = rand();
}
double cpu_time_used;
clock_t begin = clock();
bubble_sort(arr, arr_sz);
clock_t end = clock();
cpu_time_used = ((double) (end - begin)) / CLOCKS_PER_SEC;
printf("time %f\n", cpu_time_used);
return 0;
}
The performance difference is (in my computer):
Julia
C
20s
~50s
I suppose that I have a big mistake in the C code, but I am not able to find it out, or is just Julia faster in loops?
Update: performance optimization
Changed the type to int32 in Julia so it is the same as C
swap method as static (+1s improvement on average)
compiling optimization (detailed bellow)
Instead of gcc main.c, I've used different optimization flags, as also the clang compiler. Results:
Time (s)
Julia
19.13
gcc -O main.c
47.58
gcc -O1 main.c
15.98
gcc -O2 main.c
19.52
gcc -O3 main.c
19.20
gcc -Os main.c
17.72
clang -O0 main.c
51.59
clang -O1 main.c
16.78
clang -O2 main.c
13.53
clang -O3 main.c
13.57
clang -Ofast main.c
12.39
clang -Os main.c
18.85
clang -Oz main.c
15.64
clang -Og main.c
16.37
It seems like this question may have been reopened after you discovered that your initial measurements were taken on code compiled for debugging, rather than fully optimised, with different sets of data, different compiler platforms and different integer representations.
I suppose that I have a big mistake in the C code, but I am not able to find it out, or is just Julia faster in loops?
I can answer this somewhat cringy (in my opinion) double-barreled question with a quote from the C standard: "The semantic descriptions in this International Standard describe the behavior of an abstract machine in which issues of optimization are irrelevant." In short, there's no speed in C; that's an attribute that occurs in implementations of C. We can't reproduce your speed without having your implementation (including your hardware, for example).
It's very possible that Julia has similar clauses in her spec. The gist is: some nifty optimisations may determine that your sorted arrays don't have any necessary side-effects and so those may theoretically be optimised away. I'd expect both programs to output somewhere near 0.0 in that case. This is your perfectly optimal compiler; one that spots code that has no actual impact upon the logic of the program, and optimises away the dead code.
We haven't always had loop-invariant code motion, and so it stands to reason that there may be a fifth element here: your compilers version. You'll probably get different statistics if the underlying llvm is different, for example:
LLVM 11 tends to take 2x longer to compile code with optimizations, and as a result produces code that runs 10-20% faster (with occasional outliers in either direction), compared to LLVM 2.7 which is more than 10 years old.
-- source
Perhaps one day you'll update this question with output that reads 0.0s for both programs. Then this question has truly lost its point.
It's hard to tell what further is being asked here, #cbk. The comments section managed to reduce the runtime for the C program significantly with those four improvements. The question kinda doesn't even make sense here anymore, because it largely cancels itself out by answering itself at the end.
Perhaps this is just one of those cases where a newcomer ought to have answered their own question with an answer (you can do that), rather than rotting the question with edits that answer it... Nonetheless, it's a question that now shows up unanswered in the list of questions. I'd vote to close for the reason "The question should include more details", but I suspect then you might include an example of compilation halted after assembly generation, when OP seems to have glossed over the solution, the details we need are more along the lines of "What didn't you understand about these comments that answer your original question?" and yet the question has varied so much in apparent meaning... Are we gonna have a close/reopen war?
I am trying to understand pure functions, and have been reading through the Wikipedia article on that topic. I wrote the minimal sample program as follows:
#include <stdio.h>
static int a = 1;
static __attribute__((pure)) int pure_function(int x, int y)
{
return x + y;
}
static __attribute__((pure)) int impure_function(int x, int y)
{
a++;
return x + y;
}
int main(void)
{
printf("pure_function(0, 0) = %d\n", pure_function(0, 0));
printf("impure_function(0, 0) = %d\n", impure_function(0, 0));
return 0;
}
I compiled this program with gcc -O2 -Wall -Wextra, expecting that an error, or at least a warning, should have been issued for decorating impure_function() with __attribute__((pure)). However, I received no warnings or errors, and the program also ran without issues.
Isn't marking impure_function() with __attribute__((pure)) incorrect? If so, why does it compile without any errors or warnings, even with the -Wextra and -Wall flags?
Thanks in advance!
Doing this is incorrect and you are responsible for using the attribute correctly.
Look at this example:
static __attribute__((pure)) int impure_function(int x, int y)
{
extern int a;
a++;
return x + y;
}
void caller()
{
impure_function(1, 1);
}
Code generated by GCC (with -O1) for the function caller is:
caller():
ret
As you can see, the impure_function call was completely removed because compiler treats it as "pure".
GCC can mark the function as "pure" internally automatically if it sees its definition:
static __attribute__((noinline)) int pure_function(int x, int y)
{
return x + y;
}
void caller()
{
pure_function(1, 1);
}
Generated code:
caller():
ret
So there is no point in using this attribute on functions that are visible to the compiler. It is supposed to be used when definition is not available, for example when function is defined in another DLL. That means that when it is used in a proper place the compiler won't be able to perform a sanity check anyway. Implementing a warning thus is not very useful (although not meaningless).
I don't think there is anything stopping GCC developers from implementing such warning, except time that must be spend.
A pure function is a hint for the optimizing compiler. Probably, gcc don't care about pure functions when you pass just -O0 to it (the default optimizations). So if f is pure (and defined outside of your translation unit, e.g. in some outside library), the GCC compiler might optimize y = f(x) + f(x); into something like
{
int tmp = f(x); /// tmp is a fresh variable, not appearing elsewhere
y = tmp + tmp;
}
but if f is not pure (which is the usual case: think of f calling printf or malloc), such an optimization is forbidden.
Standard math functions like sin or sqrt are pure (except for IEEE rounding mode craziness, see http://floating-point-gui.de/ and Fluctuat for more), and they are complex enough to compute to make such optimizations worthwhile.
You might compile your code with gcc -O2 -Wall -fdump-tree-all to guess what is happening inside the compiler. You could add the -fverbose-asm -S flags to get a generated *.s assembler file.
You could also read the Bismon draft report (notably its section §1.4). It might give some intuitions related to your question.
In your particular case, I am guessing that gcc is inlining your calls; and then purity matters less.
If you have time to spend, you might consider writing your own GCC plugin to make such a warning. You'll spend months in writing it! These old slides might still be useful to you, even if the details are obsolete.
At the theoretical level, be aware of Rice's theorem. A consequence of it is that perfect optimization of pure functions is probably impossible.
Be aware of the GCC Resource Center, located in Bombay.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
void sort();
int main() {
int i;
for (i = 0; i < 100000; i++) {
sort();
}
}
void sort() {
int i, j, k, array[100], l = 99, m;
for (i = 0; i < 100; i++) {
array[i] = rand() % 1000 + 1;
}
for (k = 0; k < 99; k++) {
for (j = 0; j < l; j++) {
if (array[j + 1] > array[j]) {
int temp = array[j];
array[j] = array[j + 1];
array[j + 1] = temp;
}
}
l--;
}
for (m = 0; m < 100; m++) {
printf("%d ", array[m]);
}
}
On the linux shell, gcc sort -o sort.c and then time ./sort >> out.
Here if I do gcc -o2 sort -o sort.c and similarly o3 and o4 then the time keeps on decreasing. How does the optimization options work? Please explain in terms of all real time, user time and system time.
PS: The code might be a little inefficient. Kindly ignore that.
Optimization options work between the reading of the source code and the writing of the binary instructions to the CPU.
GCC is a multi-phase compiler, where the phases roughly consist of:
Creating "tokens" from the input text.
Arranging those tokens into abstract syntax tree structure.
Pruning the abstract syntax tree.
Creating register based instructions, assuming an infinite number of CPU registers.
Mapping the registers into the actual registers available.
Writing the binary information out, in the loader's expected format.
Optimizations can impact a number of locations, typically they become active in the above mentioned steps 3 through 5. There are many optimizations, including:
Constant folding – Evaluate constant subexpressions in advance.
Strength reduction – Replace slow operations with faster equivalents.
Null sequences – Delete useless operations.
Combine operations – Replace several operations with one equivalent.
Algebraic laws – Use algebraic laws to simplify or reorder instructions.
Special case instructions – Use instructions designed for special operand cases.
Address mode operations – Use address modes to simplify code.
Loop unrolling - Replace a loop with equivalent instructions
Partial loop unrolling - Reduce times a loop is evaluated while preserving overall function.
Note that these are not all the optimizations that might be performed, but it starts to give you an idea.
For example, if the compiler sees
int s = 3;
while (s < 6) {
printf("%d\n", s);
s++;
}
and the flags are set to unroll loops, then it might write CPU instructions equivalent to
printf("%d\n", 3);
printf("%d\n", 4);
printf("%d\n", 5);
Those instructions might seem more wordy to us humans, but the CPU commands might be smaller, because there is no need to "lookup" the now-erased value of s, nor is there the need to add one to it, or store the new updated value back into RAM.
GCC arranges the optimizations into categories, ranging from "safe" to "risky". -O2 is a good compromise between speed and safety. Higher -O numbers are riskier.
The -O compiler flag controls the amount of compiler optimization that you wish the compiler to perform. In short, building the project will take longer but the resulting executable should be faster. For more information, type man gcc into the command prompt or gcc -c -Q -O3 --help=optimizers for specific information regarding the optimizations performed for a particular flag.
-O stands for optimize, in which gcc will automatically take the steps necessary to optimize your program. You can read more about the specific steps that GCC takes to optimize your program here: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
But essentially, -O2 is more optimized than -O1, and -O3 more than -O2. This might come with drawbacks in regard to compiled binary size, where the resulting binary could use more space, but run faster, and vice versa. You can actually paste your code into https://godbolt.org/, and write in -O1 or any of the optimization options beside the dropdown to choose a compiler, and godbolt will show you what the resulting code looks like. You will be able to see a difference between O1 and O2, namely, the O2 generated code is probably shorter and will use a lot of shortcuts to do your algorithm.
gcc offers a number of optimization flags. You can see what each one does specifically here:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
There's always a tradeoff with optimizations, either by increased compile time, increased use of memory, etc...
There are dozens of optimizations enabled by the -o2 flag, so it might not be immediately clear which specific ones affect the sorting. Instead of -o2, you can try each optimization individually, for example using the -falign-loops flag, to see whether that is the one providing the performance increase.
I have a program problem for which I would like to declare a 256x256 array in C. Unfortunately, I each time I try to even declare an array of that size (integers) and I run my program, it terminates unexpectedly. Any suggestions? I haven't tried memory allocation since I cannot seem to understand how it works with multi-dimensional arrays (feel free to guide me through it though I am new to C). Another interesting thing to note is that I can declare a 248x248 array in C without any problems, but no larger.
dims = 256;
int majormatrix[dims][dims];
Compiled with:
gcc -msse2 -O3 -march=pentium4 -malign-double -funroll-loops -pipe -fomit-frame-pointer -W -Wall -o "SkyFall.exe" "SkyFall.c"
I am using SciTE 323 (not sure how to check GCC version).
There are three places where you can allocate an array in C:
In the automatic memory (commonly referred to as "on the stack")
In the dynamic memory (malloc/free), or
In the static memory (static keyword / global space).
Only the automatic memory has somewhat severe constraints on the amount of allocation (that is, in addition to the limits set by the operating system); dynamic and static allocations could potentially grab nearly as much space as is made available to your process by the operating system.
The simplest way to see if this is the case is to move the declaration outside your function. This would move your array to static memory. If crashes continue, they have nothing to do with the size of your array.
Unless you're running a very old machine/compiler, there's no reason that should be too large. It seems to me the problem is elsewhere. Try the following code and tell me if it works:
#include <stdio.h>
int main()
{
int ints[256][256], i, j;
i = j = 0;
while (i<256) {
while (j<256) {
ints[i][j] = i*j;
j++;
}
i++;
j = 0;
}
printf("Made it :) \n");
return 0;
}
You can't necessarily assume that "terminates unexpectedly" is necessarily directly because of "declaring a 256x256 array".
SUGGESTION:
1) Boil your code down to a simple, standalone example
2) Run it in the debugger
3) When it "terminates unexpectedly", use the debugger to get a "stack traceback" - you must identify the specific line that's failing
4) You should also look for a specific error message (if possible)
5) Post your code, the error message and your traceback
6) Be sure to tell us what platform (e.g. Centos Linux 5.5) and compiler (e.g. gcc 4.2.1) you're using, too.
It seems to me that it would work perfectly well to do tail-recursion optimization in both C and C++, yet while debugging I never seem to see a frame stack that indicates this optimization. That is kind of good, because the stack tells me how deep the recursion is. However, the optimization would be kind of nice as well.
Do any C++ compilers do this optimization? Why? Why not?
How do I go about telling the compiler to do it?
For MSVC: /O2 or /Ox
For GCC: -O2 or -O3
How about checking if the compiler has done this in a certain case?
For MSVC, enable PDB output to be able to trace the code, then inspect the code
For GCC..?
I'd still take suggestions for how to determine if a certain function is optimized like this by the compiler (even though I find it reassuring that Konrad tells me to assume it)
It is always possible to check if the compiler does this at all by making an infinite recursion and checking if it results in an infinite loop or a stack overflow (I did this with GCC and found out that -O2 is sufficient), but I want to be able to check a certain function that I know will terminate anyway. I'd love to have an easy way of checking this :)
After some testing, I discovered that destructors ruin the possibility of making this optimization. It can sometimes be worth it to change the scoping of certain variables and temporaries to make sure they go out of scope before the return-statement starts.
If any destructor needs to be run after the tail-call, the tail-call optimization can not be done.
All current mainstream compilers perform tail call optimisation fairly well (and have done for more than a decade), even for mutually recursive calls such as:
int bar(int, int);
int foo(int n, int acc) {
return (n == 0) ? acc : bar(n - 1, acc + 2);
}
int bar(int n, int acc) {
return (n == 0) ? acc : foo(n - 1, acc + 1);
}
Letting the compiler do the optimisation is straightforward: Just switch on optimisation for speed:
For MSVC, use /O2 or /Ox.
For GCC, Clang and ICC, use -O3
An easy way to check if the compiler did the optimisation is to perform a call that would otherwise result in a stack overflow — or looking at the assembly output.
As an interesting historical note, tail call optimisation for C was added to the GCC in the course of a diploma thesis by Mark Probst. The thesis describes some interesting caveats in the implementation. It's worth reading.
As well as the obvious (compilers don't do this sort of optimization unless you ask for it), there is a complexity about tail-call optimization in C++: destructors.
Given something like:
int fn(int j, int i)
{
if (i <= 0) return j;
Funky cls(j,i);
return fn(j, i-1);
}
The compiler can't (in general) tail-call optimize this because it needs
to call the destructor of cls after the recursive call returns.
Sometimes the compiler can see that the destructor has no externally visible side effects (so it can be done early), but often it can't.
A particularly common form of this is where Funky is actually a std::vector or similar.
gcc 4.3.2 completely inlines this function (crappy/trivial atoi() implementation) into main(). Optimization level is -O1. I notice if I play around with it (even changing it from static to extern, the tail recursion goes away pretty fast, so I wouldn't depend on it for program correctness.
#include <stdio.h>
static int atoi(const char *str, int n)
{
if (str == 0 || *str == 0)
return n;
return atoi(str+1, n*10 + *str-'0');
}
int main(int argc, char **argv)
{
for (int i = 1; i != argc; ++i)
printf("%s -> %d\n", argv[i], atoi(argv[i], 0));
return 0;
}
Most compilers don't do any kind of optimisation in a debug build.
If using VC, try a release build with PDB info turned on - this will let you trace through the optimised app and you should hopefully see what you want then. Note, however, that debugging and tracing an optimised build will jump you around all over the place, and often you cannot inspect variables directly as they only ever end up in registers or get optimised away entirely. It's an "interesting" experience...
As Greg mentions, compilers won't do it in debug mode. It's ok for debug builds to be slower than a prod build, but they shouldn't crash more often: and if you depend on a tail call optimization, they may do exactly that. Because of this it is often best to rewrite the tail call as an normal loop. :-(