I tried compiling the following c-code using MSVC into assembly both with (CL TestFile.c /Fa /Ot) and without optimizations (CL TestFile.c /Fa) and the result is they produce the same stack-depth.
Why does the compiler use 8 bytes for each of the 3 varibles x, y, and z when it knows it will use a maximum of 16 bytes? Instead of y$1 = 4 and z$2 = 8 could it not use y$1 = 4 and z$2 = 4 so y and z use the same memory on the stack without any problems?
int main() {
int x = 123;
if (x == 123) {
int y = 321;
}
else {
int z = 234;
}
}
; Parts of the assembly code
x$ = 0
y$1 = 4
z$2 = 8
main PROC
$LN5:
sub rsp, 24
; And so on...
Nested scopes do not affect stack depth. Per the C standard, nested scopes affect visibility of identifiers and do not impose any requirements on how a C implementation uses the stack, if it has one. A C compiler is permitted by the C standard generate any code that gets the same observable behavior.
For the program shown in the question, the only observable behavior is to exit with a success status, so a good compiler should, when optimizing, generate a minimal program. For example, GCC 10.2 for x86-64 generates just an xor and a ret:
main:
xor eax, eax
ret
So does Clang 11.0.1. If MSVC does not, that is a deficiency in it. (However, it may be that the switches /Os and /Ot do not request optimization or do not request much optimization; they may just express a preference for speed or time when used in conjunction with other optimization switches.)
Further, a good compiler should perform lifetime analysis of the use of objects, constructing a graph representing where nodes are places in code and are labeled with creations or uses of values and directed edges are potential program control flows (or some equivalent representation of the source code). Then assembler (or intermediate code) should be generated to implement the semantics required by the graph. If two sets of source code have equivalent graphs, the compiler should generate equivalent assembly (or intermediate code) for them (up to some reasonable ability to process complicated graphs) regardless of whether definitions in nested scopes were used or not.
Related
Im writing a real time DSP processing library.
My intention is to give it a flexibility to define input samples blockSize, while also having best possible performance in case of sample-by-sample processing, that is - single sample block size
I think I have to use volatile keyword defining loop variable since data processing will be using pointers to Inputs/Outputs.
This leads me to a question:
Will gcc compiler optimize this code
int blockSize = 1;
for (volatile int i=0; i<blockSize; i++)
{
foo()
}
or
//.h
#define BLOCKSIZE 1
//.c
for (volatile int i=0; i<BLOCKSIZE; i++)
{
foo()
}
to be same as simply calling body of the loop:
foo()
?
Thx
I think I have to use volatile keyword defining loop variable since data processing will be using pointers to Inputs/Outputs.
No, that doesn't make any sense. Only the input/output hardware registers themselves should be volatile. Pointers to them should be declared as pointer-to-volatile data, ie volatile uint8_t*. There is no need to make the pointer itself volatile, ie uint8_t* volatile //wrong.
As things stand now, you force the compiler to create a variable i and increase it, which will likely block loop unrolling optimizations.
Trying your code on gcc x86 with -O3 this is exactly what happens. No matter the size of BLOCKSIZE, it still generates the loop because of volatile. If I drop volatile it completely unrolls the loop up to BLOCKSIZE == 7 and replace it with a number of function calls. Beyond 8 it creates a loop (but keeps the iterator in a register instead of RAM).
x86 example:
for (int i=0; i<5; i++)
{
foo();
}
gives
call foo
call foo
call foo
call foo
call foo
But
for (volatile int i=0; i<5; i++)
{
foo();
}
gives way more inefficient
mov DWORD PTR [rsp+12], 0
mov eax, DWORD PTR [rsp+12]
cmp eax, 4
jg .L2
.L3:
call foo
mov eax, DWORD PTR [rsp+12]
add eax, 1
mov DWORD PTR [rsp+12], eax
mov eax, DWORD PTR [rsp+12]
cmp eax, 4
jle .L3
.L2:
For further study of the correct use of volatile in embedded systems, please see:
How to access a hardware register from firmware?
Using volatile in embedded C development
Since the loop variable is volatile it shouldn't optimize it. The compiler can not know wether i will be 1 when the condition is evaluated, so it has to keep the loop.
From the compiler point of view, the loop can run an indeterminite number of times until the condition is satisfied.
If you somehwere access hardware registers, then those should be declared volatile, which would make more sense, to the reader, and also allows the compiler to apply appropriate optimizations where possible.
volatile keyword says the compiler that the variable is side effects prone - ie it can be changed by something which is not visible for the compiler.
Because of that volatile variables have to read before every use and saved to their permanent storage location after every modification.
In your example the loop cannot be optimized as variable i can be changed during the loop (for example some interrupt routine will change it to zero so the loop will have to be executed again.
The answer to your question is: If the compiler can determine that every time you enter the loop, it will execute only once, then it can eliminate the loop.
Normally, the optimization phase unrolls the loops, based on how the iterations relate to one another, this makes your (e.g. indefinite) loop to get several times bigger, in exchange to avoid the back loops (that normally result in a bubble in the pipeline, depending on the cpu type) but not too much to lose cache hits.... so it is a bit complicate... but the earnings are huge. But if your loop executes only once, and always, is normally because the test you wrote is always true (a tautology) or always false (impossible fact) and can be eliminated, this makes the jump back unnecessary, and so, there's no loop anymore.
int blockSize = 1;
for (volatile int i=0; i<blockSize; i++)
{
foo(); // you missed a semicolon here.
}
In your case, the variable is assigned a value, that is never touched anymore, so the first thing the compiler is going to do is to replace all expressions of your variable by the literal you assigned to it. (lacking context I assume blocsize is a local automatic variable that is not changed anywhere else) Your code changes into:
for (volatile int i=0; i<1; i++)
{
foo();
}
the next is that volatile is not necessary, as its scope is the block body of the loop, where it is not used, so it can be replaced by a sequence of code like the following:
do {
foo();
} while (0);
hmmm.... this code can be replaced by this code:
foo();
The compiler analyses each data set analising the graph of dependencies between data and variables.... when a variable is not needed anymore, assigning a value to it is not necessary (if it is not used later in the program or goes out of life), so that code is eliminated. If you make your compiler to compile a for loop frrom 1 to 2^64, and then stop. and you optimize the compilation of that,, you will see you loop being trashed up and will get the false idea that your processor is capable of counting from 1 to 2^64 in less than a second.... but that is not true, 2^64 is still very big number to be counted in less than a second. And that is not a one fixed pass loop like yours.... but the data calculations done in the program are of no use, so the compiler eliminates it.
Just test the following program (in this case it is not a test of a just one pass loop, but 2^64-1 executions):
#include <stdint.h>
#include <stdio.h>
#include <unistd.h>
int main()
{
uint64_t low = 0UL;
uint64_t high = ~0UL;
uint64_t data = 0; // this data is updated in the loop body.
printf("counting from %lu to %lu\n", low, high);
alarm(10); /* security break after 10 seconds */
for (uint64_t i = low; i < high; i++) {
#if 0
printf("data = $lu\n", data = i ); // either here...
#else
data = i; // or here...
#endif
}
return 0;
}
(You can change the #if 0 to #if 1 to see how the optimizer doesn't eliminate the loop when you need to print the results, but you see that the program is essentially the same, except for the call to printf with the result of the assignment)
Just compile it with/without optimization:
$ cc -O0 pru.c -o pru_noopt
$ cc -O2 pru.c -o pru_optim
and then run it under time:
$ time pru_noopt
counting from 0 to 18446744073709551615
Alarm clock
real 0m10,005s
user 0m9,848s
sys 0m0,000s
while running the optimized version gives:
$ time pru_optim
counting from 0 to 18446744073709551615
real 0m0,002s
user 0m0,002s
sys 0m0,002s
(impossible, neither the best computer can count one after the other, upto that number in less than 2 milliseconds) so the loop must have gone somewhere else. You can check from the assembler code. As the updated value of data is not used after assignment, the loop body can be eliminated, so the 2^64-1 executions of it can also be eliminated.
Now add the following line after the loop:
printf("data = %lu\n", data);
You will see that then, even with the -O3 option, will get the loop untouched, because the value after all the assignments is used after the loop.
(I preferred not to show the assembler code, and remain in high level, but you can have a look at the assembler code and see the actual generated code)
I have a piece of code in C as shown below-
In a .c file-
1 custom_data_type2 myFunction1(custom_data_type1 a, custom_data_type2 b)
2 {
3 int c=foo();
4 custom_data_type3 t;
5 check_for_ir_path();
6 ...
7 ...
8 }
9
10 custom_data_type4 myFunction2(custom_data_type3 c, const void* d)
11 {
12 custom_data_type4 e;
13 struct custom_data_type5 f;
14 check_for_ir_path();
15 ...
16 temp = myFunction1(...);
17 return temp;
18 }
In a header file-
1 void CRASH_DUMP(int *i)
2 __attribute__((noinline));
3
4 #define INTRPT_FORCE_DUMMY_STACK 3
5
6 #define check_for_ir_path() { \
7 if (checkfunc1() && !checkfunc2()) { \
8 int sv = INTRPT_FORCE_DUMMY_STACK; \
9 ...
10 CRASH_DUMP(&sv);\
11 }\
12 }\
In an unknown scenario, there is a crash.
After processing the core dump using GDB, we get the call stack like -
#0 0x00007ffa589d9619 in myFunction1 [...]
(custom_data_type1=0x8080808080808080, custom_data_type2=0x7ff9d77f76b8) at ../xxx/yyy/zzz.c:5
sv = 32761
t = <optimized out>
#1 0x00007ffa589d8f91 in myFunction2 [...]
(custom_data_type3=<optimized out>, d=0x7ff9d77f7748) at ../xxx/yyy/zzz.c:16
sv = 167937677
f = {
...
}
If you see the function, myFunction1 there are three local variables- c, t, sv (defined as part of macro definition). However, in the backtrace, in the frame 0, we see only two local variables - t and sv. And i dont see the variable c being listed.
Same is the case, in the function myFunction2, there are three local variables - e, f, sv(defined as part of macro definition). However, from the backtrace, in the frame 1, we see only two local variables - f and sv. And i dont see the variable e being listed.
Why is the behavior like this?
Any non-static variable declared inside the function, should be put on the callstack during execution and which should have been listed in the backtrace full, isn't it? However, some of the local variables are missing in the backtrace. Could someone provide an explanation?
Objects local to a C function often do not appear on the stack because optimization during compilation often makes it unnecessary to store objects on the stack. In general, while an implementation of the C abstract machine may be viewed as storing objects to local to a function on the stack, the actual implementation on a real processor after compilation and optimization may be very different. In particular:
An object local to a function may be created and used only inside a processor register. When there are enough processor registers to hold a function’s local objects, or some of them, there is no point in writing them to memory, so optimized code will not do so.
Optimization may eliminate a local object completely or fold it into other values. For example, given void foo(int x) { int t = 10; bar(x+2*t); … }, the compiler may merely generate code that adds an immediate value of 20 to x, with the result that neither 10 nor any other instantiation of t ever appears on stack, in a register, or even in the immediate operand of an instruction. It simply does not exist in the generated code because there was no need for it.
An object local to a function may appear on the stack at one point during a function’s code but not at others. And the places it appears may differ from place to place in the code. For example, with { int t = x*x; … bar(t); … t = x/3; … bar(t); … }, the compiler may decide to stash the first value of t in one place on the stack. But the second value assigned to t is effectively a separate lifetime, and the compiler may stash it in another place on the stack (or not at all, per the above). In a good implementation, the debugger may be aware of these different places and display the stored value of t while the program counter is in a matching section of code. And, while the program counter is not in a matching section of code, t may effectively not exist, and the debugger could report it is optimized out at that point.
I have two files:
int PolyMod(int s);
void CreateChecksum(int isTestNet, int *mod) {
*mod = PolyMod(isTestNet == 0 ? 5 : 9);
}
and
int PolyMod(int s);
void CreateChecksum(int isTestNet, int *mod) {
if (isTestNet == 0) {
*mod = PolyMod(5);
} else {
*mod = PolyMod(9);
}
}
Somehow their assembly result is different. Why? You can see the assembly created from the first file here and from the second file here.
Doesn't the compiler know that they're equivalent, and one is faster? Was the reason they had different assemblies was that they're exactly equally fast, and the only difference between them was the order of operations?
I've wondered if the difference was caused by static branch prediction. After experimenting with __builtin_expect, I believe that the answer is no.
It seems that the problem is a missed optimization bug, caused by GIMPLE in GCC. Clang doesn't have this bug, so it generates the same assembly.
I've reported this to GCC; the bug can be tracked here: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85971
C does not impose any restriction about what instructions to generate in hardware.
It is allowed to generate any possible instruction as time as the semantics of the generated code remains the same as the abstract semantics of C (defined in ISO 9899).
The compiler will transform the C code in many intermediate languages(combinators, rtl, ssa, generic, gimple, etc etc), in particular in RTL and from there there is generated hardware dependent code.
You should study the intermediate languages in order to understand why the generated assembler is different.
I'm experimenting with the foreign-function interface in Haskell. I wanted to implement a simple test to see if I could do mutual recursion. So, I created the following Haskell code:
module MutualRecursion where
import Data.Int
foreign import ccall countdownC::Int32->IO ()
foreign export ccall countdownHaskell::Int32->IO()
countdownHaskell::Int32->IO()
countdownHaskell n = print n >> if n > 0 then countdownC (pred n) else return ()
Note that the recursive case is a call to countdownC, so this should be tail-recursive.
In my C code, I have
#include <stdio.h>
#include "MutualRecursionHaskell_stub.h"
void countdownC(int count)
{
printf("%d\n", count);
if(count > 0)
return countdownHaskell(count-1);
}
int main(int argc, char* argv[])
{
hs_init(&argc, &argv);
countdownHaskell(10000);
hs_exit();
return 0;
}
Which is likewise tail recursive. So then I make a
MutualRecursion: MutualRecursionHaskell_stub
ghc -O2 -no-hs-main MutualRecursionC.c MutualRecursionHaskell.o -o MutualRecursion
MutualRecursionHaskell_stub:
ghc -O2 -c MutualRecursionHaskell.hs
and compile with make MutualRecursion.
And... upon running, it segfaults after printing 8991.
Just as a test to make sure gcc itself can handle tco in mutual recursion, I did
void countdownC2(int);
void countdownC(int count)
{
printf("%d\n", count);
if(count > 0)
return countdownC2(count-1);
}
void countdownC2(int count)
{
printf("%d\n", count);
if(count > 0)
return countdownC(count-1);
}
and that worked quite fine. It also works in the single-recursion case of just in C and just in Haskell.
So my question is, is there a way to indicate to GHC that the call to the external C function is tail recursive? I'm assuming that the stack frame does come from the call from Haskell to C and not the other way around, since the C code is very clearly a return of a function call.
I believe cross-language C-Haskell tail calls are very, very hard to achieve.
I do not know the exact details, but the C runtime and the Haskell runtime are vastly different. The main factors for this difference, as far as I can see, are:
different paradigm: purely functional vs imperative
garbage collection vs manual memory management
lazy semantics vs strict one
The kinds of optimizations which are likely to survive across language boundaries given such differences are next to zero. Perhaps, in theory, one could invent an ad hoc C runtime together with a Haskell runtime so that some optimizations are feasible, but GHC and GCC were not designed in this way.
Just to show an example of the potential differences, assume we have the following Haskell code
p :: Int -> Bool
p x = x==42
main = if p 42
then putStrLn "A" -- A
else putStrLn "B" -- B
A possible implementation of the main could be the following:
push the address of A on the stack
push the address of B on the stack
push 42 on the stack
jump to p
A: print "A", jump to end
B: print "B", jump to end
while p is implemented as follows:
p: pop x from the stack
pop b from stack
pop a from stack
test x against 42
if equal, jump to a
jump to b
Note how p is invoked with two return addresses, one for each possible result. This is different from C, whose standard implementations use only one return address. When crossing boundaries the compiler must account for this difference and compensate.
Above I also did not account for the case when the argument of p is a thunk, to keep it simple. The GHC allocator can also trigger garbage collection.
Note that the above fictional implementation was actually used in the past by GHC (the so called "push/enter" STG machine). Even if now it is no longer in use, the "eval/apply" STG machine is only marginally closer to the C runtime. I'm not even sure about GHC using the regular C stack: I think it does not, using its own one.
You can check the GHC developer wiki to see the gory details.
While I am no expert in Haskel-C interop, I do not imagine a call from C to Haskel can be a straight function invocation - it most likely has to go through intermediary to set up environment. As a result, your call to haskel would actually consist of call to this intermediary. This call likely was optimized by gcc. But the call from intermediary to actual Haskel routine was not neccessarily optimized - so I assume, this is what you are dealing with. You can check assembly output to make sure.
Knowing the number of iteration a loop will go through allows the compiler to do some optimization. Consider for instance the two loops below :
Unknown iteration count :
static void bitreverse(vbuf_desc * vbuf)
{
unsigned int idx = 0;
unsigned char * img = vbuf->usrptr;
while(idx < vbuf->bytesused) {
img[idx] = bitrev[img[idx]];
idx++;
}
}
Known iteration count
static void bitreverse(vbuf_desc * vbuf)
{
unsigned int idx = 0;
unsigned char * img = vbuf->usrptr;
while(idx < 1280*400) {
img[idx] = bitrev[img[idx]];
idx++;
}
}
The second version will compile to faster code, because it will be unrolled twice (on ARM with gcc 4.6.3 and -O2 at least). Is there a way to make assertion on the loop count that gcc will take into account when optimizing ?
There is hot attribute on functions to give a hint to compiler about hot-spot: http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html. Just abb before your function:
static void bitreverse(vbuf_desc * vbuf) __attribute__ ((pure));
Here the docs about 'hot' from gcc:
hot The hot attribute on a function is used to inform the compiler
that the function is a hot spot of the compiled program. The function
is optimized more aggressively and on many target it is placed into
special subsection of the text section so all hot functions appears
close together improving locality. When profile feedback is available,
via -fprofile-use, hot functions are automatically detected and this
attribute is ignored.
The hot attribute on functions is not implemented in GCC versions
earlier than 4.3.
The hot attribute on a label is used to inform the compiler that path
following the label are more likely than paths that are not so
annotated. This attribute is used in cases where __builtin_expect
cannot be used, for instance with computed goto or asm goto.
The hot attribute on labels is not implemented in GCC versions earlier
than 4.8.
Also you can try to add __builtin_expect around your idx < vbuf->bytesused - it will be hint that in most cases the expression is true.
In both cases I'm not sure that your loop will be optimized.
Alternatively you can try profile-guided optimization. Build profile-generating version of program with -fprofile-generate; run it on target, copy profile data to build-host and rebuild with -fprofile-use. This will give a lot of information to compiler.
In some compilers (not in GCC) there are loop pragmas, including "#pragma loop count (N)" and "#pragma unroll (M)", e.g. in Intel; unroll in IBM; vectorizing pragmas in MSVC
ARM compiler (armcc) also has some loop pragmas: unroll(n) (via 1):
Loop Unrolling: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/CJACACFE.html and http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/CJAHJDAB.html
and __promise intrinsic:
Using __promise to improve vectorization
The __promise(expr) intrinsic is a promise to the compiler that a given expression is nonzero. This enables the compiler to improve vectorization by optimizing away code that, based on the promise you have made, is redundant.
The disassembled output of Example 3.21 shows the difference that __promise makes, reducing the disassembly to a simple vectorized loop by the removal of a scalar fix-up loop.
Example 3.21. Using __promise(expr) to improve vectorization code
void f(int *x, int n)
{
int i;
__promise((n > 0) && ((n&7)==0));
for (i=0; i<n;i++) x[i]++;
}
You can actually specify the exact count with __builtin_expect, like this:
while (idx < __builtin_expect(vbuf->bytesused, 1280*400)) {
This tells gcc that vbuf->bytesused is expected to be 1280*400 at runtime.
Alas, this does nothing for optimization with current gcc version. Haven't tried with 4.8, though.
Edit: Just realized that every standard C compiler has a way to exactly specify the loop count, via assert. Since the assert
#include <assert.h>
...
assert(loop_count == 4096);
for (i = 0; i < loop_count; i++) ...
will call exit() or abort() if the condition is not true, any compiler with value propagation will know the exact value of loop_count. I always thought that this would be the most elegant and standard-conforming way to give such optimization hints. Now, I want a C compiler that actually uses this information.
Note that if you want to make this faster, bytewise unrolling might be less effective than using a wider lookup table. A 16-bit table would occupy 128K, and thus often fit into the CPU cache. If the data is not completely random, an even wider table (3 bytes) might be effective.
2-byte example:
unsigned short *bitrev2;
...
for (idx = 0; idx < vbuf->bytesused; idx += 2) {
*(unsigned short *)(&img[idx]) = bitrev2[*(unsigned short *)(&img[idx]);
}
This is an optimization the compiler can't perform, regardless of the information you give it.