C: Comparison between value initialization during definition & assignment after it - c

Just a question from curiosity. In C we can initialize value directly or assign after defining the variable. like
char* pStr = NULL;
or
char* pStr;
pStr = NULL;
functionality-wise they are similar but is there any difference after compilation. Is extra instruction cycle is required for the later or modern compiler are intelligent enough to optimize.
N.B: I am reviewing a old codebase where the second case is being used extensively. Thats why I am curious, if I can get real change by changing the code in all the places.

The first snippet initializes the variable with a value. The second default-initializes it, which does nothing for a pointer with automatic storage duration, and then assigns a value.
For a non-const pointer with automatic storage duration, there should be no difference except that you may unintentionally use it before it is initialized, which would be UB.
Other things like references or constants for example require the first style.

depends on whether it is local or global
int hello;
int world=6;
void fun ( void )
{
int foo;
int bar=5;
foo=4;
hello=2;
}
for globals hello would land in .bss (which may require bootstrap code to set to zero) and the code would be created that executes runtime to set that "variable" to 2. world would land in .data and have the initial value of 6 set at compile time. The allocated memory/data space would have that value, but there may be bootstrap required to place that data before use.
foo and bar are ideally on the stack, which is a runtime "allocation" so in either case code is required to make room for them as well as runtime set them to a value. If you made them static or basically "local globals" they now fall into the same category as globals landing in .bss or .data bar being initialized to 5 the one time but foo being set at runtime in the .text code generated.
simple examples, compiling and disassembling will show how all of this works, granted not trivial as an optimizer may eliminate some of what you are looking for depending on the rest of the code. (the code above sets hello to 2, foo and bar are dead code and would be optimized out).
00000000 <fun>:
0: e3a02002 mov r2, #2
4: e59f3004 ldr r3, [pc, #4] ; 10 <fun+0x10>
8: e5832000 str r2, [r3]
c: e12fff1e bx lr
10: 00000000 andeq r0, r0, r0
Disassembly of section .data:
00000000 <world>:
0: 00000006 andeq r0, r0, r6
If I do a very crude link without startup code, etc to see the rest of the picture:
00001000 <fun>:
1000: e3a02002 mov r2, #2
1004: e59f3004 ldr r3, [pc, #4] ; 1010 <fun+0x10>
1008: e5832000 str r2, [r3]
100c: e12fff1e bx lr
1010: 00011018 andeq r1, r1, r8, lsl r0
Disassembly of section .data:
00011014 <__data_start>:
11014: 00000006 andeq r0, r0, r6
Disassembly of section .bss:
00011018 <__bss_start>:
11018: 00000000 andeq r0, r0, r0
we see both hello and world but foo and bar are optimized out.

Related

static and volatile keywords from assembly point of view

I know that there are many questions like this, but this question is not about what static and volatile means from C standard's point of view. I'm interested in what is happening a bit lower - on assembly level.
static keyword for variables makes those variables to be visible statically (static storage duration), like global variables. To make it real a compiler should write those variables to .bss section or somewhere else? Also, static keyword prevents the variable/function to be used outside the file, does it happen only during compilation or there are some runtime-checks?
volatile keyword for variables makes those variables to be read from memory to make sure that if something else (like Peripheral Devices) wants to modify that variable it'll see exactly the value from that memory. Here, what does exactly mean "to be read from memory"? What is memory location used? .bss, .data, or something else?
The static keyword has two meanings: (a) it conveys static storage class and (b) it conveys internal linkage. These two meanings have to be strictly distinguished.
An object having static storage class means that it is allocated at the start of the program and lives until the end of the program. This is usually achieved by placing the object into the data segment (for initialised objects) or into the bss segment (for uninitialised objects). Details may vary depending on the toolchain in question.
An identifier having internal linkage means that each identifier in the same translation unit with the same name and some linkage (i.e. the linkage is not “none”) refers to the same object as that identifier. This is usually realised by not making the symbol corresponding to the identifier a global symbol. The linker will then not recognise references of the same symbol from different translation units as referring to the same symbol.
The volatile keyword indicates that all operations performed on the volatile-qualified object in the abstract machine must be performed in the code generated. The compiler is not permitted to perform any optimisations that would discard any such operations performed on the volatile-qualified object as it usually would for non-volatile-qualified objects.
This keyword is purely an instruction to the compiler to suppress certain optimisations. It does not affect the storage class of the objects qualified such. See also my previous answer on this topic.
You can also try it and see.
C code:
static unsigned int x;
unsigned int y;
unsigned int z = 1;
static volatile unsigned int j;
static volatile const unsigned int k = 11;
void fun ( void )
{
x = 5;
y = 7;
z ++ ;
j+=2;
}
Assembler:
mov ip, #7
ldr r3, .L3
ldr r0, .L3+4
ldr r2, [r3, #4]
ldr r1, [r0]
add r2, r2, #2
add r1, r1, #1
str r1, [r0]
str r2, [r3, #4]
str ip, [r3]
bx lr
.global z
.global y
.data
.align 2
.set .LANCHOR1,. + 0
.type z, %object
.size z, 4
z:
.word 1
.type k, %object
.size k, 4
k:
.word 11
.bss
.align 2
.set .LANCHOR0,. + 0
.type y, %object
.size y, 4
y:
.space 4
.type j, %object
.size j, 4
j:
.space 4
x was not expected to survive in an example like this and in any file it will maybe land in .bss since I did not put an initial value.
y is .bss as expected
z is .data as expected
volatile prevents j from being optimized out despite it being dead code/variable.
k could have ended up in .rodata but looks like .data here.
You guys are using fancy words but static in C just means it is limited in scope limited to that function or file. Global, local, initialized or not, const or not, can affect if it is .data, .bss, or .rodata (could even land in .text instead of .rodata if you play the alphabet game with the (rwx) stuff in the linker script (suggestion: never use those).
volatile is implied to mean some flavors of do not optimize out this variable/operation, do it in this order do not move it outside the loop, etc. You can find discussions about how it is not what you think it is and we have seen on this site that llvm/clang and gnu/gcc have a different opinion on what volatile actually means (when used to describe a pointer that is intended to access a control or status register in a peripheral, based on some arguments as what volatile was invented for (not for sharing variables between interrupts and foreground code)).
Like static volatile does not imply what segment it is in can even be used with asm volatile (stuff); to tell the compiler I do not want you to move this code around I want it to happen right here in this order. (which is an aspect of using it on a variable, or so we believe).
static unsigned int x;
void fun ( void )
{
x = 5;
}
Disassembly of section .text:
00000000 <fun>:
0: e12fff1e bx lr
no .rodata, .data, nor .bss optimized away.
but
static unsigned int x;
void fun ( void )
{
x += 5;
}
Disassembly of section .text:
00000000 <fun>:
0: e59f200c ldr r2, [pc, #12] ; 14 <fun+0x14>
4: e5923000 ldr r3, [r2]
8: e2833005 add r3, r3, #5
c: e5823000 str r3, [r2]
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
Disassembly of section .bss:
00000000 <x>:
0: 00000000 andeq r0, r0, r0
How fun is that, ewww... let's not optimize out the dead code, let's put it in there. It is not global, nobody else can see it...
fun.c
static unsigned int x;
void fun ( void )
{
x += 5;
}
so.c
static unsigned int x;
void more_fun ( void )
{
x += 3;
}
linked
Disassembly of section .text:
00008000 <more_fun>:
8000: e59f200c ldr r2, [pc, #12] ; 8014 <more_fun+0x14>
8004: e5923000 ldr r3, [r2]
8008: e2833003 add r3, r3, #3
800c: e5823000 str r3, [r2]
8010: e12fff1e bx lr
8014: 00018030 andeq r8, r1, r0, lsr r0
00008018 <fun>:
8018: e59f200c ldr r2, [pc, #12] ; 802c <fun+0x14>
801c: e5923000 ldr r3, [r2]
8020: e2833005 add r3, r3, #5
8024: e5823000 str r3, [r2]
8028: e12fff1e bx lr
802c: 00018034 andeq r8, r1, r4, lsr r0
Disassembly of section .bss:
00018030 <x>:
18030: 00000000 andeq r0, r0, r0
00018034 <x>:
18034: 00000000 andeq r0, r0, r0
each x is static so as expected there are two of them... well expectations are they are optimized out but...
and they are .bss as expected since I did not initialize them.
and on that note
static unsigned int x=3;
void fun ( void )
{
x += 5;
}
Disassembly of section .text:
00000000 <fun>:
0: e59f200c ldr r2, [pc, #12] ; 14 <fun+0x14>
4: e5923000 ldr r3, [r2]
8: e2833005 add r3, r3, #5
c: e5823000 str r3, [r2]
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
Disassembly of section .data:
00000000 <x>:
0: 00000003 andeq r0, r0, r3
static const unsigned int x=3;
unsigned int fun ( void )
{
return(x);
}
Disassembly of section .text:
00000000 <fun>:
0: e3a00003 mov r0, #3
4: e12fff1e bx lr
static const unsigned int x=3;
const unsigned int y=5;
unsigned int fun ( void )
{
return(x+y);
}
Disassembly of section .text:
00000000 <fun>:
0: e3a00008 mov r0, #8
4: e12fff1e bx lr
Disassembly of section .rodata:
00000000 <y>:
0: 00000005 andeq r0, r0, r5
Okay I finally got a .rodata.
static const unsigned int x=3;
volatile const unsigned int y=5;
unsigned int fun ( void )
{
return(x+y);
}
Disassembly of section .text:
00000000 <fun>:
0: e59f3008 ldr r3, [pc, #8] ; 10 <fun+0x10>
4: e5930000 ldr r0, [r3]
8: e2800003 add r0, r0, #3
c: e12fff1e bx lr
10: 00000000 andeq r0, r0, r0
Disassembly of section .data:
00000000 <y>:
0: 00000005 andeq r0, r0, r5
There is only so much you can do with words and their (perceived) definitions, the topic as I understand it is C vs (generated) asm. At some point you should actually try it and you can see how trivial it was, do not need to write elaborate code. gcc, objdump and sometimes ld. Hmm I just noticed y moved to .data from .rodata in that case... That is interesting.
And this just try it will test the compiler and other tool authors interpretation. Things like what does register mean what does volatile mean, etc (and to find that it is subject to different interpretations like so much of the C language (implementation defined)). It is important sometimes to
know what your favorite/specific compilers interpretation of the language is, but be very mindful of actual implementation defined things (bitfields, unions, how structs are constructed (packing them causes as many problems as it solves) and so on)...
Go to the spec read whatever definition, then go to
your compiler and see how they interpreted it, then go back to the spec and see if you can figure it out.
As far as static goes essentially means scope, stays within the function or file (well compile domain for a single compile operation). and volatile implies please do this in this order and please do not optimize out this item and/or its operations. in both cases it is what you used them with that determines where they are .text, .data, .bss, .rodata, etc.

ARM Thumb GCC Disassembled C. Caller-saved registers not saved and loading and storing same register immediately

Context: STM32F469 Cortex-M4 (ARMv7-M Thumb-2), Win 10, GCC, STM32CubeIDE; Learning/Trying out inline assembly & reading disassembly, stack managements etc., writing to core registers, observing contents of registers, examining RAM around stack pointer to understand how things work.
I've noticed that at some point, when I call a function, in the beginning of a called function, which received an argument, the instructions generated for the C function do "store R3 at RAM address X" followed immediately "Read RAM address X and store in RAM". So it's writing and reading the same value back, R3 is not changed. If it only had wanted to save the value of R3 onto the stack, why load it back then?
C code, caller function (main), my code:
asm volatile(" LDR R0,=#0x00000000\n"
" LDR R1,=#0x11111111\n"
" LDR R2,=#0x22222222\n"
" LDR R3,=#0x33333333\n"
" LDR R4,=#0x44444444\n"
" LDR R5,=#0x55555555\n"
" LDR R6,=#0x66666666\n"
" MOV R7,R7\n" //Stack pointer value is here, used for stack data access
" LDR R8,=#0x88888888\n"
" LDR R9,=#0x99999999\n"
" LDR R10,=#0xAAAAAAAA\n"
" LDR R11,=#0xBBBBBBBB\n"
" LDR R12,=#0xCCCCCCCC\n"
);
testInt = addFifteen(testInt); //testInt=0x03; returns uint8_t, argument uint8_t
Function call generates instructions to load function argument into R3, then move it to R0, then branch with link to addFifteen. So by the time I enter addFifteen, R0 and R3 have value 0x03 (testInt). So far so good. Here is what function call looks like:
testInt = addFifteen(testInt);
08000272: ldrb r3, [r7, #11]
08000274: mov r0, r3
08000276: bl 0x80001f0 <addFifteen>
So I go into addFifteen, my C code for addFifteen:
uint8_t addFifteen(uint8_t input){
return (input + 15U);
}
and its disassembly:
addFifteen:
080001f0: push {r7}
080001f2: sub sp, #12
080001f4: add r7, sp, #0
080001f6: mov r3, r0
080001f8: strb r3, [r7, #7]
080001fa: ldrb r3, [r7, #7]
080001fc: adds r3, #15
080001fe: uxtb r3, r3
08000200: mov r0, r3
08000202: adds r7, #12
08000204: mov sp, r7
08000206: ldr.w r7, [sp], #4
0800020a: bx lr
My primary interest is in 1f8 and 1fa lines. It stored R3 on stack and then loads freshly written value back into the register that still holds the value anyway.
Questions are:
What is the purpose of this "store register A into RAM X, next read value from RAM X into register A"? Read instruction doesn't seem to serve any purpose. Make sure RAM write is complete?
Push{r7} instruction makes stack 4-byte aligned instead of 8-byte aligned. But immediately after that instruction we have SP decremented by 12 (bytes), so it becomes 8-byte aligned again. Therefore, this behavior is ok. Is this statement correct? What if an interrupt happens between these two instructions? Will alignment be fixed during ISR stacking for the duration of ISR?
From what I read about caller/callee saved registers (very hard to find any sort of well-organized information on that, if you have good material, please, share a link), at least R0-R3 must be placed on stack when I call a function. However, it's easy to notice in this case that NONE of the registers were pushed on stack, and I verified it by checking memory around stack pointer, it would have been easy to notice 0x11111111 and 0x22222222, but they aren't there, and nothing is pushing them there. The values in R0 and R3 that I had before I called the function are simply gone forever. Why weren't any registers pushed on stack before function call? I would expect to have R3 0x33333333 when addFifteen returns because that's how it was before function call, but that value is casually overwritten even before branch to addFifteen. Why didn't GCC generate instructions to push R0-R3 onto the stack and only after that branch with link to addFifteen?
If you need some compiler settings, please, let me know where to find them in Eclipse (STM32CubeIDE) and what exactly you need there, I will happily provide them and add them to the question here.
uint8_t addFifteen(uint8_t input){
return (input + 15U);
}
What you are looking at here is unoptimized and at least with gnu the input and local variables get a memory location on the stack.
00000000 <addFifteen>:
0: b480 push {r7}
2: b083 sub sp, #12
4: af00 add r7, sp, #0
6: 4603 mov r3, r0
8: 71fb strb r3, [r7, #7]
a: 79fb ldrb r3, [r7, #7]
c: 330f adds r3, #15
e: b2db uxtb r3, r3
10: 4618 mov r0, r3
12: 370c adds r7, #12
14: 46bd mov sp, r7
16: bc80 pop {r7}
18: 4770 bx lr
What you see with r3 is that the input variable, input, comes in r0. For some reason, code is not optimized, it goes into r3, then it is saved in its memory location on the stack.
Setup the stack
00000000 <addFifteen>:
0: b480 push {r7}
2: b083 sub sp, #12
4: af00 add r7, sp, #0
save input to the stack
6: 4603 mov r3, r0
8: 71fb strb r3, [r7, #7]
so now we can start implementing the code in the function which wants to do math on the input function, so do that math
a: 79fb ldrb r3, [r7, #7]
c: 330f adds r3, #15
Convert the result to an unsigned char.
e: b2db uxtb r3, r3
Now prepare the return value
10: 4618 mov r0, r3
and clean up and return
12: 370c adds r7, #12
14: 46bd mov sp, r7
16: bc80 pop {r7}
18: 4770 bx lr
Now if I tell it not to use a frame pointer (just a waste of a register).
00000000 <addFifteen>:
0: b082 sub sp, #8
2: 4603 mov r3, r0
4: f88d 3007 strb.w r3, [sp, #7]
8: f89d 3007 ldrb.w r3, [sp, #7]
c: 330f adds r3, #15
e: b2db uxtb r3, r3
10: 4618 mov r0, r3
12: b002 add sp, #8
14: 4770 bx lr
And you can still see each of the fundamental steps in implementing the function. Unoptimized.
Now if you optimize
00000000 <addFifteen>:
0: 300f adds r0, #15
2: b2c0 uxtb r0, r0
4: 4770 bx lr
It removes all the excess.
number two.
Yes I agree this looks wrong, but gnu certainly does not keep the stack on an alignment at all times, so this looks wrong. But I have not read the details on the arm calling convention. Nor have I read to see what gcc's interpretation is. Granted they may claim a spec, but at the end of the day the compiler authors choose the calling convention for their compiler, they are under no obligation to arm or intel or others to conform to any spec. Their choice, and like the C language itself, there are lots of places where it is implementation defined and gnu implements the C language one way and others another way. Perhaps this is the same. Same goes for this saving of the incoming variable to the stack. We will see that llvm/clang does not.
number three.
r0-r3 and another register or two may be called caller saved, but the better way to think of them is volatile. The callee is free to modify them without saving them. It is not so much a case of saving the r0 register, but instead r0 represents a variable and you are managing that variable in functionally implementing the high level code.
For example
unsigned int fun1 ( void );
unsigned int fun0 ( unsigned int x )
{
return(fun1()+x);
}
00000000 <fun0>:
0: b510 push {r4, lr}
2: 4604 mov r4, r0
4: f7ff fffe bl 0 <fun1>
8: 4420 add r0, r4
a: bd10 pop {r4, pc}
x comes in in r0, and we need to preserve that value until after fun1() is called. r0 can be destroyed/modified by fun1(). So in this case they save r4, not r0, and keep x in r4.
clang does this as well
00000000 <fun0>:
0: b5d0 push {r4, r6, r7, lr}
2: af02 add r7, sp, #8
4: 4604 mov r4, r0
6: f7ff fffe bl 0 <fun1>
a: 1900 adds r0, r0, r4
c: bdd0 pop {r4, r6, r7, pc}
Back to your function.
clang, unoptimized also keeps the input variable in memory (stack).
00000000 <addFifteen>:
0: b081 sub sp, #4
2: f88d 0003 strb.w r0, [sp, #3]
6: f89d 0003 ldrb.w r0, [sp, #3]
a: 300f adds r0, #15
c: b2c0 uxtb r0, r0
e: b001 add sp, #4
10: 4770 bx lr
and you can see the same steps, prep the stack, store the input variable. Take the input variable do the math. Prepare the return value. Clean up, return.
Clang/llvm optimized:
00000000 <addFifteen>:
0: 300f adds r0, #15
2: b2c0 uxtb r0, r0
4: 4770 bx lr
Happens to be the same as gnu. Not expected that any two different compilers generate the same code, nor any expectation that any two versions of the same compiler generate the same code.
unoptimized, the input and local variables (none in this case) get a home on the stack. So what you are seeing is the input variable being put in its home on the stack as part of the setup of the function. Then the function itself wants to operate on that variable so, unoptimized, it needs to fetch that value from memory to create an intermediate variable (that in this case did not get a home on the stack) and so on. You see this with volatile variables as well. They will get written to memory then read back then modified then written to memory and read back, etc...
yes I agree, but I have not read the specs. End of the day it is gcc's calling convention or interpretation of some spec they choose to use. They have been doing this (not being aligned 100% of the time) for a long time and it does not fail. For all called functions they are aligned when the functions are called. Interrupts in arm code generated by gcc is not aligned all the time. Been this way since they adopted that spec.
by definition r0-r3, etc are volatile. The callee can modify them at will. The callee only needs to save/preserve them if IT needs them. In both the unoptimized and optimized cases only r0 matters for your function it is the input variable and it is used for the return value. You saw in the function I created that the input variable was preserved for later, even when optimized. But, by definition, the caller assumes these registers are destroyed by called functions, and called functions can destroy the contents of these registers and no need to save them.
As far as inline assembly goes, which is a different assembly language than "real" assembly language. I think you have a ways to go before being ready for that, but maybe not. After decades of constant bare metal work I have found zero real use cases for inline assembly, the cases I see are laziness avoiding allowing real assembly into the make system or ways to avoid writing real assembly language. I see it as a ghee whiz feature that folks use like unions and bitfields.
Within gnu, for arm, you have at least four incompatible assembly languages for arm. The not unified syntax real assembly, the unified syntax real assembly. The assembly language that you see when you use gcc to assemble instead of as and then inline assembly for gcc. Despite claims of compatibility clang arm assembly language is not 100% compatible with gnu assembly language and llvm/clang does not have a separate assembler you feed it to the compiler. Arms various toolchains over the years have completely incompatible assembly language to gnu for arm. This is all expected and normal. Assembly language is specific to the tool not the target.
Before you can get into inline assembly language learn some of the real assembly language. And to be fair perhaps you do, and perhaps quite well, and this question is about the discover of how compilers generate code, and how strange it looks as you find out that it is not some one to one thing (all tools in all cases generate the same output from the same input).
For inline asm, while you can specify registers, depending on what you are doing, you generally want to let the compiler choose the register, most of the work for inline assembly is not the assembly but the language that specific compiler uses to interface it...which is compiler specific, move to another compiler and the expectation is a whole new language to learn. While moving between assemblers is also a whole new language at least the syntax of the instructions themselves tend to be the same and the language differences are in everything else, labels and directives and such. And if lucky and it is a toolchain not just an assembler, you can look at the output of the compiler to start to understand the language and compare it to any documentation you can find. Gnus documentation is pretty bad in this case, so a lot of reverse engineering is needed. At the same time you are more likely to be successful with gnu tools over any other, not because they are better, in many cases they are not, but because of the sheer user base and the common features across targets and over decades of history.
I would get really good at interfacing asm with C by creating mock C functions to see which registers are used, etc. And/or even better, implement it in C, compile it, then hand modify/improve/whatever the output of the compiler (you do not need to be a guru to beat the compiler, to be as consistent, perhaps, but fairly often you can easily see improvements that can be made on the output of gcc, and gcc has been getting worse over the last several versions it is not getting better, as you can see from time to time on this site). Get strong in the asm for this toolchain and target and how the compiler works, and then perhaps learn the gnu inline assembly language.
I'm not sure there is a specific purpose to do it. it is just one solution that the compiler has found to do it.
For example the code:
unsigned int f(unsigned int a)
{
return sqrt(a + 1);
}
compiles with ARM GCC 9 NONE with optimisation level -O0 to:
push {r7, lr}
sub sp, sp, #8
add r7, sp, #0
str r0, [r7, #4]
ldr r3, [r7, #4]
adds r3, r3, #1
mov r0, r3
bl __aeabi_ui2d
mov r2, r0
mov r3, r1
mov r0, r2
mov r1, r3
bl sqrt
...
and in level -O1 to:
push {r3, lr}
adds r0, r0, #1
bl __aeabi_ui2d
bl sqrt
...
As you can see the asm is much easier to understand in -O1: store parameter in R0, add 1, call functions.
The hardware supports non aligned stack during exception. See here
The "caller saved" registers do not necessarily need to be stored on the stack, it's up to the caller to know whether it needs to store them or not.
Here you are mixing (if I understood correctly) C and assembly: so you have to do the compiler job before switching back to C: either you store values in callee saved registers (and then you know by convention that the compiler will store them during function call) or you store them yourself on the stack.

What is 'veneer' that arm linker uses in function call?

I just read https://www.keil.com/support/man/docs/armlink/armlink_pge1406301797482.htm. but can't understand what a veneer is that arm linker inserts between function calls.
In "Procedure Call Standard for the ARM Architecture" document, it says,
5.3.1.1 Use of IP by the linker Both the ARM- and Thumb-state BL instructions are unable to address the full 32-bit address space, so
it may be necessary for the linker to insert a veneer between the
calling routine and the called subroutine. Veneers may also be needed
to support ARM-Thumb inter-working or dynamic linking. Any veneer
inserted must preserve the contents of all registers except IP (r12)
and the condition code flags; a conforming program must assume that a
veneer that alters IP may be inserted at any branch instruction that
is exposed to a relocation that supports inter-working or long
branches. Note R_ARM_CALL, R_ARM_JUMP24, R_ARM_PC24, R_ARM_THM_CALL,
R_ARM_THM_JUMP24 and R_ARM_THM_JUMP19 are examples of the ELF
relocation types with this property. See [AAELF] for full details
Here is what I guess, is it something like this ? : when function A calls function B, and when those two functions are too far apart for the bl command to express, the linker inserts function C between function A and B in such a way function C is close to function B. Now function A uses b instruction to go to function C(copying all the registers between the function call), and function C uses bl instruction(copying all the registers too). Of course the r12 register is used to keep the remaining long jump address bits. Is this what veneer means? (I don't know why arm doesn't explain what veneer is but only what veneer provides..)
It is just a trampoline. Interworking is the easier one to demonstrate, using gnu here, but the implication is that Kiel has a solution as well.
.globl even_more
.type eve_more,%function
even_more:
bx lr
.thumb
.globl more_fun
.thumb_func
more_fun:
bx lr
extern unsigned int more_fun ( unsigned int x );
extern unsigned int even_more ( unsigned int x );
unsigned int fun ( unsigned int a )
{
return(more_fun(a)+even_more(a));
}
Unlinked object:
Disassembly of section .text:
00000000 <fun>:
0: e92d4070 push {r4, r5, r6, lr}
4: e1a05000 mov r5, r0
8: ebfffffe bl 0 <more_fun>
c: e1a04000 mov r4, r0
10: e1a00005 mov r0, r5
14: ebfffffe bl 0 <even_more>
18: e0840000 add r0, r4, r0
1c: e8bd4070 pop {r4, r5, r6, lr}
20: e12fff1e bx lr
Linked binary (yes completely unusable, but demonstrates what the tool does)
Disassembly of section .text:
00001000 <fun>:
1000: e92d4070 push {r4, r5, r6, lr}
1004: e1a05000 mov r5, r0
1008: eb000008 bl 1030 <__more_fun_from_arm>
100c: e1a04000 mov r4, r0
1010: e1a00005 mov r0, r5
1014: eb000002 bl 1024 <even_more>
1018: e0840000 add r0, r4, r0
101c: e8bd4070 pop {r4, r5, r6, lr}
1020: e12fff1e bx lr
00001024 <even_more>:
1024: e12fff1e bx lr
00001028 <more_fun>:
1028: 4770 bx lr
102a: 46c0 nop ; (mov r8, r8)
102c: 0000 movs r0, r0
...
00001030 <__more_fun_from_arm>:
1030: e59fc000 ldr r12, [pc] ; 1038 <__more_fun_from_arm+0x8>
1034: e12fff1c bx r12
1038: 00001029 .word 0x00001029
103c: 00000000 .word 0x00000000
You cannot use bl to switch modes between arm and thumb so the linker has added a trampoline as I call it or have heard it called that you hop on and off to get to the destination. In this case essentially converting the branch part of bl into a bx, the link part they take advantage of just using the bl. You can see this done for thumb to arm or arm to thumb.
The even_more function is in the same mode (ARM) so no need for the trampoline/veneer.
For the distance limit of bl lemme see. Wow, that was easy, and gnu called it a veneer as well:
.globl more_fun
.type more_fun,%function
more_fun:
bx lr
extern unsigned int more_fun ( unsigned int x );
unsigned int fun ( unsigned int a )
{
return(more_fun(a)+1);
}
MEMORY
{
bob : ORIGIN = 0x00000000, LENGTH = 0x1000
ted : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.some : { so.o(.text*) } > bob
.more : { more.o(.text*) } > ted
}
Disassembly of section .some:
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: eb000003 bl 18 <__more_fun_veneer>
8: e8bd4010 pop {r4, lr}
c: e2800001 add r0, r0, #1
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
00000018 <__more_fun_veneer>:
18: e51ff004 ldr pc, [pc, #-4] ; 1c <__more_fun_veneer+0x4>
1c: 20000000 .word 0x20000000
Disassembly of section .more:
20000000 <more_fun>:
20000000: e12fff1e bx lr
Staying in the same mode it did not need the bx.
The alternative is that you replace every bl instruction at compile time with a more complicated solution just in case you need to do a far call. Or since the bl offset/immediate is computed at link time you can, at link time, put the trampoline/veneer in to change modes or cover the distance.
You should be able to repeat this yourself with Kiel tools, all you needed to do was either switch modes on an external function call or exceed the reach of the bl instruction.
Edit
Understand that toolchains vary and even within a toolchain, gcc 3.x.x was the first to support thumb and I do not know that I saw this back then. Note the linker is part of binutils which is as separate development from gcc. You mention "arm linker", well arm has its own toolchain, then they bought Kiel and perhaps replaced Kiel's with their own or not. Then there is gnu and clang/llvm and others. So it is not a case of "arm linker" doing this or that, it is a case of the toolchains linker doing this or that and each toolchain is first free to use whatever calling convention they want there is no mandate that they have to use ARM's recommendations, second they can choose to implement this or not or simply give you a warning and you have to deal with it (likely in assembly language or through function pointers).
ARM does not need to explain it, or let us say, it is clearly explained in the Architectural Reference Manual (look at the bl instruction, the bx instruction look for the words interworking, etc. All quite clearly explained) for a particular architecture. So there is no reason to explain it again. Especially for a generic statement where the reach of bl varies and each architecture has different interworking features, it would be a long set of paragraphs or a short chapter to explain something that is already clearly documented.
Anyone implementing a compiler and linker would be well versed in the instruction set before hand and understand the bl and conditional branch and other limitations of the instruction set. Some instruction sets offer near and far jumps and some of those the assembly language for the near and far may be the same mnemonic so the assembler will often decide if it does not see the label in the same file to implement a far jump/call rather than a near one so that the objects can be linked.
In any case before linking you have to compile and assembly and the toolchain folks will have fully understood the rules of the architecture. ARM is not special here.
This is Raymond Chen's comment :
The veneer has to be close to A because B is too far away. A does a bl
to the veneer, and the veneer sets r12 to the final destination(B) and
does a bx r12. bx can reach the entire address space.
This answers to my question enough, but he doesn't want to write a full answer (maybe for lack of time..) I put it here as an answer and select it. If someone posts a better, more detailed answer, I'll switch to it.

Process sections: does a declaration add also something to .text? If yes, what does it add?

I have a C code like this one, that will be possibly compiled in an ELF file for ARM:
int a;
int b=1;
int foo(int x) {
int c=2;
static float d=1.5;
// .......
}
I know that all the executable code goes into the .text section, while .data , .bss and .rodata will contain the various variables/constants.
My question is: does a line like int b=1; here add also something to the .text section, or does it only tell the compiler to place a new variable initialized to 1 in .data (then probably mapped in RAM memory when deployed on the final hardware)?
Moreover, trying to decompile a similar code, I noticed that a line such as int c=2;, inside the function foo(), was adding something to the stack, but also some lines of .text where the value '2' was actually memorized there.
So, in general, does a declaration always imply also something added to .text at an assembly level? If yes, does it depends on the context (i.e. if the variable is inside a function, if it is a local global variable, ...) and what is actually added?
Thanks a lot in advance.
does a line like int b=1; here add also something to the .text section, or does it only tell the compiler to place a new variable initialized to 1 in .data (then probably mapped in RAM memory when deployed on the final hardware)?
You understand that this is likely to be implementation specific, but the likelihood is that that you will just get initialised data in the data section. Were it a constant, it might, instead go into the text section.
Moreover, trying to decompile a similar code, I noticed that a line such as int c=2;, inside the function foo(), was adding something to the stack, but also some lines of .text where the value '2' was actually memorized there.
Automatic variables that are initialised, have to be initialised each time the function's scope is entered. The space for c is reserved on the stack (or in a register, depending on the ABI) but the program has to remember the constant from which it is initialised and this is best placed somewhere in the text segment, either as a constant value or as a "move immediate" instruction.
So, in general, does a declaration always imply also something added to .text at an assembly level?
No. If a static variable is initialised to zero or null or not initialised at all, it is often just enough to reserve space in bss. If a static non constant variable is initialised to a non zero value, it will just be put in the data segment.
As #goodvibration correctly stated, only global or static variables go to the segments. This is because their lifetime is the whole execution time of the program.
Local variables have a different lifetime. They exist only during the execution of the block (e.g. function) they are defined within. If a function is called, all parameters that does not fit into registers a pushed to the stack and the return address is written to the link register.* The function saves possibly the link register and other registers at the stack and adds some space at the stack for local variables (this is the code you have observed). At the end of the function, the saved registers are poped and the the stackpointer is readjusted. In this way, you get an automatic garbage collection for local variables.
*: Please note, that this is true for (some calling conventions of) ARM only. It's different e.g. for Intel processors.
this is one of those just try it things.
int a;
int b=1;
int foo(int x) {
int c=2;
static float d=1.5;
int e;
e=x+2;
return(e);
}
first thing without optimization.
arm-none-eabi-gcc -c so.c -o so.o
arm-none-eabi-objdump -D so.o
arm-none-eabi-ld -Ttext=0x1000 -Tdata=0x2000 so.o -o so.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000001000
arm-none-eabi-objdump -D so.elf > so.list
do worry about the warning, needed to link to see that everything found a home
Disassembly of section .text:
00001000 <foo>:
1000: e52db004 push {r11} ; (str r11, [sp, #-4]!)
1004: e28db000 add r11, sp, #0
1008: e24dd014 sub sp, sp, #20
100c: e50b0010 str r0, [r11, #-16]
1010: e3a03002 mov r3, #2
1014: e50b3008 str r3, [r11, #-8]
1018: e51b3010 ldr r3, [r11, #-16]
101c: e2833002 add r3, r3, #2
1020: e50b300c str r3, [r11, #-12]
1024: e51b300c ldr r3, [r11, #-12]
1028: e1a00003 mov r0, r3
102c: e28bd000 add sp, r11, #0
1030: e49db004 pop {r11} ; (ldr r11, [sp], #4)
1034: e12fff1e bx lr
Disassembly of section .data:
00002000 <b>:
2000: 00000001 andeq r0, r0, r1
00002004 <d.4102>:
2004: 3fc00000 svccc 0x00c00000
Disassembly of section .bss:
00002008 <a>:
2008: 00000000 andeq r0, r0, r0
as a disassembly it tries to disassemble data so ignore that (the andeq next to 0x2008 for example).
The a variable is global and uninitialized so it lands in .bss (typically...a compiler can choose to do whatever it wants so long as it implements the language correctly, doesnt have to have something called .bss for example, but gnu and many others do).
b is global and initialized so it lands in .data, had it been declared as const it might land in .rodata depending on the compiler and what it offers.
c is a local non-static variable that is initialized, because C offers recursion this needs to be on the stack (or managed with registers or other volatile resources), and initialized each run. We needed to compile without optimization to see this
1010: e3a03002 mov r3, #2
1014: e50b3008 str r3, [r11, #-8]
d is what I call a local global, it is a static local so it lives outside the function, not on the stack, alongside the globals but with local access only.
I added e to your example, this is a local not initialized, but then used. Had I not used it and not optimized there probably would have been space allocated for it but no initialization.
save x on the stack (per this calling convention x enters in r0)
100c: e50b0010 str r0, [r11, #-16]
then load x from the stack, add two, save as e on the stack. read e from
the stack and place in the return location for this calling convention which is r0.
1018: e51b3010 ldr r3, [r11, #-16]
101c: e2833002 add r3, r3, #2
1020: e50b300c str r3, [r11, #-12]
1024: e51b300c ldr r3, [r11, #-12]
1028: e1a00003 mov r0, r3
For all architectures, unoptimized this is somewhat typical, always read variables from the stack and put them back quickly. Other architectures have different calling conventions with respect to where the incoming parameters and outgoing return value live.
If I optmize (-O2 on the gcc line)
Disassembly of section .text:
00001000 <foo>:
1000: e2800002 add r0, r0, #2
1004: e12fff1e bx lr
Disassembly of section .data:
00002000 <b>:
2000: 00000001 andeq r0, r0, r1
Disassembly of section .bss:
00002004 <a>:
2004: 00000000 andeq r0, r0, r0
b is a global, so at the object level a global space has to be reserved for it, it is .data, optimization doesnt change that.
a is also global and still .bss, because at the object level it was declared such so allocated in case another object needs it. The linker doesnt remove these.
Now c and d are dead code they dont do anything they need no storage so
c is no longer allocated space on the stack nor is d allocated any .data
space.
We have plenty of registers for this architecture for this calling convention for this code, so e does not need any memory allocated on the
stack, it comes in in r0 the math can be done with r0 and then it is returned in r0.
I know I didnt tell the linker where to put .bss by telling it .data it put .bss in the same space without complaint. I could have put -Tbss=0x3000 for example to give it its own space or just done a linker script. Linker scripts can play havoc with the typical results, so beware.
Typical, but there might be a compiler with exceptions:
non-constant globals go in .data or .bss depending on whether they are initialized during the declaration or not.
If const then perhaps .rodata or .text depending (or .data or .bss would technically work)
non-static locals go in general purpose registers or on the stack as needed (if not completely optimized away).
static locals (if not optimized away) live with globals but are not globally accessible they just get allocated space in .data or .bss like the globals do.
parameters are governed completely by the calling convention used by that compiler for that target. Just because arm or mips or other may have written down a convention doesnt mean a compiler has to use it, only if they claim to support some convention or standard should they then attempt to comply. For a compiler to be useful it needs a convention and stick to it whatever it is, so that both caller and callee of a function know where to get parameters and to return a value. Architectures with enough registers will often have a convention where some few number of registers are used for the first so many parameters (not necessarily one to one) and then the stack is used for all other parameters. likewise a register may be used if possible for a return value. Some architectures due to lack of gprs or other, use the stack in both directions. or the stack in one and a register in the other. You are welcome to seek out the conventions and try to read them, but at the end of the day the compiler you are using, if not broken follows a convention and by setting up experiments like the one above you can see the convention in action.
Plus in this case optimizations.
void more_fun ( unsigned long long );
unsigned fun ( unsigned int x, unsigned long long y )
{
more_fun(y);
return(x+1);
}
If I told you that arm conventions typically use r0-r3 for the first few parameters you might assume that x is in r0 and r1 and r2 are used for y and we could have another small parameter before needing the stack, well
perhaps older arm, but now it wants the 64 bit variable to use an even then an odd.
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e1a04000 mov r4, r0
8: e1a01003 mov r1, r3
c: e1a00002 mov r0, r2
10: ebfffffe bl 0 <more_fun>
14: e2840001 add r0, r4, #1
18: e8bd4010 pop {r4, lr}
1c: e12fff1e bx lr
so r0 contains x, r2/r3 contain y and r1 was passed over.
the test was crafted to not have y as dead code and to pass it to another function we can see where y was stored on the way into fun and way out to more_fun. r2/r3 on the way in, needs to be in r0/r1 to call more fun.
we need to preserve x for the return from fun. one might expect that x would land on the stack, which unoptimized it would, but instead save a register that the convention has stated will be preserved by functions (r4) and use r4 throughout the function or at least in this function to store x. A performance optimization, if x needed to be touched more than once memory cycles going to the stack cost more than register accesses.
then it computes the return and cleans up the stack, registers.
IMO it is important to see this, the calling convention comes into play for some variables and others can vary based on optimization, no optimization they are what most folks are going to state off hand, .bss, .data (.text/.rodata), with optimization then it depends if if the variable survives at all.

Does use structs direct in functions uses more resources than pass them in parameters in C?

here is my question.
Is there a good way to uses global context structures in embedded c program ?
I mean is it better to pass them in parameters of function or directly use the global reference inside the function ? Or there is no differences ?
Example:
Context_t myContext; // is a structure with a lot of members
void function1(Context_t *ctx)
{
ctx->x = 1;
}
or
void function2(void)
{
myContext.x = 1;
}
Thanks.
Where to allocate variables is a program design decision, not a performance decision.
On modern systems there is not going to be much of a performance difference between your two versions.
When passing a lot of different parameters, rather than just one single pointer as in this case, there could be a performance difference. Older systems, most notably 8 bit MCUs with crappy compilers, could benefit quite a lot from using file scope variables when it comes to performance. Mostly since old legacy architectures like PIC, AVR, HC08, 8051 etc had very limited stack and register resources. If you have to maintain such old stuff, then file scope variables will improve performance.
That being said, you should allocate variables where it makes since. If the purpose of your code unit is to process Context_t allocated elsewhere, it should get passed as a pointer. If Context_t is private data that the caller does not need to know about, you could allocate it at file scope.
Please note that there is never a reason to declare "global" variables at file scope. All your file scope variables should have internal linkage. That is, they should be declared as static. This is perfectly fine practice in most embedded systems, particularly single core, bare metal MCU applications.
However, note that file scope variables are not thread-safe and causes complications on multi-process systems. If you are for example using a RTOS, you should minimize the amount of such variables.
Strictly to your question. If you are going to have the global then use it as a global directly. Having one function use it as a global and then pass it down requires setup on the caller, the consumption of the resource (register or stack) for the parameter, and slight savings on the function itself:
typedef struct
{
unsigned int a;
unsigned int b;
unsigned int c;
unsigned int d;
unsigned int e;
unsigned int f;
unsigned int g;
unsigned int h;
unsigned int i;
unsigned int j;
} SO_STRUCT;
SO_STRUCT so;
unsigned int fun1 ( SO_STRUCT s )
{
return(s.a+s.g);
}
unsigned int fun2 ( SO_STRUCT *s )
{
return(s->a+s->g);
}
unsigned int fun3 ( void )
{
return(so.a+so.g);
}
Disassembly of section .text:
00000000 <fun1>:
0: e24dd010 sub sp, sp, #16
4: e24dc004 sub r12, sp, #4
8: e98c000f stmib r12, {r0, r1, r2, r3}
c: e59d3018 ldr r3, [sp, #24]
10: e59d0000 ldr r0, [sp]
14: e28dd010 add sp, sp, #16
18: e0800003 add r0, r0, r3
1c: e12fff1e bx lr
00000020 <fun2>:
20: e5902000 ldr r2, [r0]
24: e5900018 ldr r0, [r0, #24]
28: e0820000 add r0, r2, r0
2c: e12fff1e bx lr
00000030 <fun3>:
30: e59f300c ldr r3, [pc, #12] ; 44 <fun3+0x14>
34: e5930000 ldr r0, [r3]
38: e5933018 ldr r3, [r3, #24]
3c: e0800003 add r0, r0, r3
40: e12fff1e bx lr
44: 00000000 andeq r0, r0, r0
the caller to fun2 would have to load the address of the struct to pass it in so in this case the extra consumption is we lost a register as a parameter, since there were so few parameters, it was a wash, for a single call from a single higher function. if you continued to nest this the best you could do is keep handing down the register:
unsigned int funx ( SO_STRUCT *s );
unsigned int fun2 ( SO_STRUCT *s )
{
return(funx(s)+3);
}
Disassembly of section .text:
00000000 <fun2>:
0: e92d4010 push {r4, lr}
4: ebfffffe bl 0 <funx>
8: e8bd4010 pop {r4, lr}
c: e2800003 add r0, r0, #3
10: e12fff1e bx lr
so no matter whether the struct was originally global or local to some function, in this case if I call the next function and pass by reference the first caller has to setup the parameter, in this case with arm that is a register r0, so stack pointer math or a load of an address into r0. r0 goes to fun2() and can be used directly by reference to get at items assuming the function is simple enough it doesnt have to evict out to the stack. Then calling funx() with the same pointer, fun2 does NOT have to load r0 (in this simplified doesnt get too much better than this case) and funx() can reference items from r0 directly. had fun2 and funx used the global directly they both would resemble fun3 above where each function would have a load to get the address and a word to store the address
one would hope multiple functions in a file would share but dont make that assumption:
unsigned int fun3 ( void )
{
return(so.a+so.g);
}
unsigned int funz ( void )
{
return(so.a+so.h);
}
00000000 <fun3>:
0: e59f300c ldr r3, [pc, #12] ; 14 <fun3+0x14>
4: e5930000 ldr r0, [r3]
8: e5933018 ldr r3, [r3, #24]
c: e0800003 add r0, r0, r3
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
00000018 <funz>:
18: e59f300c ldr r3, [pc, #12] ; 2c <funz+0x14>
1c: e5930000 ldr r0, [r3]
20: e593301c ldr r3, [r3, #28]
24: e0800003 add r0, r0, r3
28: e12fff1e bx lr
2c: 00000000 andeq r0, r0, r0
as your function gets more complicated though this optimization goes away (simply passing r0 down as the first parameter). So you end up storing and then retreiving the address to the struct so it costs a stack location and a store and some loads where direct to the global would be a flash/.text location and a load, so slightly cheaper.
if on a system where the parameters are on the stack then continuing to pass the pointer does not have a chance at that optimization you have to keep copying the pointer to the stack for each nested call...
So as far as your direct question there is no correct answer other than it depends. And you would need to be really really tight on a performance or resource budget to worry about a premature optimization like that.
As far as consumption, globals have the benefit on a very tightly constrained system of being fixed and known at compile time what their consumption is. Where having local variables as a habit in particular structures, is going to create a lot of stack use which is dynamic and much harder to measure (can change each line of code you add or remove too, so spend a week trying to determine the use, then add a line and you could gain nothing to a few percent to tens of percent). At the same time a one time or few time use variable or structure MIGHT be better served locally, depends on how deep in the nested functions, if at the end then doesnt cost much if declared locally at the top function then it costs the same as being global but is now on the stack and not measured at compile time. One struct, ehhh, no biggie, a habit, that is when it matters.
So to your specific question it cannot be determined ahead of time and cannot make a general rule that it is "faster" to pass by reference or use directly as one can easily create use cases that demonstrate each being true. The wee bitty improvement would come from knowing your memory consumption at compile time (global) vs runtime (local). But your question was not about local vs global was about access to the global.
Much better to pass a reference to the structure than modify the structure as a global. Passing a reference makes it visible the function is (potentially) changing the structure.
From a performance standpoint there won't be a measurable difference.
If the number of structure accesses is significant, passing the reference can also result in significantly smaller code.
Global variables are generally preferred to be avoided, there are plenty of reasons to it. While with global variables some find it easy to share a single resource between many functions, but there are flipsides to it. Be it ease of code understanding, be it dependencies and tight coupling of variables. Many a times we end up using libraries, and with modules that are linked dynamically, it is troublesome if different libraries have their own instances of global variables.
So, with direct reference to your question, I would prefer
void function1(Context_t *ctx)
against anything that involves changing a global variable.
But again, if the necessary precautions are taken in terms of tight coupling of global variables and functions, it is okay to go with existing implementation which has global variables, rather than scrapping the whole tested thing and start off again.

Resources