matrix multiplication in ARM assembly

matrix multiplication in ARM assembly - c

A course of ARM assembly recently started at my university, and or assignment is to create an NxM * MxP matrix multiplication programm, that is called from C code.
Now I have a fairly limited knowledge in assambler, but i'm more than willing to learn. What I would like to know is this:
How to read/pass 2D arrays from C to ASM?
How to output a 2D array back to C?
I'm thinking, that i can figure the rest of this out by myself, but these 2 points are what I find difficult.
I am using ARM assembly on qemu, on Ubuntu for this code, it's not going on any particular device.

C arrays are merely just pointers, so when you pass a C array as an argument to an assemply function, you will get a pointer to an area of memory that is the content of the array.
For retrieving the argument, it depends on what calling convention you use. The ARM EABI stipulates that:
The first four registers r0-r3 (a1-a4) are used to pass argument values into a subroutine and to return a result
value from a function. They may also be used to hold intermediate values within a routine (but, in general, only
between subroutine calls).
For simple functions, them, you should find the pointer to your array in r0 to r4 depending on your function signature. Otherwise, you will find it on the stack. A good technique to find out exactly what the ABI is would be to disassemble the object file of the C code that calls your assembly function and check what it does prior to calling your Assembly function.
For instance, on Linux, you can compile the following C code in a file called testasm.c:
extern int myasmfunc(int *);
static int array[] = { 0, 1, 2 };
int mycfunc()
{
return myasmfunc(array);
}
Then compile it with:
arm-linux-gnueabi-gcc -c testasm.c
And finally get a disassembly with:
arm-linux-gnueabi-objdump -S testasm.o
The result is:
testasm.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <mycfunc>:
0: e92d4800 push {fp, lr}
4: e28db004 add fp, sp, #4
8: e59f000c ldr r0, [pc, #12] ; 1c <mycfunc+0x1c>
c: ebfffffe bl 0 <myasmfunc>
10: e1a03000 mov r3, r0
14: e1a00003 mov r0, r3
18: e8bd8800 pop {fp, pc}
1c: 00000000 andeq r0, r0, r0
You can see that the single-parametered function myasmfunc is called by putting the parameter into register r0. The meaning of ldr r0, [pc, #12] is "load into r0 the content of the memory address that is at pc+12". That is where the pointer to the array is stored.

Even though Guillaumes answer helped me A LOT, I just thought to answer my own question with a bit of code.
What I ended up doing was creating an a 1D array, and passing it to asm along with the dimensions.
int *p;
scanf("%d", &h1);
scanf("%d", &w1);
int* A =(int *) malloc (sizeof(int) * ( w1 * h1 ));
p=A;
int i;
int j;
for(i=0;i<(w1*h1);i++)
{
scanf("%d", p++);
}
That being said, I allocated another array the same (malloc) way, and passed it along aswell. I then just stored the int value I needed in the appropriate address in the assembly code, and since the addresses of the array elements don't change, I just used the same array to output the result.

Related

ARM Link register - non-leaf subroutine [duplicate]

This question already has answers here:
ARM: link register and frame pointer
(2 answers)
Why does ARM say that "A link register supports fast leaf function calls"
(2 answers)
Closed last year.
I am wondering about, where the Link register is used in ARM CPU. As I understand it is storing return address of functions. But does every return address go to this register after function call or it is only related to leaf subroutine implementation? How it is performed in functions, that have to use stack (for storing data or additional return addresses) - is LR still used here in any way?

BL instruction
Operation
if ConditionPassed(cond) then
LR = address of the instruction after the branch instruction
PC = PC + (SignExtend(signed_immed_24) << 2)
Usage
The BL instruction is used to perform a subroutine call. The return
from subroutine is achieved by copying the LR to the PC. Typically,
this is done by one of the following methods:
- Executing a BX R14 instruction.
- Executing a MOV PC,R14 instruction.
And newer ARMs go on to allow for pop {lr} and other...
Seems quite clear to me what the usage of LR is.
You can easily try it yourself as well:
unsigned int more_fun ( unsigned int );
unsigned int fun0 ( unsigned int x )
{
return(x+1);
}
unsigned int fun1 ( unsigned int x )
{
return(more_fun(x)+1);
}
unsigned int fun2 ( unsigned int x )
{
return(more_fun(x));
}
unsigned int fun3 ( unsigned int x )
{
return(3);
}
00000000 <fun0>:
0: e2800001 add r0, r0, #1
4: e12fff1e bx lr
00000008 <fun1>:
8: e92d4010 push {r4, lr}
c: ebfffffe bl 0 <more_fun>
10: e8bd4010 pop {r4, lr}
14: e2800001 add r0, r0, #1
18: e12fff1e bx lr
0000001c <fun2>:
1c: e92d4010 push {r4, lr}
20: ebfffffe bl 0 <more_fun>
24: e8bd4010 pop {r4, lr}
28: e12fff1e bx lr
0000002c <fun3>:
2c: e3a00003 mov r0, #3
30: e12fff1e bx lr
Because, as documented, bl modifies the link register. In order to return from a non-leaf function you need to preserve the link register for that call, the return address. So you push it on the stack. The convention for this compiler wants the stack 64 bit aligned, so the addition of the r4 register is simply to facilitate that alignment and r4 is otherwise not involved here.
You can see in the leaf function it does not use the stack because it has no reason to do so, the link register does not get modified during the function and in this case the function is too simple to need the stack for other reasons. If you were to need the stack and be a leaf function the optimizer will not need to put lr on the stack, but if for alignment reasons it needs another register, who knows they are free to use r14 as well as one of many of the other registers.
Now if we force something on the stack (non-leaf)
unsigned int new_fun ( unsigned int, unsigned int );
unsigned int fun4 ( unsigned int x, unsigned int y)
{
return(new_fun(x,y)+y);
}
00000034 <fun4>:
34: e92d4010 push {r4, lr}
38: e1a04001 mov r4, r1
3c: ebfffffe bl 0 <new_fun>
40: e0800004 add r0, r0, r4
44: e8bd4010 pop {r4, lr}
48: e12fff1e bx lr
lr has to be on the stack because a bl is used to call the next function. In this case per the convention they chose to use r4 to save the y variable (in r1 coming in) so that it can be used after the return of the nested call. Since only two registers need to be preserved, and that fits with the stack alignment rule then r4 and lr are saved and in this case both are used (r4 is not just to align the stack).
Not sure what you mean by additional return addresses. Perhaps you are thinking as each function makes a call there a return address on the stack to preserve that address, and that is true but you really only need to look at it one function at a time, that is the beauty of calling conventions. And in that case for this architecture using ideally bl to make function calls (as pointed out in another answer they don't have to, but it would be silly not to) that means lr is modified for every call to a subroutine and as a result the calling function then loses its return address to its caller, so it needs to preserve it locally some how. As we saw with fun 4, technically they could for example:
fun2:
push {r4, r5}
mov r5,lr
bl 0 <more_fun>
mov r1,r5
pop {r4, r5}
bx r1
and not actually save lr on the stack. Newer ARMs than the one I am building for you will see this
00000008 <fun1>:
8: e92d4010 push {r4, lr}
c: ebfffffe bl 0 <more_fun>
10: e2800001 add r0, r0, #1
14: e8bd8010 pop {r4, pc}
00000018 <fun2>:
18: eafffffe b 0 <more_fun>
The contents of lr is on the stack (lr itself of course is a register it can't be "on the stack", but after armv4t you can pop to the pc and change modes between arm and thumb (where before only bx could be used for thumb interwork).
Also note the tail optimization for fun2. This means that fun2 did not even push the return address on the stack.
Seems pretty obvious if you look at the arm docs how lr is used. And then think about how a compiler would implement a standard function, and then what optimizations they might do. And of course you can then just try it and see what certain compilers actually generate.

The return address is a parameter, which is hidden from C and other high level languages, but visible in assembly & machine code.
On ARM, called functions (callees) rely on the return address being passed in the lr register; this by specification of the calling convention — so callers using the standard ARM calling convention must put the return address there, in lr, to satisfy this parameter passing requirement and expectation.
Standard calling conventions are designed so that a caller can properly invoke a callee — knowing nothing more about the callee than its function signature.  Thus, a caller is abstracted from knowing the implementation details of the callee, beyond the parameters and return value(s).  This means that a function can evolve (for bug fixes, or other) without having to revisit (or recompile) callers, as long as the signature is unmodified.
Whether a function is a leaf function (or not) is an aspect of its internal implementation and not visible in function signatures.  So, a caller does not know (and should not have to know) if the callee is leaf or not, or if a function changes from leaf to non-leaf during some bug fix versioning.
A function's use of the stack is also an internal implementation detail not captured in function signature, so will not affect how functions are called, and where return address value is passed and expected.
So, there really is only one way to pass the return address.
Callee's who need to use the lr (because they are calling other functions or maybe just want to use that register) will need to preserve the return address value provided to them as a parameter by their callers, for use later to return to them (assuming they want to return).
Function implementations that use the lr (and so preserve the value therein for their later use) don't have to restore that preserved return address value back into the lr register (the calling convention does not require the return address to be passed back to callers) so sometimes the lr register is restored then used, but other times on ARM, the return address is popped directly off the stack into the program counter, bypassing lr, i.e. without restoring the lr register.
You could create your own calling convention that passed the return address in a different location, i.e. in a different register, or, by pushing it onto the stack!
Some languages do diverge from the standard calling conventions (in minor ways) and then still support the standard calling conventions for their interoperability with C-style functions.
The hardware is designed to support collecting the return address into the lr register while making the call that transfers control to the callee, all in one instruction, so it would be silly to avoid that.  The hardware also offers no other particularly efficient way to capture the return address, and there is no real reason for it to.

Process sections: does a declaration add also something to .text? If yes, what does it add?

I have a C code like this one, that will be possibly compiled in an ELF file for ARM:
int a;
int b=1;
int foo(int x) {
int c=2;
static float d=1.5;
// .......
}
I know that all the executable code goes into the .text section, while .data , .bss and .rodata will contain the various variables/constants.
My question is: does a line like int b=1; here add also something to the .text section, or does it only tell the compiler to place a new variable initialized to 1 in .data (then probably mapped in RAM memory when deployed on the final hardware)?
Moreover, trying to decompile a similar code, I noticed that a line such as int c=2;, inside the function foo(), was adding something to the stack, but also some lines of .text where the value '2' was actually memorized there.
So, in general, does a declaration always imply also something added to .text at an assembly level? If yes, does it depends on the context (i.e. if the variable is inside a function, if it is a local global variable, ...) and what is actually added?
Thanks a lot in advance.

does a line like int b=1; here add also something to the .text section, or does it only tell the compiler to place a new variable initialized to 1 in .data (then probably mapped in RAM memory when deployed on the final hardware)?
You understand that this is likely to be implementation specific, but the likelihood is that that you will just get initialised data in the data section. Were it a constant, it might, instead go into the text section.
Moreover, trying to decompile a similar code, I noticed that a line such as int c=2;, inside the function foo(), was adding something to the stack, but also some lines of .text where the value '2' was actually memorized there.
Automatic variables that are initialised, have to be initialised each time the function's scope is entered. The space for c is reserved on the stack (or in a register, depending on the ABI) but the program has to remember the constant from which it is initialised and this is best placed somewhere in the text segment, either as a constant value or as a "move immediate" instruction.
So, in general, does a declaration always imply also something added to .text at an assembly level?
No. If a static variable is initialised to zero or null or not initialised at all, it is often just enough to reserve space in bss. If a static non constant variable is initialised to a non zero value, it will just be put in the data segment.

As #goodvibration correctly stated, only global or static variables go to the segments. This is because their lifetime is the whole execution time of the program.
Local variables have a different lifetime. They exist only during the execution of the block (e.g. function) they are defined within. If a function is called, all parameters that does not fit into registers a pushed to the stack and the return address is written to the link register.* The function saves possibly the link register and other registers at the stack and adds some space at the stack for local variables (this is the code you have observed). At the end of the function, the saved registers are poped and the the stackpointer is readjusted. In this way, you get an automatic garbage collection for local variables.
*: Please note, that this is true for (some calling conventions of) ARM only. It's different e.g. for Intel processors.

this is one of those just try it things.
int a;
int b=1;
int foo(int x) {
int c=2;
static float d=1.5;
int e;
e=x+2;
return(e);
}
first thing without optimization.
arm-none-eabi-gcc -c so.c -o so.o
arm-none-eabi-objdump -D so.o
arm-none-eabi-ld -Ttext=0x1000 -Tdata=0x2000 so.o -o so.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000001000
arm-none-eabi-objdump -D so.elf > so.list
do worry about the warning, needed to link to see that everything found a home
Disassembly of section .text:
00001000 <foo>:
1000: e52db004 push {r11} ; (str r11, [sp, #-4]!)
1004: e28db000 add r11, sp, #0
1008: e24dd014 sub sp, sp, #20
100c: e50b0010 str r0, [r11, #-16]
1010: e3a03002 mov r3, #2
1014: e50b3008 str r3, [r11, #-8]
1018: e51b3010 ldr r3, [r11, #-16]
101c: e2833002 add r3, r3, #2
1020: e50b300c str r3, [r11, #-12]
1024: e51b300c ldr r3, [r11, #-12]
1028: e1a00003 mov r0, r3
102c: e28bd000 add sp, r11, #0
1030: e49db004 pop {r11} ; (ldr r11, [sp], #4)
1034: e12fff1e bx lr
Disassembly of section .data:
00002000 <b>:
2000: 00000001 andeq r0, r0, r1
00002004 <d.4102>:
2004: 3fc00000 svccc 0x00c00000
Disassembly of section .bss:
00002008 <a>:
2008: 00000000 andeq r0, r0, r0
as a disassembly it tries to disassemble data so ignore that (the andeq next to 0x2008 for example).
The a variable is global and uninitialized so it lands in .bss (typically...a compiler can choose to do whatever it wants so long as it implements the language correctly, doesnt have to have something called .bss for example, but gnu and many others do).
b is global and initialized so it lands in .data, had it been declared as const it might land in .rodata depending on the compiler and what it offers.
c is a local non-static variable that is initialized, because C offers recursion this needs to be on the stack (or managed with registers or other volatile resources), and initialized each run. We needed to compile without optimization to see this
1010: e3a03002 mov r3, #2
1014: e50b3008 str r3, [r11, #-8]
d is what I call a local global, it is a static local so it lives outside the function, not on the stack, alongside the globals but with local access only.
I added e to your example, this is a local not initialized, but then used. Had I not used it and not optimized there probably would have been space allocated for it but no initialization.
save x on the stack (per this calling convention x enters in r0)
100c: e50b0010 str r0, [r11, #-16]
then load x from the stack, add two, save as e on the stack. read e from
the stack and place in the return location for this calling convention which is r0.
1018: e51b3010 ldr r3, [r11, #-16]
101c: e2833002 add r3, r3, #2
1020: e50b300c str r3, [r11, #-12]
1024: e51b300c ldr r3, [r11, #-12]
1028: e1a00003 mov r0, r3
For all architectures, unoptimized this is somewhat typical, always read variables from the stack and put them back quickly. Other architectures have different calling conventions with respect to where the incoming parameters and outgoing return value live.
If I optmize (-O2 on the gcc line)
Disassembly of section .text:
00001000 <foo>:
1000: e2800002 add r0, r0, #2
1004: e12fff1e bx lr
Disassembly of section .data:
00002000 <b>:
2000: 00000001 andeq r0, r0, r1
Disassembly of section .bss:
00002004 <a>:
2004: 00000000 andeq r0, r0, r0
b is a global, so at the object level a global space has to be reserved for it, it is .data, optimization doesnt change that.
a is also global and still .bss, because at the object level it was declared such so allocated in case another object needs it. The linker doesnt remove these.
Now c and d are dead code they dont do anything they need no storage so
c is no longer allocated space on the stack nor is d allocated any .data
space.
We have plenty of registers for this architecture for this calling convention for this code, so e does not need any memory allocated on the
stack, it comes in in r0 the math can be done with r0 and then it is returned in r0.
I know I didnt tell the linker where to put .bss by telling it .data it put .bss in the same space without complaint. I could have put -Tbss=0x3000 for example to give it its own space or just done a linker script. Linker scripts can play havoc with the typical results, so beware.
Typical, but there might be a compiler with exceptions:
non-constant globals go in .data or .bss depending on whether they are initialized during the declaration or not.
If const then perhaps .rodata or .text depending (or .data or .bss would technically work)
non-static locals go in general purpose registers or on the stack as needed (if not completely optimized away).
static locals (if not optimized away) live with globals but are not globally accessible they just get allocated space in .data or .bss like the globals do.
parameters are governed completely by the calling convention used by that compiler for that target. Just because arm or mips or other may have written down a convention doesnt mean a compiler has to use it, only if they claim to support some convention or standard should they then attempt to comply. For a compiler to be useful it needs a convention and stick to it whatever it is, so that both caller and callee of a function know where to get parameters and to return a value. Architectures with enough registers will often have a convention where some few number of registers are used for the first so many parameters (not necessarily one to one) and then the stack is used for all other parameters. likewise a register may be used if possible for a return value. Some architectures due to lack of gprs or other, use the stack in both directions. or the stack in one and a register in the other. You are welcome to seek out the conventions and try to read them, but at the end of the day the compiler you are using, if not broken follows a convention and by setting up experiments like the one above you can see the convention in action.
Plus in this case optimizations.
void more_fun ( unsigned long long );
unsigned fun ( unsigned int x, unsigned long long y )
{
more_fun(y);
return(x+1);
}
If I told you that arm conventions typically use r0-r3 for the first few parameters you might assume that x is in r0 and r1 and r2 are used for y and we could have another small parameter before needing the stack, well
perhaps older arm, but now it wants the 64 bit variable to use an even then an odd.
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e1a04000 mov r4, r0
8: e1a01003 mov r1, r3
c: e1a00002 mov r0, r2
10: ebfffffe bl 0 <more_fun>
14: e2840001 add r0, r4, #1
18: e8bd4010 pop {r4, lr}
1c: e12fff1e bx lr
so r0 contains x, r2/r3 contain y and r1 was passed over.
the test was crafted to not have y as dead code and to pass it to another function we can see where y was stored on the way into fun and way out to more_fun. r2/r3 on the way in, needs to be in r0/r1 to call more fun.
we need to preserve x for the return from fun. one might expect that x would land on the stack, which unoptimized it would, but instead save a register that the convention has stated will be preserved by functions (r4) and use r4 throughout the function or at least in this function to store x. A performance optimization, if x needed to be touched more than once memory cycles going to the stack cost more than register accesses.
then it computes the return and cleans up the stack, registers.
IMO it is important to see this, the calling convention comes into play for some variables and others can vary based on optimization, no optimization they are what most folks are going to state off hand, .bss, .data (.text/.rodata), with optimization then it depends if if the variable survives at all.

Does use structs direct in functions uses more resources than pass them in parameters in C?

here is my question.
Is there a good way to uses global context structures in embedded c program ?
I mean is it better to pass them in parameters of function or directly use the global reference inside the function ? Or there is no differences ?
Example:
Context_t myContext; // is a structure with a lot of members
void function1(Context_t *ctx)
{
ctx->x = 1;
}
or
void function2(void)
{
myContext.x = 1;
}
Thanks.

Where to allocate variables is a program design decision, not a performance decision.
On modern systems there is not going to be much of a performance difference between your two versions.
When passing a lot of different parameters, rather than just one single pointer as in this case, there could be a performance difference. Older systems, most notably 8 bit MCUs with crappy compilers, could benefit quite a lot from using file scope variables when it comes to performance. Mostly since old legacy architectures like PIC, AVR, HC08, 8051 etc had very limited stack and register resources. If you have to maintain such old stuff, then file scope variables will improve performance.
That being said, you should allocate variables where it makes since. If the purpose of your code unit is to process Context_t allocated elsewhere, it should get passed as a pointer. If Context_t is private data that the caller does not need to know about, you could allocate it at file scope.
Please note that there is never a reason to declare "global" variables at file scope. All your file scope variables should have internal linkage. That is, they should be declared as static. This is perfectly fine practice in most embedded systems, particularly single core, bare metal MCU applications.
However, note that file scope variables are not thread-safe and causes complications on multi-process systems. If you are for example using a RTOS, you should minimize the amount of such variables.

Strictly to your question. If you are going to have the global then use it as a global directly. Having one function use it as a global and then pass it down requires setup on the caller, the consumption of the resource (register or stack) for the parameter, and slight savings on the function itself:
typedef struct
{
unsigned int a;
unsigned int b;
unsigned int c;
unsigned int d;
unsigned int e;
unsigned int f;
unsigned int g;
unsigned int h;
unsigned int i;
unsigned int j;
} SO_STRUCT;
SO_STRUCT so;
unsigned int fun1 ( SO_STRUCT s )
{
return(s.a+s.g);
}
unsigned int fun2 ( SO_STRUCT *s )
{
return(s->a+s->g);
}
unsigned int fun3 ( void )
{
return(so.a+so.g);
}
Disassembly of section .text:
00000000 <fun1>:
0: e24dd010 sub sp, sp, #16
4: e24dc004 sub r12, sp, #4
8: e98c000f stmib r12, {r0, r1, r2, r3}
c: e59d3018 ldr r3, [sp, #24]
10: e59d0000 ldr r0, [sp]
14: e28dd010 add sp, sp, #16
18: e0800003 add r0, r0, r3
1c: e12fff1e bx lr
00000020 <fun2>:
20: e5902000 ldr r2, [r0]
24: e5900018 ldr r0, [r0, #24]
28: e0820000 add r0, r2, r0
2c: e12fff1e bx lr
00000030 <fun3>:
30: e59f300c ldr r3, [pc, #12] ; 44 <fun3+0x14>
34: e5930000 ldr r0, [r3]
38: e5933018 ldr r3, [r3, #24]
3c: e0800003 add r0, r0, r3
40: e12fff1e bx lr
44: 00000000 andeq r0, r0, r0
the caller to fun2 would have to load the address of the struct to pass it in so in this case the extra consumption is we lost a register as a parameter, since there were so few parameters, it was a wash, for a single call from a single higher function. if you continued to nest this the best you could do is keep handing down the register:
unsigned int funx ( SO_STRUCT *s );
unsigned int fun2 ( SO_STRUCT *s )
{
return(funx(s)+3);
}
Disassembly of section .text:
00000000 <fun2>:
0: e92d4010 push {r4, lr}
4: ebfffffe bl 0 <funx>
8: e8bd4010 pop {r4, lr}
c: e2800003 add r0, r0, #3
10: e12fff1e bx lr
so no matter whether the struct was originally global or local to some function, in this case if I call the next function and pass by reference the first caller has to setup the parameter, in this case with arm that is a register r0, so stack pointer math or a load of an address into r0. r0 goes to fun2() and can be used directly by reference to get at items assuming the function is simple enough it doesnt have to evict out to the stack. Then calling funx() with the same pointer, fun2 does NOT have to load r0 (in this simplified doesnt get too much better than this case) and funx() can reference items from r0 directly. had fun2 and funx used the global directly they both would resemble fun3 above where each function would have a load to get the address and a word to store the address
one would hope multiple functions in a file would share but dont make that assumption:
unsigned int fun3 ( void )
{
return(so.a+so.g);
}
unsigned int funz ( void )
{
return(so.a+so.h);
}
00000000 <fun3>:
0: e59f300c ldr r3, [pc, #12] ; 14 <fun3+0x14>
4: e5930000 ldr r0, [r3]
8: e5933018 ldr r3, [r3, #24]
c: e0800003 add r0, r0, r3
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
00000018 <funz>:
18: e59f300c ldr r3, [pc, #12] ; 2c <funz+0x14>
1c: e5930000 ldr r0, [r3]
20: e593301c ldr r3, [r3, #28]
24: e0800003 add r0, r0, r3
28: e12fff1e bx lr
2c: 00000000 andeq r0, r0, r0
as your function gets more complicated though this optimization goes away (simply passing r0 down as the first parameter). So you end up storing and then retreiving the address to the struct so it costs a stack location and a store and some loads where direct to the global would be a flash/.text location and a load, so slightly cheaper.
if on a system where the parameters are on the stack then continuing to pass the pointer does not have a chance at that optimization you have to keep copying the pointer to the stack for each nested call...
So as far as your direct question there is no correct answer other than it depends. And you would need to be really really tight on a performance or resource budget to worry about a premature optimization like that.
As far as consumption, globals have the benefit on a very tightly constrained system of being fixed and known at compile time what their consumption is. Where having local variables as a habit in particular structures, is going to create a lot of stack use which is dynamic and much harder to measure (can change each line of code you add or remove too, so spend a week trying to determine the use, then add a line and you could gain nothing to a few percent to tens of percent). At the same time a one time or few time use variable or structure MIGHT be better served locally, depends on how deep in the nested functions, if at the end then doesnt cost much if declared locally at the top function then it costs the same as being global but is now on the stack and not measured at compile time. One struct, ehhh, no biggie, a habit, that is when it matters.
So to your specific question it cannot be determined ahead of time and cannot make a general rule that it is "faster" to pass by reference or use directly as one can easily create use cases that demonstrate each being true. The wee bitty improvement would come from knowing your memory consumption at compile time (global) vs runtime (local). But your question was not about local vs global was about access to the global.

Much better to pass a reference to the structure than modify the structure as a global. Passing a reference makes it visible the function is (potentially) changing the structure.
From a performance standpoint there won't be a measurable difference.
If the number of structure accesses is significant, passing the reference can also result in significantly smaller code.

Global variables are generally preferred to be avoided, there are plenty of reasons to it. While with global variables some find it easy to share a single resource between many functions, but there are flipsides to it. Be it ease of code understanding, be it dependencies and tight coupling of variables. Many a times we end up using libraries, and with modules that are linked dynamically, it is troublesome if different libraries have their own instances of global variables.
So, with direct reference to your question, I would prefer
void function1(Context_t *ctx)
against anything that involves changing a global variable.
But again, if the necessary precautions are taken in terms of tight coupling of global variables and functions, it is okay to go with existing implementation which has global variables, rather than scrapping the whole tested thing and start off again.

Is there any real life example of optimization benefit in case of passing const parameter by value ?

Here is the case,
I've tried to investigate a little bit advantages/disadvantages of implementation functions as follows :
void foo(const int a, const int b)
{
...
}
with common function prototype which is used as API and it's included in header file as shown below :
void foo(int a, int b)
I've found quite a huge discussion about this topic in the following question:
Similar Q
I agree there with answer from the rlerallut who is saying about self-documenting code and about being a little bit paranoid on the angle of security of your code.
However and this is the question, someone wrote there that using const for the usual parameters passed to the function can bring some optimization benefits. My question is does anybody have real life example which proves this claim ?

"it might help the compiler optimize things a bit (though it's a long shot)."
I cant see how it would make a difference. It is most useful in this case to generate compiler warnings/errors when you try to modify the const variable...
If you were to try to invent an experiment to compare a function parameter declared as const or not for the purpose of optimization, pass by value. That means that this experiment would not modify the variable(s) because when const is used you would expect a warning/error. An optimizer that might be able to care, would already know that the variable is not modified in the code with or without the declaration and can then act accordingly. How would that declaration matter? If I found such a thing I would file the difference as a bug to the compiler bugboard.
For example here is a missed opportunity I found when playing with const vs not.
Note const or not doesnt matter...
void fun0x ( int a, int b);
void fun1x ( const int a, const int b);
int fun0 ( int a, int b )
{
fun0x(a,b);
return(a+b);
}
int fun1 ( const int a, const int b )
{
fun1x(a,b);
return(a+b);
}
gcc produced with a -O2 and -O3 (and -O1?)
00000000 <fun0>:
0: e92d4038 push {r3, r4, r5, lr}
4: e1a05000 mov r5, r0
8: e1a04001 mov r4, r1
c: ebfffffe bl 0 <fun0x>
10: e0850004 add r0, r5, r4
14: e8bd4038 pop {r3, r4, r5, lr}
18: e12fff1e bx lr
0000001c <fun1>:
1c: e92d4038 push {r3, r4, r5, lr}
20: e1a05000 mov r5, r0
24: e1a04001 mov r4, r1
28: ebfffffe bl 0 <fun1x>
2c: e0850004 add r0, r5, r4
30: e8bd4038 pop {r3, r4, r5, lr}
34: e12fff1e bx lr
Where this would have worked with less cycles...
push {r4,lr}
add r4,r0,r1
bl fun1x
mov r0,r4
pop {r4,lr}
bx lr
clang/llvm did the same thing the add after the function call burning extra stack locations.
Googling showed mostly discussions about const by reference rather than const by value and then the nuances of C and C++ as to what will or wont or can or cant change with the const declaration, etc.
If you use the const on a global variable then it can leave that item in .text and not have it in .data (and for your STM32 microcontroller not have to copy it from flash to ram). But that doesnt fit into your rules. the optimizer may not care and may not actually reach out to that variables home it might know to encode it directly into the instruction as an immediate based on the instruction set, etc...All things held equal though a non-const would have that same benefit if not declared volatile...
Following your rules the const saves on some human error, if you try to put that const variable on the left side of an equals sign the compiler will let you know.
I would consider it a violation of your rules, but if inside the function where it was pass by value you then did some pass by reference things and played the pass by reference const vs not optimization game....

Passing arguments from asm to C in on ARM

I read a lot of topics on this forum and found a lot of answers on this subject. I achieved to pass 5 arguments to a C function from my assembly code. For doing this, i used the instructions below :
mov r0, #0
mov r1, #1
mov r2, #2
mov r3, #3
mov r4, #4
STR r4, [sp, #-4]!
BL displayRegistersValue
But today i'm trying to pass the whole registers to a C function to save them in a C structure. I tried with this instruction :
STMDB sp!, {registers that i want to save}
My C function :
displayRegistersValue(int registers[number_of_registers])
char printable = registers[0] + (int)'0'; // Convert in a printable character
print_uart0(&printable);
But my display is not good. So, how I can access to the registers in C code?

Pretty sure the ARM standard only allows R0-R3 to be passed by value so 4 max. If you need more values, then push them onto the stack and access them that way - like the compiler does. Or make a struct and pass its address.
Ok, doubled cheked and I was right here is a link to the ARM calling conventions - down the page a bit.
To do what you want, pass the address of some memory location (an array) into your assembly routine. Once you have that address, probably within r0, you can stmdb! into that location all your register values and that memory will be viewable at the C level.
Beware, this probably isn't going to do what you think it will. Those values are allowed to change quite a bit as per the calling convention link above. If this is for debugging, you are better off using a debugger and watching the registers that way.
Ok, you are still not understanding here:
{
int registerValues[14];
myAsmRoutine(registerValues);
print_uart0(& registerValues);
}
myAsmRoutine:
stmia r0!, {r1-r14}
blx lr
I skipped R0 and PC, but you get the idea. Also, you will need to do something a bit mroe complex to change the values into a printable format - sprintf or itoa os something like that.

displayRegistersValue(int registers[number_of_registers])
this is an array not a structure and is passed as a pointer to something not as a long list of items. same goes for structures btw.
It is usually easiest to construct a C function that does what you want in asm then see what the compiler produces, then go from there (use the ABI document to confirm, etc).
#define NUMREGS 13
void displayRegistersValue(unsigned int registers[NUMREGS]);
void outer ( void )
{
unsigned int regs[NUMREGS];
displayRegistersValue(regs);
}
> arm-none-linux-gnueabi-gcc -O2 -c fun.c -o fun.o
> arm-none-linux-gnueabi-objdump -D fun.o
fun.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <outer>:
0: e52de004 push {lr} ; (str lr, [sp, #-4]!)
4: e24dd03c sub sp, sp, #60 ; 0x3c
8: e28d0004 add r0, sp, #4
c: ebfffffe bl 0 <displayRegistersValue>
10: e28dd03c add sp, sp, #60 ; 0x3c
14: e49df004 pop {pc} ; (ldr pc, [sp], #4)
You will need to do something similar, make room on the stack by adding to the stack pointer, save the lr so you dont trash it with the branch link, copy your registers to that memory (the stack) point r0 to the beginning of the memory/array you want to pass, then call the function (r0 being the first and only parameter you are passing to the function).
push {lr}
mov lr,sp
stmdb sp!,{r0-r12}
mov r0,lr
bl displayRegistersValue
add sp,sp,#52
pop {lr}

An array is passed as a pointer in a single register. If you want 5 registers then you need to have 5 parameters (int i1, int i2 etc.).

To quote from the ARM APCS document:
"The first four registers r0-r3 (a1-a4) are used to pass argument values into a subroutine and to return a result value from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls)."
So if you want to pass more than 4 values to a C function, you need to pass the rest of the values on the stack. A better idea would be to put the register values in a memory region that has been statically allocated and pass the address of the memory (pointer) to the C function. The pointer can be de-referenced by the function to get to the register values.