I created a C99 VLA function as such :
void create_polygon(int n, int faces[][n]);
I want to call this function in another function where I would allocate my two-dimensional array :
void parse_faces()
{
int faces[3][6];
create_polygon(6, faces);
}
When I pass a two-dimensional array as an argument, it passes a pointer to a 6 integer array, referencing the stack memory in the calling function.
The VLA argument here only acts as a type declaration (not allocating any actual memory), telling the compiler to access the data in row-major order with ((int*)faces)[i * 6 + j] instead of faces[i][j].
What is the difference between declaring functions with a VLA argument or with a fixed size ?
faces[i][j] always is equivalent to *(*(faces + i) + j), no matter if VLA or not.
Now let's compare two variants (not considering that you actually need the outer dimension as well to prevent exceeding array bounds on iterating):
void create_polygon1(int faces[][6]);
void create_polygon2(int n, int faces[][n]);
It doesn't matter if array passed to originally were created as classic array or as VLA, first function accepts arrays of length of exactly 6, second can accept arbitrary length array (assuming this being clear so far...).
faces[i][j] will now be translated to:
*((int*)faces + (i * 6 + j)) // (1)
*((int*)faces + (i * n + j)) // (2)
Difference yet looks marginal, but might get more obvious on assembler level (assuming all variables are yet stored on stack; assuming sizeof(int) == 4):
LD R1, i;
LD R2, j;
MUL R1, R1, 24; // using a constant! 24: 6 * sizeof(int)!
MUL R2, R2, 4; // sizeof(int)
ADD R1, R2, R2; // index stored in R1 register
LD R1, i;
LD R2, j;
LD R3, m; // need to load from stack
MUL R3, R3, 4; // need to multiply with sizeof(int) yet
MUL R1, R1, R3; // can now use m from register R3
MUL R2, R2, 4; // ...
ADD R1, R2, R2; // ...
True assembler code might vary, of course, especially if you use a calling convention that allows passing some parameters in registers (then loading n into into R3 might be unnecessary).
For completeness (added due to comments, unrelated to original question):There's yet the int* array[] case: Representation by array of pointers to arrays.
*((int*)faces + (i * ??? + j))
doesn't work any more, as faces in this case is no contiguous memory (well, the pointers themselves are in contiguous memory, of course, but not all the faces[i][j]). We are forced to do:
*(*(faces + i) + j)
as we need to dereference the true pointer in the array before we can apply the next index. Assembler code for (for comparison, need a more complete variant of the pointer to 2D-array case first):
LD R1, faces;
LD R2, i;
LD R3, j;
LD R4, m; // or skip, if no VLA
MUL R4, R4, 4; // or skip, if no VLA
MUL R2, R2, R3; // constant instead of R3, if no VLA
MUL R3, R3, 4;
ADD R2, R2, R3; // index stored in R1 register
ADD R1, R1, R2; // offset from base pointer
LD R1, [R1]; // loading value of faces[i][j] into register
LD R1, faces;
LD R2, i;
LD R3, j;
MUL R2, R2, 8; // sizeof(void*) (any pointer)
MUL R3, R3, 4; // sizeof(int)
ADD R1, R1, R2; // address of faces[i]
LD R1, [R1]; // now need to load address - i. e. de-referencing faces[i]
ADD R1, R1, R3; // offset within array
LD R1, [R1]; // loading value of faces[i][j] into register
I disassembled this code :
void create_polygon(int n, int faces[][6])
{
int a = sizeof(faces[0]);
(void)a;
}
With VLA argument :
movl %edi, -4(%rbp) # 6
movq %rsi, -16(%rbp) # faces
movl %edi, %esi
shlq $2, %rsi # 6 << 2 = 24
movl %esi, %edi
With fixed size :
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
movl $24, %edi # 24
As Aconcagua pointed out, in the first example using a VLA, the size is computed at run time by multiplying the size of an int by the size of the second dimension, which is the argument stored in rsi, then moved into edi.
In the second example, the size is directly computed at compile time and placed into edi. The main advantage being the ability to check an incorrect pointer type argument if passing a different size, thus avoiding a crash.
Related
I don't know if this question is vague or lacks enough information but I was just wondering If I want to convert this line a = a * b * c written in C language to LC-3, how can I do it? Assuming that a, b and c are local variables and that the offset of a is 0, b is -1, c is -2?
I know that I can start off like this:
LDR R0, R5, #0; to load A
LDR R1, R5, #-1; to load B
Is there a limitation to the registers I can use? Can I use R2 to load C?
Edit:
LDR R0, R5, #0; LOAD A
LDR R1, R5, #-1; LOAD B
LDR R2, R5, #-2; LOAD C
AND R3, R3, #0; Sum = 0
LOOP ADD R3, R3, R1; Sum = sum + B
ADD R0, R0, #-1; A = A-1
STR R0, R5, #0; SAVE TO A (a = a*b)
BRp LOOP
ADD R4, R4, R2; Sum1 = sum1 + C
ADD R2, R2, #-1; C = C-1
BRp LOOP
STR R0, R5, #0; SAVE TO A (a = a*c = a*b*c)
If you're writing a whole program, which is often the case with LC-3, the only physical limit is the instruction set, so modulo that, you can use the registers as you like.
Your coursework/assignment may impose some environmental requirements, such as local variables and parameters being accessible from a frame pointer, e.g. R5, and having the stack pointer in R6. If so, then those should probably be left alone, but in a pinch you could save them, and later restore them.
If you're writing just a function that is going to be called, then you'll need to follow the calling convention. Decode the signature of the function you're implementing according to the parameter passing approach. If you want to use R7 (e.g. as a scratch register, or if you want to call another function) be aware that on entry to your function, it holds the return address, whose value is needed to return to the caller, but you can save it on the stack or in a global storage for later retrieval.
The calling convention in use should also inform which registers are call clobbered vs. call preserved. Within a function, call-clobbered registers can be used without fuss, but call-preserved registers require being saving before use, and later restored to original values before returning to the function's caller.
Assume I am working on ARM Cortex M7. Now take a look at:
int a[4][4];
a[i][j]=5;
Is in assembly language a function will be calculating the a[j][j] address or it uses lookuptable (pointer array with the same size) or some magical way to place 5 in correct place?
This is disassembler output:
136 array1[i][i+1]=i;
08000da6: ldr r3, [r7, #36] ; 0x24
08000da8: adds r3, #1
08000daa: ldr r2, [r7, #36] ; 0x24
08000dac: uxtb r1, r2
08000dae: ldr r2, [r7, #36] ; 0x24
08000db0: lsls r2, r2, #2
08000db2: add.w r0, r7, #40 ; 0x28
08000db6: add r2, r0
08000db8: add r3, r2
08000dba: subs r3, #36 ; 0x24
08000dbc: mov r2, r1
08000dbe: strb r2, [r3, #0]
If you write the indices as in your example, the compiler will calculate the exact memory address required at compile time.
If the indices were variables, then address would be calculated at run time.
Here is a comparison of assembly output for both cases.
Without considering optimization, the standard way a compiler implements a reference to an array element, say x[k]:
Start with the memory address of x.
Multiply the number of bytes in an element of x by k.
Add that to the memory address of x.
Let’s suppose int has four bytes.
With a[i][j] of int a[4][4], there are two memory references, so the address is calculated:
Start with the memory address of a.
The elements of a are arrays of 4 int, so the size of an element of a is 4 times 4 bytes, which is 16 bytes.
Multiply 16 bytes by i.
Add that to the address of a. Now we have calculated the address of a[i].
We use the address of a[i] to start a new calculation.
a[i] is an array of int, so the size of an element of a[i] is 4 bytes.
Multiply 4 bytes by j.
Add that to the address of a[i]. This gives the address of a[i][j].
Summarizing, if s is the start address of a, and we use units of bytes, then the address of a[i][j] is s + 16•i + 4•j.
I am facing a weird issue. I am passing a uint64_t offset to a function on a 32-bit architecture(Cortex R52). The registers are all 32-bit.
This is the function. :
Sorry, I messed up the function declaration.
void read_func(uint8_t* read_buffer_ptr, uint64_t offset, size_t size);
main(){
// read_buffer : memory where something is read.
// read_buffer_ptr : points to read_buffer structure where value should be stored after reading value.
read_func(read_buffer_ptr, 0, sizeof(read_buffer));
}
In this function, the value stored in offset is not zero but some random values which I also see in the registers(r5, r6). Also, when I use offset as a 32-bit value, it works perfectly fine. The value is copied from r2,r3 into r5,r6.
Can you please let me know why this could be happening? Are registers not enough?
The prototype posted is invalid, it should be:
void read_func(uint8_t *read_buffer_ptr, uint64_t offset, size_t size);
Similarly, the definition main() is obsolete: the implicit int return type is not supported as of c99, the function call has another syntax error with a missing )...
What happens when you pass a 64-bit argument on a 32-bit architecture is implementation defined:
either 8 bytes of stack space are used to pass the value
or 2 32-bit registers are used to pass the least significant part and the most significant part
or a combination of both depending on the number of arguments
or some other scheme appropriate for the target CPU
In your code you pass 0 which has type int and presumably has only 32 bits. This is not a problem if the prototype for read_func was correct and parsed before the function call, otherwise the behavior is undefined and a C99 compiler should not even compile the code, but may compilers will just issue a warning and generate bogus code.
In your case (Cortex R52), the 64-bit argument is passed to read_func in registers r2 and r3.
Cortex-R52 has 32 bits address bus and offset cannot be 64 bits. In calculations only lower 32bits will be used as higher ones will not have any effect.
example:
uint64_t foo(void *buff, uint64_t offset, uint64_t size)
{
unsigned char *cbuff = buff;
while(size--)
{
*(cbuff++ + offset) = size & 0xff;
}
return offset + (uint32_t)cbuff;
}
void *z1(void);
uint64_t z2(void);
uint64_t z3(void);
uint64_t bar(void)
{
return foo(z1(), z2(), z3());
}
foo:
push {r4, lr}
ldr lr, [sp, #8] //size
ldr r1, [sp, #12] //size
mov ip, lr
add r4, r0, r2 // cbuff + offset calculation r3 is ignored as not needed - processor has only 32bits address space.
.L2:
subs ip, ip, #1 //size--
sbc r1, r1, #0 //size--
cmn r1, #1
cmneq ip, #1
bne .L3
add r0, r0, lr
adds r0, r0, r2
adc r1, r3, #0
pop {r4, pc}
.L3:
strb ip, [r4], #1
b .L2
bar:
push {r0, r1, r4, r5, r6, lr}
bl z1
mov r4, r0 // buff
bl z2
mov r6, r0 // offset
mov r5, r1 // offset
bl z3
mov r2, r6 // offset
strd r0, [sp] // size passed on the stack
mov r3, r5 // offset
mov r0, r4 // buff
bl foo
add sp, sp, #8
pop {r4, r5, r6, pc}
As you see resister r2 & r3 contain the offset, r0 - buff and size is on the stack.
here is a c source code example:
register int a asm("r8");
register int b asm("r9");
int main() {
int c;
a=2;
b=3;
c=a+b;
return c;
}
And this is the assembled code generated using a arm gcc cross compiler:
$ arm-linux-gnueabi-gcc -c global_reg_var_test.c -Wa,-a,-ad
...
mov r8, #2
mov r9, #3
mov r2, r8
mov r3, r9
add r3, r2, r3
...
When using -frename-registers, the behaviour was the same. (updated. Before I had said with -O3.)
So the question is: why gcc add the 3rd and 4th MOV's instead of 'ADD R3, R8, R9'?
Context: I need to optimize a code in a simulated inorder cpu (gem5 arm minorcpu) that doesn't rename registers.
I took real example (posted in comments) and put it on the godbolt compiler explorer. The main inefficiency in calc() is that src1 and src2 are globals it has to load from memory, instead of args passed in registers.
I didn't look at main, just calc.
register int sum asm ("r4");
register int r asm ("r5");
register int c asm ("r6");
register int k asm ("r7");
register int temp1 asm ("r8"); // really? you're using two global register vars for scratch temporaries? Just let the compiler do its job.
register int temp2 asm ("r9");
register long n asm ("r10");
int *src1, *src2, *dst;
void calc() {
temp1 = r*n;
temp2 = k*n;
temp1 = temp1+k;
temp2 = temp2+c;
// you get bad code for this because src1 and src2 are globals, not args passed in regs
sum = sum + src1[temp1] * src2[temp2];
}
# gcc 4.8.2 -O3 -Wall -Wextra -Wa,-a,-ad -fverbose-asm
mla r0, r10, r7, r6 # temp2.9, n, k, c ## tmp = k*n + c
movw r3, #:lower16:.LANCHOR0 # tmp136,
mla r8, r10, r5, r7 # temp1, n, r, k ## temp1 = r*n + k
movt r3, #:upper16:.LANCHOR0 # tmp136,
ldmia r3, {r1, r2} # tmp136,, ## load both pointers, since they're stored adjacently in memory
mov r9, r0 # temp2, temp2.9 ## This insn is wasted: the first MLA should have had this as the dest
ldr r3, [r1, r8, lsl #2] # *_22, *_22
ldr r2, [r2, r9, lsl #2] # *_28, *_28
mla r4, r2, r3, r4 # sum, *_28, *_22, sum
bx lr #
For some reason, one of the integer multiply-accumulate (mla) instructions uses r8 (temp1) as the destination, but the other one writes to r0 (a scratch reg), and only later moves the result to r9 (temp2).
The sum += src1[temp1] * src2[temp2] is done with an mla that reads and writes r4 (sum).
Why do you need temp1 and temp2 to be globals? That's just going to stop the optimizer from doing aggressive optimizations that don't calculate exactly the same temporaries that the C source does. Fortunately the C memory model is weak enough that it should be able to reorder assignments to them, although this might actually be why it didn't MLA into temp2 directly, since it decided to do that calculation first. (Hmm, does the memory model even apply? Other threads can't see our registers at all, so those globals are all effectively thread-local. It should allow relaxed ordering for assignments to globals. Signal handlers can see these globals, and could run at any point. gcc isn't following strict source order, since in the source both multiplies happen before either add.)
Godbolt doesn't have a newer ARM gcc version, so I can't easily test a newer gcc. A newer gcc might do a better job with this.
BTW, I tried a version of the function using local variables for temporaries, and didn't actually get better results. Probably because there are still so many register globals that gcc couldn't pick convenient regs for the temporaries.
// same register globals, except for temp1 and temp2.
void calc_local_tmp() {
int t1 = r*n + k;
sum += src1[t1] * src2[k*n + c];
}
push {lr} # gcc decides to push to get a tmp reg
movw r3, #:lower16:.LANCHOR0 # tmp131,
mla lr, r10, r5, r7 # tmp133, n.1, r, k.2
movt r3, #:upper16:.LANCHOR0 # tmp131,
mla ip, r7, r10, r6 # tmp137, k.2, n.1, c
ldr r2, [r3] # src1, src1
ldr r0, [r3, #4] # src2, src2
ldr r1, [r2, lr, lsl #2] # *_10, *_10
ldr r3, [r0, ip, lsl #2] # *_20, *_20
mla r4, r3, r1, r4 # sum, *_20, *_10, sum
ldr pc, [sp], #4 #
Compiling with -fcall-used-r8 -fcall-used-r9 didn't help; gcc makes the same code that pushes lr to get an extra temporary. It fails to use ldmia (load-multiple) because it makes a sub-optimal choice of which temporary to put in which reg. (&src1 in r0 would let it load src1 and src2 into r2 and r3.)
Suppose I am given as input to a function foo some pointer *pL that points to a pointer to a struct that has a pointer field next in it. I know this is weird, but all I want to implement in assembly is the line of code with the ** around it:
typedef struct CELL *LIST;
struct CELL {
int element;
LIST next;
};
void foo(LIST *pL){
**(*pL)->next = NULL;**
}
How do I implement this in ARM assembly? The issue comes from having nested startements when I want to store such as:
.irrelevant header junk
foo:
MOV R1, #0
STR R1, [[R0,#0],#4] #This is gibberish, but [R0,#0] is to dereference and the #4 is to offeset that.
The sequence would be similar to:
... ; r0 = LIST *pL = CELL **ppC (ptr2ptr2cell)
ldr r0,[r0] ; r0 = CELL *pC (ptr2cell)
mov r1,#0 ; r1 = NULL
str r1,[r0,#4] ; (*pL)->next = pC->next = (*pC).next = NULL
The correct sequence would be (assuming ARM ABI and LIST *pL is in R0),
.global foo
foo:
ldr r0, [r0] # get *pL to R0
mov r1, #0 # set R1 to zero.
str r1, [r0, #4] # set (*pL)->List = NULL;
bx lr # return
You can swap the first two assembler statements, but it is generally better to interleave ALU with load/store for performance. With the ARM ABI, you can use R0-R3 without saving. foo() above should be callable from 'C' with most ARM compilers.
The 'C' might be simplified to,
void foo(struct CELL **pL)
{
(*pL)->next = NULL;
}
if I understand correctly.