I am trying to understand how a variable sized static array work internally:
Following is a fixed size static array in C and its Assembly equivalent;
int main()
{
int arr[2] = {3};
}
================
main:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], 0
mov DWORD PTR [rbp-8], 2
mov eax, 0
pop rbp
ret
However a variable sized array is shown below
int main()
{
int varSize ;
int Arr[varSize];
}
=================
main:
push rbp
mov rbp, rsp
sub rsp, 32
mov rax, rsp
mov rcx, rax
mov eax, DWORD PTR [rbp-4]
movsx rdx, eax
sub rdx, 1
mov QWORD PTR [rbp-16], rdx
movsx rdx, eax
mov r8, rdx
mov r9d, 0
movsx rdx, eax
mov rsi, rdx
mov edi, 0
cdqe
lea rdx, [0+rax*4]
mov eax, 16
sub rax, 1
add rax, rdx
mov edi, 16
mov edx, 0
div rdi
imul rax, rax, 16
sub rsp, rax
mov rax, rsp
add rax, 3
shr rax, 2
sal rax, 2
mov QWORD PTR [rbp-24], rax
mov rsp, rcx
mov eax, 0
leave
ret
I am seeing a whole lot of assembly instructions if I declare a variable sized array. Can some one explain how is this flexibility of variable size achieved?
Same mechanism as alloca() - allocate memory by decreasing the stack pointer, with the assumption that the stack is big enough and/or the OS will grow it as needed.
There might be a bit of an issue when the requested size is over a memory page and the stack is near its end. Normally, the OS grows the stack by setting up a guard page at the stack top and watching for faults in that area, but that assumes that the stack grows more or less sequentially (by pushes and function calls). If the decreased stack pointer overshoots the guard page, it might end up pointing at a bogus location. I'm not sure what does the compiler do about that possibility.
Related
Why is it convention in C and Go to pass a pointer to a variable and change it rather return a new variable with the value?
In C:
#include <stdio.h>
int getValueUsingReturn() {
int value = 42;
return value;
}
void getValueUsingPointer(int* value ) {
*value = 42;
}
int main(void) {
int valueUsingReturn = getValueUsingReturn();
printf("%d\n", valueUsingReturn);
int valueUsingPointer;
getValueUsingPointer(&valueUsingPointer);
printf("%d\n", valueUsingPointer);
return 0;
}
In Go:
package main
import "fmt"
func getValueUsingReturn() int {
value := 42
return value
}
func getValueUsingPointer(value *int) {
*value = 42
}
func main() {
valueUsingReturn := getValueUsingReturn()
fmt.Printf("%d\n", valueUsingReturn)
var valueUsingPointer int
getValueUsingPointer(&valueUsingPointer)
fmt.Printf("%d\n", valueUsingPointer)
}
It there any performance benefits or restrictions in doing one or the other?
First off, I don't know enough about Go to give a judgement on it, but the answer will apply in the case of C.
If you're just working on primitive types like ints, then I'd say there is no performance difference between the two techniques.
When structs come into play, there is a very slight advantage of modifying a variable via pointer (based purely on what you're doing in your code)
#include <stdio.h>
struct Person {
int age;
const char *name;
const char *address;
const char *occupation;
};
struct Person getReturnedPerson() {
struct Person thePerson = {26, "Chad", "123 Someplace St.", "Software Engineer"};
return thePerson;
}
void changeExistingPerson(struct Person *thePerson) {
thePerson->age = 26;
thePerson->name = "Chad";
thePerson->address = "123 Someplace St.";
thePerson->occupation = "Software Engineer";
}
int main(void) {
struct Person someGuy = getReturnedPerson();
struct Person theSameDude;
changeExistingPerson(&theSameDude);
return 0;
}
GCC x86-64 11.2
With No Optimizations
Returning a struct variable through the function's return is slower because the variable has to be "built" by assigning the desired values, after which, the variable is copied to the return value.
When you're modifying a variable by pointer indirection, there is nothing to do except write the desired values to the memory addresses (based off the pointer you passed in)
.LC0:
.string "Chad"
.LC1:
.string "123 Someplace St."
.LC2:
.string "Software Engineer"
getReturnedPerson:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-40], rdi
mov DWORD PTR [rbp-32], 26
mov QWORD PTR [rbp-24], OFFSET FLAT:.LC0
mov QWORD PTR [rbp-16], OFFSET FLAT:.LC1
mov QWORD PTR [rbp-8], OFFSET FLAT:.LC2
mov rcx, QWORD PTR [rbp-40]
mov rax, QWORD PTR [rbp-32]
mov rdx, QWORD PTR [rbp-24]
mov QWORD PTR [rcx], rax
mov QWORD PTR [rcx+8], rdx
mov rax, QWORD PTR [rbp-16]
mov rdx, QWORD PTR [rbp-8]
mov QWORD PTR [rcx+16], rax
mov QWORD PTR [rcx+24], rdx
mov rax, QWORD PTR [rbp-40]
pop rbp
ret
changeExistingPerson:
push rbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], 26
mov rax, QWORD PTR [rbp-8]
mov QWORD PTR [rax+8], OFFSET FLAT:.LC0
mov rax, QWORD PTR [rbp-8]
mov QWORD PTR [rax+16], OFFSET FLAT:.LC1
mov rax, QWORD PTR [rbp-8]
mov QWORD PTR [rax+24], OFFSET FLAT:.LC2
nop
pop rbp
ret
main:
push rbp
mov rbp, rsp
sub rsp, 64
lea rax, [rbp-32]
mov rdi, rax
mov eax, 0
call getReturnedPerson
lea rax, [rbp-64]
mov rdi, rax
call changeExistingPerson
mov eax, 0
leave
ret
With Slight Optimization
However, most compilers today can figure out what you're trying to do here, and will equalize the performance between the two techniques.
If you want to be absolutely stingy, passing pointers is still slightly faster by a few clock cycles at best.
In returning a variable from the function, you still have to at least set the address of the return value.
mov rax, rdi
But in passing the pointer, not even this is done.
But other than that, the two techniques have no performance difference.
.LC0:
.string "Chad"
.LC1:
.string "123 Someplace St."
.LC2:
.string "Software Engineer"
getReturnedPerson:
mov rax, rdi
mov DWORD PTR [rdi], 26
mov QWORD PTR [rdi+8], OFFSET FLAT:.LC0
mov QWORD PTR [rdi+16], OFFSET FLAT:.LC1
mov QWORD PTR [rdi+24], OFFSET FLAT:.LC2
ret
changeExistingPerson:
mov DWORD PTR [rdi], 26
mov QWORD PTR [rdi+8], OFFSET FLAT:.LC0
mov QWORD PTR [rdi+16], OFFSET FLAT:.LC1
mov QWORD PTR [rdi+24], OFFSET FLAT:.LC2
ret
main:
mov eax, 0
ret
I think the short answer to you question (At least for C, I am not familiar with GO internals) is that C functions are pass by value and generally also return by value so data objects must be copied and people worried about the performance of all the copying. For large objects or objects that are complex in their depth (containing pointers to other stuff) it is often just more efficient or logical for the value being copied to be a pointer so the function can "operate" on the data without needing to copy it.
That being said, modern compilers are pretty smart about figuring out stuff like whether the parameter data will fit in registers or efficiently copying returned structures.
Bottom line is for modern C code do what seems best for your application or what is clearest to you. Avoid premature optimization if it detracts from readability at least in the beginning.
Also Compiler Explorer (https://godbolt.org/) is your friend if you want to examine the effect of different styles, especially in light of optimization.
Here is a basic program I written on the godbolt compiler, and it's as simple as:
#include<stdio.h>
void main()
{
int a = 10;
int *p = &a;
printf("%d", *p);
}
The results after compilation I get:
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-12], 10
lea rax, [rbp-12]
mov QWORD PTR [rbp-8], rax
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
nop
leave
ret
Question: Pushing the rbp, making the stack frame by making a 16 byte block, how from a register, a value is moved to a stack location and vice versa, how the job of LEA is to figure out the address, I got this part.
Problem:
lea rax, [rbp-12]
mov QWORD PTR [rbp-8], rax
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
Lea -> getting address of rbp-12 into rax,
then moving the value which is the address of rbp-12 into rax,
but next line again says, move to rax, the value of rbp-8. This seems ambiguous. Then again moving the value of rax to eax. I don't understand the amount of work here. Why couldn't I have done
lea rax, [rbp-12]
mov QWORD PTR [rbp-8], rax
mov eax, QWORD PTR [rbp-8]
and be done with it? coz on the original line, rbp-12's address is stored onto rax, then rax stored to rbp-8. then rbp-8 stored again into rax, and then again rax is stored into eax? couldn't we have just copied the rbp-8 directly to eax? i guess not. But my question is why?
I know there is de-referencing in pointers, so How LEA helps grabbing the address of rbp-12, I understand, but on the next parts, when did it went from grabbing values from addresses I completely lost. And also, after that, I didn't understand any of the asm lines.
You're seeing very un-optimized code. Here's my line-by-line interpretation:
.LC0:
.string "%d" ; Format string for printf
main:
push rbp ; Save original base pointer
mov rbp, rsp ; Set base pointer to beginning of stack frame
sub rsp, 16 ; Allocate space for stack frame
mov DWORD PTR [rbp-12], 10 ; Initialize variable 'a'
lea rax, [rbp-12] ; Load effective address of 'a'
mov QWORD PTR [rbp-8], rax ; Store address of 'a' in 'p'
mov rax, QWORD PTR [rbp-8] ; Load 'p' into rax (even though it's already there - heh!)
mov eax, DWORD PTR [rax] ; Load 32-bit value of '*p' into eax
mov esi, eax ; Load value to print into esi
mov edi, OFFSET FLAT:.LC0 ; Load format string address into edi
mov eax, 0 ; Zero out eax (not sure why -- likely printf call protocol)
call printf ; Make the printf call
nop ; No-op (not sure why)
leave ; Remove the stack frame
ret ; Return
Compilers, when not optimizing, generate code like this as they parse the code you gave them. It's doing a lot of unnecessary stuff, but it is quicker to generate and makes using a debugger easier.
Compare this with the optimized code (-O2):
.LC0:
.string "%d" ; Format string for printf
main:
mov esi, 10 ; Don't need those variables -- just a 10 to pass to printf!
mov edi, OFFSET FLAT:.LC0 ; Load format string address into edi
xor eax, eax ; It's a few cycles faster to xor a register with itself than to load an immediate 0
jmp printf ; Just jmp to printf -- it will handle the return
The optimizer found that the variables weren't necessary, so no stack frame is created. Nothing is left but the printf call! And that's done as a jmp since nothing else need be done here when the printf is complete.
I never learn C language so it makes me confuse. I just like to know if I did it correctly or where I need to improve. For this code I used assembly x86 32 bit. Thanks
This is what I supposed to do:
Write a procedure with the signature
char *strchar(char *s1, char c1)
that returns a pointer to the first occurrence of the character c1 within the string s1 or, if not found, returns a null.
This is what I came out with:
strchar (char*, char):
push ebp
mov ebp, esp
mov dword ptr [ebp-24], edi
mov EAX , esi
mov BYTE PTR [ebp-28], al
.L5:
mov EAX , dword ptr [ebp-24]
movzx EAX , byte ptr [ EAX ]
test AL, AL
je .L2
mov EAX , dword PTR [ebp-24]
movzx EAX , BYTE PTR [ EAX ]
cmp BYTE PTR [ebp-28], al
jne .L3
mov eax, dword PTR [ebp-24]
jmp .L6
.L3:
add dword PTR [ebp-24], 1
jmp .L5
.L2:
LEA eax, [ebp-9]
MOV DWORD PTR [EBP-8], eax
MOV EAX, DWORD PTR [ebp-8]
.L6:
POP EBP
RET
The lines:
mov dword ptr [ebp-24], edi
mov EAX , esi
mov BYTE PTR [ebp-28], al
assume that a stack frame has been allocated for this function which doesn’t appear true; I think you should have something like:
sub esp, 32
after the
mov ebp,esp
Also, the three lines after L2 seem confused. The only way to get to L2 is if the nil (0) byte is discovered in the string, at which point, the code should return a NULL pointer.
The exit path in the code (L6) leaves eax alone, so all that should be needed is:
L2:
mov eax, 0
It might make debugging easier if you kept the alias up to date; so:
L2:
mov eax, 0
mov [ebp-24], eax
Also, the calling convention used here is a bit odd: the string is passed in edi and the character in esi. Normally, in x86-32, these would both be passed on the stack. This looks like it might have been x86-64 code, converted to x86-32....
A final note; this assembly code looks like the output of a compiler with optimisations disabled. Often, generating the assembly with the optimisations enabled generates easier to understand code. This code, for example, could be much more concisely written as below, without even devolving into weird intel ops:
strchar:
mov edx, esi
mov eax, edi
L:
mov dh, [eax]
test dh, dh
jz null
cmp dh, dl
je done
inc eax
jmp L
null:
mov eax, 0
done:
ret
Here is one with stack overhead
[global strchar]
strchar:
push ebp
mov ebp, esp
mov dl, byte [ebp + 12]
mov ecx, dword [ebp + 8]
xor eax, eax
.loop: mov al, [ecx]
or al, al
jz .exit
cmp al, dl
jz .found
add ecx, 1
jmp .loop
.found: mov eax, ecx
.exit:
leave
ret
Here is one without stack overhead
[global strchar]
strchar:
mov dl, byte [esp + 8]
mov ecx, dword [esp + 4]
xor eax, eax
.loop: mov al, [ecx]
or al, al
jz .exit
cmp al, dl
jz .found
add ecx, 1
jmp .loop
.found: mov eax, ecx
.exit:
ret
These are using the 'cdecl' calling convention. For 'stdcall' change the last 'ret' to 'ret 8'.
I'm trying to understand this simple C program:
int square(int num) {
return num * num;
}
When it's in Assembly code:
square(int):
push rbp ;push rbp register onto stack
mov rbp, rsp ;move contents of rbp register into rsp register
mov DWORD PTR [rbp-4], edi ;not sure what happens here
mov eax, DWORD PTR [rbp-4] ;not sure what happens here
imul eax, DWORD PTR [rbp-4] ;multiply eax and DWORD PTR [rbp-4] (?)
pop rbp ;pop original register out of stack
ret ;return
What is happening in the 3rd and 4th line?
Why did two more register (edi and eax) have to be used instead of rsp?
What is actually happening with DWORD PTR [rbp-4]?
mov DWORD PTR [rbp-4], edi ;not sure what happens here
The x86_64 System V ABI passes function arguments via registers - the first integer argument is passed in the rdi/edi register. So this line copies the argument num to a local (offset -4 bytes from the frame pointer value stored in rbp).
mov eax, DWORD PTR [rbp-4] ;not sure what happens here
This copies the value in the local to the eax register.
imul eax, DWORD PTR [rbp-4] ;multiply eax and DWORD PTR [rbp-4] (?)
And this multiplies the value in eax by the local, and stores the result to eax (which also happens to be the register in which the function return value is stored).
As others pointed out in the comments, compiling with optimization would likely eliminate the local, and write directly from edi to eax.
Say I have a struct defined as follows
struct my_struct
{
int num;
};
....
Here I have a pointer to my_struct and I want to do an increment on num
void foo(struct my_struct* my_ptr)
{
// increment num
// method #1
my_ptr->num++;
// method #2
++(my_ptr->num);
// method #3
my_ptr->++num;
}
Do these 3 ways of incrementing num do the same thing?
While we're at it, is it true that pre-increment is more efficient than post-increment?
Thanks!
First two will have the same effect (when on a line on their own like that), but the third method isn't valid C code (you can't put the ++ there).
As for efficiency, there is no difference. The difference you may have heard people talking about is when, in C++, you increment a non-pointer data type, such as an iterator. In some cases, pre-increment can be faster there.
You can see the generated code using GCC Explorer.
void foo(struct my_struct* my_ptr)
{
my_ptr->num++;
}
void bar(struct my_struct* my_ptr)
{
++(my_ptr->num);
}
Output:
foo(my_struct*): # #foo(my_struct*)
incl (%rdi)
ret
bar(my_struct*): # #bar(my_struct*)
incl (%rdi)
ret
As you can see, there's no difference whatsoever.
The only possible difference between the first two is when you use them in expressions:
my_ptr->num = 0;
int x = my_ptr->num++; // x = 0
my_ptr->num = 0;
int y = ++my_ptr->num; // y = 1
If your only intention is to increment the value of num then the 1st and 2nd method will yield same intented result to the callee method.
However, if you change your code to the following, you can see the difference between the code generated by gcc (assembly level code):
struct my_struct
{
int num;
};
void foo(struct my_struct* my_ptr)
{
printf("\nPost Increment: %d", my_ptr->num++);
}
int main()
{
struct my_struct a;
a.num = 10;
foo(&a);
}
Now compile it using: gcc -masm=intel -S structTest.c -o structTest.s
This asks gcc to generate the assembly code:
Open structTest.s in a text editor.
foo:
.LFB0:
push rbp
mov rbp, rsp
sub rsp, 16
**mov QWORD PTR [rbp-8], rdi**
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov edx, eax
**lea ecx, [rax+1]**
mov rax, QWORD PTR [rbp-8]
mov DWORD PTR [rax], ecx
mov eax, OFFSET FLAT:.LC0
mov esi, edx
mov rdi, rax
mov eax, 0
call printf
leave
ret
.cfi_endproc
main:
.LFB1:
push rbp
mov rbp, rsp
sub rsp, 16
**mov DWORD PTR [rbp-16], 10
lea rax, [rbp-16]
mov rdi, rax
call foo**
leave
ret
.cfi_endproc
And when you change the operation to pre-increment, the follwoing code is generated:
foo:
.LFB0:
.cfi_startproc
push rbp
mov rbp, rsp
sub rsp, 16
**mov QWORD PTR [rbp-8], rdi**
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
**lea edx, [rax+1]**
mov rax, QWORD PTR [rbp-8]
**mov DWORD PTR [rax], edx**
mov rax, QWORD PTR [rbp-8]
**mov edx, DWORD PTR [rax]**
mov eax, OFFSET FLAT:.LC0
mov esi, edx
mov rdi, rax
mov eax, 0
call printf
leave
ret
.cfi_endproc
So, you would see that in the second case, the compiler increments the num value and passes on this num value to printf().
In terms of performance, I would expect the post-increment to be more efficient since the memory locations are touched a fewer number of times.
The important lines have been marked between ** in the above code.