I was given a function in assembly which basically converted uppercase letters to lowercase letters. Here is some of the assembly,
Q1:
pushq %rbp
movq %rsp, %rbp
subq $24, %rsp
movq %rdi, -24(%rbp)
movl $0, -4(%rbp)
movl $0. -8%(%rbp)
jmp .L2
L2:
movl -4(%rbp) %edx
movq -24(%rbp), %rax
addq %rdx, %rax
movzbl (%rax), %eax
testb %al, %al
jne .L4
...
Much of the rest is repetitive but L2 is what really is confusing me. This is my logic so far:
We store param1 into -24(%rbp). We create local1 and local2, set them both to 0 and then jump to L2. I move local1 into %edx, param1 into %rax. Now this is where things get confusing for me,
I was told the following line, addq ended up in local1 being a pointer to param1. I just reasoned add local1 + param1 and store them into %rax. How is that possible?
Next is, movzbl. From my understanding we dereference %rax so we get something like eax = (int) rax.
I was also told to think of it as converting a char to int. Which one is true, how do I know that I'm typecasting? What about if %rax didn't have parentheses around it? Is it an int because it's 4 bytes and %eax is a 32 bit register. Thank you in advance for your help, I'm kind of lost here....
local1 is not a pointer, it's an index (a counter).
That code is doing something like:
void toupper(char* text)
{
int i = 0; /* at rbp-4 */
int j = 0; /* unused, at rbp-8 */
int ch; /* in eax */
while((ch = *(text + i)) != 0)
{
...
}
}
Note that in C pointer arithmetic *(text + i) is of course equivalent to text[i].
Yes, the movzbl is converting an unsigned char to an int you can see that from the instruction name itself: MOVe Zero extended Byte to Long.
The parentheses denote pointer dereferencing.
Related
I'm reading "Computer Systems: A Programmer's Perspective, 3/E" (CS:APP3e) and the following code is an example from the book:
long call_proc() {
long x1 = 1;
int x2 = 2;
short x3 = 3;
char x4 = 4;
proc(x1, &x1, x2, &x2, x3, &x3, x4, &x4);
return (x1+x2)*(x3-x4);
}
The book gives the assembly code generated by GCC:
long call_proc()
call_proc:
; Set up arguments to proc
subq $32, %rsp ; Allocate 32-byte stack frame
movq $1, 24(%rsp) ; Store 1 in &x1
movl $2, 20(%rsp) ; Store 2 in &x2
movw $3, 18(%rsp) ; Store 3 in &x3
movb $4, 17(%rsp) ; Store 4 in &x4
leaq 17(%rsp), %rax ; Create &x4
movq %rax, 8(%rsp) ; Store &x4 as argument 8
movl $4, (%rsp) ; Store 4 as argument 7
leaq 18(%rsp), %r9 ; Pass &x3 as argument 6
movl $3, %r8d ; Pass 3 as argument 5
leaq 20(%rsp), %rcx ; Pass &x2 as argument 4
movl $2, %edx ; Pass 2 as argument 3
leaq 24(%rsp), %rsi ; Pass &x1 as argument 2
movl $1, %edi ; Pass 1 as argument 1
; Call proc
call proc
; Retrieve changes to memory
movslq 20(%rsp), %rdx ; Get x2 and convert to long
addq 24(%rsp), %rdx ; Compute x1+x2
movswl 18(%rsp), %eax ; Get x3 and convert to int
movsbl 17(%rsp), %ecx ; Get x4 and convert to int
subl %ecx, %eax ; Compute x3-x4
cltq ; Convert to long
imulq %rdx, %rax ; Compute (x1+x2) * (x3-x4)
addq $32, %rsp ; Deallocate stack frame
ret ; Return
I can understand this code: the compiler allocates 32 bytes of space on the stack, of which the first 16 bytes hold the arguments passed to proc and the last 16 bytes hold 4 local variables.
Then I tested this code on GCC 11.2, using the optimization flag -Og, and got this assembly code:
call_proc():
subq $24, %rsp
movq $1, 8(%rsp)
movl $2, 4(%rsp)
movw $3, 2(%rsp)
movb $4, 1(%rsp)
leaq 1(%rsp), %rax
pushq %rax
pushq $4
leaq 18(%rsp), %r9
movl $3, %r8d
leaq 20(%rsp), %rcx
movl $2, %edx
leaq 24(%rsp), %rsi
movl $1, %edi
call proc(long, long*, int, int*, short, short*, char, char*)
movslq 20(%rsp), %rax
addq 24(%rsp), %rax
movswl 18(%rsp), %edx
movsbl 17(%rsp), %ecx
subl %ecx, %edx
movslq %edx, %rdx
imulq %rdx, %rax
addq $40, %rsp
ret
I noticed that gcc first allocated 24 bytes for 4 local variables. Then it uses pushq to add 2 arguments to the stack, so the final code uses addq $40, %rsp to free stack space.
Compared to the code in the book, GCC allocates 8 more bytes of space here, and it doesn't seem to use the extra space. Why does it need the extra space?
(This answer is a summary of comments posted above by Antti Haapala, klutt and Peter Cordes.)
GCC allocates more space than "necessary" in order to ensure that the stack is properly aligned for the call to proc: the stack pointer must be adjusted by a multiple of 16, plus 8 (i.e. by an odd multiple of 8). Why does the x86-64 / AMD64 System V ABI mandate a 16 byte stack alignment?
What's strange is that the code in the book doesn't do that; the code as shown would violate the ABI and, if proc actually relies on proper stack alignment (e.g. using aligned SSE2 instructions), it may crash.
So it appears that either the code in the book was incorrectly copied from compiler output, or else the authors of the book are using some unusual compiler flags which alter the ABI.
Modern GCC 11.2 emits nearly identical asm (Godbolt) using -Og -mpreferred-stack-boundary=3 -maccumulate-outgoing-args, the former of which changes the ABI to maintain only 2^3 byte stack alignment, down from the default 2^4. (Code compiled this way can't safely call anything compiled normally, even standard library functions.) -maccumulate-outgoing-args used to be the default in older GCC, but modern CPUs have a "stack engine" that makes push/pop single-uop so that option isn't the default anymore; push for stack args saves a bit of code size.
One difference from the book's asm is a movl $0, %eax before the call, because there's no prototype so the caller has to assume it might be variadic and pass AL = the number of FP args in XMM registers. (A prototype that matches the args passed would prevent that.) The other instructions are all the same, and in the same order as whatever older GCC version the book used, except for choice of registers after call proc returns: it ends up using movslq %edx, %rdx instead of cltq (sign-extend with RAX).
CS:APP 3e global edition is notorious for errors in practice problems introduced by the publisher (not the authors), but apparently this code is present in the North American edition, too. So this may be the author's mistake / choice to use actual compiler output with weird options. Unlike some of the bad global edition practice problems, this code could have come unmodified from some GCC version, but only with non-standard options.
Related: Why does GCC allocate more space than necessary on the stack, beyond what's needed for alignment? - GCC has a missed-optimization bug where it sometimes reserves an additional 16 bytes that it truly didn't need to. That's not what's happening here, though.
This question already has answers here:
Why does the compiler reserve a little stack space but not the whole array size?
(2 answers)
Stack allocation, padding, and alignment
(6 answers)
Closed 2 years ago.
How does gcc decides how much memory allocate for stack and why does it not decrement %rsp anymore when I remove printf() (or any function call) from my main?
1. I noticed when I played around with a code sample: https://godbolt.org/z/fQqkNE that the 6th line in gcc assembly viewer subq $48, %rsp gets removed if I remove printf() from my C code on line 22. It looks like when I don’t make any function calls from within my main, then the %rsp does not get decremented, but data still gets allocated based on %rbp and offsets. I thought %rsp changes only when stack grows. My theory is that since it won’t make any other function calls, it knows that it won’t need to keep stack for other nonexistent functions. But shouldn’t %rsp still grow as data is getting saved?
2. When adding variables to my rect struct, I also noticed that it sometimes allocates memory in steps greater than what the added data type size was. What is the convention it follows when deciding how much memory to allocate to stack?
3. Is there an online tool that would take assembly code as input, and then draw an image of stack and tell me state of every register at any point of execution? Godbolt.org is a very good tool, I just wish it had these 2 extra features.
I'll paste the code below in case the link to godbolt stops working in the future:
#include <stdio.h>
#include <stdint.h>
struct rect {
int a;
int b;
int* c;
int d[2];
uint8_t f;
};
int main() {
int arr[2] = {2, 3};
struct rect Rect;
Rect.a = 10;
Rect.b = 20;
Rect.c = arr;
Rect.d[0] = Rect.a;
Rect.d[1] = Rect.b;
Rect.f =255;
printf("%d and %d", Rect.a, Rect.b);
return 0;
}
.LC0:
.string "%d and %d"
main:
pushq %rbp
movq %rsp, %rbp
subq $48, %rsp
movl $2, -8(%rbp)
movl $3, -4(%rbp)
movl $10, -48(%rbp)
movl $20, -44(%rbp)
leaq -8(%rbp), %rax
movq %rax, -40(%rbp)
movl -48(%rbp), %eax
movl %eax, -32(%rbp)
movl -44(%rbp), %eax
movl %eax, -28(%rbp)
movb $-1, -24(%rbp)
movl -44(%rbp), %edx
movl -48(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
leave
ret
P.S.: The book I follow uses AT&T syntax for teaching x86. Which is weird because it makes finding online tutorials much harder.
I am getting a segfault at the movq (%rsi, %rcx) line.
I know you can't do mem->mem mov, so I did it through a temporary register. (%rsi), %rcx, then in the loop %rcx, (%rdi). Here is my code:
experimentMemset: #memset(void *ptr, int value, size_t num)
#%rdi #%rsi #%rdx
movq %rdi, %rax #sets rax to the first pointer, to return later
.loop:
cmp $0, (%rdx) #see if num has reached 0
je .end
cmpb $0, (%rdi) #see if string has ended also
je .end
movq %rsi, %rdi #copies value into rdi
inc %rdi #increments pointer to traverse string
dec %rdx #decrements the count, aka num
jmp .loop
.end:
ret
As you discovered, RDX holds a size (an integer count), not a pointer. It's passed by value, not by reference.
cmp $0, (%rdx)
compares not the register, but the location pointed by it. It seems that %rdx is used as a counter, so you should compare the register itself.
test %rdx,%rdx ; je count_was_zero
There are other bugs, like checking the contents of the write-only destination for zeros, and not storing %sil into (%rdi). But this was the cause of the segfault in the current version of the question.
So I've been working on a problem (and before you ask, yes, it is homework, but I've been putting in faithful effort!) where I have some assembly code and want to be able to convert it (as faithfully as possible) to C.
Here is the assembly code:
A1:
pushl %ebp
movl %esp, %ebp
subl $16, %esp
movl $0, -4(%ebp)
jmp .L2
.L4:
movl -4(%ebp), %eax
sall $2, %eax
addl 8(%ebp), %eax
movl (%eax), %eax
cmpl 12(%ebp), %eax
jg .L6
.L2:
movl -4(%ebp), %eax
cmpl 16(%ebp), %eax
jl .L4
jmp .L3
.L6:
nop
.L3:
movl -4(%ebp), %eax
leave
ret
And here's some of the C code I wrote to mimic it:
int A1(int a, int b, int c) {
int local = 0;
while(local < c) {
if(b > (int*)((local << 2) + a)) {
return local;
}
}
return local;
}
I have a few questions about how assembly works.
First, I notice that in L4, the body of the while loop, nothing is ever assigned to local. It's initialized to be 0 at the start of the function, and then never modified again. Looking at the C code I made for it, though, that seems odd, considering that the loop will go on indefinitely if the if-condition fails. Am I missing something there? I was under the impression that you'd need a snippet of code like:
movl %eax, -4(%ebp)
in order to actually assign anything to the local variable, and I don't see anything like that in the body of the while loop.
Secondly, you'll see that in the assembly code, the only local variable that's declared is "local". Hence, I have to use a snippet of code like:
if(b > (int*)((local << 2) + a))
The output of this line doesn't look much like the assembly code, though, and I think I might have made a mistake. What did I do wrong here?
And finally (thanks for your patience!), on a related note, I understand that the purpose of this if-loop in the while loop is to break out if the condition is fulfilled, and then to return local. Hence L6 and "nop" (which is basically saying nothing). However, I don't know how to replicate this in my program. I've tried "break", and I've tried returning local as you see here. I understand the functionality - I just don't know how to replicate it in C (short of using goto, but that kind of defeats the purpose of the exercise...).
Thank you for your time!
This is my guess:
int A1 (int *a, int value, int size)
{
int i = 0;
while (i<size)
{
if (a[i] <= value)
break;
}
return i;
}
Which, compiled back to assembly, gives me this code:
A1:
.LFB0:
pushl %ebp
movl %esp, %ebp
subl $16, %esp
movl $0, -4(%ebp)
jmp .L2
.L4:
movl -4(%ebp), %eax
leal 0(,%eax,4), %edx
movl 8(%ebp), %eax
addl %edx, %eax
movl (%eax), %eax
cmpl 12(%ebp), %eax
jg .L2
jmp .L3
.L2:
movl -4(%ebp), %eax
cmpl 16(%ebp), %eax
jl .L4
.L3:
movl -4(%ebp), %eax
leave
ret
Now this seems to be identical to your original ASM code, just the code starting at L4 is not the same, but if we anotate both codes:
ORIGINAL
movl -4(%ebp), %eax ;EAX = local
sall $2, %eax ;EAX = EAX*4
addl 8(%ebp), %eax ;EAX = EAX+a, hence EAX=a+local*4
ASM-C-ASM
movl -4(%ebp), %eax ;EAX = i
leal 0(,%eax,4), %edx ;EDX = EAX*4
movl 8(%ebp), %eax ;EAX = a
addl %edx, %eax ;EAX = EAX+EDX, hence EAX=a+i*4
Both codes continue with
movl (%eax), %eax
Because of this, I guess a is actually a pointer to some variable type that uses 4 bytes. By the comparison between the second argument and the value read from memory, I guess that type must be either int or long. I choose int solely by convenience.
Of course this also means that this code (and the original one) does not make any sense. It lacks the i++ part somewhere. If this is so, then a is an array, and the third argument is the size of the array. I've named my local variable i to keep with the tradition of naming index variables like this.
This code would scan the array searching for a value inside it that is equal or less than value. If it finds it, the index to that value is returned. If not, the size of the array is returned.
I am looking at the performance of memchr-like functions and made an interesting observation.
This is check.c with 3 implementations to find the offset of a \n character in a string:
#include <stdlib.h>
size_t mem1(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x == '\n') return (p - s);
p++;
}
}
size_t mem2(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x <= '$' && (x == '\n' || x == '\0')) return (p - s);
p++;
}
}
size_t mem3(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x == '\n' || x == '\0') return (p - s);
p++;
}
}
size_t mem4(const char *s)
{
const char *p = s;
while (1)
{
const char x = *p;
if (x <= '$' && (x == '\n')) return (p - s);
p++;
}
}
I run these functions on a string of bytes which can be described by the Haskell expression (concat $ replicate 10000 "abcd") ++ "\n" ++ "hello" - that is 10000 times asdf, then the newline to find, and then hello. Of course all 3 implementations return the same offset: 40000 as expected.
Interestingly, when compiled with gcc -O2, the run times on that string are:
mem1: 16 us
mem2: 12 us
mem3: 25 us
mem4: 16 us
(I'm using the criterion library to measure these times with statistical accuracy.)
I cannot explain this to myself. Why is mem2 so much faster than the other two?
--
The assembly as generated by gcc -S -O2 -o check.asm check.c:
mem1:
.LFB14:
cmpb $10, (%rdi)
movq %rdi, %rax
je .L9
.L6:
addq $1, %rax
cmpb $10, (%rax)
jne .L6
subq %rdi, %rax
ret
.L9:
xorl %eax, %eax
ret
mem2:
.LFB15:
movq %rdi, %rax
jmp .L13
.L19:
cmpb $10, %dl
je .L14
.L11:
addq $1, %rax
.L13:
movzbl (%rax), %edx
cmpb $36, %dl
jg .L11
testb %dl, %dl
jne .L19
.L14:
subq %rdi, %rax
ret
mem3:
.LFB16:
movzbl (%rdi), %edx
testb %dl, %dl
je .L26
cmpb $10, %dl
movq %rdi, %rax
jne .L27
jmp .L26
.L30:
cmpb $10, %dl
je .L23
.L27:
addq $1, %rax
movzbl (%rax), %edx
testb %dl, %dl
jne .L30
.L23:
subq %rdi, %rax
ret
.L26:
xorl %eax, %eax
ret
mem4:
.LFB17:
cmpb $10, (%rdi)
movq %rdi, %rax
je .L38
.L36:
addq $1, %rax
cmpb $10, (%rax)
jne .L36
subq %rdi, %rax
ret
.L38:
xorl %eax, %eax
ret
Any explanation is very appreciated!
My best guess is it's to do with register dependency - if you look at the 3-instruction main loop in mem1, you have a circular dependency on rax. Naïvely, this means each instruction has to wait for the last one to finish - in practice it means if the instructions aren't retired quickly enough the microarchitecture may run out of registers to rename and just give up and stall for a bit.
In mem2 the fact that there are 4 instructions in the loop - and possibly also the fact that there's more of an explicit pipeline in the use of both rax and edx/dl - is probably giving the out-of-order execution hardware an easier time thus it ends up pipelining more efficiently.
I don't claim to be an expert so this may be complete nonsense, but based on what I've studied of Agner Fog's absolute goldmine of Intel optimisation details it doesn't seem an entirely unreasonable hypothesis.
Edit: Out of interest, I've tested mem1 and mem2 on my machine (Core 2 Duo E7500), compiled with -O2 -falign-functions=64 to the exact same assembly code. Calling either function with the given string 1,000,000 times in a loop and using Linux's time, I get ~19s for mem1 and ~18.8s for mem2 - much less than the 25% difference on the newer microarchitecture. Guess it's time to buy an i5...
Your input is such that makes mem2 faster. Every letter in the input apart from '\n' has value larger than '$', so if condition is false from the first part of the expression (x <= '$'), and second part of the expression (x == '\n' || x == '\0') is never executed. If you would use "####" instead of "abcd" I suspect the execution would become slower.
With a cache, the test of mem1() takes the brunt of filling the cache.
Run the mem1() test first and again as last and use the 2nd time as it reflects a primed cache like the other tests. Confident it will be faster and a more fair time comparison.