I have a school project about compilers and how it differs in assembly code between Intel x86 and ARMv7, but i'm stuck trying to comprehend the assembly for the Intel x86 architecture.
The source code is:
int main()
{
int a=5,b=2;
int result;
result = a % b;
printf("Result of 5 modulo 2 is %i\n", result);
}
Assembly output (gcc masm=Intel)
main:
/*
Intel32-x86 Arhchitecture
Little endian
ebp register -- base pointer
esp register -- stack pointer
*/
push ebp ; ebp register put on stack
mov ebp, esp ; Move data from ebp to esp
and esp, -16 ; Logical AND ??
sub esp, 32 ; Subtraction ??
mov DWORD PTR [esp+20], 5
;5 as 32 bits
;00000101-00000000-00000000-00000000
mov DWORD PTR [esp+24], 2
;2 as 32 bits
;00000010-00000000-00000000-00000000
mov eax, DWORD PTR [esp+20]
mov edx, eax
sar edx, 31
;Shift Arithmetically right - edx med 31.
;00000101-00000000-00000000-00000000 BEFORE
;00000000-00000000-00000000-00000000 AFTER
idiv DWORD PTR [esp+24]
;Signed divide - IDIV r/m32 - EDX:EAX register
;Dividing EDX:EAX on value of esp+24, and save the remainder in edx.
;EDX:EAX 00000000-00000000-00000000-00000000-00000101-00000000-00000000-00000000
mov DWORD PTR [esp+28], edx
mov eax, OFFSET FLAT:.LC0
mov edx, DWORD PTR [esp+28]
mov DWORD PTR [esp+4], edx
mov DWORD PTR [esp], eax
call printf
leave
ret
and esp, -16 ; Logical AND
sub esp, 32 ; Subtraction
What is the purpose of those two instructions?
The purpose is mentioned in the comments:
and esp,-16 ;round esp down to 16 byte boundary
sub esp,32 ;allocate 32 bytes of space for local variables
In case you didn't catch this part about sign extending the dividend:
mov eax, DWORD PTR [esp+20] ; eax = dividend
mov edx, eax ; edx = dividend
sar edx, 31 ; edx = 0 or -1 (the sign extension)
Related
In short: try switching foos pointer from 0 to 1 here:
godbolt - compiler explorer link - what is happening?
I was surprised at how many instruction came out of clang when I compiled the following C code. - And I noticed that it only happens when the pointer foos is zero. (x86-64 clang 12.0.1 with -O2 or -O3).
#include <stdint.h>
typedef uint8_t u8;
typedef uint32_t u32;
typedef struct {
u32 x;
u32 y;
}Foo;
u32 count = 500;
int main()
{
u8 *foos = (u8 *)0;
u32 element_size = 8;
u32 offset = 0;
for(u32 i=0;i<count;i++)
{
u32 *p = (u32 *)(foos + element_size*i);
*p = i;
}
return 0;
}
This is the output when the pointer is zero.
main: # #main
mov r8d, dword ptr [rip + count]
test r8, r8
je .LBB0_6
lea rcx, [r8 - 1]
mov eax, r8d
and eax, 3
cmp rcx, 3
jae .LBB0_7
xor ecx, ecx
jmp .LBB0_3
.LBB0_7:
and r8d, -4
mov esi, 16
xor ecx, ecx
.LBB0_8: # =>This Inner Loop Header: Depth=1
lea edi, [rsi - 16]
and edi, -32
mov dword ptr [rdi], ecx
lea edi, [rsi - 8]
and edi, -24
lea edx, [rcx + 1]
mov dword ptr [rdi], edx
mov edx, esi
and edx, -16
lea edi, [rcx + 2]
mov dword ptr [rdx], edi
lea edx, [rsi + 8]
and edx, -8
lea edi, [rcx + 3]
mov dword ptr [rdx], edi
add rcx, 4
add rsi, 32
cmp r8, rcx
jne .LBB0_8
.LBB0_3:
test rax, rax
je .LBB0_6
lea rdx, [8*rcx]
.LBB0_5: # =>This Inner Loop Header: Depth=1
mov esi, edx
and esi, -8
mov dword ptr [rsi], ecx
add rdx, 8
add ecx, 1
add rax, -1
jne .LBB0_5
.LBB0_6:
xor eax, eax
ret
count:
.long 500 # 0x1f4
Can you please help me understand what is happening here? I don't know assembly very well. The AND with 3 suggest to me that there's some alignment branching. The top part of LBB0_8 looks very strange to me...
This is loop unrolling.
The code first checks if count is greater than 3, and if so, branches to LBB0_7, which sets up loop variables and drops into the loop at LBB0_8. This loop does 4 steps per iteration, as long as there are still 4 or more to do. Afterwards it falls through to the "slow path" at LBB0_3/LBB0_5 that just does one step per iteration.
That slow path is also very similar to what you get when you compile the code with a non-zero value for that pointer.
As for why this happens, I don't know. Initially I was thinking that the compiler proves that a NULL deref will happen inside the loop and optimises on that, but usually that's akin to replacing the loop contents with __builtin_unreachable();, which causes it to throw out the loop entirely. Still can't rule it out, but I've seen the compiler throw out a lot of code many times, so it seems at least unlikely that UB causes this.
Then I was thinking maybe the fact that 0 requires no additional calculation, but all it'd have to change was mov esi, 16 to mov esi, 17, so it'd have the same amount of instructions.
What's also interesting is that on x86_64, it generates a loop with 4 steps per iteration, whereas on arm64 it generates one with 2 steps per iteration.
Sorry if I am asking a stupid question, but I can't find the answer due to clumsy search terms I guess
If I declare three variables as follows
volatile uint16_t a, b, c;
Will all three variables be declared volatile?
Or should I really not declare multiple variables in a row but instead do:
volatile uint16_t a;
volatile uint16_t b;
volatile uint16_t c;
If I declare three variables as follows
volatile uint16_t a, b, c;
Will all three variables be declared volatile?
Yes, all 3 variables will be volatile.
Or should I really not declare multiple variables in a row but instead do:
That is related to code style and personal preference. Usually declaring variables one per line is preferred, is more readable, easier to read, easier to refactor and results in more readable changes when browsing diff output of files.
We can check the assembly generated by the compiler to see if it optimizes the variables out or not.
When I check this simple program:
#include <stdio.h>
#include <stdint.h>
int main(void)
{
uint16_t a = 1, b = 1, c = 1;
printf("%hu", a);
printf("%hu", b);
printf("%hu", c);
}
The generated assembly at -O3 (link) is:
.LC0:
.string "%hu"
main:
sub rsp, 8
mov esi, 1
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
mov esi, 1
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
mov esi, 1
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 8
ret
It's obvious here that the variables have been optimized out and 1 is being used as a parameter instead of the variables.
When I replace the uint16_t a = 1, b = 1, c = 1; with volatile uint16_t a = 1, b = 1, c = 1;, The assembly generated (link) is:
main:
sub rsp, 24
mov edx, 1
mov ecx, 1
mov eax, 1
mov WORD PTR [rsp+10], ax
mov edi, OFFSET FLAT:.LC0
xor eax, eax
mov WORD PTR [rsp+12], dx
mov WORD PTR [rsp+14], cx
movzx esi, WORD PTR [rsp+10]
call printf
movzx esi, WORD PTR [rsp+12]
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
movzx esi, WORD PTR [rsp+14]
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 24
ret
Here, volatile is working like it should for all variables. The variables are created and are not optimized out.
In comparison, if we replace volatile uint16_t a = 1, b = 1, c = 1; with volatile uint16_t a = 1; uint16_t b = 1, c = 1; we see that only a is not optimized out (link):
main:
sub rsp, 24
mov eax, 1
mov edi, OFFSET FLAT:.LC0
mov WORD PTR [rsp+14], ax
movzx esi, WORD PTR [rsp+14]
xor eax, eax
call printf
mov esi, 1
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
mov esi, 1
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 24
ret
I never learn C language so it makes me confuse. I just like to know if I did it correctly or where I need to improve. For this code I used assembly x86 32 bit. Thanks
This is what I supposed to do:
Write a procedure with the signature
char *strchar(char *s1, char c1)
that returns a pointer to the first occurrence of the character c1 within the string s1 or, if not found, returns a null.
This is what I came out with:
strchar (char*, char):
push ebp
mov ebp, esp
mov dword ptr [ebp-24], edi
mov EAX , esi
mov BYTE PTR [ebp-28], al
.L5:
mov EAX , dword ptr [ebp-24]
movzx EAX , byte ptr [ EAX ]
test AL, AL
je .L2
mov EAX , dword PTR [ebp-24]
movzx EAX , BYTE PTR [ EAX ]
cmp BYTE PTR [ebp-28], al
jne .L3
mov eax, dword PTR [ebp-24]
jmp .L6
.L3:
add dword PTR [ebp-24], 1
jmp .L5
.L2:
LEA eax, [ebp-9]
MOV DWORD PTR [EBP-8], eax
MOV EAX, DWORD PTR [ebp-8]
.L6:
POP EBP
RET
The lines:
mov dword ptr [ebp-24], edi
mov EAX , esi
mov BYTE PTR [ebp-28], al
assume that a stack frame has been allocated for this function which doesn’t appear true; I think you should have something like:
sub esp, 32
after the
mov ebp,esp
Also, the three lines after L2 seem confused. The only way to get to L2 is if the nil (0) byte is discovered in the string, at which point, the code should return a NULL pointer.
The exit path in the code (L6) leaves eax alone, so all that should be needed is:
L2:
mov eax, 0
It might make debugging easier if you kept the alias up to date; so:
L2:
mov eax, 0
mov [ebp-24], eax
Also, the calling convention used here is a bit odd: the string is passed in edi and the character in esi. Normally, in x86-32, these would both be passed on the stack. This looks like it might have been x86-64 code, converted to x86-32....
A final note; this assembly code looks like the output of a compiler with optimisations disabled. Often, generating the assembly with the optimisations enabled generates easier to understand code. This code, for example, could be much more concisely written as below, without even devolving into weird intel ops:
strchar:
mov edx, esi
mov eax, edi
L:
mov dh, [eax]
test dh, dh
jz null
cmp dh, dl
je done
inc eax
jmp L
null:
mov eax, 0
done:
ret
Here is one with stack overhead
[global strchar]
strchar:
push ebp
mov ebp, esp
mov dl, byte [ebp + 12]
mov ecx, dword [ebp + 8]
xor eax, eax
.loop: mov al, [ecx]
or al, al
jz .exit
cmp al, dl
jz .found
add ecx, 1
jmp .loop
.found: mov eax, ecx
.exit:
leave
ret
Here is one without stack overhead
[global strchar]
strchar:
mov dl, byte [esp + 8]
mov ecx, dword [esp + 4]
xor eax, eax
.loop: mov al, [ecx]
or al, al
jz .exit
cmp al, dl
jz .found
add ecx, 1
jmp .loop
.found: mov eax, ecx
.exit:
ret
These are using the 'cdecl' calling convention. For 'stdcall' change the last 'ret' to 'ret 8'.
I got a task to write an assembly routine that can be read from C and declared as follows:
extern int solve_equation(long int a, long int b,long int c, long int *x, long int *y);
that finds a solution to the equation
a * x + b * y = c
In -2147483648 <x, y <2147483647 by checking all options.
The value returned from the routine will be 1 if a solution is found and another 0.
You must take into consideration that the results of the calculations: a * x, b * y, a * x + b * y can exceed 32 bits.
.MODEL SMALL
.DATA
C DQ ?
SUM DQ 0
MUL1 DQ ?
MUL2 DQ ?
X DD ?
Y DD ?
.CODE
.386
PUBLIC _solve_equation
_solve_equation PROC NEAR
PUSH BP
MOV BP,SP
PUSH SI
MOV X,-2147483648
MOV Y,-2147483648
MOV ECX,4294967295
FOR1:
CMP ECX,0
JE FALSE
PUSH ECX
MOV Y,-2147483648
MOV ECX,4294967295
FOR2:
MOV SUM,0
CMP ECX,0
JE SET_FOR1
MOV EAX,DWORD PTR [BP+4]
IMUL X
MOV DWORD PTR MUL1,EAX
MOV DWORD PTR MUL1+4,EDX
MOV EAX,DWORD PTR [BP+8]
IMUL Y
MOV DWORD PTR MUL2,EAX
MOV DWORD PTR MUL2+4,EDX
MOV EAX, DWORD PTR MUL1
ADD DWORD PTR SUM,EAX
MOV EAX, DWORD PTR MUL2
ADD DWORD PTR SUM,EAX
MOV EAX, DWORD PTR MUL1+4
ADD DWORD PTR SUM+4,EAX
MOV EAX, DWORD PTR MUL2+4
ADD DWORD PTR SUM+4,EAX
CMP SUM,-2147483648
JL SET_FOR2
CMP SUM,2147483647
JG SET_FOR2
MOV EAX,DWORD PTR [BP+12]
CMP DWORD PTR SUM,EAX
JE TRUE
SET_FOR2:
DEC ECX
INC Y
JMP FOR2
SET_FOR1:
POP ECX
DEC ECX
JMP FOR1
FALSE:
MOV AX,0
JMP SOF
TRUE:
MOV SI,WORD PTR [BP+16]
MOV EAX,X
MOV DWORD PTR [SI],EAX
MOV SI,WORD PTR [BP+18]
MOV EAX,Y
MOV DWORD PTR [SI],EAX
MOV AX,1
SOF:
POP SI
POP BP
RET
_solve_equation ENDP
END
Is this the right way to work with large numbers?
I get argument to operation or instruction has illegal size when I try to do:
MOV SUM,0
CMP SUM,-2147483648
CMP SUM,2147483647
main code:
int main()
{
long int x, y, flag;
flag = solve_equation(-5,4,2147483647,&x, &y);
if (flag == 1)
printf("%ld*%ld + %ld*%ld = %ld\n", -5L,x,4L,y,2147483647);
return 0;
}
output
-5*-2147483647 + 4*-2147483647 = 2147483647
I`m using dosbox 0.74 and tcc
You're using 16-bit code, so 64-bit operand-size isn't available. Your assembler magically associates a size with sum, because you defined it with sum dq 0.
So mov sum, 0 is equivalent to mov qword ptr [sum], 0, which of course won't assemble in 16 or 32-bit mode; you can only operate on up to 32 bits at once with integer operations.
(32-bit operand-size is available in 16-bit mode on 386-compatible CPUs, using the same machine encodings that allows 16-bit operand size in 32-bit mode. But 64-bit operand size is only available in 64-bit mode. Unlike 386, AMD64 didn't add any new prefixes or anything to previously-existing modes, for various reasons.)
You could zero the whole 64-bit sum with an SSE store, or even compare with SSE4.2 pcmpgtq, but that's probably not what you want.
It looks like you want to check if 64-bit sum fits in 32 bits. (i.e. if it is a sign-extended 32-bit integer).
So really you just need to check that all 32 high bits are the same and match bit 31 of the low half.
mov eax, dword ptr [sum]
cdq ; sign extend eax into edx:eax
; i.e. copy bit 31 of EAX to all bits of EDX
cmp edx, dword ptr [sum+4]
je small_sum
I'm trying to convert 32bit float to 64bit double in asm on x86 architecture. The conversion is done by function written in asm and then I want to call it from C. I have no idea what I'm doing wrong, but memory pointed by dst seem to stay untouched and after printf program crashes. I want to do it without any floating-point intructions. Here's the code:
.686
.model flat
public _conv
.data
mantissa_mask dd 00000000011111111111111111111111b
exponent_mask dd 01111111100000000000000000000000b
.code
_conv PROC
pusha
mov ebp, esp
mov esi, dword ptr [ebp+8] ; src
mov edi, dword ptr [ebp+12]; dst
mov dword ptr [edi], 0
mov dword ptr [edi+4], 0
mov eax, dword ptr [esi]
and eax, dword ptr mantissa_mask
mov dword ptr [edi], eax
xor edx, edx ; zero edx
mov ecx, 1
shl ecx, 29 ;ecx == 2^29
mul ecx ;so it's like `shl edx:eax, 29`
mov dword ptr [edi], eax
mov dword ptr [edi+4], edx
mov eax, dword ptr [esi]
and eax, dword ptr exponent_mask
shr eax, 23 ;put exponent on lowest bits
sub eax, 127 ;exponent in float is coded enlarged by 127
add eax, 1023 ;in double it's enlarged by 1023
shl eax, 20 ;exponent in double starts on 20bit of 2nd byte
or dword ptr [edi], eax
;sign bit:
bt dword ptr [esi], 31
jc set_sign_bit
btr dword ptr [edi+4], 31
jmp endthis
set_sign_bit:
bts dword ptr [edi+4], 31
endthis:
popa
ret
_conv ENDP
END
And the C code:
void conv(float * src, double * dst);
int main()
{
float src = 4.5f;
double dst = 0.;
conv(&src, &dst);
printf("%f\n", dst);
return 0;
}
Your primary problem is accessing the arguments. Since you did pusha the arguments are not at [ebp+8] and [ebp+12], rather at [ebp+36] and [ebp+40]. A debugger would have shown you this right away. Even with those changes your code is still broken though.
Ok, finally it works. Very helpful was Jester's advice about args access. Stupid thing, but hard to notice. Here's final code:
.686
.model flat
public _conv
.data
mantissa_mask dd 00000000011111111111111111111111b
exponent_mask dd 01111111100000000000000000000000b
.code
_conv PROC
pusha
mov ebp, esp
;+36 and +40 since pusha
mov esi, dword ptr [ebp+36]; src
mov edi, dword ptr [ebp+40]; dst
mov dword ptr [edi], 0
mov dword ptr [edi+4], 0
;mentissa:
mov eax, dword ptr [esi]
and eax, dword ptr mantissa_mask
mov dword ptr [edi], eax
xor edx, edx ; zero edx
mov ecx, 1
shl ecx, 29 ;ecx == 2^29
mul ecx ;so it's like `shl edx:eax, 29`
mov dword ptr [edi], eax
mov dword ptr [edi+4], edx
;exponent:
mov eax, dword ptr [esi]
and eax, dword ptr exponent_mask
shr eax, 23 ;put exponent on lowest bits
sub eax, 127 ;exponent in float is coded enlarged by 127
add eax, 1023 ;in double it's enlarged by 1023
shl eax, 20 ;exponent in double starts on 20bit of 2nd byte
or dword ptr [edi+4], eax
;sign bit:
bt dword ptr [esi], 31
jc set_sign_bit
btr dword ptr [edi+4], 31
jmp endthis
set_sign_bit:
bts dword ptr [edi+4], 31
endthis:
popa
ret
_conv ENDP
END