I'm trying to store four 32-bit float variables into a 128-bit float variable, here is the asm code:
; function signature is:
; int findMin()
section .data
fa: dd 10.0
fb: dd 20.0
fc: dd 12.0
fd: dd 9.0
section .bss
fl_var: reso 1
section .text
global findMin
findMin:
; -------------------------------------------
; Entrace sequence
; -------------------------------------------
push ebp ; save base pointer
mov ebp, esp ; point to current stack frame
push ebx ; save general registers
push ecx
push edx
push esi
push edi
fld dword [fa]
fstp dword [fl_var]
fld dword [fb]
fstp dword [fl_var + 4]
fld dword [fc]
fstp dword [fl_var + 8]
fld dword [fd]
fstp dword [fl_var + 12]
mov eax, 0
; ------------------------------------------
; Exit sequence
; ------------------------------------------
pop edi
pop esi
pop edx
pop ecx
pop ebx
mov esp, ebp
pop ebp
ret
For each fn float variable, I use the st0 register to move it into the fl_var, usign a proper offset.
The code doesn't work, and the problem is in the declaration of the fn float variables. If I try to print them using the gdb I get this:
f/fh &fa //print a 32-bit float
0x804a020 <fa>: 0
The same happen for the others fn variables.
If write:
f/fw &fa //print a 64-bit float
0x804a020 <fa>: 10
It print the right value! The same happen for the others fn variables.
So, even if I declare fa to be a dd (4 bytes), it creates a dq (8 bytes). There is a way to declare a 32-bit float in asm?
I've found the type real4, but the compiler complains.
EDIT
Another question regarding the gdb and the way it prints variables. Look at this code:
;int findMin(float a, float b, float c, float d)
section .bss
int_var: reso 2
section .text
global findMin
findMin:
fld qword [ebp + 8]
fstp qword [int_var]
fld qword [ebp + 16]
fstp qword [int_var + 8]
fld qword [ebp + 24]
fstp qword [int_var + 16]
fld qword [ebp + 32]
fstp qword [int_var + 24]
I call it from a C piece of code, because of the C calling conventions, float are passed as 64-bit values. So I take each argument moving 8 bytes each time from the stack.
When I enter the gdb and print the value of int_var, I get the correct result using:
x/4fw &int_var
If I write
x/4fg &int_var
I don't get the numbers passed to the function, but inconsistent values.
Why it is printing correctly using the w letter (32-bit)?
Related
Function 1.
It is a pointer function.
char *abc(unsigned int a, unsigned int b)
{
//do something here ...
}
Function 2
Leveraged the function 1 into function 2.
I am trying to store the abc function into an array, however I am getting the error as : error: assignment to expression with array type.
fun2()
{
unsigned int x, y;
x= 5, y=6;
char *array1;
char array2;
for(i=0; i<3; i++)
{
array2[i] = abc(x, y);
}
}
You can't store the invocation of a function in C since it would defeat many existing popular optimizations involving register parameters passing - see because normally parameters are assigned their argument values immediately before the execution flow is transferred to the calling site - compilers may choose to use the registers to store those values but as it stands those registers are volatile and so if we were to delay the actual call they would be overwritten at said later time - possibly even by another call to some function which also have its arguments passed as registers. A solution - which I've personally implemented - is to have a function simulate the call for you by re-assigning to the proper registers and any further arguments - to the stack. In this case you store the argument values in a flat memory. But this must be done in assembly exclusively for this purpose and specific to your target architecture. On the other hand if your architecture is not using any such optimizations - it could be quite easier but still hand written assembly would be required.
In any case this is not a feature the standard (or even pre standard as far as I know) C has implemented anytime.
For example this is an implementation for x86-64 I've wrote some time ago (for MSVC masm assembler):
PUBLIC makeuniquecall
.data
makeuniquecall_jmp_table dq zero_zero, one_zero, two_zero, three_zero ; ordinary
makeuniquecall_jmp_table_one dq zero_one, one_one, two_one, three_one ; single precision
makeuniquecall_jmp_table_two dq zero_two, one_two, two_two, three_two ; double precision
.code
makeuniquecall PROC
;rcx - function pointer
;rdx - raw argument data
;r8 - a byte array specifying each register parameter if it's float and the last qword is the size of the rest
push r12
push r13
push r14
mov r12, rcx
mov r13, rdx
mov r14, r8
; first store the stack vars
mov rax, [r14 + 4] ; retrieve size of stack
sub rsp, rax
mov rdi, rsp
xor rdx, rdx
mov r8, 8
div r8
mov rcx, rax
mov rsi, r13
;add rsi, 32
rep movs qword ptr [rdi], qword ptr [rsi]
xor r10,r10
cycle:
mov rax, r14
add rax, r10
movzx rax, byte ptr [rax]
test rax, rax
jnz jmp_one
lea rax, makeuniquecall_jmp_table
jmp qword ptr[rax + r10 * 8]
jmp_one:
cmp rax, 1
jnz jmp_two
lea rax, makeuniquecall_jmp_table_one
jmp qword ptr[rax + r10 * 8]
jmp_two:
lea rax, makeuniquecall_jmp_table_two
jmp qword ptr[rax + r10 * 8]
zero_zero::
mov rcx, qword ptr[r13+r10*8]
jmp continue
one_zero::
mov rdx, qword ptr[r13+r10*8]
jmp continue
two_zero::
mov r8, qword ptr[r13+r10*8]
jmp continue
three_zero::
mov r9, qword ptr[r13+r10*8]
jmp continue
zero_one::
movss xmm0, dword ptr[r13+r10*8]
jmp continue
one_one::
movss xmm1, dword ptr[r13+r10*8]
jmp continue
two_one::
movss xmm2, dword ptr[r13+r10*8]
jmp continue
three_one::
movss xmm3, dword ptr[r13+r10*8]
jmp continue
zero_two::
movsd xmm0, qword ptr[r13+r10*8]
jmp continue
one_two::
movsd xmm1, qword ptr[r13+r10*8]
jmp continue
two_two::
movsd xmm2, qword ptr[r13+r10*8]
jmp continue
three_two::
movsd xmm3, qword ptr[r13+r10*8]
continue:
inc r10
cmp r10, 4
jb cycle
mov r14, [r14 + 4] ; retrieve size of stack
call r12
add rsp, r14
pop r14
pop r13
pop r12
ret
makeuniquecall ENDP
END
And your code will look something like this:
#include <stdio.h>
char* abc(unsigned int a, unsigned int b)
{
printf("a - %d, b - %d\n", a, b);
return "return abc str\n";
}
extern makeuniquecall();
main()
{
unsigned int x, y;
x = 5, y = 6;
#pragma pack(4)
struct {
struct { char maskargs[4]; unsigned long long szargs; } invok;
char *(*pfunc)();
unsigned long long args[2], shadow[2];
} array2[3];
#pragma pack(pop)
for (int i = 0; i < 3; i++)
{
memset(array2[i].invok.maskargs, 0, sizeof array2[i].invok.maskargs); // standard - no floats passed
array2[i].invok.szargs = 8 * 4; //consider shadow space
array2[i].pfunc = abc;
array2[i].args[0] = x;
array2[i].args[1] = y;
}
//now do the calls
for (int i = 0; i < 3; i++)
printf("%s\n", ((char *(*)())makeuniquecall)(array2[i].pfunc, array2[i].args, &array2[i].invok));
}
You'll probably not need that for your specific case you will get away with simply storing each argument and calling the function directly - i.e. (plus this method won't be x86-64 specific):
//now do the calls
for (int i = 0; i < 3; i++)
printf("%s\n", array2[i].pfunc(array2[i].args[0], array2[i].args[1]));
But mine implementation gives you the flexibility to store different amount of arguments for each call.
Note consider this guide for running above examples on msvc (since it requires to add asm file for the assembly code).
I love such noob questions since they make you think about why x-y feature doesn't actually exist in the language.
Here is a basic program I written on the godbolt compiler, and it's as simple as:
#include<stdio.h>
void main()
{
int a = 10;
int *p = &a;
printf("%d", *p);
}
The results after compilation I get:
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-12], 10
lea rax, [rbp-12]
mov QWORD PTR [rbp-8], rax
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
nop
leave
ret
Question: Pushing the rbp, making the stack frame by making a 16 byte block, how from a register, a value is moved to a stack location and vice versa, how the job of LEA is to figure out the address, I got this part.
Problem:
lea rax, [rbp-12]
mov QWORD PTR [rbp-8], rax
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
Lea -> getting address of rbp-12 into rax,
then moving the value which is the address of rbp-12 into rax,
but next line again says, move to rax, the value of rbp-8. This seems ambiguous. Then again moving the value of rax to eax. I don't understand the amount of work here. Why couldn't I have done
lea rax, [rbp-12]
mov QWORD PTR [rbp-8], rax
mov eax, QWORD PTR [rbp-8]
and be done with it? coz on the original line, rbp-12's address is stored onto rax, then rax stored to rbp-8. then rbp-8 stored again into rax, and then again rax is stored into eax? couldn't we have just copied the rbp-8 directly to eax? i guess not. But my question is why?
I know there is de-referencing in pointers, so How LEA helps grabbing the address of rbp-12, I understand, but on the next parts, when did it went from grabbing values from addresses I completely lost. And also, after that, I didn't understand any of the asm lines.
You're seeing very un-optimized code. Here's my line-by-line interpretation:
.LC0:
.string "%d" ; Format string for printf
main:
push rbp ; Save original base pointer
mov rbp, rsp ; Set base pointer to beginning of stack frame
sub rsp, 16 ; Allocate space for stack frame
mov DWORD PTR [rbp-12], 10 ; Initialize variable 'a'
lea rax, [rbp-12] ; Load effective address of 'a'
mov QWORD PTR [rbp-8], rax ; Store address of 'a' in 'p'
mov rax, QWORD PTR [rbp-8] ; Load 'p' into rax (even though it's already there - heh!)
mov eax, DWORD PTR [rax] ; Load 32-bit value of '*p' into eax
mov esi, eax ; Load value to print into esi
mov edi, OFFSET FLAT:.LC0 ; Load format string address into edi
mov eax, 0 ; Zero out eax (not sure why -- likely printf call protocol)
call printf ; Make the printf call
nop ; No-op (not sure why)
leave ; Remove the stack frame
ret ; Return
Compilers, when not optimizing, generate code like this as they parse the code you gave them. It's doing a lot of unnecessary stuff, but it is quicker to generate and makes using a debugger easier.
Compare this with the optimized code (-O2):
.LC0:
.string "%d" ; Format string for printf
main:
mov esi, 10 ; Don't need those variables -- just a 10 to pass to printf!
mov edi, OFFSET FLAT:.LC0 ; Load format string address into edi
xor eax, eax ; It's a few cycles faster to xor a register with itself than to load an immediate 0
jmp printf ; Just jmp to printf -- it will handle the return
The optimizer found that the variables weren't necessary, so no stack frame is created. Nothing is left but the printf call! And that's done as a jmp since nothing else need be done here when the printf is complete.
This is probably my final hurdle in learning x86 assembly language.
The following subroutine is giving me a segmentation fault:
;=================================================================
; RemCharCodeFromAToB - removes all chars between a and e from str
; arguments:
; str - string to be processed
; a - start
; e - end
; return value:
; n/a
;-------------------------------------------------------------------
RemCharCodeFromAToB:
; standard entry sequence
push ebp ; save the previous value of ebp for the benefi$
mov ebp, esp ; copy esp -> ebp so that ebp can be used as a $
; accessing arguments
; [ebp + 0] = old ebp stack frame
; [ebp + 4] = return address
mov edx, [ebp + 8] ; string address
while_loop_rcc:
mov cl, [edx] ; obtain the address of the 1st character of the string
cmp cl, 0 ; check the null value
je while_loop_exit_rcc ; exit if the null-character is reached
mov al, cl ; save cl
mov cl, [ebp + 16] ; end-char
push cx ; push end-char
mov cl, [ebp + 12] ; start-char
push cx ; push start-char
push ax; ; push ch
call IsBetweenAandB
add esp, 12
cmp eax, 0 ; if(ch is not between 'a' and 'e')
je inner_loop_exit_rcc
mov eax, edx ; copy the current address
inner_loop_rcc:
mov cl, [eax+1]
cmp cl, 0
je inner_loop_exit_rcc
mov [eax], cl
inc eax
jmp inner_loop_rcc
inner_loop_exit_rcc:
inc edx ; increment the address
jmp while_loop_rcc ; start the loop again
while_loop_exit_rcc:
; standard exit sequence
mov esp, ebp ; restore esp with ebp
pop ebp ; remove ebp from stack
ret ; return the value of temporary variable
;===================================================================
I am suspecting that there is something wrong with data conversions from 32-bit to 8-bit registers and vice-versa. My concept regarding this is not clear yet.
Or, is there something wrong in the following part
mov al, cl ; save cl
mov cl, [ebp + 16] ; end-char
push cx ; push end-char
mov cl, [ebp + 12] ; start-char
push cx ; push start-char
push ax; ; push ch
call IsBetweenAandB
add esp, 12
?
Full asm code is here.
C++ code is here.
Makefile is here.
cx and ax are 16-bit registers, so your push cx ; push cx; push ax are pushing 16-bit values on the stack, a total of 6 bytes. But IsBetweenAandB is apparently expecting 32-bit values, and you add 12 to esp at the end (instead of 6). So you probably wanted push ecx etc.
Also, you probably want to zero out eax and ecx before using them. As it stands, they probably contain garbage initially, and you only load useful data into the low 8 bits al and cl. Thus when IsBetweenAandB tries to compare the full 32-bit values, you are going to get false results. Or else you want to rewrite IsBetweenAandB to only compare the low bytes that you care about.
Summary: We have an int variable and 4 double arrays in C, 2 of which hold input data and 2 of which we want to write output data to. We pass the variable and arrays to a function in an external .asm file, where the input data is used to determine output data and the output data is written to the output arrays.
Our problem is, that the output arrays remain seemingly untouched after the assembly routine finishes its work. We don't even know, if the routine reads the correct input data. Where did we go wrong?
We compile with the following commands
nasm -f elf32 -o calculation.o calculation.asm
gcc -m32 -o programm main.c calculation.o
If you need any more information, feel free to ask.
C code:
// before int main()
extern void calculate(int32_t counter, double radius[], double distance[], double paper[], double china[]) asm("calculate");
// in int main()
double radius[counter];
double distance[counter];
// [..] Write Input Data to radius & distance [...]
double paper[counter];
double china[counter];
for(int i = 0; i < counter; i++) {
paper[i] = 0;
china[i] = 0;
}
calculate(counter, radius, distance, paper, china);
// here we expect paper[] and china[] to contain output data
Our Assembly code currently only takes in the values, puts them into the FPU, then places them into the output array.
x86 Assembly (Intel Syntax) (I know, this code looks horrible, but we're beginners, so bear with us, please; Also I can't get syntax highlighting to work correctly for this one):
BITS 32
GLOBAL calculate
calculate:
SECTION .BSS
; declare all variables
pRadius: DD 0
pDistance: DD 0
pPaper: DD 0
pChina: DD 0
numItems: DD 0
counter: DD 0
; populate them
POP DWORD [numItems]
POP DWORD [pRadius]
POP DWORD [pDistance]
POP DWORD [pPaper]
POP DWORD [pChina]
SECTION .TEXT
PUSH EBX ; because of cdecl calling convention
JMP calcLoopCond
calcLoop:
; get input array element
MOV EBX, [counter]
MOV EAX, [pDistance]
; move it into fpu, then move it to output
FLD QWORD [EAX + EBX * 8]
MOV EAX, [pPaper]
FSTP QWORD [EAX + EBX * 8]
; same for the second one
MOV EAX, [pRadius]
FLD QWORD [EAX + EBX * 8]
MOV EAX, [pChina]
FSTP QWORD [EAX + EBX * 8]
INC EBX
MOV [counter], EBX
calcLoopCond:
MOV EAX, [counter]
MOV EBX, [numItems]
CMP EAX, EBX
JNZ calcLoop
POP EBX
RET
There are a couple of problems in the assembler routine. The POP instructions are emitted into the .bss section, so they are never executed. In the sequence of POPs, the return address (pushed by the caller) is not accounted for. Depending on the ABI, you must leave the arguments on the stack anyway. Because the POPs are never executed, the loop exit condition always happens to be true.
And you really should not use global variables this way. Instead, create a stack frame and use that.
Thanks to all your answers and comments and some heavy research, we managed to finally produce functioning code, which now properly uses stack frames and fulfills the cdecl calling convention:
BITS 32
GLOBAL calculate
SECTION .DATA
electricFieldConstant DQ 8.85e-12
permittivityPaper DQ 3.7
permittivityChina DQ 7.0
SECTION .TEXT
calculate:
PUSH EBP
MOV EBP, ESP
PUSH EBX
PUSH ESI
PUSH EDI
MOV ECX, 0 ; counter for loop
JMP calcLoopCond
calcLoop:
MOV EAX, [EBP + 12]
FLD QWORD [EAX + ECX * 8]
MOV EAX, [EBP + 20]
FSTP QWORD [EAX + ECX * 8]
MOV EAX, [EBP + 16]
FLD QWORD [EAX + ECX * 8]
MOV EAX, [EBP + 24]
FSTP QWORD [EAX + ECX * 8]
ADD ECX, 1 ; increase loop counter
calcLoopCond:
MOV EDX, [EBP + 8]
CMP ECX, EDX
JNZ calcLoop
POP EDI
POP ESI
POP EBX
MOV ESP, EBP
POP EBP
RET
I'm trying to learn how to work with SSE and I decided to realize a simple code that computes n^d, using a function that gets called by a C program.
Here's my NASM code:
section .data
resmsg: db '%d^%d = %d', 0
section .bss
section .text
extern printf
; ------------------------------------------------------------
; Function called from a c program, I only use n and d parameters but I left the others
; ------------------------------------------------------------
global main
T equ 8
n equ 12
d equ 16
m equ 20
Sid equ 24
Sn equ 28
main:
; ------------------------------------------------------------
; Function enter sequence
; ------------------------------------------------------------
push ebp ; save Base Pointer
mov ebp, esp ; Move Base Point to current frame
sub esp, 8 ; reserve space for two local vars
push ebx ; save some registries (don't know if needed)
push esi
push edi
; ------------------------------------------------------------
; copy function's parameters to registries from stack
; ------------------------------------------------------------
mov eax, [ebp+T] ; T
mov ebx, [ebp+n] ; n
mov ecx, [ebp+d] ; d
mov edx, [ebp+m] ; m
mov esi, [ebp+Sid] ; Sid
mov edi, [ebp+Sn] ; Sn
mov [ebp-8], ecx ; copy ecx into one of the local vars
;
; pow is computed by doing n*n d times
;
movss xmm0, [ebp+n] ; base
movss xmm1, [ebp+n] ; another copy of the base because xmm0 will be overwritten by the result
loop: mulss xmm0, xmm1 ; scalar mult from sse
dec ecx ; counter--
cmp ecx,0 ; check if counter is 0 to end loop
jnz loop ;
;
; let's store the result in eax by moving it to the stack and then copying to the registry (we use the other local var as support)
;
movss [ebp-4], xmm0
mov eax, [ebp-4]
;
; Print using C's printf
;
push eax ; result
mov ecx, [ebp-8] ; copy the original d back since we used it as loop's counter
push ecx ; exponent
push ebx ; base
push resmsg ; string format
call printf ; printf call
add esp, 24 ; clean the stack from both our local and printf's vars
; ------------------------------------------------------------
; Function exit sequence
; ------------------------------------------------------------
pop edi ; restore the registries
pop esi
pop ebx
mov esp, ebp ; restore the Stack Pointer
pop ebp ; restore the Base Pointer
ret ; get back to C program
Now, what I'd expect is it to print
4^2 = 16
but, instead, I got
4^2 = 0
I've spent my whole afternoon on this and I couldn't find a solution, do you have any hints?
EDIT:
Since it seems a format problem, I tried converting the data using
movss [ebp-4], xmm0
fld dword [ebp-4]
mov eax, dword [ebp-4]
instead of
movss [ebp-4], xmm0
mov eax, [ebp-4]
but I got the same result.
MOVSS moves single precision floats (32-bit). I assume that n is an integer so you can't load it into a XMM register with MOVSS. Use CVTSI2SS instead. printf cannot process single precision floats, which would converted to doubles by the compiler. It's convenient to use CVTSS2SI at this point. So the code should look like:
...
;
; pow is computed by doing n*n d times
;
cvtsi2ss xmm0, [ebp+n] ; load integer
sub ecx, 1 ; first step (n^1) is done
cvtsi2ss xmm1, [ebp+n] ; load integer
loop:
mulss xmm0, xmm1 ; scalar mult from sse
sub ecx, 1
jnz loop
cvtss2si eax, xmm0 ; result as integer
;
; Print using C's printf
;
push eax ; result
mov ecx, [ebp-8] ; copy the original d back since we used it as loop's counter
push ecx ; exponent
push ebx ; base
push resmsg ; string format
call printf ; printf call
add esp, 16 ; clean the stack only from printf's vars
...