Arithmetic operations with large numbers in assembly

Arithmetic operations with large numbers in assembly - c

I got a task to write an assembly routine that can be read from C and declared as follows:
extern int solve_equation(long int a, long int b,long int c, long int *x, long int *y);
that finds a solution to the equation
a * x + b * y = c
In -2147483648 <x, y <2147483647 by checking all options.
The value returned from the routine will be 1 if a solution is found and another 0.
You must take into consideration that the results of the calculations: a * x, b * y, a * x + b * y can exceed 32 bits.
.MODEL SMALL
.DATA
C DQ ?
SUM DQ 0
MUL1 DQ ?
MUL2 DQ ?
X DD ?
Y DD ?
.CODE
.386
PUBLIC _solve_equation
_solve_equation PROC NEAR
PUSH BP
MOV BP,SP
PUSH SI
MOV X,-2147483648
MOV Y,-2147483648
MOV ECX,4294967295
FOR1:
CMP ECX,0
JE FALSE
PUSH ECX
MOV Y,-2147483648
MOV ECX,4294967295
FOR2:
MOV SUM,0
CMP ECX,0
JE SET_FOR1
MOV EAX,DWORD PTR [BP+4]
IMUL X
MOV DWORD PTR MUL1,EAX
MOV DWORD PTR MUL1+4,EDX
MOV EAX,DWORD PTR [BP+8]
IMUL Y
MOV DWORD PTR MUL2,EAX
MOV DWORD PTR MUL2+4,EDX
MOV EAX, DWORD PTR MUL1
ADD DWORD PTR SUM,EAX
MOV EAX, DWORD PTR MUL2
ADD DWORD PTR SUM,EAX
MOV EAX, DWORD PTR MUL1+4
ADD DWORD PTR SUM+4,EAX
MOV EAX, DWORD PTR MUL2+4
ADD DWORD PTR SUM+4,EAX
CMP SUM,-2147483648
JL SET_FOR2
CMP SUM,2147483647
JG SET_FOR2
MOV EAX,DWORD PTR [BP+12]
CMP DWORD PTR SUM,EAX
JE TRUE
SET_FOR2:
DEC ECX
INC Y
JMP FOR2
SET_FOR1:
POP ECX
DEC ECX
JMP FOR1
FALSE:
MOV AX,0
JMP SOF
TRUE:
MOV SI,WORD PTR [BP+16]
MOV EAX,X
MOV DWORD PTR [SI],EAX
MOV SI,WORD PTR [BP+18]
MOV EAX,Y
MOV DWORD PTR [SI],EAX
MOV AX,1
SOF:
POP SI
POP BP
RET
_solve_equation ENDP
END
Is this the right way to work with large numbers?
I get argument to operation or instruction has illegal size when I try to do:
MOV SUM,0
CMP SUM,-2147483648
CMP SUM,2147483647
main code:
int main()
{
long int x, y, flag;
flag = solve_equation(-5,4,2147483647,&x, &y);
if (flag == 1)
printf("%ld*%ld + %ld*%ld = %ld\n", -5L,x,4L,y,2147483647);
return 0;
}
output
-5*-2147483647 + 4*-2147483647 = 2147483647
I`m using dosbox 0.74 and tcc

You're using 16-bit code, so 64-bit operand-size isn't available. Your assembler magically associates a size with sum, because you defined it with sum dq 0.
So mov sum, 0 is equivalent to mov qword ptr [sum], 0, which of course won't assemble in 16 or 32-bit mode; you can only operate on up to 32 bits at once with integer operations.
(32-bit operand-size is available in 16-bit mode on 386-compatible CPUs, using the same machine encodings that allows 16-bit operand size in 32-bit mode. But 64-bit operand size is only available in 64-bit mode. Unlike 386, AMD64 didn't add any new prefixes or anything to previously-existing modes, for various reasons.)
You could zero the whole 64-bit sum with an SSE store, or even compare with SSE4.2 pcmpgtq, but that's probably not what you want.
It looks like you want to check if 64-bit sum fits in 32 bits. (i.e. if it is a sign-extended 32-bit integer).
So really you just need to check that all 32 high bits are the same and match bit 31 of the low half.
mov eax, dword ptr [sum]
cdq ; sign extend eax into edx:eax
; i.e. copy bit 31 of EAX to all bits of EDX
cmp edx, dword ptr [sum+4]
je small_sum

Related

Understanding Clang's optimization when pointer is zero

In short: try switching foos pointer from 0 to 1 here:
godbolt - compiler explorer link - what is happening?
I was surprised at how many instruction came out of clang when I compiled the following C code. - And I noticed that it only happens when the pointer foos is zero. (x86-64 clang 12.0.1 with -O2 or -O3).
#include <stdint.h>
typedef uint8_t u8;
typedef uint32_t u32;
typedef struct {
u32 x;
u32 y;
}Foo;
u32 count = 500;
int main()
{
u8 *foos = (u8 *)0;
u32 element_size = 8;
u32 offset = 0;
for(u32 i=0;i<count;i++)
{
u32 *p = (u32 *)(foos + element_size*i);
*p = i;
}
return 0;
}
This is the output when the pointer is zero.
main: # #main
mov r8d, dword ptr [rip + count]
test r8, r8
je .LBB0_6
lea rcx, [r8 - 1]
mov eax, r8d
and eax, 3
cmp rcx, 3
jae .LBB0_7
xor ecx, ecx
jmp .LBB0_3
.LBB0_7:
and r8d, -4
mov esi, 16
xor ecx, ecx
.LBB0_8: # =>This Inner Loop Header: Depth=1
lea edi, [rsi - 16]
and edi, -32
mov dword ptr [rdi], ecx
lea edi, [rsi - 8]
and edi, -24
lea edx, [rcx + 1]
mov dword ptr [rdi], edx
mov edx, esi
and edx, -16
lea edi, [rcx + 2]
mov dword ptr [rdx], edi
lea edx, [rsi + 8]
and edx, -8
lea edi, [rcx + 3]
mov dword ptr [rdx], edi
add rcx, 4
add rsi, 32
cmp r8, rcx
jne .LBB0_8
.LBB0_3:
test rax, rax
je .LBB0_6
lea rdx, [8*rcx]
.LBB0_5: # =>This Inner Loop Header: Depth=1
mov esi, edx
and esi, -8
mov dword ptr [rsi], ecx
add rdx, 8
add ecx, 1
add rax, -1
jne .LBB0_5
.LBB0_6:
xor eax, eax
ret
count:
.long 500 # 0x1f4
Can you please help me understand what is happening here? I don't know assembly very well. The AND with 3 suggest to me that there's some alignment branching. The top part of LBB0_8 looks very strange to me...

This is loop unrolling.
The code first checks if count is greater than 3, and if so, branches to LBB0_7, which sets up loop variables and drops into the loop at LBB0_8. This loop does 4 steps per iteration, as long as there are still 4 or more to do. Afterwards it falls through to the "slow path" at LBB0_3/LBB0_5 that just does one step per iteration.
That slow path is also very similar to what you get when you compile the code with a non-zero value for that pointer.
As for why this happens, I don't know. Initially I was thinking that the compiler proves that a NULL deref will happen inside the loop and optimises on that, but usually that's akin to replacing the loop contents with __builtin_unreachable();, which causes it to throw out the loop entirely. Still can't rule it out, but I've seen the compiler throw out a lot of code many times, so it seems at least unlikely that UB causes this.
Then I was thinking maybe the fact that 0 requires no additional calculation, but all it'd have to change was mov esi, 16 to mov esi, 17, so it'd have the same amount of instructions.
What's also interesting is that on x86_64, it generates a loop with 4 steps per iteration, whereas on arm64 it generates one with 2 steps per iteration.

How can I store abc(x, y) which is a pointer function into an array as per the following code sample?

Function 1.
It is a pointer function.
char *abc(unsigned int a, unsigned int b)
{
//do something here ...
}
Function 2
Leveraged the function 1 into function 2.
I am trying to store the abc function into an array, however I am getting the error as : error: assignment to expression with array type.
fun2()
{
unsigned int x, y;
x= 5, y=6;
char *array1;
char array2;
for(i=0; i<3; i++)
{
array2[i] = abc(x, y);
}
}

You can't store the invocation of a function in C since it would defeat many existing popular optimizations involving register parameters passing - see because normally parameters are assigned their argument values immediately before the execution flow is transferred to the calling site - compilers may choose to use the registers to store those values but as it stands those registers are volatile and so if we were to delay the actual call they would be overwritten at said later time - possibly even by another call to some function which also have its arguments passed as registers. A solution - which I've personally implemented - is to have a function simulate the call for you by re-assigning to the proper registers and any further arguments - to the stack. In this case you store the argument values in a flat memory. But this must be done in assembly exclusively for this purpose and specific to your target architecture. On the other hand if your architecture is not using any such optimizations - it could be quite easier but still hand written assembly would be required.
In any case this is not a feature the standard (or even pre standard as far as I know) C has implemented anytime.
For example this is an implementation for x86-64 I've wrote some time ago (for MSVC masm assembler):
PUBLIC makeuniquecall
.data
makeuniquecall_jmp_table dq zero_zero, one_zero, two_zero, three_zero ; ordinary
makeuniquecall_jmp_table_one dq zero_one, one_one, two_one, three_one ; single precision
makeuniquecall_jmp_table_two dq zero_two, one_two, two_two, three_two ; double precision
.code
makeuniquecall PROC
;rcx - function pointer
;rdx - raw argument data
;r8 - a byte array specifying each register parameter if it's float and the last qword is the size of the rest
push r12
push r13
push r14
mov r12, rcx
mov r13, rdx
mov r14, r8
; first store the stack vars
mov rax, [r14 + 4] ; retrieve size of stack
sub rsp, rax
mov rdi, rsp
xor rdx, rdx
mov r8, 8
div r8
mov rcx, rax
mov rsi, r13
;add rsi, 32
rep movs qword ptr [rdi], qword ptr [rsi]
xor r10,r10
cycle:
mov rax, r14
add rax, r10
movzx rax, byte ptr [rax]
test rax, rax
jnz jmp_one
lea rax, makeuniquecall_jmp_table
jmp qword ptr[rax + r10 * 8]
jmp_one:
cmp rax, 1
jnz jmp_two
lea rax, makeuniquecall_jmp_table_one
jmp qword ptr[rax + r10 * 8]
jmp_two:
lea rax, makeuniquecall_jmp_table_two
jmp qword ptr[rax + r10 * 8]
zero_zero::
mov rcx, qword ptr[r13+r10*8]
jmp continue
one_zero::
mov rdx, qword ptr[r13+r10*8]
jmp continue
two_zero::
mov r8, qword ptr[r13+r10*8]
jmp continue
three_zero::
mov r9, qword ptr[r13+r10*8]
jmp continue
zero_one::
movss xmm0, dword ptr[r13+r10*8]
jmp continue
one_one::
movss xmm1, dword ptr[r13+r10*8]
jmp continue
two_one::
movss xmm2, dword ptr[r13+r10*8]
jmp continue
three_one::
movss xmm3, dword ptr[r13+r10*8]
jmp continue
zero_two::
movsd xmm0, qword ptr[r13+r10*8]
jmp continue
one_two::
movsd xmm1, qword ptr[r13+r10*8]
jmp continue
two_two::
movsd xmm2, qword ptr[r13+r10*8]
jmp continue
three_two::
movsd xmm3, qword ptr[r13+r10*8]
continue:
inc r10
cmp r10, 4
jb cycle
mov r14, [r14 + 4] ; retrieve size of stack
call r12
add rsp, r14
pop r14
pop r13
pop r12
ret
makeuniquecall ENDP
END
And your code will look something like this:
#include <stdio.h>
char* abc(unsigned int a, unsigned int b)
{
printf("a - %d, b - %d\n", a, b);
return "return abc str\n";
}
extern makeuniquecall();
main()
{
unsigned int x, y;
x = 5, y = 6;
#pragma pack(4)
struct {
struct { char maskargs[4]; unsigned long long szargs; } invok;
char *(*pfunc)();
unsigned long long args[2], shadow[2];
} array2[3];
#pragma pack(pop)
for (int i = 0; i < 3; i++)
{
memset(array2[i].invok.maskargs, 0, sizeof array2[i].invok.maskargs); // standard - no floats passed
array2[i].invok.szargs = 8 * 4; //consider shadow space
array2[i].pfunc = abc;
array2[i].args[0] = x;
array2[i].args[1] = y;
}
//now do the calls
for (int i = 0; i < 3; i++)
printf("%s\n", ((char *(*)())makeuniquecall)(array2[i].pfunc, array2[i].args, &array2[i].invok));
}
You'll probably not need that for your specific case you will get away with simply storing each argument and calling the function directly - i.e. (plus this method won't be x86-64 specific):
//now do the calls
for (int i = 0; i < 3; i++)
printf("%s\n", array2[i].pfunc(array2[i].args[0], array2[i].args[1]));
But mine implementation gives you the flexibility to store different amount of arguments for each call.
Note consider this guide for running above examples on msvc (since it requires to add asm file for the assembly code).
I love such noob questions since they make you think about why x-y feature doesn't actually exist in the language.

volatile keyword in C, are all variables marked as volatile?

Sorry if I am asking a stupid question, but I can't find the answer due to clumsy search terms I guess
If I declare three variables as follows
volatile uint16_t a, b, c;
Will all three variables be declared volatile?
Or should I really not declare multiple variables in a row but instead do:
volatile uint16_t a;
volatile uint16_t b;
volatile uint16_t c;

If I declare three variables as follows
volatile uint16_t a, b, c;
Will all three variables be declared volatile?
Yes, all 3 variables will be volatile.
Or should I really not declare multiple variables in a row but instead do:
That is related to code style and personal preference. Usually declaring variables one per line is preferred, is more readable, easier to read, easier to refactor and results in more readable changes when browsing diff output of files.

We can check the assembly generated by the compiler to see if it optimizes the variables out or not.
When I check this simple program:
#include <stdio.h>
#include <stdint.h>
int main(void)
{
uint16_t a = 1, b = 1, c = 1;
printf("%hu", a);
printf("%hu", b);
printf("%hu", c);
}
The generated assembly at -O3 (link) is:
.LC0:
.string "%hu"
main:
sub rsp, 8
mov esi, 1
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
mov esi, 1
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
mov esi, 1
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 8
ret
It's obvious here that the variables have been optimized out and 1 is being used as a parameter instead of the variables.
When I replace the uint16_t a = 1, b = 1, c = 1; with volatile uint16_t a = 1, b = 1, c = 1;, The assembly generated (link) is:
main:
sub rsp, 24
mov edx, 1
mov ecx, 1
mov eax, 1
mov WORD PTR [rsp+10], ax
mov edi, OFFSET FLAT:.LC0
xor eax, eax
mov WORD PTR [rsp+12], dx
mov WORD PTR [rsp+14], cx
movzx esi, WORD PTR [rsp+10]
call printf
movzx esi, WORD PTR [rsp+12]
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
movzx esi, WORD PTR [rsp+14]
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 24
ret
Here, volatile is working like it should for all variables. The variables are created and are not optimized out.
In comparison, if we replace volatile uint16_t a = 1, b = 1, c = 1; with volatile uint16_t a = 1; uint16_t b = 1, c = 1; we see that only a is not optimized out (link):
main:
sub rsp, 24
mov eax, 1
mov edi, OFFSET FLAT:.LC0
mov WORD PTR [rsp+14], ax
movzx esi, WORD PTR [rsp+14]
xor eax, eax
call printf
mov esi, 1
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
mov esi, 1
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 24
ret

float to double (IEEE754) conversion

I'm trying to convert 32bit float to 64bit double in asm on x86 architecture. The conversion is done by function written in asm and then I want to call it from C. I have no idea what I'm doing wrong, but memory pointed by dst seem to stay untouched and after printf program crashes. I want to do it without any floating-point intructions. Here's the code:
.686
.model flat
public _conv
.data
mantissa_mask dd 00000000011111111111111111111111b
exponent_mask dd 01111111100000000000000000000000b
.code
_conv PROC
pusha
mov ebp, esp
mov esi, dword ptr [ebp+8] ; src
mov edi, dword ptr [ebp+12]; dst
mov dword ptr [edi], 0
mov dword ptr [edi+4], 0
mov eax, dword ptr [esi]
and eax, dword ptr mantissa_mask
mov dword ptr [edi], eax
xor edx, edx ; zero edx
mov ecx, 1
shl ecx, 29 ;ecx == 2^29
mul ecx ;so it's like `shl edx:eax, 29`
mov dword ptr [edi], eax
mov dword ptr [edi+4], edx
mov eax, dword ptr [esi]
and eax, dword ptr exponent_mask
shr eax, 23 ;put exponent on lowest bits
sub eax, 127 ;exponent in float is coded enlarged by 127
add eax, 1023 ;in double it's enlarged by 1023
shl eax, 20 ;exponent in double starts on 20bit of 2nd byte
or dword ptr [edi], eax
;sign bit:
bt dword ptr [esi], 31
jc set_sign_bit
btr dword ptr [edi+4], 31
jmp endthis
set_sign_bit:
bts dword ptr [edi+4], 31
endthis:
popa
ret
_conv ENDP
END
And the C code:
void conv(float * src, double * dst);
int main()
{
float src = 4.5f;
double dst = 0.;
conv(&src, &dst);
printf("%f\n", dst);
return 0;
}

Your primary problem is accessing the arguments. Since you did pusha the arguments are not at [ebp+8] and [ebp+12], rather at [ebp+36] and [ebp+40]. A debugger would have shown you this right away. Even with those changes your code is still broken though.

Ok, finally it works. Very helpful was Jester's advice about args access. Stupid thing, but hard to notice. Here's final code:
.686
.model flat
public _conv
.data
mantissa_mask dd 00000000011111111111111111111111b
exponent_mask dd 01111111100000000000000000000000b
.code
_conv PROC
pusha
mov ebp, esp
;+36 and +40 since pusha
mov esi, dword ptr [ebp+36]; src
mov edi, dword ptr [ebp+40]; dst
mov dword ptr [edi], 0
mov dword ptr [edi+4], 0
;mentissa:
mov eax, dword ptr [esi]
and eax, dword ptr mantissa_mask
mov dword ptr [edi], eax
xor edx, edx ; zero edx
mov ecx, 1
shl ecx, 29 ;ecx == 2^29
mul ecx ;so it's like `shl edx:eax, 29`
mov dword ptr [edi], eax
mov dword ptr [edi+4], edx
;exponent:
mov eax, dword ptr [esi]
and eax, dword ptr exponent_mask
shr eax, 23 ;put exponent on lowest bits
sub eax, 127 ;exponent in float is coded enlarged by 127
add eax, 1023 ;in double it's enlarged by 1023
shl eax, 20 ;exponent in double starts on 20bit of 2nd byte
or dword ptr [edi+4], eax
;sign bit:
bt dword ptr [esi], 31
jc set_sign_bit
btr dword ptr [edi+4], 31
jmp endthis
set_sign_bit:
bts dword ptr [edi+4], 31
endthis:
popa
ret
_conv ENDP
END

Understanding Intel x86 assembly output for a modulo calculation

I have a school project about compilers and how it differs in assembly code between Intel x86 and ARMv7, but i'm stuck trying to comprehend the assembly for the Intel x86 architecture.
The source code is:
int main()
{
int a=5,b=2;
int result;
result = a % b;
printf("Result of 5 modulo 2 is %i\n", result);
}
Assembly output (gcc masm=Intel)
main:
/*
Intel32-x86 Arhchitecture
Little endian
ebp register -- base pointer
esp register -- stack pointer
*/
push ebp ; ebp register put on stack
mov ebp, esp ; Move data from ebp to esp
and esp, -16 ; Logical AND ??
sub esp, 32 ; Subtraction ??
mov DWORD PTR [esp+20], 5
;5 as 32 bits
;00000101-00000000-00000000-00000000
mov DWORD PTR [esp+24], 2
;2 as 32 bits
;00000010-00000000-00000000-00000000
mov eax, DWORD PTR [esp+20]
mov edx, eax
sar edx, 31
;Shift Arithmetically right - edx med 31.
;00000101-00000000-00000000-00000000 BEFORE
;00000000-00000000-00000000-00000000 AFTER
idiv DWORD PTR [esp+24]
;Signed divide - IDIV r/m32 - EDX:EAX register
;Dividing EDX:EAX on value of esp+24, and save the remainder in edx.
;EDX:EAX 00000000-00000000-00000000-00000000-00000101-00000000-00000000-00000000
mov DWORD PTR [esp+28], edx
mov eax, OFFSET FLAT:.LC0
mov edx, DWORD PTR [esp+28]
mov DWORD PTR [esp+4], edx
mov DWORD PTR [esp], eax
call printf
leave
ret
and esp, -16 ; Logical AND
sub esp, 32 ; Subtraction
What is the purpose of those two instructions?

The purpose is mentioned in the comments:
and esp,-16 ;round esp down to 16 byte boundary
sub esp,32 ;allocate 32 bytes of space for local variables
In case you didn't catch this part about sign extending the dividend:
mov eax, DWORD PTR [esp+20] ; eax = dividend
mov edx, eax ; edx = dividend
sar edx, 31 ; edx = 0 or -1 (the sign extension)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Arithmetic operations with large numbers in assembly - c

Related

Understanding Clang's optimization when pointer is zero

How can I store abc(x, y) which is a pointer function into an array as per the following code sample?

volatile keyword in C, are all variables marked as volatile?

float to double (IEEE754) conversion

Understanding Intel x86 assembly output for a modulo calculation

Categories

Resources