Dividing in Assembly

Dividing in Assembly - c

I am trying to define a calculator in C language based on the Linux command dc the structure of the program is not so important all you need to know that I get two numbers and I want to divide them when typing /. Therefore, I send this two numbers to an assembly function that makes the division (see code below). But this works for positive numbers only.
When typing 999 3 / it returns 333 which is correct but when typing -999 3 / I get the strange number 1431655432 and also when typing both negative numbers like -999 -3 / I get 0 every time for any two negative numbers.
The code in assembly is:
section .text
global _div
_div:
push rbp ; Save caller state
mov rbp, rsp
mov rax, rdi ; Copy function args to registers: leftmost...
mov rbx, rsi ; Next argument...
cqo
idiv rbx ; divide 2 arguments
mov [rbp-8], rax
pop rbp ; Restore caller state

Your comments say you are passing integers to _idiv. If you are using int those are 32-bit values:
extern int _div (int a, int b);
When passed to the function a will be in the bottom 32-bits of RDI and b will be in the bottom 32-bits of RSI. The upper 32-bits of the arguments can be garbage but often they are zero, but doesn't have to be the case.
If you use a 64-bit register as a divisor with IDIV then the division is RDX:RAX / 64-bit divisor (in your case RBX). The problem here is that you are using the full 64-bit registers to do 32-bit division. If we assume for arguments sake that the upper bits of RDI and RSI were originally 0 then RSI would be 0x00000000FFFFFC19 (RAX) and RDI would be 0x0000000000000003 (RBX). CQO would zero extend RAX to RDX. The upper most bit of RAX is zero so RDX would be zero. The division would look like:
0x000000000000000000000000FFFFFC19 / 0x0000000000000003 = 0x55555408
0x55555408 happens to be 1431655432 (decimal) which is the result you were seeing. One fix for this is to use 32-bit registers for the division. To sign extend EAX (lower 32-bit of RAX) into EDX you can use CDQ instead of CQO.You can then divide EDX:EAX by EBX. This should get you the 32-bit signed division you are looking for. The code would look like:
cdq
idiv ebx ; divide 2 arguments EDX:EAX by EBX
Be aware that RBX, RBP, R12 to R15 all need to be preserved by your function of you modify them (they are volatile registers in the AMD 64-bit ABI). If you modify RBX you need to make sure you save and restore it like you do with RBP. A better alternative is to use one of the volatile registers like RCX instead of RBX.
You don't need the intermediate register to place the divisor into. You could have used RSI (or ESI in the fixed version) directly instead of moving it to a register like RBX.

Your issue has to do with how arguments are passed to _div.
Assuming your _div's prototype is:
int64_t _div(int32_t, int32_t);
Then, the arguments are passed in edi and esi (i.e., 32-bit signed integers), the upper halves of the registers rdi and rsi are undefined.
Sign extension is needed when assigning edi and esi to rax and rbx for performing a 64-bit signed division (for performing a 64-bit unsigned division zero extension would be needed instead).
That is, instead of:
mov rax, rdi
mov rbx, rsi
use the instruction movsx, which sign extends the source, on edi and esi:
movsx rax, edi
movsx rbx, esi
Using true 64-bit operands for the 64-bit division
The previous approach consits of performing a 64-bit division on "fake" 64-bit operands (i.e., sign-extended 32-bit operands). Mixing 64-bit instructions with "32-bit operands" is usually not a very good idea because it may result in worse performance and larger code size.
A better approach would be to simply change the C prototype of your _div function to accept actual 64-bit arguments, i.e.:
int64_t _div(int64_t, int64_t);
This way, the argument will be passed to rdi and rsi (i.e., already 64-bit signed integers) and a 64-bit division will be performed on true 64-bit integers.
Using a 32-bit division instead
You may also want to consider using the 32-bit idiv if it suits your needs, since it performs faster than a 64-bit division and the resulting code size is smaller (no REX prefix):
...
mov eax, edi
mov ebx, esi
cdq
idiv ebx
...
_div's prototype would be:
int32_t _div(int32_t, int32_t);

Related

Emit DIV instruction, instead of __udivti3

Consider the following code:
unsigned long long div(unsigned long long a, unsigned long long b, unsigned long long c) {
unsigned __int128 d = (unsigned __int128)a*(unsigned __int128)b;
return d/c;
}
When compiled with x86-64 gcc 10 or clang 10, both with -O3, it emits __udivti3, instead of DIVQ instruction:
div:
mov rax, rdi
mov r8, rdx
sub rsp, 8
xor ecx, ecx
mul rsi
mov r9, rax
mov rsi, rdx
mov rdx, r8
mov rdi, r9
call __udivti3
add rsp, 8
ret
At least in my testing, the former is much slower than the (already) slow later, hence the question: is there a way to make a modern compiler emit DIVQ for the above code?
Edit: Let's assume the quotient fits into 64-bits register.

div will fault if the quotient doesn't fit in 64 bits. Doing (a*b) / c with mul + a single div isn't safe in the general case (doesn't implement the abstract-machine semantics for every possible input), therefore a compiler can't generate asm that way for x86-64.
Even if you do give the compiler enough info to figure out that the division can't overflow (i.e. that high_half < divisor), unfortunately gcc/clang still won't ever optimize it to single a div with a non-zero high-half dividend (RDX).
You need an intrinsic or inline asm to explicitly do 128 / 64-bit => 64-bit division. e.g. Intrinsics for 128 multiplication and division has GNU C inline asm that looks right for low/high halves separately.
Unfortunately GNU C doesn't have an intrinsic for this. MSVC does, though: Unsigned 128-bit division on 64-bit machine has links.

Multiplying values in an array using IMUL instruction produces incorrect values

I'm picking up ASM language and trying out the IMUL function on Ubuntu Eclipse C++, but for some reason I just cant seem to get the desired output from my code.
Required:
Multiply the negative elements of an integer array int_array by a specified integer inum
Here's my code for the above:
C code:
#include <stdio.h>
extern void multiply_function();
// Variables
int iaver, inum;
int int_ar[10] = {1,2,3,4,-9,6,7,8,9,10};
int main()
{
inum = 2;
multiply_function();
for(int i=0; i<10; i++){
printf("%d ",int_ar[i]);
}
}
ASM code:
extern int_ar
extern inum
global multiply_function
multiply_function:
enter 0,0
mov ecx, 10
mov eax, inum
multiply_loop:
cmp [int_ar +ecx*4-4], dword 0
jg .ifpositive
mov ebx, [int_ar +ecx*4-4]
imul ebx
cdq
mov [int_ar +ecx*4-4], eax
loop multiply_loop
leave
ret
.ifpositive:
loop multiply_loop
leave
ret
The Problem
For an array of: {1,2,3,4,-9,6,7,8,9,10} and inum, I get the output {1,2,3,4,-1210688460,6,7,8,9,10} which hints at some sort of overflow occurring.
Is there something I'm missing or understood wrong about how the IMUL function in assembly language for x86 works?
Expected Output
The output I expected is {1,2,3,4,-18,6,7,8,9,10}
My Thought Process
My thought process for the above task:
1) Find which array elements in array are negative, for each positive element found, do nothing and continue loop to next element
cmp [int_ar +ecx*4-4], dword 0
jg .ifpositive
.ifpositive:
loop multiply_loop
leave
ret
2) Upon finding the negative element, move its value into register EBX which will serve as SRC in the IMUL SRC function. Then extend register EAX to EAX-EDX where the result is stored in:
mov ebx, [int_ar +ecx*4-4]
imul ebx
cdq
3) Move the result into the negative element of the array by using MOV:
mov [int_ar +ecx*4-4], eax
4) Loop through to the next array element and repeat the above 1)-3)

Reason for Incorrect Values
If we look past the inefficiencies and unneeded code and deal with the real issue it comes down to this instruction:
mov eax, inum
What is inum? You created and initialized a global variable in C called inum with:
int iaver, inum;
[snip]
inum = 2;
inum as a variable is essentially a label to a memory location containing an int (32-bit value). In your assembly code you need to treat inum as a pointer to a value, not the value itself. In your assembly code you need to change:
mov eax, inum
to:
mov eax, [inum]
What your version does is moves the address of inum into EAX. Your code ended up multiplying the address of the variable by the negative numbers in your array. That cause the erroneous values you see. the square brackets around inum tell the assembler you want to treat inum as a memory operand, and that you want to move the 32-bit value at inuminto EAX.
Calling Convention
You appear to be creating a 32-bit program and running it on 32-bit Ubuntu. I can infer the possibility of a 32-bit Linux by the erroneous value of -1210688460 being returned. -1210688460 = 0xB7D65C34 divide by -9 and you get 804A06C. Programs on 32-bit Linux are usually loaded starting at 0x8048000
Whether running on 32-bit Linux or 64-bit Linux, assembly code linked with 32-bit C/C++ programs need to abide by the CDECL calling convention:
cdecl
The cdecl (which stands for C declaration) is a calling convention that originates from the C programming language and is used by many C compilers for the x86 architecture.1 In cdecl, subroutine arguments are passed on the stack. Integer values and memory addresses are returned in the EAX register, floating point values in the ST0 x87 register. Registers EAX, ECX, and EDX are caller-saved, and the rest are callee-saved. The x87 floating point registers ST0 to ST7 must be empty (popped or freed) when calling a new function, and ST1 to ST7 must be empty on exiting a function. ST0 must also be empty when not used for returning a value.
Your code clobbers EAX, EBX, ECX, and EDX. You are free to destroy the contents of EAX, ECX, and EDX but you must preserve EBX. If you don't you can cause problems for the C code calling the function. After you do the enter 0,0 instruction you can push ebx and just before each leave instruction you can do pop ebx
If you were to use -O1, -O2, or -O3 GCC compiler options to enable optimizations your program may not work as expected or crash altogether.

Move variable to cl and perform shr using inline assembly

So I am trying to translate the following assignment from C to inline assembly
resp = (0x1F)&(letter >> (3 - numB));
Assuming that the declaration of the variables are the following
unsigned char resp;
unsigned char letter;
int numB;
So I have tried the following:
_asm {
mov ebx, 01fh
movzx edx, letter
mov cl,3
sub cl, numB // Line 5
shr edx, cl
and ebx, edx
mov resp, ebx
}
or the following
_asm {
mov ebx, 01fh
movzx edx, letter
mov ecx,3
sub ecx, numB
mov cl, ecx // Line 5
shr edx, cl
and ebx, edx
mov resp, ebx
}
In both cases I get size operand error in Line 5.
How can I achieve the right shift?

The E*X registers are 32 bits, while the *L registers are 8 bits. Similarly, on Windows, the int type is 32 bits wide, while the char type is 8 bits wide. You cannot arbitrarily mix these sizes within a single instruction.
So, in your first piece of code:
sub cl, numB // Line 5
this is wrong because the cl register stores an 8-bit value, whereas the numB variable is of type int, which stores a 32-bit value. You cannot subtract a 32-bit value from an 8-bit value; both operands to the SUB instruction must be the same size.
Similarly, in your second piece of code:
mov cl, ecx // Line 5
you are trying to move the 32-bit value in ECX into the 8-bit CL register. That can't happen without some kind of truncation, so you have to indicate it explicitly. The MOV instruction requires that both of its operands have the same size.
(MOVZX and MOVSX are obvious exceptions to this rule that the operand types must match for a single instruction. These instructions zero-extend or sign-extend, respectively, a smaller value so that it can be stored into a larger-sized register.)
However, in this case, you don't even need the MOV instruction. Remember that CL is just the lower 8 bits of the full 32-bit ECX register. Therefore, setting ECX also implicitly sets CL. If you only need the lower 8 bits, you can just use CL in a subsequent instruction. Thus, your code becomes:
mov ebx, 01fh ; move constant into 32-bit EBX
movzx edx, BYTE PTR letter ; zero-extended move of 8-bit variable into 32-bit EDX
mov ecx, 3 ; move constant into ECX
sub ecx, DWORD PTR numB ; subtract 32-bit variable from ECX
shr edx, cl ; shift EDX right by the lower 8 bits of ECX
and ebx, edx ; bitwise AND of EDX and EBX, leaving result in EBX
mov BYTE PTR resp, bl ; move lower 8 bits of EBX into 8-bit variable
For the same operand-size matching issue discussed above, I've also had to change the final MOV instruction. You cannot move the value stored in a 32-bit register directly into an 8-bit variable. You will have to move either the lower 8 bits or the upper 8 bits, allowing you to use either the BL or BH registers, which are 8 bits and therefore match the size of resp. In the above code, I assumed that you want only the lower 8 bits, so I've used BL.
Also note that I've used the BYTE PTR and DWORD PTR specifications. These are not strictly necessary in MASM (or Visual Studio's inline assembler), since it can deduce the sizes of the types from the types of the variables. However, I think it increases readability, and is generally a recommended practice. DWORD means 32 bit; it is the same size as int and a 32-bit register (E*X). WORD means 16 bit; it is the same size as short and a 16-bit register (*X). BYTE means 8 bits; it is the same size as char and an 8-bit register (*L or *H).

C Pointer to EFLAGS using NASM

For a task at my school, I need to write a C Program which does a 16 bit addition using an assembly programm. Besides the result the EFLAGS shall also be returned.
Here is my C Program:
int add(int a, int b, unsigned short* flags);//function must be used this way
int main(void){
unsigned short* flags = NULL;
printf("%d\n",add(30000, 36000, flags);// printing just the result
return 0;
}
For the moment the Program just prints out the result and not the flags because I am unable to get them.
Here is the assembly program for which I use NASM:
global add
section .text
add:
push ebp
mov ebp,esp
mov ax,[ebp+8]
mov bx,[ebp+12]
add ax,bx
mov esp,ebp
pop ebp
ret
Now this all works smoothly. But I have no idea how to get the pointer which must be at [ebp+16] pointing to the flag register. The professor said we will have to use the pushfd command.
My problem just lies in the assembly code. I will modify the C Program to give out the flags after I get the solution for the flags.

Normally you'd just use a debugger to look at flags, instead of writing all the code to get them into a C variable for a debug-print. Especially since decent debuggers decode the condition flags symbolically for you, instead of or as well as showing a hex value.
You don't have to know or care which bit in FLAGS is CF and which is ZF. (This information isn't relevant for writing real programs, either. I don't have it memorized, I just know which flags are tested by different conditions like jae or jl. Of course, it's good to understand that FLAGS are just data that you can copy around, save/restore, or even modify if you want)
Your function args and return value are int, which is 32-bit in the System V 32-bit x86 ABI you're using. (links to ABI docs in the x86 tag wiki). Writing a function that only looks at the low 16 bits of its input, and leaves high garbage in the high 16 bits of the output is a bug. The int return value in the prototype tells the compiler that all 32 bits of EAX are part of the return value.
As Michael points out, you seem to be saying that your assignment requires using a 16-bit ADD. That will produce carry, overflow, and other conditions with different inputs than if you looked at the full 32 bits. (BTW, this article explains carry vs. overflow very well.)
Here's what I'd do. Note the 32-bit operand size for the ADD.
global add
section .text
add:
push ebp
mov ebp,esp ; stack frames are optional, you can address things relative to ESP
mov eax, [ebp+8] ; first arg: No need to avoid loading the full 32 bits; the next insn doesn't care about the high garbage.
add ax, [ebp+12] ; low 16 bits of second arg. Operand-size implied by AX
cwde ; sign-extend AX into EAX
mov ecx, [ebp+16] ; the pointer arg
pushf ; the simple straightforward way
pop edx
mov [ecx], dx ; Store the low 16 of what we popped. Writing word [ecx] is optional, because dx implies 16-bit operand-size
; be careful not to do a 32-bit store here, because that would write outside the caller's object.
; mov esp,ebp ; redundant: ESP is still pointing at the place we pushed EBP, since the push is balanced by an equal-size pop
pop ebp
ret
CWDE (the 16->32 form of the 8086 CBW instruction) is not to be confused with CWD (the AX -> DX:AX 8086 instruction). If you're not using AX, then MOVSX / MOVZX are a good way to do this.
The fun way: instead of using the default operand size and doing 32-bit push and pop, we can do a 16-bit pop directly into the destination memory address. That would leave the stack unbalanced, so we could either uncomment the mov esp, ebp again, or use a 16-bit pushf (with an operand-size prefix, which according to the docs makes it only push the low 16 FLAGS, not the 32-bit EFLAGS.)
; What I'd *really* do: maximum efficiency if I had to use the 32-bit ABI with args on the stack, instead of args in registers
global add
section .text
add:
mov eax, [esp+4] ; first arg, first thing above the return address
add ax, [esp+8] ; second arg
cwde ; sign-extend AX into EAX
mov ecx, [esp+12] ; the pointer
pushfw ; push the low 16 of FLAGS
pop word [ecx] ; pop into memory pointed to by unsigned short* flags
ret
Both PUSHFW and POP WORD will assemble with an operand-size prefix. output from objdump -Mintel, which uses slightly different syntax from NASM:
4000c0: 66 9c pushfw
4000c2: 66 8f 01 pop WORD PTR [ecx]
PUSHFW is the same as o16 PUSHF. In NASM, o16 applies the operand-size prefix.
If you only needed the low 8 flags (not including OF), you could use LAHF to load FLAGS into AH and store that.
PUSHFing directly into the destination is not something I'd recommend. Temporarily pointing the stack at some random address is not safe in general. Programs with signal handlers will use the space below the stack asynchronously. This is why you have to reserve stack space before using it with sub esp, 32 or whatever, even if you're not going to make a function call that would overwrite it by pushing more stuff on the stack. The only exception is when you have a red-zone.
You C caller:
You're passing a NULL pointer, so of course your asm function segfaults. Pass the address of a local to give the function somewhere to store to.
int add(int a, int b, unsigned short* flags);
int main(void) {
unsigned short flags;
int result = add(30000, 36000, &flags);
printf("%d %#hx\n", result, flags);
return 0;
}

This is just a simple approach. I didn't test it, but you should get the idea...
Just set ESP to the pointer value, increment it by 2 (even for 32-bit arch) and PUSHF like this:
global add
section .text
add:
push ebp
mov ebp,esp
mov ax,[ebp+8]
mov bx,[ebp+12]
add ax,bx
; --- here comes the mod
mov esp, [ebp+16] ; this will set ESP to the pointers address "unsigned short* flags"
lea esp, [esp+2] ; adjust ESP to address above target
db 66h ; operand size prefix for 16-bit PUSHF (alternatively 'db 0x66', depending on your assembler
pushf ; this will save the lower 16-bits of EFLAGS to WORD PTR [EBP+16] = [ESP+2-2]
; --- here ends the mod
mov esp,ebp
pop ebp
ret
This should work, because PUSHF decrements ESP by 2 and then saves the value to WORD PTR [ESP]. Therefore it had to be increased before using the pointer address. Setting ESP to the appropriate value enables you to denominate the direct target of PUSHF.

What is happening in this disassembled code, and what would it look like in C?

I've disassembled this c code (using ida), and ran across this bit of code. I believe the second line is an array, as well as the 5th line, but I'm not sure why it uses a sign extend or a zero extend.
I need to convert the code to C, and I'm not sure why the sign/zero extend is used, or what C code would cause that.
mov ecx, [ebp+var_58]
mov dl, byte ptr [ebp+ecx*2+var_28]
mov [ebp+var_59], dl
mov eax, [ebp+var_58]
movsx ecx, [ebp+eax*2+var_20]
movzx edx, [ebp+var_59]
or edx, ecx
mov [ebp+var_59], dl

unsigned integer types will be zero-extended, while signed types will be sign-extended.
I kinda want to downvote this as too trivial. It's not like there's anything going on that the instruction reference manual doesn't cover. I guess it's different from asking for an explanation of a really simple C program because the trick here is understanding why one might string this sequence of instructions together, rather than just what each one does individually. Being familiar with the idioms used by non-optimizing compilers (store and reload from RAM after every statement) helps.
I'm guessing this is a snippet from inside a function that makes a stack frame, so positive offsets from ebp are where local variables are spilled when they're not live in registers.
mov ecx, [ebp+var_58] ; load var58 into ecx
mov dl, byte ptr [ebp+ecx*2+var_28] ; load a byte from var28[2*var58]
mov [ebp+var_59], dl ; store it to var59
mov eax, [ebp+var_58] ; load var58 again for some reason? can var59 alias var58?
; otherwise we still have the value in ecx, right?
; Or is this non-optimizing compiler output that's really annoying to read?
movsx ecx, [ebp+eax*2+var_20] ; load var20[var58*2]
movzx edx, [ebp+var_59] ; load var59 again
or edx, ecx ; edx = var59|var20[var58*2]
mov [ebp+var_59], dl ; spill var59 back to memory
I guess the default operand size for movsx/movzx is byte-to-dword. word-to-dword also exists, and I'm surprised your disassembler didn't disambiguate with a byte ptr on the memory operand. I'm inferring that it's a byte load because the preceding store to that address was byte-wide.
movsx is used when loading signed data that's smaller than 32b. C's integer-promotion rules dictate that operations on integer types smaller than int are automatically promoted to int (or unsigned int if int can't represent all values. e.g. if unsigned short and unsigned int are the same size).
8bit or 32bit operand sizes are available without operand-size prefix bytes. Some only Intel P6/SnB family CPUs track partial-register dependencies, sign-extending to a full register width on loads can make for faster code (avoiding false dependencies on the previous contents of the register on AMD and Silvermont). So sign or zero extending (as appropriate for the data type) on loads is often the best way to handle narrow memory locations.
Looking at the output of non-optimizing compilers is not usually worth bothering with.
If the code had been generated by a proper optimizing compiler, it would probably be more like
mov ecx, [ebp+var_58] ; var58 is live in ecx
mov al, byte ptr [ebp+ecx*2+var_28] ; var59 = var28[2*var58]
or al, [ebp+ecx*2+var_20] ; var59 |= var20[var58*2]
mov [ebp+var_59], al ; spill var59 to memory
Much easier to read, IMO, without the noise of constantly storing/reloading. You can see when a value is used multiple times without having to notice that a load was from an address that was just stored to.
If a false dependency on the upper 24 bits of eax was causing a problem, we could use movzx or movsx loads into two registers, and do an or r32, r32 like the original, but then still store the low 8. (Using a 32bit or with a memory operand would do a 4B load, not a 1B load, which could cross a cache line or even a page and segfault.)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight