convert number byte array to string and print it- Assembly x86 [duplicate] - arrays

I was tasked to write a program that displays the linear address of my
program's PSP. I wrote the following:
ORG 256
mov dx,Msg
mov ah,09h ;DOS.WriteStringToStandardOutput
int 21h
mov ax,ds
mov dx,16
mul dx ; -> Linear address is now in DX:AX
???
mov ax,4C00h ;DOS.TerminateWithExitCode
int 21h
; ------------------------------
Msg: db 'PSP is at linear address $'
I searched the DOS api (using Ralph Brown's interrupt list)
and didn't find a single function to output a number!
Did I miss it, and what can I do?
I want to display the number in DX:AX in decimal.

It's true that DOS doesn't offer us a function to output a number directly.
You'll have to first convert the number yourself and then have DOS display it
using one of the text output functions.
Displaying the unsigned 16-bit number held in AX
When tackling the problem of converting a number, it helps to see how the
digits that make up a number relate to each other.
Let's consider the number 65535 and its decomposition:
(6 * 10000) + (5 * 1000) + (5 * 100) + (3 * 10) + (5 * 1)
Method 1 : division by decreasing powers of 10
Processing the number going from the left to the right is convenient because it
allows us to display an individual digit as soon as we've extracted it.
By dividing the number (65535) by 10000, we obtain a single digit quotient
(6) that we can output as a character straight away. We also get a remainder
(5535) that will become the dividend in the next step.
By dividing the remainder from the previous step (5535) by 1000, we obtain
a single digit quotient (5) that we can output as a character straight away.
We also get a remainder (535) that will become the dividend in the next step.
By dividing the remainder from the previous step (535) by 100, we obtain
a single digit quotient (5) that we can output as a character straight away.
We also get a remainder (35) that will become the dividend in the next step.
By dividing the remainder from the previous step (35) by 10, we obtain
a single digit quotient (3) that we can output as a character straight away.
We also get a remainder (5) that will become the dividend in the next step.
By dividing the remainder from the previous step (5) by 1, we obtain
a single digit quotient (5) that we can output as a character straight away.
Here the remainder will always be 0. (Avoiding this silly division by 1
requires some extra code)
mov bx,.List
.a: xor dx,dx
div word ptr [bx] ; -> AX=[0,9] is Quotient, Remainder DX
xchg ax,dx
add dl,"0" ;Turn into character [0,9] -> ["0","9"]
push ax ;(1)
mov ah,02h ;DOS.DisplayCharacter
int 21h ; -> AL
pop ax ;(1) AX is next dividend
add bx,2
cmp bx,.List+10
jb .a
...
.List:
dw 10000,1000,100,10,1
Although this method will of course produce the correct result, it has a few
drawbacks:
Consider the smaller number 255 and its decomposition:
(0 * 10000) + (0 * 1000) + (2 * 100) + (5 * 10) + (5 * 1)
If we were to use the same 5 step process we'd get "00255". Those 2 leading
zeroes are undesirable and we would have to include extra instructions to get
rid of them.
The divider changes with each step. We had to store a list of dividers in
memory. Dynamically calculating these dividers is possible but introduces a
lot of extra divisions.
If we wanted to apply this method to displaying even larger numbers say
32-bit, and we will want to eventually, the divisions involved would get
really problematic.
So method 1 is impractical and therefore it is seldom used.
Method 2 : division by const 10
Processing the number going from the right to the left seems counter-intuitive
since our goal is to display the leftmost digit first. But as you're about to
find out, it works beautifully.
By dividing the number (65535) by 10, we obtain a quotient (6553) that will
become the dividend in the next step. We also get a remainder (5) that we
can't output just yet and so we'll have to save in somewhere. The stack is a
convenient place to do so.
By dividing the quotient from the previous step (6553) by 10, we obtain
a quotient (655) that will become the dividend in the next step. We also get
a remainder (3) that we can't just yet output and so we'll have to save it
somewhere. The stack is a convenient place to do so.
By dividing the quotient from the previous step (655) by 10, we obtain
a quotient (65) that will become the dividend in the next step. We also get
a remainder (5) that we can't just yet output and so we'll have to save it
somewhere. The stack is a convenient place to do so.
By dividing the quotient from the previous step (65) by 10, we obtain
a quotient (6) that will become the dividend in the next step. We also get
a remainder (5) that we can't just yet output and so we'll have to save it
somewhere. The stack is a convenient place to do so.
By dividing the quotient from the previous step (6) by 10, we obtain
a quotient (0) that signals that this was the last division. We also get
a remainder (6) that we could output as a character straight away, but
refraining from doing so turns out to be most effective and so as before we'll
save it on the stack.
At this point the stack holds our 5 remainders, each being a single digit
number in the range [0,9]. Since the stack is LIFO (Last In First Out), the
value that we'll POP first is the first digit we want displayed. We use a
separate loop with 5 POP's to display the complete number. But in practice,
since we want this routine to be able to also deal with numbers that have
fewer than 5 digits, we'll count the digits as they arrive and later do that
many POP's.
mov bx,10 ;CONST
xor cx,cx ;Reset counter
.a: xor dx,dx ;Setup for division DX:AX / BX
div bx ; -> AX is Quotient, Remainder DX=[0,9]
push dx ;(1) Save remainder for now
inc cx ;One more digit
test ax,ax ;Is quotient zero?
jnz .a ;No, use as next dividend
.b: pop dx ;(1)
add dl,"0" ;Turn into character [0,9] -> ["0","9"]
mov ah,02h ;DOS.DisplayCharacter
int 21h ; -> AL
loop .b
This second method has none of the drawbacks of the first method:
Because we stop when a quotient becomes zero, there's never any problem
with ugly leading zeroes.
The divider is fixed. That's easy enough.
It's real simple to apply this method to displaying larger numbers and
that's precisely what comes next.
Displaying the unsigned 32-bit number held in DX:AX
On 8086 a cascade of 2 divisions is needed to divide the 32-bit value in
DX:AX by 10.
The 1st division divides the high dividend (extended with 0) yielding a high
quotient. The 2nd division divides the low dividend (extended with the
remainder from the 1st division) yielding the low quotient. It's the remainder
from the 2nd division that we save on the stack.
To check if the dword in DX:AX is zero, I've OR-ed both halves in a scratch
register.
Instead of counting the digits, requiring a register, I chose to put a sentinel
on the stack. Because this sentinel gets a value (10) that no digit can ever
have ([0,9]), it nicely allows to determine when the display loop has to stop.
Other than that this snippet is similar to method 2 above.
mov bx,10 ;CONST
push bx ;Sentinel
.a: mov cx,ax ;Temporarily store LowDividend in CX
mov ax,dx ;First divide the HighDividend
xor dx,dx ;Setup for division DX:AX / BX
div bx ; -> AX is HighQuotient, Remainder is re-used
xchg ax,cx ;Temporarily move it to CX restoring LowDividend
div bx ; -> AX is LowQuotient, Remainder DX=[0,9]
push dx ;(1) Save remainder for now
mov dx,cx ;Build true 32-bit quotient in DX:AX
or cx,ax ;Is the true 32-bit quotient zero?
jnz .a ;No, use as next dividend
pop dx ;(1a) First pop (Is digit for sure)
.b: add dl,"0" ;Turn into character [0,9] -> ["0","9"]
mov ah,02h ;DOS.DisplayCharacter
int 21h ; -> AL
pop dx ;(1b) All remaining pops
cmp dx,bx ;Was it the sentinel?
jb .b ;Not yet
Displaying the signed 32-bit number held in DX:AX
The procedure is as follows:
First find out if the signed number is negative by testing the sign bit.
If it is, then negate the number and output a "-" character but beware to not
destroy the number in DX:AX in the process.
The rest of the snippet is the same as for an unsigned number.
test dx,dx ;Sign bit is bit 15 of high word
jns .a ;It's a positive number
neg dx ;\
neg ax ; | Negate DX:AX
sbb dx,0 ;/
push ax dx ;(1)
mov dl,"-"
mov ah,02h ;DOS.DisplayCharacter
int 21h ; -> AL
pop dx ax ;(1)
.a: mov bx,10 ;CONST
push bx ;Sentinel
.b: mov cx,ax ;Temporarily store LowDividend in CX
mov ax,dx ;First divide the HighDividend
xor dx,dx ;Setup for division DX:AX / BX
div bx ; -> AX is HighQuotient, Remainder is re-used
xchg ax,cx ;Temporarily move it to CX restoring LowDividend
div bx ; -> AX is LowQuotient, Remainder DX=[0,9]
push dx ;(2) Save remainder for now
mov dx,cx ;Build true 32-bit quotient in DX:AX
or cx,ax ;Is the true 32-bit quotient zero?
jnz .b ;No, use as next dividend
pop dx ;(2a) First pop (Is digit for sure)
.c: add dl,"0" ;Turn into character [0,9] -> ["0","9"]
mov ah,02h ;DOS.DisplayCharacter
int 21h ; -> AL
pop dx ;(2b) All remaining pops
cmp dx,bx ;Was it the sentinel?
jb .c ;Not yet
Will I need separate routines for different number sizes?
In a program where you need to display on occasion AL, AX, or DX:AX, you could
just include the 32-bit version and use next little wrappers for the smaller
sizes:
; IN (al) OUT ()
DisplaySignedNumber8:
push ax
cbw ;Promote AL to AX
call DisplaySignedNumber16
pop ax
ret
; -------------------------
; IN (ax) OUT ()
DisplaySignedNumber16:
push dx
cwd ;Promote AX to DX:AX
call DisplaySignedNumber32
pop dx
ret
; -------------------------
; IN (dx:ax) OUT ()
DisplaySignedNumber32:
push ax bx cx dx
...
Alternatively, if you don't mind the clobbering of the AX and DX registers use
this fall-through solution:
; IN (al) OUT () MOD (ax,dx)
DisplaySignedNumber8:
cbw
; --- --- --- --- -
; IN (ax) OUT () MOD (ax,dx)
DisplaySignedNumber16:
cwd
; --- --- --- --- -
; IN (dx:ax) OUT () MOD (ax,dx)
DisplaySignedNumber32:
push bx cx
...

Related

How to calculate the average of an array in assembly [duplicate]

I noticed when EDX contains some random default value like 00401000, and I then use a DIV instruction like this:
mov eax,10
mov ebx,5
div ebx
it causes an INTEGER OVERFLOW ERROR. However, if I set edx to 0 and do the same thing it works. I believed that using div would result in the quotient overwriting eax and the remainder overwriting edx.
Getting this INTEGER OVERFLOW ERROR really confuses me.
What to do
For 32-bit / 32-bit => 32-bit division: zero- or sign-extend the 32-bit dividend from EAX into 64-bit EDX:EAX.
For 16-bit, AX into DX:AX with cwd or xor-zeroing.
unsigned: XOR EDX,EDX then DIV divisor
signed: CDQ then IDIV divisor
See also When and why do we sign extend and use cdq with mul/div?
Why (TL;DR)
For DIV, the registers EDX and EAX form one single 64 bit value (often shown as EDX:EAX), which is then divided, in this case, by EBX.
So if EAX = 10 or hex A and EDX is, say 20 or hex 14, then together they form the 64 bit value hex 14 0000 000A or decimal 85899345930. If this is divided by 5, the result is 17179869186 or hex
4 0000 0002, which is a value that does not fit in 32 bits.
That is why you get an integer overflow.
If, however, EDX were only 1, you would divide hex 1 0000 000A by 5, which results in hex 3333 3335. That is not the value you wanted, but it does not cause an integer overflow.
To really divide 32 bit register EAX by another 32 bit register, take care that the top of the 64 bit value formed by EDX:EAX is 0.
So, before a single division, you should generally set EDX to 0.
(Or for signed division, cdq to sign extend EAX into EDX:EAX before idiv)
But EDX does not have always have to be 0. It can just not be that big that the result causes an overflow.
One example from my BigInteger code:
After a division with DIV, the quotient is in EAX and the remainder is in EDX. To divide something like a BigInteger, which consists of an array of many DWORDS, by 10 (for instance to convert the value to a decimal string), you do something like the following:
; ECX contains number of "limbs" (DWORDs) to divide by 10
XOR EDX,EDX ; before start of loop, set EDX to 0
MOV EBX,10
LEA ESI,[EDI + 4*ECX - 4] ; now points to top element of array
#DivLoop:
MOV EAX,[ESI]
DIV EBX ; divide EDX:EAX by EBX. After that,
; quotient in EAX, remainder in EDX
MOV [ESI],EAX
SUB ESI,4 ; remainder in EDX is re-used as top DWORD...
DEC ECX ; ... for the next iteration, and is NOT set to 0.
JNE #DivLoop
After that loop, the value represented by the entire array (i.e. by the BigInteger) is divided by 10, and EDX contains the remainder of that division.
FWIW, in the assembler I use (Delphi's built-in assembler), labels starting with # are local to the function, i.e. they don't interfere with equally named labels in other functions.
The DIV instruction divides EDX:EAX by the r/m32 that follows the DIV instruction. So, if you fail to set EDX to zero, the value you are using becomes extremely large.
Trust that helps

Assembly Language 8086 Display 1-10 [duplicate]

I was tasked to write a program that displays the linear address of my
program's PSP. I wrote the following:
ORG 256
mov dx,Msg
mov ah,09h ;DOS.WriteStringToStandardOutput
int 21h
mov ax,ds
mov dx,16
mul dx ; -> Linear address is now in DX:AX
???
mov ax,4C00h ;DOS.TerminateWithExitCode
int 21h
; ------------------------------
Msg: db 'PSP is at linear address $'
I searched the DOS api (using Ralph Brown's interrupt list)
and didn't find a single function to output a number!
Did I miss it, and what can I do?
I want to display the number in DX:AX in decimal.
It's true that DOS doesn't offer us a function to output a number directly.
You'll have to first convert the number yourself and then have DOS display it
using one of the text output functions.
Displaying the unsigned 16-bit number held in AX
When tackling the problem of converting a number, it helps to see how the
digits that make up a number relate to each other.
Let's consider the number 65535 and its decomposition:
(6 * 10000) + (5 * 1000) + (5 * 100) + (3 * 10) + (5 * 1)
Method 1 : division by decreasing powers of 10
Processing the number going from the left to the right is convenient because it
allows us to display an individual digit as soon as we've extracted it.
By dividing the number (65535) by 10000, we obtain a single digit quotient
(6) that we can output as a character straight away. We also get a remainder
(5535) that will become the dividend in the next step.
By dividing the remainder from the previous step (5535) by 1000, we obtain
a single digit quotient (5) that we can output as a character straight away.
We also get a remainder (535) that will become the dividend in the next step.
By dividing the remainder from the previous step (535) by 100, we obtain
a single digit quotient (5) that we can output as a character straight away.
We also get a remainder (35) that will become the dividend in the next step.
By dividing the remainder from the previous step (35) by 10, we obtain
a single digit quotient (3) that we can output as a character straight away.
We also get a remainder (5) that will become the dividend in the next step.
By dividing the remainder from the previous step (5) by 1, we obtain
a single digit quotient (5) that we can output as a character straight away.
Here the remainder will always be 0. (Avoiding this silly division by 1
requires some extra code)
mov bx,.List
.a: xor dx,dx
div word ptr [bx] ; -> AX=[0,9] is Quotient, Remainder DX
xchg ax,dx
add dl,"0" ;Turn into character [0,9] -> ["0","9"]
push ax ;(1)
mov ah,02h ;DOS.DisplayCharacter
int 21h ; -> AL
pop ax ;(1) AX is next dividend
add bx,2
cmp bx,.List+10
jb .a
...
.List:
dw 10000,1000,100,10,1
Although this method will of course produce the correct result, it has a few
drawbacks:
Consider the smaller number 255 and its decomposition:
(0 * 10000) + (0 * 1000) + (2 * 100) + (5 * 10) + (5 * 1)
If we were to use the same 5 step process we'd get "00255". Those 2 leading
zeroes are undesirable and we would have to include extra instructions to get
rid of them.
The divider changes with each step. We had to store a list of dividers in
memory. Dynamically calculating these dividers is possible but introduces a
lot of extra divisions.
If we wanted to apply this method to displaying even larger numbers say
32-bit, and we will want to eventually, the divisions involved would get
really problematic.
So method 1 is impractical and therefore it is seldom used.
Method 2 : division by const 10
Processing the number going from the right to the left seems counter-intuitive
since our goal is to display the leftmost digit first. But as you're about to
find out, it works beautifully.
By dividing the number (65535) by 10, we obtain a quotient (6553) that will
become the dividend in the next step. We also get a remainder (5) that we
can't output just yet and so we'll have to save in somewhere. The stack is a
convenient place to do so.
By dividing the quotient from the previous step (6553) by 10, we obtain
a quotient (655) that will become the dividend in the next step. We also get
a remainder (3) that we can't just yet output and so we'll have to save it
somewhere. The stack is a convenient place to do so.
By dividing the quotient from the previous step (655) by 10, we obtain
a quotient (65) that will become the dividend in the next step. We also get
a remainder (5) that we can't just yet output and so we'll have to save it
somewhere. The stack is a convenient place to do so.
By dividing the quotient from the previous step (65) by 10, we obtain
a quotient (6) that will become the dividend in the next step. We also get
a remainder (5) that we can't just yet output and so we'll have to save it
somewhere. The stack is a convenient place to do so.
By dividing the quotient from the previous step (6) by 10, we obtain
a quotient (0) that signals that this was the last division. We also get
a remainder (6) that we could output as a character straight away, but
refraining from doing so turns out to be most effective and so as before we'll
save it on the stack.
At this point the stack holds our 5 remainders, each being a single digit
number in the range [0,9]. Since the stack is LIFO (Last In First Out), the
value that we'll POP first is the first digit we want displayed. We use a
separate loop with 5 POP's to display the complete number. But in practice,
since we want this routine to be able to also deal with numbers that have
fewer than 5 digits, we'll count the digits as they arrive and later do that
many POP's.
mov bx,10 ;CONST
xor cx,cx ;Reset counter
.a: xor dx,dx ;Setup for division DX:AX / BX
div bx ; -> AX is Quotient, Remainder DX=[0,9]
push dx ;(1) Save remainder for now
inc cx ;One more digit
test ax,ax ;Is quotient zero?
jnz .a ;No, use as next dividend
.b: pop dx ;(1)
add dl,"0" ;Turn into character [0,9] -> ["0","9"]
mov ah,02h ;DOS.DisplayCharacter
int 21h ; -> AL
loop .b
This second method has none of the drawbacks of the first method:
Because we stop when a quotient becomes zero, there's never any problem
with ugly leading zeroes.
The divider is fixed. That's easy enough.
It's real simple to apply this method to displaying larger numbers and
that's precisely what comes next.
Displaying the unsigned 32-bit number held in DX:AX
On 8086 a cascade of 2 divisions is needed to divide the 32-bit value in
DX:AX by 10.
The 1st division divides the high dividend (extended with 0) yielding a high
quotient. The 2nd division divides the low dividend (extended with the
remainder from the 1st division) yielding the low quotient. It's the remainder
from the 2nd division that we save on the stack.
To check if the dword in DX:AX is zero, I've OR-ed both halves in a scratch
register.
Instead of counting the digits, requiring a register, I chose to put a sentinel
on the stack. Because this sentinel gets a value (10) that no digit can ever
have ([0,9]), it nicely allows to determine when the display loop has to stop.
Other than that this snippet is similar to method 2 above.
mov bx,10 ;CONST
push bx ;Sentinel
.a: mov cx,ax ;Temporarily store LowDividend in CX
mov ax,dx ;First divide the HighDividend
xor dx,dx ;Setup for division DX:AX / BX
div bx ; -> AX is HighQuotient, Remainder is re-used
xchg ax,cx ;Temporarily move it to CX restoring LowDividend
div bx ; -> AX is LowQuotient, Remainder DX=[0,9]
push dx ;(1) Save remainder for now
mov dx,cx ;Build true 32-bit quotient in DX:AX
or cx,ax ;Is the true 32-bit quotient zero?
jnz .a ;No, use as next dividend
pop dx ;(1a) First pop (Is digit for sure)
.b: add dl,"0" ;Turn into character [0,9] -> ["0","9"]
mov ah,02h ;DOS.DisplayCharacter
int 21h ; -> AL
pop dx ;(1b) All remaining pops
cmp dx,bx ;Was it the sentinel?
jb .b ;Not yet
Displaying the signed 32-bit number held in DX:AX
The procedure is as follows:
First find out if the signed number is negative by testing the sign bit.
If it is, then negate the number and output a "-" character but beware to not
destroy the number in DX:AX in the process.
The rest of the snippet is the same as for an unsigned number.
test dx,dx ;Sign bit is bit 15 of high word
jns .a ;It's a positive number
neg dx ;\
neg ax ; | Negate DX:AX
sbb dx,0 ;/
push ax dx ;(1)
mov dl,"-"
mov ah,02h ;DOS.DisplayCharacter
int 21h ; -> AL
pop dx ax ;(1)
.a: mov bx,10 ;CONST
push bx ;Sentinel
.b: mov cx,ax ;Temporarily store LowDividend in CX
mov ax,dx ;First divide the HighDividend
xor dx,dx ;Setup for division DX:AX / BX
div bx ; -> AX is HighQuotient, Remainder is re-used
xchg ax,cx ;Temporarily move it to CX restoring LowDividend
div bx ; -> AX is LowQuotient, Remainder DX=[0,9]
push dx ;(2) Save remainder for now
mov dx,cx ;Build true 32-bit quotient in DX:AX
or cx,ax ;Is the true 32-bit quotient zero?
jnz .b ;No, use as next dividend
pop dx ;(2a) First pop (Is digit for sure)
.c: add dl,"0" ;Turn into character [0,9] -> ["0","9"]
mov ah,02h ;DOS.DisplayCharacter
int 21h ; -> AL
pop dx ;(2b) All remaining pops
cmp dx,bx ;Was it the sentinel?
jb .c ;Not yet
Will I need separate routines for different number sizes?
In a program where you need to display on occasion AL, AX, or DX:AX, you could
just include the 32-bit version and use next little wrappers for the smaller
sizes:
; IN (al) OUT ()
DisplaySignedNumber8:
push ax
cbw ;Promote AL to AX
call DisplaySignedNumber16
pop ax
ret
; -------------------------
; IN (ax) OUT ()
DisplaySignedNumber16:
push dx
cwd ;Promote AX to DX:AX
call DisplaySignedNumber32
pop dx
ret
; -------------------------
; IN (dx:ax) OUT ()
DisplaySignedNumber32:
push ax bx cx dx
...
Alternatively, if you don't mind the clobbering of the AX and DX registers use
this fall-through solution:
; IN (al) OUT () MOD (ax,dx)
DisplaySignedNumber8:
cbw
; --- --- --- --- -
; IN (ax) OUT () MOD (ax,dx)
DisplaySignedNumber16:
cwd
; --- --- --- --- -
; IN (dx:ax) OUT () MOD (ax,dx)
DisplaySignedNumber32:
push bx cx
...

Assembly 8086 loops issue [duplicate]

This question already has an answer here:
Problems with IDIV Assembly Language
(1 answer)
Closed 1 year ago.
The pseudocode is the following:
read c //a double digit number
for(i=1,n,i++)
{ if (n%i==0)
print i;}
In assembly I have written it as:
mov bx,ax ; ax was the number ex.0020, storing a copy in bx.
mov cx,1 ; the start of the for loop
.forloop:
mov ax,bx ; resetting ax to be the number(needed for the next iterations)
div cx
cmp ah,0 ; checking if the remainder is 0
jne .ifinstr
add cl 48 ;adding so my number would be displayed later as decimal
mov dl,cl ;printing the remainder
mov ah,2
int 21h
sub cl,48 ;converting it back to hexa
.ifinstr:
inc cx ;the loop goes on
cmp cx,bx
jle .forloop
I've checked by tracing its steps. The first iteration goes well, then, at the second one, it makes ax=the initial number and cx=2 as it should, but at 'div cx' it jumps somwhere unknown to me and it doesn't stop anywhere. It does:
push ax
mov al,12
nop
push 9
.
.
Any idea why it does that?
try to do mov dx,0 just before div instruction.
Basically every time you come after jump, there may be some data in dx register, so you can just move zero in dx or XOR dx,dx.
This is to be done, because otherwise division will be considered differently.
See this:
Unsigned divide.
Algorithm:
when operand is a byte:
AL = AX / operand
AH = remainder (modulus)
when operand is a word:
AX = (DX AX) / operand
DX = remainder (modulus)
Example:
MOV AX, 203 ; AX = 00CBh
MOV BL, 4
DIV BL ; AL = 50 (32h), AH = 3
RET

Executing the DIV instruction in 8086 assembly

My program prints the msg3 statement (PutStr msg3), but does not proceed to the
DIV CX
instruction in my program.
Is there something I'm doing incorrectly with that register?
Or should the instruction be
DIV [CX]
instead or do I not have the compare and jump conditions set correctly?
prime_loop:
sub AX,AX ;clears the reg to allow the next index of the array
sub CX,CX ;clears counter to decrement starting from number of the value for array
mov AX, [test_marks+ESI*4] ;copy value of array at index ESI into reg
mov CX, [test_marks+ESI*4] ;copy value of array at index ESI into reg for purposes of counting down
check_prime:
dec CX
nwln
PutStr msg3
div WORD CX ;divide value of EAX by ECX
cmp DX,0 ;IF the remainder is zero
je chck_divisor ;check to see divisor 'ECX'
sub AX,AX ;else clear quotient register EAX
sub DX,DX ;clear remainder register
mov AX,[test_marks+ESI*4] ;move the number of the current iteration back into EAX
jmp check_prime ;start again from loop
chck_divisor:
cmp CX,1
jne prime_loop ;if the divisor is not 1 then it is not a prime number so continue with iterations
PutInt AX ;else print the prime_num
PutStr
inc ESI
jmp prime_loop
done:
.EXIT
These are some points about your code:
If this is indeed 8086 assembly then instructions like mov AX, [test_marks+ESI*4] that use scaled indexed addressing simply don't exist!
The scale by 4 suggests that your array is filled with doublewords, yet you use just a word. This could be what you want, but it looks suspicious.
Let's hope no array element is 1 because if so, then the div cx instruction will trigger an exception (#DE). Because you don't test the CX register for becoming 0.
In the check_prime loop only the 1st iteration lacks the zeroing of DX in order to give a correct quotient.
The solution will depend on the targetted architecture 8086 or x86. Now your program is a mix of both!
It's possible that since DX is not zeroed before div, that you're getting overflow. I don't know how your environment handles overflow.

Fastest way to calculate a 128-bit integer modulo a 64-bit integer

I have a 128-bit unsigned integer A and a 64-bit unsigned integer B. What's the fastest way to calculate A % B - that is the (64-bit) remainder from dividing A by B?
I'm looking to do this in either C or assembly language, but I need to target the 32-bit x86 platform. This unfortunately means that I cannot take advantage of compiler support for 128-bit integers, nor of the x64 architecture's ability to perform the required operation in a single instruction.
Edit:
Thank you for the answers so far. However, it appears to me that the suggested algorithms would be quite slow - wouldn't the fastest way to perform a 128-bit by 64-bit division be to leverage the processor's native support for 64-bit by 32-bit division? Does anyone know if there is a way to perform the larger division in terms of a few smaller divisions?
Re: How often does B change?
Primarily I'm interested in a general solution - what calculation would you perform if A and B are likely to be different every time?
However, a second possible situation is that B does not vary as often as A - there may be as many as 200 As to divide by each B. How would your answer differ in this case?
You can use the division version of Russian Peasant Multiplication.
To find the remainder, execute (in pseudo-code):
X = B;
while (X <= A/2)
{
X <<= 1;
}
while (A >= B)
{
if (A >= X)
A -= X;
X >>= 1;
}
The modulus is left in A.
You'll need to implement the shifts, comparisons and subtractions to operate on values made up of a pair of 64 bit numbers, but that's fairly trivial (likely you should implement the left-shift-by-1 as X + X).
This will loop at most 255 times (with a 128 bit A). Of course you need to do a pre-check for a zero divisor.
Perhaps you're looking for a finished program, but the basic algorithms for multi-precision arithmetic can be found in Knuth's Art of Computer Programming, Volume 2. You can find the division algorithm described online here. The algorithms deal with arbitrary multi-precision arithmetic, and so are more general than you need, but you should be able to simplify them for 128 bit arithmetic done on 64- or 32-bit digits. Be prepared for a reasonable amount of work (a) understanding the algorithm, and (b) converting it to C or assembler.
You might also want to check out Hacker's Delight, which is full of very clever assembler and other low-level hackery, including some multi-precision arithmetic.
If your B is small enough for the uint64_t + operation to not wrap:
Given A = AH*2^64 + AL:
A % B == (((AH % B) * (2^64 % B)) + (AL % B)) % B
== (((AH % B) * ((2^64 - B) % B)) + (AL % B)) % B
If your compiler supports 64-bit integers, then this is probably the easiest way to go.
MSVC's implementation of a 64-bit modulo on 32-bit x86 is some hairy loop filled assembly (VC\crt\src\intel\llrem.asm for the brave), so I'd personally go with that.
This is almost untested partly speed modificated Mod128by64 'Russian peasant' algorithm function. Unfortunately I'm a Delphi user so this function works under Delphi. :) But the assembler is almost the same so...
function Mod128by64(Dividend: PUInt128; Divisor: PUInt64): UInt64;
//In : eax = #Dividend
// : edx = #Divisor
//Out: eax:edx as Remainder
asm
//Registers inside rutine
//Divisor = edx:ebp
//Dividend = bh:ebx:edx //We need 64 bits + 1 bit in bh
//Result = esi:edi
//ecx = Loop counter and Dividend index
push ebx //Store registers to stack
push esi
push edi
push ebp
mov ebp, [edx] //Divisor = edx:ebp
mov edx, [edx + 4]
mov ecx, ebp //Div by 0 test
or ecx, edx
jz #DivByZero
xor edi, edi //Clear result
xor esi, esi
//Start of 64 bit division Loop
mov ecx, 15 //Load byte loop shift counter and Dividend index
#SkipShift8Bits: //Small Dividend numbers shift optimisation
cmp [eax + ecx], ch //Zero test
jnz #EndSkipShiftDividend
loop #SkipShift8Bits //Skip 8 bit loop
#EndSkipShiftDividend:
test edx, $FF000000 //Huge Divisor Numbers Shift Optimisation
jz #Shift8Bits //This Divisor is > $00FFFFFF:FFFFFFFF
mov ecx, 8 //Load byte shift counter
mov esi, [eax + 12] //Do fast 56 bit (7 bytes) shift...
shr esi, cl //esi = $00XXXXXX
mov edi, [eax + 9] //Load for one byte right shifted 32 bit value
#Shift8Bits:
mov bl, [eax + ecx] //Load 8 bits of Dividend
//Here we can unrole partial loop 8 bit division to increase execution speed...
mov ch, 8 //Set partial byte counter value
#Do65BitsShift:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
setc bh //Save 65th bit
sub edi, ebp //Compare dividend and divisor
sbb esi, edx //Subtract the divisor
sbb bh, 0 //Use 65th bit in bh
jnc #NoCarryAtCmp //Test...
add edi, ebp //Return privius dividend state
adc esi, edx
#NoCarryAtCmp:
dec ch //Decrement counter
jnz #Do65BitsShift
//End of 8 bit (byte) partial division loop
dec cl //Decrement byte loop shift counter
jns #Shift8Bits //Last jump at cl = 0!!!
//End of 64 bit division loop
mov eax, edi //Load result to eax:edx
mov edx, esi
#RestoreRegisters:
pop ebp //Restore Registers
pop edi
pop esi
pop ebx
ret
#DivByZero:
xor eax, eax //Here you can raise Div by 0 exception, now function only return 0.
xor edx, edx
jmp #RestoreRegisters
end;
At least one more speed optimisation is possible! After 'Huge Divisor Numbers Shift Optimisation' we can test divisors high bit, if it is 0 we do not need to use extra bh register as 65th bit to store in it. So unrolled part of loop can look like:
shl bl,1 //Shift dividend left for one bit
rcl edi,1
rcl esi,1
sub edi, ebp //Compare dividend and divisor
sbb esi, edx //Subtract the divisor
jnc #NoCarryAtCmpX
add edi, ebp //Return privius dividend state
adc esi, edx
#NoCarryAtCmpX:
I know the question specified 32-bit code, but the answer for 64-bit may be useful or interesting to others.
And yes, 64b/32b => 32b division does make a useful building-block for 128b % 64b => 64b. libgcc's __umoddi3 (source linked below) gives an idea of how to do that sort of thing, but it only implements 2N % 2N => 2N on top of a 2N / N => N division, not 4N % 2N => 2N.
Wider multi-precision libraries are available, e.g. https://gmplib.org/manual/Integer-Division.html#Integer-Division.
GNU C on 64-bit machines does provide an __int128 type, and libgcc functions to multiply and divide as efficiently as possible on the target architecture.
x86-64's div r/m64 instruction does 128b/64b => 64b division (also producing remainder as a second output), but it faults if the quotient overflows. So you can't directly use it if A/B > 2^64-1, but you can get gcc to use it for you (or even inline the same code that libgcc uses).
This compiles (Godbolt compiler explorer) to one or two div instructions (which happen inside a libgcc function call). If there was a faster way, libgcc would probably use that instead.
#include <stdint.h>
uint64_t AmodB(unsigned __int128 A, uint64_t B) {
return A % B;
}
The __umodti3 function it calls calculates a full 128b/128b modulo, but the implementation of that function does check for the special case where the divisor's high half is 0, as you can see in the libgcc source. (libgcc builds the si/di/ti version of the function from that code, as appropriate for the target architecture. udiv_qrnnd is an inline asm macro that does unsigned 2N/N => N division for the target architecture.
For x86-64 (and other architectures with a hardware divide instruction), the fast-path (when high_half(A) < B; guaranteeing div won't fault) is just two not-taken branches, some fluff for out-of-order CPUs to chew through, and a single div r64 instruction, which takes about 50-100 cycles1 on modern x86 CPUs, according to Agner Fog's insn tables. Some other work can be happening in parallel with div, but the integer divide unit is not very pipelined and div decodes to a lot of uops (unlike FP division).
The fallback path still only uses two 64-bit div instructions for the case where B is only 64-bit, but A/B doesn't fit in 64 bits so A/B directly would fault.
Note that libgcc's __umodti3 just inlines __udivmoddi4 into a wrapper that only returns the remainder.
Footnote 1: 32-bit div is over 2x faster on Intel CPUs. On AMD CPUs, performance only depends on the size of the actual input values, even if they're small values in a 64-bit register. If small values are common, it might be worth benchmarking a branch to a simple 32-bit division version before doing 64-bit or 128-bit division.
For repeated modulo by the same B
It might be worth considering calculating a fixed-point multiplicative inverse for B, if one exists. For example, with compile-time constants, gcc does the optimization for types narrower than 128b.
uint64_t modulo_by_constant64(uint64_t A) { return A % 0x12345678ABULL; }
movabs rdx, -2233785418547900415
mov rax, rdi
mul rdx
mov rax, rdx # wasted instruction, could have kept using RDX.
movabs rdx, 78187493547
shr rax, 36 # division result
imul rax, rdx # multiply and subtract to get the modulo
sub rdi, rax
mov rax, rdi
ret
x86's mul r64 instruction does 64b*64b => 128b (rdx:rax) multiplication, and can be used as a building block to construct a 128b * 128b => 256b multiply to implement the same algorithm. Since we only need the high half of the full 256b result, that saves a few multiplies.
Modern Intel CPUs have very high performance mul: 3c latency, one per clock throughput. However, the exact combination of shifts and adds required varies with the constant, so the general case of calculating a multiplicative inverse at run-time isn't quite as efficient each time its used as a JIT-compiled or statically-compiled version (even on top of the pre-computation overhead).
IDK where the break-even point would be. For JIT-compiling, it will be higher than ~200 reuses, unless you cache generated code for commonly-used B values. For the "normal" way, it might possibly be in the range of 200 reuses, but IDK how expensive it would be to find a modular multiplicative inverse for 128-bit / 64-bit division.
libdivide can do this for you, but only for 32 and 64-bit types. Still, it's probably a good starting point.
I have made both version of Mod128by64 'Russian peasant' division function: classic and speed optimised. Speed optimised can do on my 3Ghz PC more than 1000.000 random calculations per second and is more than three times faster than classic function.
If we compare the execution time of calculating 128 by 64 and calculating 64 by 64 bit modulo than this function is only about 50% slower.
Classic Russian peasant:
function Mod128by64Clasic(Dividend: PUInt128; Divisor: PUInt64): UInt64;
//In : eax = #Dividend
// : edx = #Divisor
//Out: eax:edx as Remainder
asm
//Registers inside rutine
//edx:ebp = Divisor
//ecx = Loop counter
//Result = esi:edi
push ebx //Store registers to stack
push esi
push edi
push ebp
mov ebp, [edx] //Load divisor to edx:ebp
mov edx, [edx + 4]
mov ecx, ebp //Div by 0 test
or ecx, edx
jz #DivByZero
push [eax] //Store Divisor to the stack
push [eax + 4]
push [eax + 8]
push [eax + 12]
xor edi, edi //Clear result
xor esi, esi
mov ecx, 128 //Load shift counter
#Do128BitsShift:
shl [esp + 12], 1 //Shift dividend from stack left for one bit
rcl [esp + 8], 1
rcl [esp + 4], 1
rcl [esp], 1
rcl edi, 1
rcl esi, 1
setc bh //Save 65th bit
sub edi, ebp //Compare dividend and divisor
sbb esi, edx //Subtract the divisor
sbb bh, 0 //Use 65th bit in bh
jnc #NoCarryAtCmp //Test...
add edi, ebp //Return privius dividend state
adc esi, edx
#NoCarryAtCmp:
loop #Do128BitsShift
//End of 128 bit division loop
mov eax, edi //Load result to eax:edx
mov edx, esi
#RestoreRegisters:
lea esp, esp + 16 //Restore Divisors space on stack
pop ebp //Restore Registers
pop edi
pop esi
pop ebx
ret
#DivByZero:
xor eax, eax //Here you can raise Div by 0 exception, now function only return 0.
xor edx, edx
jmp #RestoreRegisters
end;
Speed optimised Russian peasant:
function Mod128by64Oprimized(Dividend: PUInt128; Divisor: PUInt64): UInt64;
//In : eax = #Dividend
// : edx = #Divisor
//Out: eax:edx as Remainder
asm
//Registers inside rutine
//Divisor = edx:ebp
//Dividend = ebx:edx //We need 64 bits
//Result = esi:edi
//ecx = Loop counter and Dividend index
push ebx //Store registers to stack
push esi
push edi
push ebp
mov ebp, [edx] //Divisor = edx:ebp
mov edx, [edx + 4]
mov ecx, ebp //Div by 0 test
or ecx, edx
jz #DivByZero
xor edi, edi //Clear result
xor esi, esi
//Start of 64 bit division Loop
mov ecx, 15 //Load byte loop shift counter and Dividend index
#SkipShift8Bits: //Small Dividend numbers shift optimisation
cmp [eax + ecx], ch //Zero test
jnz #EndSkipShiftDividend
loop #SkipShift8Bits //Skip Compute 8 Bits unroled loop ?
#EndSkipShiftDividend:
test edx, $FF000000 //Huge Divisor Numbers Shift Optimisation
jz #Shift8Bits //This Divisor is > $00FFFFFF:FFFFFFFF
mov ecx, 8 //Load byte shift counter
mov esi, [eax + 12] //Do fast 56 bit (7 bytes) shift...
shr esi, cl //esi = $00XXXXXX
mov edi, [eax + 9] //Load for one byte right shifted 32 bit value
#Shift8Bits:
mov bl, [eax + ecx] //Load 8 bit part of Dividend
//Compute 8 Bits unroled loop
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
jc #DividentAbove0 //dividend hi bit set?
cmp esi, edx //dividend hi part larger?
jb #DividentBelow0
ja #DividentAbove0
cmp edi, ebp //dividend lo part larger?
jb #DividentBelow0
#DividentAbove0:
sub edi, ebp //Return privius dividend state
sbb esi, edx
#DividentBelow0:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
jc #DividentAbove1 //dividend hi bit set?
cmp esi, edx //dividend hi part larger?
jb #DividentBelow1
ja #DividentAbove1
cmp edi, ebp //dividend lo part larger?
jb #DividentBelow1
#DividentAbove1:
sub edi, ebp //Return privius dividend state
sbb esi, edx
#DividentBelow1:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
jc #DividentAbove2 //dividend hi bit set?
cmp esi, edx //dividend hi part larger?
jb #DividentBelow2
ja #DividentAbove2
cmp edi, ebp //dividend lo part larger?
jb #DividentBelow2
#DividentAbove2:
sub edi, ebp //Return privius dividend state
sbb esi, edx
#DividentBelow2:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
jc #DividentAbove3 //dividend hi bit set?
cmp esi, edx //dividend hi part larger?
jb #DividentBelow3
ja #DividentAbove3
cmp edi, ebp //dividend lo part larger?
jb #DividentBelow3
#DividentAbove3:
sub edi, ebp //Return privius dividend state
sbb esi, edx
#DividentBelow3:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
jc #DividentAbove4 //dividend hi bit set?
cmp esi, edx //dividend hi part larger?
jb #DividentBelow4
ja #DividentAbove4
cmp edi, ebp //dividend lo part larger?
jb #DividentBelow4
#DividentAbove4:
sub edi, ebp //Return privius dividend state
sbb esi, edx
#DividentBelow4:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
jc #DividentAbove5 //dividend hi bit set?
cmp esi, edx //dividend hi part larger?
jb #DividentBelow5
ja #DividentAbove5
cmp edi, ebp //dividend lo part larger?
jb #DividentBelow5
#DividentAbove5:
sub edi, ebp //Return privius dividend state
sbb esi, edx
#DividentBelow5:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
jc #DividentAbove6 //dividend hi bit set?
cmp esi, edx //dividend hi part larger?
jb #DividentBelow6
ja #DividentAbove6
cmp edi, ebp //dividend lo part larger?
jb #DividentBelow6
#DividentAbove6:
sub edi, ebp //Return privius dividend state
sbb esi, edx
#DividentBelow6:
shl bl, 1 //Shift dividend left for one bit
rcl edi, 1
rcl esi, 1
jc #DividentAbove7 //dividend hi bit set?
cmp esi, edx //dividend hi part larger?
jb #DividentBelow7
ja #DividentAbove7
cmp edi, ebp //dividend lo part larger?
jb #DividentBelow7
#DividentAbove7:
sub edi, ebp //Return privius dividend state
sbb esi, edx
#DividentBelow7:
//End of Compute 8 Bits (unroled loop)
dec cl //Decrement byte loop shift counter
jns #Shift8Bits //Last jump at cl = 0!!!
//End of division loop
mov eax, edi //Load result to eax:edx
mov edx, esi
#RestoreRegisters:
pop ebp //Restore Registers
pop edi
pop esi
pop ebx
ret
#DivByZero:
xor eax, eax //Here you can raise Div by 0 exception, now function only return 0.
xor edx, edx
jmp #RestoreRegisters
end;
I'd like to share a few thoughts.
It's not as simple as MSN proposes I'm afraid.
In the expression:
(((AH % B) * ((2^64 - B) % B)) + (AL % B)) % B
both multiplication and addition may overflow. I think one could take it into account and still use the general concept with some modifications, but something tells me it's going to get really scary.
I was curious how 64 bit modulo operation was implemented in MSVC and I tried to find something out. I don't really know assembly and all I had available was Express edition, without the source of VC\crt\src\intel\llrem.asm, but I think I managed to get some idea what's going on, after a bit of playing with the debugger and disassembly output. I tried to figure out how the remainder is calculated in case of positive integers and the divisor >=2^32. There is some code that deals with negative numbers of course, but I didn't dig into that.
Here is how I see it:
If divisor >= 2^32 both the dividend and the divisor are shifted right as much as necessary to fit the divisor into 32 bits. In other words: if it takes n digits to write the divisor down in binary and n > 32, n-32 least significant digits of both the divisor and the dividend are discarded. After that, the division is performed using hardware support for dividing 64 bit integers by 32 bit ones. The result might be incorrect, but I think it can be proved, that the result may be off by at most 1. After the division, the divisor (original one) is multiplied by the result and the product subtracted from the dividend. Then it is corrected by adding or subtracting the divisor if necessary (if the result of the division was off by one).
It's easy to divide 128 bit integer by 32 bit one leveraging hardware support for 64-bit by 32-bit division. In case the divisor < 2^32, one can calculate the remainder performing just 4 divisions as follows:
Let's assume the dividend is stored in:
DWORD dividend[4] = ...
the remainder will go into:
DWORD remainder;
1) Divide dividend[3] by divisor. Store the remainder in remainder.
2) Divide QWORD (remainder:dividend[2]) by divisor. Store the remainder in remainder.
3) Divide QWORD (remainder:dividend[1]) by divisor. Store the remainder in remainder.
4) Divide QWORD (remainder:dividend[0]) by divisor. Store the remainder in remainder.
After those 4 steps the variable remainder will hold what You are looking for.
(Please don't kill me if I got the endianess wrong. I'm not even a programmer)
In case the divisor is grater than 2^32-1 I don't have good news. I don't have a complete proof that the result after the shift is off by no more than 1, in the procedure I described earlier, which I believe MSVC is using. I think however that it has something to do with the fact, that the part that is discarded is at least 2^31 times less than the divisor, the dividend is less than 2^64 and the divisor is greater than 2^32-1, so the result is less than 2^32.
If the dividend has 128 bits the trick with discarding bits won't work. So in general case the best solution is probably the one proposed by GJ or caf. (Well, it would be probably the best even if discarding bits worked. Division, multiplication subtraction and correction on 128 bit integer might be slower.)
I was also thinking about using the floating point hardware. x87 floating point unit uses 80 bit precision format with fraction 64 bits long. I think one can get the exact result of 64 bit by 64 bit division. (Not the remainder directly, but also the remainder using multiplication and subtraction like in the "MSVC procedure"). IF the dividend >=2^64 and < 2^128 storing it in the floating point format seems similar to discarding least significant bits in "MSVC procedure". Maybe someone can prove the error in that case is bound and find it useful. I have no idea if it has a chance to be faster than GJ's solution, but maybe it's worth it to try.
The solution depends on what exactly you are trying to solve.
E.g. if you are doing arithmetic in a ring modulo a 64-bit integer then using
Montgomerys reduction is very efficient. Of course this assumes that you the same modulus many times and that it pays off to convert the elements of the ring into a special representation.
To give just a very rough estimate on the speed of this Montgomerys reduction: I have an old benchmark that performs a modular exponentiation with 64-bit modulus and exponent in 1600 ns on a 2.4Ghz Core 2. This exponentiation does about 96 modular multiplications (and modular reductions) and hence needs about 40 cycles per modular multiplication.
The accepted answer by #caf was real nice and highly rated, yet it contain a bug not seen for years.
To help test that and other solutions, I am posting a test harness and making it community wiki.
unsigned cafMod(unsigned A, unsigned B) {
assert(B);
unsigned X = B;
// while (X < A / 2) { Original code used <
while (X <= A / 2) {
X <<= 1;
}
while (A >= B) {
if (A >= X) A -= X;
X >>= 1;
}
return A;
}
void cafMod_test(unsigned num, unsigned den) {
if (den == 0) return;
unsigned y0 = num % den;
unsigned y1 = mod(num, den);
if (y0 != y1) {
printf("FAIL num:%x den:%x %x %x\n", num, den, y0, y1);
fflush(stdout);
exit(-1);
}
}
unsigned rand_unsigned() {
unsigned x = (unsigned) rand();
return x * 2 ^ (unsigned) rand();
}
void cafMod_tests(void) {
const unsigned i[] = { 0, 1, 2, 3, 0x7FFFFFFF, 0x80000000,
UINT_MAX - 3, UINT_MAX - 2, UINT_MAX - 1, UINT_MAX };
for (unsigned den = 0; den < sizeof i / sizeof i[0]; den++) {
if (i[den] == 0) continue;
for (unsigned num = 0; num < sizeof i / sizeof i[0]; num++) {
cafMod_test(i[num], i[den]);
}
}
cafMod_test(0x8711dd11, 0x4388ee88);
cafMod_test(0xf64835a1, 0xf64835a);
time_t t;
time(&t);
srand((unsigned) t);
printf("%u\n", (unsigned) t);fflush(stdout);
for (long long n = 10000LL * 1000LL * 1000LL; n > 0; n--) {
cafMod_test(rand_unsigned(), rand_unsigned());
}
puts("Done");
}
int main(void) {
cafMod_tests();
return 0;
}
As a general rule, division is slow and multiplication is faster, and bit shifting is faster yet. From what I have seen of the answers so far, most of the answers have been using a brute force approach using bit-shifts. There exists another way. Whether it is faster remains to be seen (AKA profile it).
Instead of dividing, multiply by the reciprocal. Thus, to discover A % B, first calculate the reciprocal of B ... 1/B. This can be done with a few loops using the Newton-Raphson method of convergence. To do this well will depend upon a good set of initial values in a table.
For more details on the Newton-Raphson method of converging on the reciprocal, please refer to http://en.wikipedia.org/wiki/Division_(digital)
Once you have the reciprocal, the quotient Q = A * 1/B.
The remainder R = A - Q*B.
To determine if this would be faster than the brute force (as there will be many more multiplies since we will be using 32-bit registers to simulate 64-bit and 128-bit numbers, profile it.
If B is constant in your code, you can pre-calculate the reciprocal and simply calculate using the last two formulae. This, I am sure will be faster than bit-shifting.
Hope this helps.
If 128-bit unsigned by 63-bit unsigned is good enough, then it can be done in a loop doing at most 63 cycles.
Consider this a proposed solution MSNs' overflow problem by limiting it to 1-bit. We do so by splitting the problem in 2, modular multiplication and adding the results at the end.
In the following example upper corresponds to the most significant 64-bits, lower to the least significant 64-bits and div is the divisor.
unsigned 128_mod(uint64_t upper, uint64_t lower, uint64_t div) {
uint64_t result = 0;
uint64_t a = (~0%div)+1;
upper %= div; // the resulting bit-length determines number of cycles required
// first we work out modular multiplication of (2^64*upper)%div
while (upper != 0){
if(upper&1 == 1){
result += a;
if(result >= div){result -= div;}
}
a <<= 1;
if(a >= div){a -= div;}
upper >>= 1;
}
// add up the 2 results and return the modulus
if(lower>div){lower -= div;}
return (lower+result)%div;
}
The only problem is that, if the divisor is 64-bits then we get overflows of 1-bit (loss of information) giving a faulty result.
It bugs me that I haven't figured out a neat way to handle the overflows.
I don't know how to compile the assembler codes, any help is appreciated to compile and test them.
I solved this problem by comparing against gmplib "mpz_mod()" and summing 1 million loop results. It was a long ride to go from slowdown (seedup 0.12) to speedup 1.54 -- that is the reason I think the C codes in this thread will be slow.
Details inclusive test harness in this thread:
https://www.raspberrypi.org/forums/viewtopic.php?f=33&t=311893&p=1873122#p1873122
This is "mod_256()" with speedup over using gmplib "mpz_mod()", use of __builtin_clzll() for longer shifts was essential:
typedef __uint128_t uint256_t[2];
#define min(x, y) ((x<y) ? (x) : (y))
int clz(__uint128_t u)
{
// unsigned long long h = ((unsigned long long *)&u)[1];
unsigned long long h = u >> 64;
return (h!=0) ? __builtin_clzll(h) : 64 + __builtin_clzll(u);
}
__uint128_t mod_256(uint256_t x, __uint128_t n)
{
if (x[1] == 0) return x[0] % n;
else
{
__uint128_t r = x[1] % n;
int F = clz(n);
int R = clz(r);
for(int i=0; i<128; ++i)
{
if (R>F+1)
{
int h = min(R-(F+1), 128-i);
r <<= h; R-=h; i+=(h-1); continue;
}
r <<= 1; if (r >= n) { r -= n; R=clz(r); }
}
r += (x[0] % n); if (r >= n) r -= n;
return r;
}
}
If you have a recent x86 machine, there are 128-bit registers for SSE2+. I've never tried to write assembly for anything other than basic x86, but I suspect there are some guides out there.
I am 9 years after the battle but here is an interesting O(1) edge case for powers of 2 that is worth mentioning.
#include <stdio.h>
// example with 32 bits and 8 bits.
int main() {
int i = 930;
unsigned char b = (unsigned char) i;
printf("%d", (int) b); // 162, same as 930 % 256
}
Since there is no predefined 128-bit integer type in C, bits of A have to be represented in an array. Although B (64-bit integer) can be stored in an unsigned long long int variable, it is needed to put bits of B into another array in order to work on A and B efficiently.
After that, B is incremented as Bx2, Bx3, Bx4, ... until it is the greatest B less than A. And then (A-B) can be calculated, using some subtraction knowledge for base 2.
Is this the kind of solution that you are looking for?

Resources