How can I multiply and divide using only bit shifting and adding? - c

How can I multiply and divide using only bit shifting and adding?

To multiply in terms of adding and shifting you want to decompose one of the numbers by powers of two, like so:
21 * 5 = 10101_2 * 101_2 (Initial step)
= 10101_2 * (1 * 2^2 + 0 * 2^1 + 1 * 2^0)
= 10101_2 * 2^2 + 10101_2 * 2^0
= 10101_2 << 2 + 10101_2 << 0 (Decomposed)
= 10101_2 * 4 + 10101_2 * 1
= 10101_2 * 5
= 21 * 5 (Same as initial expression)
(_2 means base 2)
As you can see, multiplication can be decomposed into adding and shifting and back again. This is also why multiplication takes longer than bit shifts or adding - it's O(n^2) rather than O(n) in the number of bits. Real computer systems (as opposed to theoretical computer systems) have a finite number of bits, so multiplication takes a constant multiple of time compared to addition and shifting. If I recall correctly, modern processors, if pipelined properly, can do multiplication just about as fast as addition, by messing with the utilization of the ALUs (arithmetic units) in the processor.

The answer by Andrew Toulouse can be extended to division.
The division by integer constants is considered in details in the book "Hacker's Delight" by Henry S. Warren (ISBN 9780201914658).
The first idea for implementing division is to write the inverse value of the denominator in base two.
E.g.,
1/3 = (base-2) 0.0101 0101 0101 0101 0101 0101 0101 0101 .....
So,
a/3 = (a >> 2) + (a >> 4) + (a >> 6) + ... + (a >> 30)
for 32-bit arithmetics.
By combining the terms in an obvious manner we can reduce the number of operations:
b = (a >> 2) + (a >> 4)
b += (b >> 4)
b += (b >> 8)
b += (b >> 16)
There are more exciting ways to calculate division and remainders.
EDIT1:
If the OP means multiplication and division of arbitrary numbers, not the division by a constant number, then this thread might be of use: https://stackoverflow.com/a/12699549/1182653
EDIT2:
One of the fastest ways to divide by integer constants is to exploit the modular arithmetics and Montgomery reduction: What's the fastest way to divide an integer by 3?

X * 2 = 1 bit shift left
X / 2 = 1 bit shift right
X * 3 = shift left 1 bit and then add X

x << k == x multiplied by 2 to the power of k
x >> k == x divided by 2 to the power of k
You can use these shifts to do any multiplication operation. For example:
x * 14 == x * 16 - x * 2 == (x << 4) - (x << 1)
x * 12 == x * 8 + x * 4 == (x << 3) + (x << 2)
To divide a number by a non-power of two, I'm not aware of any easy way, unless you want to implement some low-level logic, use other binary operations and use some form of iteration.

A left shift by 1 position is analogous to multiplying by 2. A right shift is analogous to dividing by 2.
You can add in a loop to multiply. By picking the loop variable and the addition variable correctly, you can bound performance. Once you've explored that, you should use Peasant Multiplication

A procedure for dividing integers that uses shifts and adds can be derived in straightforward fashion from decimal longhand division as taught in elementary school. The selection of each quotient digit is simplified, as the digit is either 0 and 1: if the current remainder is greater than or equal to the divisor, the least significant bit of the partial quotient is 1.
Just as with decimal longhand division, the digits of the dividend are considered from most significant to least significant, one digit at a time. This is easily accomplished by a left shift in binary division. Also, quotient bits are gathered by left shifting the current quotient bits by one position, then appending the new quotient bit.
In a classical arrangement, these two left shifts are combined into left shifting of one register pair. The upper half holds the current remainder, the lower half initial holds the dividend. As the dividend bits are transferred to the remainder register by left shift, the unused least significant bits of the lower half are used to accumulate the quotient bits.
Below is x86 assembly language and C implementations of this algorithm. This particular variant of a shift & add division is sometimes referred to as the "non-performing" variant, as the subtraction of the divisor from the current remainder is not performed unless the remainder is greater than or equal to the divisor (Otto Spaniol, "Computer Arithmetic: Logic and Design." Chichester: Wiley 1981, p. 144). In C, there is no notion of the carry flag used by the assembly version in the register pair left shift. Instead, it is emulated, based on the observation that the result of an addition modulo 2n can be smaller that either addend only if there was a carry out.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define USE_ASM 0
#if USE_ASM
uint32_t bitwise_division (uint32_t dividend, uint32_t divisor)
{
uint32_t quot;
__asm {
mov eax, [dividend];// quot = dividend
mov ecx, [divisor]; // divisor
mov edx, 32; // bits_left
mov ebx, 0; // rem
$div_loop:
add eax, eax; // (rem:quot) << 1
adc ebx, ebx; // ...
cmp ebx, ecx; // rem >= divisor ?
jb $quot_bit_is_0; // if (rem < divisor)
$quot_bit_is_1: //
sub ebx, ecx; // rem = rem - divisor
add eax, 1; // quot++
$quot_bit_is_0:
dec edx; // bits_left--
jnz $div_loop; // while (bits_left)
mov [quot], eax; // quot
}
return quot;
}
#else
uint32_t bitwise_division (uint32_t dividend, uint32_t divisor)
{
uint32_t quot, rem, t;
int bits_left = CHAR_BIT * sizeof (uint32_t);
quot = dividend;
rem = 0;
do {
// (rem:quot) << 1
t = quot;
quot = quot + quot;
rem = rem + rem + (quot < t);
if (rem >= divisor) {
rem = rem - divisor;
quot = quot + 1;
}
bits_left--;
} while (bits_left);
return quot;
}
#endif

I translated the Python code to C. The example given had a minor flaw. If the dividend value that took up all the 32 bits, the shift would fail. I just used 64-bit variables internally to work around the problem:
int No_divide(int nDivisor, int nDividend, int *nRemainder)
{
int nQuotient = 0;
int nPos = -1;
unsigned long long ullDivisor = nDivisor;
unsigned long long ullDividend = nDividend;
while (ullDivisor < ullDividend)
{
ullDivisor <<= 1;
nPos ++;
}
ullDivisor >>= 1;
while (nPos > -1)
{
if (ullDividend >= ullDivisor)
{
nQuotient += (1 << nPos);
ullDividend -= ullDivisor;
}
ullDivisor >>= 1;
nPos -= 1;
}
*nRemainder = (int) ullDividend;
return nQuotient;
}

Take two numbers, lets say 9 and 10, write them as binary - 1001 and 1010.
Start with a result, R, of 0.
Take one of the numbers, 1010 in this case, we'll call it A, and shift it right by one bit, if you shift out a one, add the first number, we'll call it B, to R.
Now shift B left by one bit and repeat until all bits have been shifted out of A.
It's easier to see what's going on if you see it written out, this is the example:
0
0000 0
10010 1
000000 0
1001000 1
------
1011010

Taken from here.
This is only for division:
int add(int a, int b) {
int partialSum, carry;
do {
partialSum = a ^ b;
carry = (a & b) << 1;
a = partialSum;
b = carry;
} while (carry != 0);
return partialSum;
}
int subtract(int a, int b) {
return add(a, add(~b, 1));
}
int division(int dividend, int divisor) {
boolean negative = false;
if ((dividend & (1 << 31)) == (1 << 31)) { // Check for signed bit
negative = !negative;
dividend = add(~dividend, 1); // Negation
}
if ((divisor & (1 << 31)) == (1 << 31)) {
negative = !negative;
divisor = add(~divisor, 1); // Negation
}
int quotient = 0;
long r;
for (int i = 30; i >= 0; i = subtract(i, 1)) {
r = (divisor << i);
// Left shift divisor until it's smaller than dividend
if (r < Integer.MAX_VALUE && r >= 0) { // Avoid cases where comparison between long and int doesn't make sense
if (r <= dividend) {
quotient |= (1 << i);
dividend = subtract(dividend, (int) r);
}
}
}
if (negative) {
quotient = add(~quotient, 1);
}
return quotient;
}

This should work for multiplication:
.data
.text
.globl main
main:
# $4 * $5 = $2
addi $4, $0, 0x9
addi $5, $0, 0x6
add $2, $0, $0 # initialize product to zero
Loop:
beq $5, $0, Exit # if multiplier is 0,terminate loop
andi $3, $5, 1 # mask out the 0th bit in multiplier
beq $3, $0, Shift # if the bit is 0, skip add
addu $2, $2, $4 # add (shifted) multiplicand to product
Shift:
sll $4, $4, 1 # shift up the multiplicand 1 bit
srl $5, $5, 1 # shift down the multiplier 1 bit
j Loop # go for next
Exit: #
EXIT:
li $v0,10
syscall

The below method is the implementation of binary divide considering both numbers are positive. If subtraction is a concern we can implement that as well using binary operators.
Code
-(int)binaryDivide:(int)numerator with:(int)denominator
{
if (numerator == 0 || denominator == 1) {
return numerator;
}
if (denominator == 0) {
#ifdef DEBUG
NSAssert(denominator==0, #"denominator should be greater then 0");
#endif
return INFINITY;
}
// if (numerator <0) {
// numerator = abs(numerator);
// }
int maxBitDenom = [self getMaxBit:denominator];
int maxBitNumerator = [self getMaxBit:numerator];
int msbNumber = [self getMSB:maxBitDenom ofNumber:numerator];
int qoutient = 0;
int subResult = 0;
int remainingBits = maxBitNumerator-maxBitDenom;
if (msbNumber >= denominator) {
qoutient |=1;
subResult = msbNumber - denominator;
}
else {
subResult = msbNumber;
}
while (remainingBits > 0) {
int msbBit = (numerator & (1 << (remainingBits-1)))>0?1:0;
subResult = (subResult << 1) | msbBit;
if(subResult >= denominator) {
subResult = subResult - denominator;
qoutient= (qoutient << 1) | 1;
}
else{
qoutient = qoutient << 1;
}
remainingBits--;
}
return qoutient;
}
-(int)getMaxBit:(int)inputNumber
{
int maxBit = 0;
BOOL isMaxBitSet = NO;
for (int i=0; i<sizeof(inputNumber)*8; i++) {
if (inputNumber & (1<<i)) {
maxBit = i;
isMaxBitSet=YES;
}
}
if (isMaxBitSet) {
maxBit+=1;
}
return maxBit;
}
-(int)getMSB:(int)bits ofNumber:(int)number
{
int numbeMaxBit = [self getMaxBit:number];
return number >> (numbeMaxBit - bits);
}
For multiplication:
-(int)multiplyNumber:(int)num1 withNumber:(int)num2
{
int mulResult = 0;
int ithBit;
BOOL isNegativeSign = (num1<0 && num2>0) || (num1>0 && num2<0);
num1 = abs(num1);
num2 = abs(num2);
for (int i=0; i<sizeof(num2)*8; i++)
{
ithBit = num2 & (1<<i);
if (ithBit>0) {
mulResult += (num1 << i);
}
}
if (isNegativeSign) {
mulResult = ((~mulResult)+1);
}
return mulResult;
}

it is basically multiplying and dividing with the base power 2
shift left = x * 2 ^ y
shift right = x / 2 ^ y
shl eax,2 = 2 * 2 ^ 2 = 8
shr eax,3 = 2 / 2 ^ 3 = 1/4

For anyone interested in a 16-bit x86 solution, there is a piece of code by JasonKnight here1 (he also includes a signed multiply piece, which I haven't tested). However, that code has issues with large inputs, where the "add bx,bx" part would overflow.
The fixed version:
softwareMultiply:
; INPUT CX,BX
; OUTPUT DX:AX - 32 bits
; CLOBBERS BX,CX,DI
xor ax,ax ; cheap way to zero a reg
mov dx,ax ; 1 clock faster than xor
mov di,cx
or di,bx ; cheap way to test for zero on both regs
jz #done
mov di,ax ; DI used for reg,reg adc
#loop:
shr cx,1 ; divide by two, bottom bit moved to carry flag
jnc #skipAddToResult
add ax,bx
adc dx,di ; reg,reg is faster than reg,imm16
#skipAddToResult:
add bx,bx ; faster than shift or mul
adc di,di
or cx,cx ; fast zero check
jnz #loop
#done:
ret
Or the same in GCC inline assembly:
asm("mov $0,%%ax\n\t"
"mov $0,%%dx\n\t"
"mov %%cx,%%di\n\t"
"or %%bx,%%di\n\t"
"jz done\n\t"
"mov %%ax,%%di\n\t"
"loop:\n\t"
"shr $1,%%cx\n\t"
"jnc skipAddToResult\n\t"
"add %%bx,%%ax\n\t"
"adc %%di,%%dx\n\t"
"skipAddToResult:\n\t"
"add %%bx,%%bx\n\t"
"adc %%di,%%di\n\t"
"or %%cx,%%cx\n\t"
"jnz loop\n\t"
"done:\n\t"
: "=d" (dx), "=a" (ax)
: "b" (bx), "c" (cx)
: "ecx", "edi"
);

Try this. https://gist.github.com/swguru/5219592
import sys
# implement divide operation without using built-in divide operator
def divAndMod_slow(y,x, debug=0):
r = 0
while y >= x:
r += 1
y -= x
return r,y
# implement divide operation without using built-in divide operator
def divAndMod(y,x, debug=0):
## find the highest position of positive bit of the ratio
pos = -1
while y >= x:
pos += 1
x <<= 1
x >>= 1
if debug: print "y=%d, x=%d, pos=%d" % (y,x,pos)
if pos == -1:
return 0, y
r = 0
while pos >= 0:
if y >= x:
r += (1 << pos)
y -= x
if debug: print "y=%d, x=%d, r=%d, pos=%d" % (y,x,r,pos)
x >>= 1
pos -= 1
return r, y
if __name__ =="__main__":
if len(sys.argv) == 3:
y = int(sys.argv[1])
x = int(sys.argv[2])
else:
y = 313271356
x = 7
print "=== Slow Version ...."
res = divAndMod_slow( y, x)
print "%d = %d * %d + %d" % (y, x, res[0], res[1])
print "=== Fast Version ...."
res = divAndMod( y, x, debug=1)
print "%d = %d * %d + %d" % (y, x, res[0], res[1])

Related

Bit Twiddling in C - computes 2 times x or returns largest signed number

I am writing a function that computes 2 times the parameter x, but if it overflows it should return the largest positive or negative signed number. The problem is I can only use ! ~ & ^ | + << >>. 32-bit integers are involved.
This is what I have so far:
int boundedMult(int x){
int xSign = x>>31
int y = x << 1; // Multiply by 2
int signBit = y>>31; // Capture the signBit of the answer
int shift = signBit;
shift |= shift<<1; // Make shift all 1 or 0 depending on the sign bit
shift |= shift<<2;
shift |= shift<<4;
shift |= shift<<8;
shift |= shift<<15;
int answer = ~signBit<<31 | shift; // 01111... If signBit=1 and 1000... If signBit=0
}
Seems fine, right? I also can't use any constants outside of an unsigned byte (0-255) inclusive. I've tried so many approaches but I end up breaking one of these rules in all of them.
Interesting challenge! Here's my solution, I hope I didn't violate any of the constraints by mistake:
#include <stdio.h>
#include <stdint.h>
// work with uint to avoid undefined behavior (signed int overflow is undefined)
static inline int32_t x2(int32_t v) {
uint32_t uv = v;
// our first option: "multiply" by shifting:
uint32_t doubled = uv<<1;
// our second option: clamp to max/min integer:
uint32_t neg = !!(uv >> 31); // 1 if negative
uint32_t bigval = (~0u)>>1; // 0x7fffffff
uint32_t clamped = bigval + neg; // 0x80000000 if neg, 0x7fffffff otherwise
// so, which one will we use?
uint32_t ok = !((v>>31) ^ (v>>30)); // 0 if overflow, 1 otherwise
// note the use of signed value here
uint32_t mask = (~ok)+1; // 0x00000000 if overflow, 0xffffffff otherwise
// choose by masking one option with ones, the other with zeroes
return (mask & doubled) | ((~mask) & clamped);
}
static inline void check(int32_t val, int32_t expect) {
int32_t actual = x2(val);
if ((val & 0x3ffffff) == 0) {
printf("0x%08x...\n", val);
}
if (actual != expect) {
printf("val=%d, expected=%d, actual=%d\n", val, expect, actual);
}
}
int main() {
int32_t v = 0x80000000;
printf("checking negative clamp...\n");
for (; v < -0x40000000; ++v) {
check(v, 0x80000000);
}
printf("checking straight double...\n");
for(; v < 0x40000000; ++v) {
check(v, 2*v);
}
printf("checking positive clamp...\n");
for(; v < 0x7fffffff; ++v) {
check(v, 0x7fffffff);
}
check(0x7fffffff, 0x7fffffff);
printf("All done!\n");
return 0;
}
And it seems to work fine:
gcc -std=c99 -O2 -Wall -Werror -Wextra -pedantic bounded.c -o bounded && ./bounded
checking negative clamp...
0x80000000...
0x84000000...
0x88000000...
0x8c000000...
0x90000000...
0x94000000...
0x98000000...
0x9c000000...
0xa0000000...
0xa4000000...
0xa8000000...
0xac000000...
0xb0000000...
0xb4000000...
0xb8000000...
0xbc000000...
checking straight double...
0xc0000000...
0xc4000000...
0xc8000000...
0xcc000000...
0xd0000000...
0xd4000000...
0xd8000000...
0xdc000000...
0xe0000000...
0xe4000000...
0xe8000000...
0xec000000...
0xf0000000...
0xf4000000...
0xf8000000...
0xfc000000...
0x00000000...
0x04000000...
0x08000000...
0x0c000000...
0x10000000...
0x14000000...
0x18000000...
0x1c000000...
0x20000000...
0x24000000...
0x28000000...
0x2c000000...
0x30000000...
0x34000000...
0x38000000...
0x3c000000...
checking positive clamp...
0x40000000...
0x44000000...
0x48000000...
0x4c000000...
0x50000000...
0x54000000...
0x58000000...
0x5c000000...
0x60000000...
0x64000000...
0x68000000...
0x6c000000...
0x70000000...
0x74000000...
0x78000000...
0x7c000000...
All done!
Using this handy interactive compiler, we can get disassembly for various platforms. Annotated ARM64 assembly:
x2(int):
asr w1, w0, 30 # w1 = v >> 30
cmp w1, w0, asr 31 # compare w1 to (v>>31)
csetm w1, eq # w1 = eq ? 0 : -1
# --- so w1 is "mask"
mov w2, 2147483647 # w2 = 0x7fffffff
mvn w3, w1 # w3 = ~w1
# --- so w3 is ~mask
add w2, w2, w0, lsr 31 # w2 = w2 + (v>>31)
# --- so w2 is "clamped"
and w2, w3, w2 # w2 = w3 & w2
and w0, w1, w0, lsl 1 # w0 = w1 & (v << 1)
orr w0, w2, w0 # w0 = w2 | w0
ret # return w0
Looks pretty efficient to me. Pretty sweet that "doubled" is never saved to a register -- it's simply done as a shift on the input value for one of the and instructions.
Like this?
int boundedMult(int x)
{
int xSign = (x >> 31) & 1;
int resultArray[] = {x + x, x + x, ~(1 << 31), 1 << 31};
int willOverflow = xSign ^ ((x >> 30) & 1);
return resultArray[(willOverflow << 1) + xSign];
}
Just as #Tom Karzes wisely pointed out in comment, "To see if the result will overflow, all you have to do is see if the two high-order bits are different in the original number."
It can be done without knowing how many bits there are in an int. An overflow means the sign bit changes if you *2 then /2. The change can be detected by xor and ends up being min int in 2's complement.
int t2(int v)
{
int v2=v<<1; // *2
int ovfl=(v2>>1)^v; // smallest negative # if overflow or 0 if not
return (ovfl&v)+(~ovfl&~!!(ovfl&~v)+1)+((~!ovfl+1)&v2);
}
or
template<class T> T t2(T v)
{
T v2=v<<1; // *2
T ovfl=(v2>>1)^v; // smallest negative # if overflow or 0 if not
return (ovfl&v)+(~ovfl&~!!(ovfl&~v)+1)+((~!ovfl+1)&v2);
}
Here's one approach, it's not the best approach, and it can be optimized, but this approach is easy to explain. And the optimizations are reasonably straightforward.
int boundedMult(int x){
int ans = x << 1;
// if the sign bit between ans and x is different an overflow occured
int ansSign = (ans >> 31) & 1;
int xSign = (x >> 31) & 1;
// but we can't use branching, so instead let's construct a number
int overflowed = ansSign ^ xSign;
// let's shift this up to the signbit
overflowed <<= 31;
// And let's subtract 1 to make it INTMAX or -1, we're doing this because -1
// is all 1s, so we can OR this later on.
overflowed += ~1+1;
// now overflowed contains INTMAX if an overflow has occured and -1 otherwise
// but we want -INTMAX if x is negative, we do this by taking the complement of
// overflowed if xSign is set. A way to take the complement is to xor by
// -1. How can we make a negative 1?
// Four possibilities for xSign and ansSign at this point:
// ansSign xSign Meaning
// 0 0 x positive no overflow
// 1 0 x positive overflow
// 1 1 x negative no overflow
// 0 1 x negative overflow
// We want to detect the case "x negative overflow" without detecting any
// other case, and we want to generate a negative 1, generating a 0 for all
// other cases.
// ansSign xSign -(!ansSign & xSign)
// 0 0 0
// 1 0 0
// 1 1 0
// 0 1 -1
overflowed ^= ~(!ansSign & xSign) + 1;
// Now overflowed contains INTMAX, -INTMAX or 0 as appropriate, so let's
// rename it
int aproposMaxIfOverflowed = overflowed;
// But there's one other problem. We only want to overwrite ans if an
// overflow happened So because -(ansSign ^ xSign) is either all 1s if an
// overflow occured or all zeros if it hasn't we can use it to blank or keep
// numbers
// This contains all 1s or ans
int ansIfNotOverflowed = ans | (~(ansSign ^ xSign) + 1);
// So now we have ansIfNotOverflowed, and aproposMaxIfOverflowed, we can
// combine these now
return ansIfNotOverflowed & aproposMaxIfOverflowed;
}

Divide by 9 without using division or multiplication operator

This question I have tried to solve it but couldn't get any way. Any pointers would be appreciated.
Regular subtraction way of doing division is not the intention here, ingenious way of using shifting operator to get this done is the intention.
Although an answer has been accepted, I post mine for what it's worth.
UPDATE. This works by multiplying by a recurring binary fraction. In decimal 1/9 = 0.1111111 recurring. In binary, that is 1/1001 = 0.000111000111000111 recurring.
Notice the binary multiplier is in groups of 6 bits, decimal 7 recurring. So what I want to do here, is to multiply the dividend by 7, shift it right 6 bits, and add it to a running quotient. However to keep significance, I do the shift after the addition, and shift the quotient q after the loop ends to align it properly.
There are up to 6 iterations of the calculation loop for a 32 bit int (6 bits * 6 shifts = 36 bits).
#include<stdio.h>
int main(void)
{
unsigned x, y, q, d;
int i, err = 0;
for (x=1; x<100; x++) { // candidates
q = 0; // quotient
y = (x << 3) - x; // y = x * 7
while(y) { // until nothing significant
q += y; // add (effectively) binary 0.000111
y >>= 6; // realign
}
q >>= 6; // align
d = x / 9; // the true answer
if (d != q) {
printf ("%d / 9 = %d (%d)\n", x, q, d); // print any errors
err++;
}
}
printf ("Errors: %d\n", err);
return 0;
}
Unfortunately, this fails for every candidate that is a multiple of 9, for rounding error, due to the same reason that multiplying decimal 27 * 0.111111 = 2.999999 and not 3. So I now complicate the answer by keeping the 4 l.s. bits of the quotient for rounding. The result is it works for all int values limited by the two top nibbles, one for the * 7 and one for the * 16 significance.
#include<stdio.h>
int main(void)
{
unsigned x, y, q, d;
int i, err = 0;
for (x=1; x<0x00FFFFFF; x++) {
q = 8; // quotient with (effectively) 0.5 for rounding
y = (x << 3) - x; // y = x * 7
y <<= 4; // y *= 16 for rounding
while(y) { // until nothing significant
q += y; // add (effectively) binary 0.000111
y >>= 6; // realign
}
q >>= (4 + 6); // the 4 bits significance + recurrence
d = x / 9; // the true answer
if (d != q) {
printf ("%d / 9 = %d (%d)\n", x, q, d); // print any errors
err++;
}
}
printf ("Errors: %d\n", err);
return 0;
}
Here's a solution heavily inspired by Hacker's Delight that really uses only bit shifts:
def divu9(n):
q = n - (n >> 3)
q = q + (q >> 6)
q = q + (q>>12) + (q>>24); q = q >> 3
r = n - (((q << 2) << 1) + q)
return q + ((r + 7) >> 4)
#return q + (r > 8)
See this answer: https://stackoverflow.com/a/11694778/4907651
Exactly what you're looking for except the divisor is 3.
EDIT: explanation
I will replace the add function with simply + as you're looking for the solution without using * or / only.
In this explanation, we assume we are dividing by 3.
Also, I am assuming you know how to convert decimal to binary and vice versa.
int divideby3 (int num) {
int sum = 0;
while (num > 3) {
sum += (num >> 2);
num = (num >> 2) + (num & 3);
}
if (num == 3)
sum += 1;
return sum;
}
This approach uses bitwise operators:
bitwise AND: &.
bitwise left shift: <<. Shifts binary values left.
bitwise right shift: >>. Shifts binary values right.
bitwise XOR: ^
The first condition (num > 3) is as such because the divisor is 3. In your case, the divisor is 9, so when you use it, the condition must be (num > 9).
Suppose the number we want to divide is 6.
In binary, 6 is represented as 000110.
Now, we enter while (num > 3) loop. The first statement adds sum (initialised to 0) to num >> 2.
What num >> 2 does:
num in binary initially: 00000000 00000110
after bitwise shift: 00000000 00000001 i.e. 1 in decimal
sum after adding num >> 2 is 1.
Since we know num >> 2 is equal to 1, we add that to num & 3.
num in binary initially: 00000000 00000110
3 in binary: 00000000 00000011
For each bit position in the result of expression a & b, the bit is 1 if both operands contain 1, and 0 otherwise
result of num & 3: 00000000 00000010 i.e. 2 in decimal
num after num = (num >> 2) + (num & 3) equals 1 + 2 = 3
Now, since num is EQUAL to 3, we enter if (num==3) loop.
We then add 1 to sum, and return the value. This value of sum is the quotient.
As expected, the value returned is 2.
Hope that wasn't a horrible explanation.
Create a loop and every step you should substract N-9 .. then (N-9)-9 .. until N<9 OR N=0 and every substraction you count the step For exemple : 36/9 36-9=27 cmpt (1) 27-9=18 cmpt(2) 18-9=9 cmpt(3) 9-9=0 cmpt (4)
So 36/9= 4
This http://en.wikipedia.org/wiki/Ancient_Egyptian_multiplication algorithm can do it using only subtraction and binary shifts in log(n) time. However, as far as I know, state-of-the-art hardware already either use this one, or even better algorithms. Therefore, I do not think there is anything you can do (assuming performance is your goal) unless you can somehow avoid the division completely or change your use case so that you can divide by a power of 2, because there are some tricks for these cases.
If you're not allowed to multiply/divide, you're left with addition/subtraction. Dividing by a number shows how many times the divisor contains the dividend. You can use this in return: How many times can you subtract the number from the original value?
divisor = 85;
dividend = 9;
remaining = divisor;
result = 0;
while (remaining >= dividend)
{
remaining -= dividend;
result++;
}
std::cout << divisor << " / " << dividend << " = " << result;
If you need to divide a positive number, you can use the following function:
unsigned int divideBy9(unsigned int num)
{
unsigned int result = 0;
while (num >= 9)
{
result += 1;
num -= 9;
}
return result;
}
In the case of a negative number, you can use a similar approach.
Hope this helps!

How can I implement a modulo operation on unsigned ints with limited hardware in C

I have a machine which only supports 32 bit operations, long long does not work on this machine.
I have one 64 bit quantity represented as two unsigned int 32s.
The question is how can I perform a mod on that 64 bit quantity with a 32 bit divisor.
r = a mod b
where:
a is 64 bit value and b is 32 bit value
I was thinking that I could represent the mod part by doing:
a = a1 * (2 ^ 32) + a2 (where a1 is the top bits and a2 is the bottom bits)
(a1 * (2 ^32) + a2) mod b = ( (a1 * 2 ^ 32) mod b + a2 mod b) mod b
( (a1 * 2 ^ 32) mod b + a2 mod b) mod b = (a1 mod b * 2 ^ 32 mod b + a2 mod b) mod b
but the problem is that 2 ^ 32 mod b may sometimes be equal to 2 ^ 32 and therefore the multiplication will overflow. I have looked at attempting to convert the multiplication into an addition but that also requires me to use 2 ^ 32 which if I mod will again give me 2 ^ 32 :) so I am not sure how to perform an unsigned mod of a 64 bit value with a 32 bit one.
I guess a simple solution to this would be to perform the following operations:
a / b = c
a = a - floor(c) * b
perform 1 until c is equal to 0 and use a as the answer.
but I am not sure how to combine these two integers together to form the 64 bit value
Just to be complete here are some links for binary division and subtractions:
http://www.exploringbinary.com/binary-division/
and a description of binary division algorithm:
http://en.wikipedia.org/wiki/Division_algorithm
Works: Tested with 1000M random combinations against a 64-bit %.
Like grade school long division a/b (but in base 2), subtract b from a if possible, then shift, looping 64 times. Return the remainder.
#define MSBit 0x80000000L
uint32_t mod32(uint32_t a1 /* MSHalf */, uint32_t a2 /* LSHalf */, uint32_t b) {
uint32_t a = 0;
for (int i = 31+32; i >= 0; i--) {
if (a & MSBit) { // Note 1
a <<= 1;
a -= b;
} else {
a <<= 1;
}
if (a1 & MSBit) a++;
a1 <<= 1;
if (a2 & MSBit) a1++;
a2 <<= 1;
if (a >= b)
a -= b;
}
return a;
}
Note 1: This is the sneaky part to do a 33-bit subtraction. Since code knows n has the MSBit set, 2*n will be greater than b, then n = 2*n - b. This counts on unsigned wrap-around.
[Edit]
Below is a generic modu() that works with any array size a and any size unsigned integer.
#include <stdint.h>
#include <limits.h>
// Use any unpadded unsigned integer type
#define UINT uint32_t
#define BitSize (sizeof(UINT) * CHAR_BIT)
#define MSBit ((UINT)1 << (BitSize - 1))
UINT modu(const UINT *aarray, size_t alen, UINT b) {
UINT r = 0;
while (alen-- > 0) {
UINT a = aarray[alen];
for (int i = BitSize; i > 0; i--) {
UINT previous = r;
r <<= 1;
if (a & MSBit) {
r++;
}
a <<= 1;
if ((previous & MSBit) || (r >= b)) {
r -= b;
}
}
}
return r;
}
UINT modu2(UINT a1 /* MSHalf */, UINT a2 /* LSHalf */, UINT b) {
UINT a[] = { a2, a1 }; // Least significant at index 0
return modu(a, sizeof a / sizeof a[0], b);
}
Do it the same way you would do a long division with pencil and paper.
#include <stdio.h>
unsigned int numh = 0x12345678;
unsigned int numl = 0x456789AB;
unsigned int denom = 0x17234591;
int main() {
unsigned int numer, quotient, remain;
numer = numh >> 16;
quotient = numer / denom;
remain = numer - quotient * denom;
numer = (remain << 16) | (numh & 0xffff);
quotient = numer / denom;
remain = numer - quotient * denom;
numer = (remain << 16) | (numl >> 16);
quotient = numer / denom;
remain = numer - quotient * denom;
numer = (remain << 16) | (numl & 0xffff);
quotient = numer / denom;
remain = numer - quotient * denom;
printf("%X\n", remain);
return 0;
}

Efficient computation of greatest power of 2 < x [duplicate]

This question already has answers here:
Find most significant bit (left-most) that is set in a bit array
(17 answers)
Compute fast log base 2 ceiling
(15 answers)
Closed 9 years ago.
I have a requirement to compute the greatest power of 2 which is < an integer value, x
currently I am using:
#define log2(x) log(x)/log(2)
#define round(x) (int)(x+0.5)
x = round(pow(2,(ceil(log2(n))-1)));
this is in a performance critical function
Is there a more computationally efficient way of calculating x?
You are essentially looking for the highest non-zero bit in your number. Many processors have built-in instructions for this, which in turn are exposed by many compilers. For example, in GCC I would look at __builtin_clz, which
Returns the number of leading 0-bits in x, starting at the most significant bit position.
Together with sizeof(int) * CHAR_BIT and a shift, you can use this to figure out the corresponding pure-power-of-two integer. There's also a version for long integers.
(The CPU instruction is presumably called "CLZ" (count leading zeros), in case you need to look this up for other compilers.)
I have an integer log2 function in my c-libutl library (hosted on googlecode if anyone is interested)
/*
** Integer log base 2 of a 32 bits integer values.
** llog2(0) == llog2(1) == 0
*/
unsigned short llog2(unsigned long x)
{
long l = 0;
x &= 0xFFFFFFFF /* just in case 'long' is more than 32bit */
if (x==0) return 0;
#ifndef UTL_NOASM
#if defined(__POCC__) || defined(_MSC_VER) || defined (__WATCOMC__)
/* Pelles C MS Visual C++ OpenWatcom */
__asm { mov eax, [x]
bsr ecx, eax
mov l, ecx
}
#elif defined(__GNUC__)
l = (unsigned short) ((sizeof(long)*8 -1) - __builtin_clzl(x));
#else
#define UTL_NOASM
#endif
#endif
#ifdef UTL_NOASM /* Make a binary search.*/
if (x & 0xFFFF0000) {l += 16; x >>= 16;} /* 11111111111111110000000000000000 */
if (x & 0xFF00) {l += 8; x >>= 8 ;} /* 1111111100000000*/
if (x & 0xF0) {l += 4; x >>= 4 ;} /* 11110000*/
if (x & 0xC) {l += 2; x >>= 2 ;} /* 1100 */
if (x & 2) {l += 1; } /* 10 */
return l;
#endif
return (unsigned short)l;
}
Then you can simply compute
(1 << llog2(x))
to compute the greatest power of two that is less than x. Beware 0! You should handle it separately.
It uses assembler code but can also be forced to plain C code by defining the UTL_NOASM symbol.
The code has been tested at the time but it's quite some time I don't use it and I can't say if it behaves in a 64-bit environment.
Based on Bit Twiddling Hacks: Find the log base 2 of an N-bit integer in O(lg(N)) operations by Sean Eron Anderson (code contributed by Eric Cole and Andrew Shapira):
unsigned int highest_bit (uint32_t v) {
unsigned int r = 0, s;
s = (v > 0xFFFF) << 4; v >>= s; r |= s;
s = (v > 0xFF ) << 3; v >>= s; r |= s;
s = (v > 0xF ) << 2; v >>= s; r |= s;
s = (v > 0x3 ) << 1; v >>= s; r |= s;
return r | (v >> 1);
}
This returns the index of the highest bit of the input; the greatest power of 2 no greater than the input is then 1 << highest_bit(x), and the greatest power of 2 strictly less than the input is thus simply 1 << highest_bit(x-1).
For 64-bit inputs, just change the input type to uint64_t and add the following extra line at the beginning of the function, after the variable declarations:
s = (v > 0xFFFFFFFF) << 8; v >>= s; r |= s;
Left and right shift operators do this the best
int MaxPowerOf2(int x)
{
int out = 1;
while(x > 1) { x>>1; out<<1;}
return out;
}
#include <math.h>
double greatestPower( double x )
{
return floor(log( x ) / log( 2 ));
}
That is true since log in monotony increasing function.
Shifting bits around will most likely be much faster. Probably some bisection method on bits could make it even faster. Nice exercise for an improvement.
#include <stdio.h>
int closestPow2(int x)
{
int p;
if (x <= 1) return 0; /* No such power exists */
x--; /* Account for exact powers of 2, then one power less must be returned */
for (p = 0; x > 0; p++)
{
x >>= 1;
}
return 1<<(p-1);
}
int main(void)
{
printf("%x\n", closestPow2(0x7FFFFFFF));
return 0;
}

optimized itoa function

I am thinking on how to implement the conversion of an integer (4byte, unsigned) to string with SSE instructions. The usual routine is to divide the number and store it in a local variable, then invert the string (the inversion routine is missing in this example):
char *convert(unsigned int num, int base) {
static char buff[33];
char *ptr;
ptr = &buff[sizeof(buff) - 1];
*ptr = '\0';
do {
*--ptr="0123456789abcdef"[num%base];
num /= base;
} while(num != 0);
return ptr;
}
But inversion will take extra time. Is there any other algorithm than can be used preferably with SSE instruction to parallelize the function?
Terje Mathisen invented a very fast itoa() that does not require lookup tables. If you're not interested in the explanation of how it works, skip down to Performance or Implementation.
More than 15 years ago Terje Mathisen came up with a parallelized itoa() for base 10. The idea is to take a 32-bit value and break it into two chunks of 5 digits. (A quick Google search for "Terje Mathisen itoa" gave this post: http://computer-programming-forum.com/46-asm/7aa4b50bce8dd985.htm)
We start like so:
void itoa(char *buf, uint32_t val)
{
lo = val % 100000;
hi = val / 100000;
itoa_half(&buf[0], hi);
itoa_half(&buf[5], lo);
}
Now we can just need an algorithm that can convert any integer in the domain [0, 99999] to a string. A naive way to do that might be:
// 0 <= val <= 99999
void itoa_half(char *buf, uint32_t val)
{
// Move all but the first digit to the right of the decimal point.
float tmp = val / 10000.0;
for(size_t i = 0; i < 5; i++)
{
// Extract the next digit.
int digit = (int) tmp;
// Convert to a character.
buf[i] = '0' + (char) digit;
// Remove the lead digit and shift left 1 decimal place.
tmp = (tmp - digit) * 10.0;
}
}
Rather than use floating-point, we will use 4.28 fixed-point math because it is significantly faster in our case. That is, we fix the binary point at the 28th bit position such that 1.0 is represented as 2^28. To convert into fixed-point, we simply multiply by 2^28. We can easily round down to the nearest integer by masking with 0xf0000000, and we can extract the fractional portion by masking with 0x0fffffff.
(Note: Terje's algorithm differs slightly in the choice of fixed-point format.)
So now we have:
typedef uint32_t fix4_28;
// 0 <= val <= 99999
void itoa_half(char *buf, uint32_t val)
{
// Convert `val` to fixed-point and divide by 10000 in a single step.
// N.B. we would overflow a uint32_t if not for the parentheses.
fix4_28 tmp = val * ((1 << 28) / 10000);
for(size_t i = 0; i < 5; i++)
{
int digit = (int)(tmp >> 28);
buf[i] = '0' + (char) digit;
tmp = (tmp & 0x0fffffff) * 10;
}
}
The only problem with this code is that 2^28 / 10000 = 26843.5456, which is truncated to 26843. This causes inaccuracies for certain values. For example, itoa_half(buf, 83492) produces the string "83490". If we apply a small correction in our conversion to 4.28 fixed-point, then the algorithm works for all numbers in the domain [0, 99999]:
// 0 <= val <= 99999
void itoa_half(char *buf, uint32_t val)
{
fix4_28 const f1_10000 = (1 << 28) / 10000;
// 2^28 / 10000 is 26843.5456, but 26843.75 is sufficiently close.
fix4_28 tmp = val * ((f1_10000 + 1) - (val / 4);
for(size_t i = 0; i < 5; i++)
{
int digit = (int)(tmp >> 28);
buf[i] = '0' + (char) digit;
tmp = (tmp & 0x0fffffff) * 10;
}
}
Terje interleaves the itoa_half part for the low & high halves:
void itoa(char *buf, uint32_t val)
{
fix4_28 const f1_10000 = (1 << 28) / 10000;
fix4_28 tmplo, tmphi;
lo = val % 100000;
hi = val / 100000;
tmplo = lo * (f1_10000 + 1) - (lo / 4);
tmphi = hi * (f1_10000 + 1) - (hi / 4);
for(size_t i = 0; i < 5; i++)
{
buf[i + 0] = '0' + (char)(tmphi >> 28);
buf[i + 5] = '0' + (char)(tmplo >> 28);
tmphi = (tmphi & 0x0fffffff) * 10;
tmplo = (tmplo & 0x0fffffff) * 10;
}
}
There is an additional trick that makes the code slightly faster if the loop is fully unrolled. The multiply by 10 is implemented as either a LEA+SHL or LEA+ADD sequence. We can save 1 instruction by multiplying instead by 5, which requires only a single LEA. This has the same effect as shifting tmphi and tmplo right by 1 position each pass through the loop, but we can compensate by adjusting our shift counts and masks like this:
uint32_t mask = 0x0fffffff;
uint32_t shift = 28;
for(size_t i = 0; i < 5; i++)
{
buf[i + 0] = '0' + (char)(tmphi >> shift);
buf[i + 5] = '0' + (char)(tmplo >> shift);
tmphi = (tmphi & mask) * 5;
tmplo = (tmplo & mask) * 5;
mask >>= 1;
shift--;
}
This only helps if the loop is fully-unrolled because you can precalculate the value of shift and mask for each iteration.
Finally, this routine produces zero-padded results. You can get rid of the padding by returning a pointer to the first character that is not 0 or the last character if val == 0:
char *itoa_unpadded(char *buf, uint32_t val)
{
char *p;
itoa(buf, val);
p = buf;
// Note: will break on GCC, but you can work around it by using memcpy() to dereference p.
if (*((uint64_t *) p) == 0x3030303030303030)
p += 8;
if (*((uint32_t *) p) == 0x30303030)
p += 4;
if (*((uint16_t *) p) == 0x3030)
p += 2;
if (*((uint8_t *) p) == 0x30)
p += 1;
return min(p, &buf[15]);
}
There is one additional trick applicable to 64-bit (i.e. AMD64) code. The extra, wider registers make it efficient to accumulate each 5-digit group in a register; after the last digit has been calculated, you can smash them together with SHRD, OR them with 0x3030303030303030, and store to memory. This improves performance for me by about 12.3%.
Vectorization
We could execute the above algorithm as-is on the SSE units, but there is almost no gain in performance. However, if we split the value into smaller chunks, we can take advantage of SSE4.1 32-bit multiply instructions. I tried three different splits:
2 groups of 5 digits
3 groups of 4 digits
4 groups of 3 digits
The fastest variant was 4 groups of 3 digits. See below for the results.
Performance
I tested many variants of Terje's algorithm in addition to the algorithms suggested by vitaut and Inge Henriksen. I verified through exhaustive testing of inputs that each algorithm's output matches itoa().
My numbers are taken from a Westmere E5640 running Windows 7 64-bit. I benchmark at real-time priority and locked to core 0. I execute each algorithm 4 times to force everything into the cache. I time 2^24 calls using RDTSCP to remove the effect of any dynamic clock speed changes.
I timed 5 different patterns of inputs:
itoa(0 .. 9) -- nearly best-case performance
itoa(1000 .. 1999) -- longer output, no branch mispredicts
itoa(100000000 .. 999999999) -- longest output, no branch mispredicts
itoa(256 random values) -- varying output length
itoa(65536 random values) -- varying output length and thrashes L1/L2 caches
The data:
ALG TINY MEDIUM LARGE RND256 RND64K NOTES
NULL 7 clk 7 clk 7 clk 7 clk 7 clk Benchmark overhead baseline
TERJE_C 63 clk 62 clk 63 clk 57 clk 56 clk Best C implementation of Terje's algorithm
TERJE_ASM 48 clk 48 clk 50 clk 45 clk 44 clk Naive, hand-written AMD64 version of Terje's algorithm
TERJE_SSE 41 clk 42 clk 41 clk 34 clk 35 clk SSE intrinsic version of Terje's algorithm with 1/3/3/3 digit grouping
INGE_0 12 clk 31 clk 71 clk 72 clk 72 clk Inge's first algorithm
INGE_1 20 clk 23 clk 45 clk 69 clk 96 clk Inge's second algorithm
INGE_2 18 clk 19 clk 32 clk 29 clk 36 clk Improved version of Inge's second algorithm
VITAUT_0 9 clk 16 clk 32 clk 35 clk 35 clk vitaut's algorithm
VITAUT_1 11 clk 15 clk 33 clk 31 clk 30 clk Improved version of vitaut's algorithm
LIBC 46 clk 128 clk 329 clk 339 clk 340 clk MSVCRT12 implementation
My compiler (VS 2013 Update 4) produced surprisingly bad code; the assembly version of Terje's algorithm is just a naive translation, and it's a full 21% faster. I was also surprised at the performance of the SSE implementation, which I expected to be slower. The big surprise was how fast INGE_2, VITAUT_0, and VITAUT_1 were. Bravo to vitaut for coming up with a portable solution that bests even my best effort at the assembly level.
Note: INGE_1 is a modified version of Inge Henriksen's second algorithm because the original has a bug.
INGE_2 is based on the second algorithm that Inge Henriksen gave. Rather than storing pointers to the precalculated strings in a char*[] array, it stores the strings themselves in a char[][5] array. The other big improvement is in how it stores characters in the output buffer. It stores more characters than necessary and uses pointer arithmetic to return a pointer to the first non-zero character. The result is substantially faster -- competitive even with the SSE-optimized version of Terje's algorithm. It should be noted that the microbenchmark favors this algorithm a bit because in real-world applications the 600K data set will constantly blow the caches.
VITAUT_1 is based on vitaut's algorithm with two small changes. The first change is that it copies character pairs in the main loop, reducing the number of store instructions. Similar to INGE_2, VITAUT_1 copies both final characters and uses pointer arithmetic to return a pointer to the string.
Implementation
Here I give code for the 3 most interesting algorithms.
TERJE_ASM:
; char *itoa_terje_asm(char *buf<rcx>, uint32_t val<edx>)
;
; *** NOTE ***
; buf *must* be 8-byte aligned or this code will break!
itoa_terje_asm:
MOV EAX, 0xA7C5AC47
ADD RDX, 1
IMUL RAX, RDX
SHR RAX, 48 ; EAX = val / 100000
IMUL R11D, EAX, 100000
ADD EAX, 1
SUB EDX, R11D ; EDX = (val % 100000) + 1
IMUL RAX, 214748 ; RAX = (val / 100000) * 2^31 / 10000
IMUL RDX, 214748 ; RDX = (val % 100000) * 2^31 / 10000
; Extract buf[0] & buf[5]
MOV R8, RAX
MOV R9, RDX
LEA EAX, [RAX+RAX] ; RAX = (RAX * 2) & 0xFFFFFFFF
LEA EDX, [RDX+RDX] ; RDX = (RDX * 2) & 0xFFFFFFFF
LEA RAX, [RAX+RAX*4] ; RAX *= 5
LEA RDX, [RDX+RDX*4] ; RDX *= 5
SHR R8, 31 ; R8 = buf[0]
SHR R9, 31 ; R9 = buf[5]
; Extract buf[1] & buf[6]
MOV R10, RAX
MOV R11, RDX
LEA EAX, [RAX+RAX] ; RAX = (RAX * 2) & 0xFFFFFFFF
LEA EDX, [RDX+RDX] ; RDX = (RDX * 2) & 0xFFFFFFFF
LEA RAX, [RAX+RAX*4] ; RAX *= 5
LEA RDX, [RDX+RDX*4] ; RDX *= 5
SHR R10, 31 - 8
SHR R11, 31 - 8
AND R10D, 0x0000FF00 ; R10 = buf[1] << 8
AND R11D, 0x0000FF00 ; R11 = buf[6] << 8
OR R10D, R8D ; R10 = buf[0] | (buf[1] << 8)
OR R11D, R9D ; R11 = buf[5] | (buf[6] << 8)
; Extract buf[2] & buf[7]
MOV R8, RAX
MOV R9, RDX
LEA EAX, [RAX+RAX] ; RAX = (RAX * 2) & 0xFFFFFFFF
LEA EDX, [RDX+RDX] ; RDX = (RDX * 2) & 0xFFFFFFFF
LEA RAX, [RAX+RAX*4] ; RAX *= 5
LEA RDX, [RDX+RDX*4] ; RDX *= 5
SHR R8, 31 - 16
SHR R9, 31 - 16
AND R8D, 0x00FF0000 ; R8 = buf[2] << 16
AND R9D, 0x00FF0000 ; R9 = buf[7] << 16
OR R8D, R10D ; R8 = buf[0] | (buf[1] << 8) | (buf[2] << 16)
OR R9D, R11D ; R9 = buf[5] | (buf[6] << 8) | (buf[7] << 16)
; Extract buf[3], buf[4], buf[8], & buf[9]
MOV R10, RAX
MOV R11, RDX
LEA EAX, [RAX+RAX] ; RAX = (RAX * 2) & 0xFFFFFFFF
LEA EDX, [RDX+RDX] ; RDX = (RDX * 2) & 0xFFFFFFFF
LEA RAX, [RAX+RAX*4] ; RAX *= 5
LEA RDX, [RDX+RDX*4] ; RDX *= 5
SHR R10, 31 - 24
SHR R11, 31 - 24
AND R10D, 0xFF000000 ; R10 = buf[3] << 24
AND R11D, 0xFF000000 ; R11 = buf[7] << 24
AND RAX, 0x80000000 ; RAX = buf[4] << 31
AND RDX, 0x80000000 ; RDX = buf[9] << 31
OR R10D, R8D ; R10 = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24)
OR R11D, R9D ; R11 = buf[5] | (buf[6] << 8) | (buf[7] << 16) | (buf[8] << 24)
LEA RAX, [R10+RAX*2] ; RAX = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24) | (buf[4] << 32)
LEA RDX, [R11+RDX*2] ; RDX = buf[5] | (buf[6] << 8) | (buf[7] << 16) | (buf[8] << 24) | (buf[9] << 32)
; Compact the character strings
SHL RAX, 24 ; RAX = (buf[0] << 24) | (buf[1] << 32) | (buf[2] << 40) | (buf[3] << 48) | (buf[4] << 56)
MOV R8, 0x3030303030303030
SHRD RAX, RDX, 24 ; RAX = buf[0] | (buf[1] << 8) | (buf[2] << 16) | (buf[3] << 24) | (buf[4] << 32) | (buf[5] << 40) | (buf[6] << 48) | (buf[7] << 56)
SHR RDX, 24 ; RDX = buf[8] | (buf[9] << 8)
; Store 12 characters. The last 2 will be null bytes.
OR R8, RAX
LEA R9, [RDX+0x3030]
MOV [RCX], R8
MOV [RCX+8], R9D
; Convert RCX into a bit pointer.
SHL RCX, 3
; Scan the first 8 bytes for a non-zero character.
OR EDX, 0x00000100
TEST RAX, RAX
LEA R10, [RCX+64]
CMOVZ RAX, RDX
CMOVZ RCX, R10
; Scan the next 4 bytes for a non-zero character.
TEST EAX, EAX
LEA R10, [RCX+32]
CMOVZ RCX, R10
SHR RAX, CL ; N.B. RAX >>= (RCX % 64); this works because buf is 8-byte aligned.
; Scan the next 2 bytes for a non-zero character.
TEST AX, AX
LEA R10, [RCX+16]
CMOVZ RCX, R10
SHR EAX, CL ; N.B. RAX >>= (RCX % 32)
; Convert back to byte pointer. N.B. this works because the AMD64 virtual address space is 48-bit.
SAR RCX, 3
; Scan the last byte for a non-zero character.
TEST AL, AL
MOV RAX, RCX
LEA R10, [RCX+1]
CMOVZ RAX, R10
RETN
INGE_2:
uint8_t len100K[100000];
char str100K[100000][5];
void itoa_inge_2_init()
{
memset(str100K, '0', sizeof(str100K));
for(uint32_t i = 0; i < 100000; i++)
{
char buf[6];
itoa(i, buf, 10);
len100K[i] = strlen(buf);
memcpy(&str100K[i][5 - len100K[i]], buf, len100K[i]);
}
}
char *itoa_inge_2(char *buf, uint32_t val)
{
char *p = &buf[10];
uint32_t prevlen;
*p = '\0';
do
{
uint32_t const old = val;
uint32_t mod;
val /= 100000;
mod = old - (val * 100000);
prevlen = len100K[mod];
p -= 5;
memcpy(p, str100K[mod], 5);
}
while(val != 0);
return &p[5 - prevlen];
}
VITAUT_1:
static uint16_t const str100p[100] = {
0x3030, 0x3130, 0x3230, 0x3330, 0x3430, 0x3530, 0x3630, 0x3730, 0x3830, 0x3930,
0x3031, 0x3131, 0x3231, 0x3331, 0x3431, 0x3531, 0x3631, 0x3731, 0x3831, 0x3931,
0x3032, 0x3132, 0x3232, 0x3332, 0x3432, 0x3532, 0x3632, 0x3732, 0x3832, 0x3932,
0x3033, 0x3133, 0x3233, 0x3333, 0x3433, 0x3533, 0x3633, 0x3733, 0x3833, 0x3933,
0x3034, 0x3134, 0x3234, 0x3334, 0x3434, 0x3534, 0x3634, 0x3734, 0x3834, 0x3934,
0x3035, 0x3135, 0x3235, 0x3335, 0x3435, 0x3535, 0x3635, 0x3735, 0x3835, 0x3935,
0x3036, 0x3136, 0x3236, 0x3336, 0x3436, 0x3536, 0x3636, 0x3736, 0x3836, 0x3936,
0x3037, 0x3137, 0x3237, 0x3337, 0x3437, 0x3537, 0x3637, 0x3737, 0x3837, 0x3937,
0x3038, 0x3138, 0x3238, 0x3338, 0x3438, 0x3538, 0x3638, 0x3738, 0x3838, 0x3938,
0x3039, 0x3139, 0x3239, 0x3339, 0x3439, 0x3539, 0x3639, 0x3739, 0x3839, 0x3939, };
char *itoa_vitaut_1(char *buf, uint32_t val)
{
char *p = &buf[10];
*p = '\0';
while(val >= 100)
{
uint32_t const old = val;
p -= 2;
val /= 100;
memcpy(p, &str100p[old - (val * 100)], sizeof(uint16_t));
}
p -= 2;
memcpy(p, &str100p[val], sizeof(uint16_t));
return &p[val < 10];
}
The first step to optimizing your code is getting rid of the arbitrary base support. This is because dividing by a constant is almost surely multiplication, but dividing by base is division, and because '0'+n is faster than "0123456789abcdef"[n] (no memory involved in the former).
If you need to go beyond that, you could make lookup tables for each byte in the base you care about (e.g. 10), then vector-add the (e.g. decimal) results for each byte. As in:
00 02 00 80 (input)
0000000000 (place3[0x00])
+0000131072 (place2[0x02])
+0000000000 (place1[0x00])
+0000000128 (place0[0x80])
==========
0000131200 (result)
This post compares several methods of integer to string conversion aka itoa. The fastest method reported there is fmt::format_int from the {fmt} library which is 5-18 times faster than sprintf/std::stringstream and ~4 times faster than a naive ltoa/itoa implementation (the actual numbers may of course vary depending on platform).
Unlike most other methods fmt::format_int does one pass over the digits. It also minimizes the number of integer divisions using the idea from Alexandrescu's talk Fastware. The implementation is available here.
This is of course if C++ is an option and you are not restricted by the itoa's API.
Disclaimer: I'm the author of this method and the fmt library.
http://sourceforge.net/projects/itoa/
Its uses a big static const array of all 4-digits integers and uses it for 32-bits or 64-bits conversion to string.
Portable, no need of a specific instruction set.
The only faster version I could find was in assembly code and limited to 32 bits.
Interesting problem. If you're interested in a 10 radix only itoa() then I have made a 10 times as fast example and a 3 times as fast example as the typical itoa() implementation.
First example (3x performance)
The first, which is 3 times as fast as itoa(), uses a single-pass non-reversal software design pattern and is based on the open source itoa() implementation found in groff.
// itoaSpeedTest.cpp : Defines the entry point for the console application.
//
#pragma comment(lib, "Winmm.lib")
#include "stdafx.h"
#include "Windows.h"
#include <iostream>
#include <time.h>
using namespace std;
#ifdef _WIN32
/** a signed 32-bit integer value type */
#define _INT32 __int32
#else
/** a signed 32-bit integer value type */
#define _INT32 long int // Guess what a 32-bit integer is
#endif
/** minimum allowed value in a signed 32-bit integer value type */
#define _INT32_MIN -2147483647
/** maximum allowed value in a signed 32-bit integer value type */
#define _INT32_MAX 2147483647
/** maximum allowed number of characters in a signed 32-bit integer value type including a '-' */
#define _INT32_MAX_LENGTH 11
#ifdef _WIN32
/** Use to init the clock */
#define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency);
/** Use to start the performance timer */
#define TIMER_START QueryPerformanceCounter(&t1);
/** Use to stop the performance timer and output the result to the standard stream */
#define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl;
#else
/** Use to init the clock */
#define TIMER_INIT
/** Use to start the performance timer */
#define TIMER_START clock_t start;double diff;start=clock();
/** Use to stop the performance timer and output the result to the standard stream */
#define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl;
#endif
/** Array used for fast number character lookup */
const char numbersIn10Radix[10] = {'0','1','2','3','4','5','6','7','8','9'};
/** Array used for fast reverse number character lookup */
const char reverseNumbersIn10Radix[10] = {'9','8','7','6','5','4','3','2','1','0'};
const char *reverseArrayEndPtr = &reverseNumbersIn10Radix[9];
/*!
\brief Converts a 32-bit signed integer to a string
\param i [in] Integer
\par Software design pattern
Uses a single pass non-reversing algorithm and is 3x as fast as \c itoa().
\returns Integer as a string
\copyright GNU General Public License
\copyright 1989-1992 Free Software Foundation, Inc.
\date 1989-1992, 2013
\author James Clark<jjc#jclark.com>, 1989-1992
\author Inge Eivind Henriksen<inge#meronymy.com>, 2013
\note Function was originally a part of \a groff, and was refactored & optimized in 2013.
\relates itoa()
*/
const char *Int32ToStr(_INT32 i)
{
// Make room for a 32-bit signed integers digits and the '\0'
char buf[_INT32_MAX_LENGTH + 2];
char *p = buf + _INT32_MAX_LENGTH + 1;
*--p = '\0';
if (i >= 0)
{
do
{
*--p = numbersIn10Radix[i % 10];
i /= 10;
} while (i);
}
else
{
// Negative integer
do
{
*--p = reverseArrayEndPtr[i % 10];
i /= 10;
} while (i);
*--p = '-';
}
return p;
}
int _tmain(int argc, _TCHAR* argv[])
{
TIMER_INIT
// Make sure we are playing fair here
if (sizeof(int) != sizeof(_INT32))
{
cerr << "Error: integer size mismatch; test would be invalid." << endl;
return -1;
}
const int steps = 100;
{
char intBuffer[20];
cout << "itoa() took:" << endl;
TIMER_START;
for (int i = _INT32_MIN; i < i + steps ; i += steps)
itoa(i, intBuffer, 10);
TIMER_STOP;
}
{
cout << "Int32ToStr() took:" << endl;
TIMER_START;
for (int i = _INT32_MIN; i < i + steps ; i += steps)
Int32ToStr(i);
TIMER_STOP;
}
cout << "Done" << endl;
int wait;
cin >> wait;
return 0;
}
On 64-bit Windows the result from running this example is:
itoa() took:
2909.84 ms.
Int32ToStr() took:
991.726 ms.
Done
On 32-bit Windows the result from running this example is:
itoa() took:
3119.6 ms.
Int32ToStr() took:
1031.61 ms.
Done
Second example (10x performance)
If you don't mind spending some time initializing some buffers then it's possible to optimize the function above to be 10x faster than the typical itoa() implementation. What you need to do is to create string buffers rather than character buffers, like this:
// itoaSpeedTest.cpp : Defines the entry point for the console application.
//
#pragma comment(lib, "Winmm.lib")
#include "stdafx.h"
#include "Windows.h"
#include <iostream>
#include <time.h>
using namespace std;
#ifdef _WIN32
/** a signed 32-bit integer value type */
#define _INT32 __int32
/** a signed 8-bit integer value type */
#define _INT8 __int8
/** an unsigned 8-bit integer value type */
#define _UINT8 unsigned _INT8
#else
/** a signed 32-bit integer value type */
#define _INT32 long int // Guess what a 32-bit integer is
/** a signed 8-bit integer value type */
#define _INT8 char
/** an unsigned 8-bit integer value type */
#define _UINT8 unsigned _INT8
#endif
/** minimum allowed value in a signed 32-bit integer value type */
#define _INT32_MIN -2147483647
/** maximum allowed value in a signed 32-bit integer value type */
#define _INT32_MAX 2147483647
/** maximum allowed number of characters in a signed 32-bit integer value type including a '-' */
#define _INT32_MAX_LENGTH 11
#ifdef _WIN32
/** Use to init the clock */
#define TIMER_INIT LARGE_INTEGER frequency;LARGE_INTEGER t1, t2;double elapsedTime;QueryPerformanceFrequency(&frequency);
/** Use to start the performance timer */
#define TIMER_START QueryPerformanceCounter(&t1);
/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP QueryPerformanceCounter(&t2);elapsedTime=(t2.QuadPart-t1.QuadPart)*1000.0/frequency.QuadPart;wcout<<elapsedTime<<L" ms."<<endl;
#else
/** Use to init the clock to get better precision that 15ms on Windows */
#define TIMER_INIT timeBeginPeriod(10);
/** Use to start the performance timer */
#define TIMER_START clock_t start;double diff;start=clock();
/** Use to stop the performance timer and output the result to the standard stream. Less verbose than \c TIMER_STOP_VERBOSE */
#define TIMER_STOP diff=(clock()-start)/(double)CLOCKS_PER_SEC;wcout<<fixed<<diff<<endl;
#endif
/* Set this as large or small as you want, but has to be in the form 10^n where n >= 1, setting it smaller will
make the buffers smaller but the performance slower. If you want to set it larger than 100000 then you
must add some more cases to the switch blocks. Try to make it smaller to see the difference in
performance. It does however seem to become slower if larger than 100000 */
static const _INT32 numElem10Radix = 100000;
/** Array used for fast lookup number character lookup */
const char *numbersIn10Radix[numElem10Radix] = {};
_UINT8 numbersIn10RadixLen[numElem10Radix] = {};
/** Array used for fast lookup number character lookup */
const char *reverseNumbersIn10Radix[numElem10Radix] = {};
_UINT8 reverseNumbersIn10RadixLen[numElem10Radix] = {};
void InitBuffers()
{
char intBuffer[20];
for (int i = 0; i < numElem10Radix; i++)
{
itoa(i, intBuffer, 10);
size_t numLen = strlen(intBuffer);
char *intStr = new char[numLen + 1];
strcpy(intStr, intBuffer);
numbersIn10Radix[i] = intStr;
numbersIn10RadixLen[i] = numLen;
reverseNumbersIn10Radix[numElem10Radix - 1 - i] = intStr;
reverseNumbersIn10RadixLen[numElem10Radix - 1 - i] = numLen;
}
}
/*!
\brief Converts a 32-bit signed integer to a string
\param i [in] Integer
\par Software design pattern
Uses a single pass non-reversing algorithm with string buffers and is 10x as fast as \c itoa().
\returns Integer as a string
\copyright GNU General Public License
\copyright 1989-1992 Free Software Foundation, Inc.
\date 1989-1992, 2013
\author James Clark<jjc#jclark.com>, 1989-1992
\author Inge Eivind Henriksen, 2013
\note This file was originally a part of \a groff, and was refactored & optimized in 2013.
\relates itoa()
*/
const char *Int32ToStr(_INT32 i)
{
/* Room for INT_DIGITS digits, - and '\0' */
char buf[_INT32_MAX_LENGTH + 2];
char *p = buf + _INT32_MAX_LENGTH + 1;
_INT32 modVal;
*--p = '\0';
if (i >= 0)
{
do
{
modVal = i % numElem10Radix;
switch(numbersIn10RadixLen[modVal])
{
case 5:
*--p = numbersIn10Radix[modVal][4];
case 4:
*--p = numbersIn10Radix[modVal][3];
case 3:
*--p = numbersIn10Radix[modVal][2];
case 2:
*--p = numbersIn10Radix[modVal][1];
default:
*--p = numbersIn10Radix[modVal][0];
}
i /= numElem10Radix;
} while (i);
}
else
{
// Negative integer
const char **reverseArray = &reverseNumbersIn10Radix[numElem10Radix - 1];
const _UINT8 *reverseArrayLen = &reverseNumbersIn10RadixLen[numElem10Radix - 1];
do
{
modVal = i % numElem10Radix;
switch(reverseArrayLen[modVal])
{
case 5:
*--p = reverseArray[modVal][4];
case 4:
*--p = reverseArray[modVal][3];
case 3:
*--p = reverseArray[modVal][2];
case 2:
*--p = reverseArray[modVal][1];
default:
*--p = reverseArray[modVal][0];
}
i /= numElem10Radix;
} while (i);
*--p = '-';
}
return p;
}
int _tmain(int argc, _TCHAR* argv[])
{
InitBuffers();
TIMER_INIT
// Make sure we are playing fair here
if (sizeof(int) != sizeof(_INT32))
{
cerr << "Error: integer size mismatch; test would be invalid." << endl;
return -1;
}
const int steps = 100;
{
char intBuffer[20];
cout << "itoa() took:" << endl;
TIMER_START;
for (int i = _INT32_MIN; i < i + steps ; i += steps)
itoa(i, intBuffer, 10);
TIMER_STOP;
}
{
cout << "Int32ToStr() took:" << endl;
TIMER_START;
for (int i = _INT32_MIN; i < i + steps ; i += steps)
Int32ToStr(i);
TIMER_STOP;
}
cout << "Done" << endl;
int wait;
cin >> wait;
return 0;
}
On 64-bit Windows the result from running this example is:
itoa() took:
2914.12 ms.
Int32ToStr() took:
306.637 ms.
Done
On 32-bit Windows the result from running this example is:
itoa() took:
3126.12 ms.
Int32ToStr() took:
299.387 ms.
Done
Why do you use reverse string lookup buffers?
It's possible to do this without the reverse string lookup buffers (thus saving 1/2 the internal memory), but this makes it significantly slower (timed at about 850 ms on 64-bit and 380 ms on 32-bit systems). It's not clear to me exactly why it's so much slower - especially on 64-bit systems, to test this further yourself you can change simply the following code:
#define _UINT32 unsigned _INT32
...
static const _UINT32 numElem10Radix = 100000;
...
void InitBuffers()
{
char intBuffer[20];
for (int i = 0; i < numElem10Radix; i++)
{
_itoa(i, intBuffer, 10);
size_t numLen = strlen(intBuffer);
char *intStr = new char[numLen + 1];
strcpy(intStr, intBuffer);
numbersIn10Radix[i] = intStr;
numbersIn10RadixLen[i] = numLen;
}
}
...
const char *Int32ToStr(_INT32 i)
{
char buf[_INT32_MAX_LENGTH + 2];
char *p = buf + _INT32_MAX_LENGTH + 1;
_UINT32 modVal;
*--p = '\0';
_UINT32 j = i;
do
{
modVal = j % numElem10Radix;
switch(numbersIn10RadixLen[modVal])
{
case 5:
*--p = numbersIn10Radix[modVal][4];
case 4:
*--p = numbersIn10Radix[modVal][3];
case 3:
*--p = numbersIn10Radix[modVal][2];
case 2:
*--p = numbersIn10Radix[modVal][1];
default:
*--p = numbersIn10Radix[modVal][0];
}
j /= numElem10Radix;
} while (j);
if (i < 0) *--p = '-';
return p;
}
That's part of my code in asm. It works only for range 255-0 It can be faster however here you can find direction and main idea.
4 imuls
1 memory read
1 memory write
You can try to reduce 2 imule's and use lea's with shifting. However you can't find anything faster in C/C++/Python ;)
void itoa_asm(unsigned char inVal, char *str)
{
__asm
{
// eax=100's -> (some_integer/100) = (some_integer*41) >> 12
movzx esi,inVal
mov eax,esi
mov ecx,41
imul eax,ecx
shr eax,12
mov edx,eax
imul edx,100
mov edi,edx
// ebx=10's -> (some_integer/10) = (some_integer*205) >> 11
mov ebx,esi
sub ebx,edx
mov ecx,205
imul ebx,ecx
shr ebx,11
mov edx,ebx
imul edx,10
// ecx = 1
mov ecx,esi
sub ecx,edx // -> sub 10's
sub ecx,edi // -> sub 100's
add al,'0'
add bl,'0'
add cl,'0'
//shl eax,
shl ebx,8
shl ecx,16
or eax,ebx
or eax,ecx
mov edi,str
mov [edi],eax
}
}
#Inge Henriksen
I believe your code has a bug:
IntToStr(2701987) == "2701987" //Correct
IntToStr(27001987) == "2701987" //Incorrect
Here's why your code is wrong:
modVal = i % numElem10Radix;
switch (reverseArrayLen[modVal])
{
case 5:
*--p = reverseArray[modVal][4];
case 4:
*--p = reverseArray[modVal][3];
case 3:
*--p = reverseArray[modVal][2];
case 2:
*--p = reverseArray[modVal][1];
default:
*--p = reverseArray[modVal][0];
}
i /= numElem10Radix;
There should be a leading 0 before "1987", which is "01987". But after the first iteration, you get 4 digits instead of 5.
So,
IntToStr(27000000) = "2700" //Incorrect
For unsigned 0 to 9,999,999 with terminating null. (99,999,999 without)
void itoa(uint64_t u, char *out) // up to 9,999,999 with terminating zero
{
*out = 0;
do {
uint64_t n0 = u;
*((uint64_t *)out) = (*((uint64_t *)out) << 8) | (n0 + '0' - (u /= 10) * 10);
} while (u);
}

Resources