Hi I'm implementing some fixed point math stuff for embedded systems and I'm trying to do the multiplication of two 16.16 fixed point numbers without creating a 64bit temporary. So far here is the code I came up with that generates the least instructions.
int multiply(int x, int y){
int result;
long long temp = x;
temp *= y;
temp >>= 16;
result = temp;
return result;
}
the problem with this code is that it uses a temporary 64 bit integer which seem to generate bad assembly code. I'm trying to make a system that uses two 32 bit integers instead of a 64 bit one. Anyone know how to do this?
Think of your numbers as each composed of two large "digits."
A.B
x C.D
The "base" of the digits is the 2^bit_width, i.e., 2^16, or 65536.
So, the product is
D*B + D*A*65536 + C*B*65536 + C*A*65536*65536
However, to get the product shifted right by 16, you need to divide all these terms by 65536, so
D*B/65536 + D*A + C*B + C*A*65536
In C:
uint16_t a = x >> 16;
uint16_t b = x & 0xffff;
uint16_t c = y >> 16;
uint16_t d = y & 0xffff;
return ((d * b) >> 16) + (d * a) + (c * b) + ((c * a) << 16);
The signed version is a bit more complicated; it is often easiest to perform the arithmetic on the absolute values of x and y and then fix the sign (unless you overflow, which you can check for rather tediously).
Related
Can we use bitwise operator for conversion from decimal to other bases other than 4, 8, 16 and so on?
I understand how to do that for 4, 8, 16 and so on.
But for conversion from decimal to base 3, or base 12, for example, I don't know.
It is possible?
I assume in your question you mend conversion from binary to other bases.
All arithmetic operations can be reduced to bitwise operations and shifts. That's what the CPU is doing internally in hardware too.
a + b ==> (a ^ b) + ((a & b) << 1)
The right side still has a + in there so you have to apply the same transformation again and again till you have a left shift larger than the width of your integer type. Or do it bit by bit in a loop.
With two's-complement:
-a ==> ~a + 1
And if you have + and negate you have -. * is just a bunch of shifts and adds. / is a bunch of shifts and subtract. Just consider how you did multiplication and long division in school and bring that down to base 2.
For most bases doing the math with bitwise operations is insane. Especially if you derive your code from the basic operations above. The CPUs add, sub and mul operations are just fine and way faster. But if you want to implement printf() for a freestanding environment (like a kernel) you might need to do a division of uint64_t / 10 that your CPU can't do in hardware. The compiler (gcc, clang) also isn't smart enough do this well and falls back to a general iterative uint64_t / uint64_t long division algorithm.
But a division can be done by multiplying by the inverse shifted a few bits and the shifting the result back. This method works out really well for a division by 10 and you get nicely optimized code:
uint64_t divu10(uint64_t n) {
uint64_t q, r;
q = (n >> 1) + (n >> 2);
q = q + (q >> 4);
q = q + (q >> 8);
q = q + (q >> 16);
q = q + (q >> 32);
q = q >> 3;
r = n - (((q << 2) + q) << 1);
return q + (r > 9);
}
That's is shorter and faster by a magnitude or two to the general uint64_t / uint64_t long division function that gcc / clang will call when you write x / 10.
Note: (((q << 2) + q) << 1) is q * 10. Another bitwise operation that is faster than q * 10 when the cpu doesn't have 64bit integers.
So currently (I don't know if the width is relevant) I have 2 128bit integers that I want to swap shift bits as shift is a part of a structure:
unsigned char shift : 7;
So I get the relevant bits like this:
__uint128_t rawdata = ((*pfirst << shift) >> shift), rawdatasecond = ((*psecond << shift) >> shift);
And then I swap them like this:
*pfirst = *pfirst >> 127 - shift << 127 - shift | rawdatasecond;
*psecond = *psecond >> 127 - shift << 127 - shift | rawdata;
But I feel like I'm missing something - or am I (not sure how to test either)?
Note:
__uint128_t *pfirst, *psecond; //pointer to some integers
As I understand the task, you would like to swap bits 0, ..., n-1 for two integers. This can be done as follows:
static inline void swapbits(__uint128_t *a, __uint128_t *b, unsigned n)
{
if(n <= 8 * sizeof(*a))
{
__uint128_t mask = (n < 8 *sizeof(*a)) ? (((__uint128_t)1) << n) - 1 : ~(__uint128_t)0;
__uint128_t bits_a = *a & mask; // Get the bits from a
__uint128_t bits_b = *b & mask; // Get the bits from b
*a = (*a & ~mask) | bits_b; // Set bits in a
*b = (*b & ~mask) | bits_a; // Set bits in b
}
}
Testing is a matter of calling the function with different combinations of values for a, b and n and checking that the result is as expected.
Clearly it is not feasible to test all combinations so a number of representative cases must be defined. It is not an exact science, but think about middle and corner cases: n=0, n=1, n=64, n=127, n=128 and a and b having different bit patterns around the left-most and right-most positions as well as around the n'th position.
static void SwapHighNBits(__uint128_t *a, __uint128_t *b, size_t n)
{
// Calculate number of uninvolved bits.
size_t NU = 128 - n;
// Calculate bitwise difference and remove uninvolved bits.
__uint128_t d = (*a ^ *b) >> NU << NU;
// Apply difference.
*a ^= d;
*b ^= d;
}
A good test suite would include all combinations of bits in the critical positions: The high bit, the lowest bit swapped, the highest bit not swapped, and the lowest bit. That is four bits in each of two operands, so eight bits total, 256 combinations for a specific value of n. Then test values of n from 2 to 126. Other bits can be filled randomly.
For n = 1, the high bit and the lowest bit swapped are identical, so make separate test code for that, or write the common test code carefully to cover that. Similarly, for n = 127, the highest bit not swapped and the lowest bit are identical, and, for n = 128, there are no bits not swapped.
If defined behavior is defined for n ≤ 0 or n > 128, add test cases for those. Note that the code above does not support n = 0, as NU will be 128, and the shifts are not defined by the C standard. Of course, for n = 0, one can simply return without making any changes.
In Hacker's delight there is an algorithm to calculate the double word product of two (signed) words.
The function muldws1 uses four multiplications and five additions to calculate
the double word from two words.
Towards the end of that code there is a line commented out
/* w[1] = u*v; // Alternative. */
This alternative uses five multiplications and four addition, i.e. it exchanges an addition for a multiplication.
But I think this alternative method can be improved. I have not said anything about hardware yet. Let's assume a hypothetical CPU which can calculate the lower word of the product of two words but not the upper word (e.g. for 32-bit words 32x32 to lower 32). In this case it seems to me that this algorithm can be improved. Here is what I have come up with
assuming 32-bit words (the same concept would work for 64-bit words).
void muldws1_improved(int w[], int32_t x, int32_t y) {
uint16_t xl = x; int16_t xh = x >> 16;
uint16_t yl = y; int16_t yh = y >> 16;
uint32 lo = x*y;
int32_t t = xl*yh + xh*yl;
uint16_t tl = t; int16_t th = t >>16;
uint16_t loh = lo >> 16;
int32_t cy = loh<tl; //carry
int32_t hi = xh*yh + th + cy;
w[0] = hi; w[1] = lo;
}
This uses four multiplications, three additions, and one comparison. This is a smaller improvement then I had hoped for.
Can this be improved? Is there a better way to determine the carry flag? I should point out I am also assuming the hardware has no carry flag (e.g. no ADDC instruction) but words can be compared (e.g. word1<word).
Edit: as Sander De Dycker pointed out my function fails the unit tests. Here is a version which passes the unit tests but it's less efficient. I think it can be improved.
void muldws1_improved_v2(int w[], int32_t x, int32_t y) {
uint16_t xl = x; int16_t xh = x >> 16;
uint16_t yl = y; int16_t yh = y >> 16;
uint32_t lo = x*y;
int32_t t2 = xl*yh;
int32_t t3 = xh*yl;
int32_t t4 = xh*yh;
uint16_t t2l = t2; int16_t t2h = t2 >>16;
uint16_t t3l = t3; int16_t t3h = t3 >>16;
uint16_t loh = lo >> 16;
uint16_t t = t2l + t3l;
int32_t carry = (t<t2l) + (loh<t);
int32_t hi = t4 + t2h + t3h + carry;
w[0] = hi; w[1] = lo;
}
This uses four multiplications, five additions, and two comparisons which is worse that the original function.
There were two problems with my muldws1_improved function in my question. One of them is that it missed a carry when I did xl*yh + xh*yl. This is why it failed the unit tests. But the other is that there are signed*unsigned products which require a lot more machine logic than is seen in the C code. (see my edit below). I found a better solution which is to optimized the unsigned product function muldwu1 first and then do
muldwu1(w,x,y);
w[0] -= ((x<0) ? y : 0) + ((y<0) ? x : 0);
to correct for the sign.
Here is my attempt at improving the muldwu1 using the lower word lo = x*y (yes this function passes the unit tests from Hacker's delight).
void muldwu1_improved(uint32_t w[], uint32_t x, uint32_t y) {
uint16_t xl = x; uint16_t xh = x >> 16;
uint16_t yl = y; uint16_t yh = y >> 16;
uint32_t lo = x*y; //32x32 to 32
uint32_t t1 = xl*yh; //16x16 to 32
uint32_t t2 = xh*yl; //16x16 to 32
uint32_t t3 = xh*yh; //16x16 to 32
uint32_t t = t1 + t2;
uint32_t tl = 0xFFFF & t;
uint32_t th = t >> 16;
uint32_t loh = lo >> 16;
uint32_t cy = ((t<t1) << 16) + (loh<tl); //carry
w[1] = lo;
w[0] = t3 + th + cy;
}
This uses one less addition than the original function from Hacker's delight but it has to do two comparisons
1 mul32x32 to 32
3 mul16x16 to 32
4 add32
5 shift logical (or shuffles)
1 and
2 compare32
***********
16 operations
Edit:
I was bothered by a statement in Hacker's Delight (2nd Edition) which says in regards to the mulhs and mulhu algorithm.
The algorithm requires 16 basic RISC instructions in either the signed or unsigned version, four of which are multiplications.
I implemented the unsigned algorithm in only 16 SSE instructions but my signed version required more instructions. I figured out why and I can now answer my own question.
The reason I failed to find a better version that in Hacker's Delight is that their hypothetical RISC processor has an instruction which calculates the lower word of the product of two words. In other words, their algorithm is already optimized for this case and so it's unlikely there is a better version than the one they already have.
The reason they list an alternative is because they assume multiplication (and division) may be more expensive than other instructions and so they left the alternative as a case to optimize on.
So the C code does not hide significant machine logic. It assumes the machine can do word * word to lower word.
Why does this matter? In their algorithm they do first
u0 = u >> 16;
and later
t = u0*v1 + k;
if u = 0x80000000 u0 = 0xffff8000. However, if your CPU can only take half word products to get a full word the upper half word of u0 is ignored and you get the wrong signed result.
In this case you should calculate the unsigned upper word and then correct using hi -= ((x<0) ? y : 0) + ((y<0) ? x : 0); as I already stated.
The reason I am interested in this is that Intel's SIMD instruction (SSE2 through AVX2) do not have an instruction which does 64x64 to 64, they only have 32x32 to 64. That's why my signed version requires more instructions.
But AVX512 has a 64x64 to 64 instruction. Therefore with AVX512 the signed version should take the same number of instructions as the unsigned. However, since the 64x64 to 64 instruction may be much slower than the 32x32 to 64 instruction it may make more sense to do the unsigned version anyway and then correct.
The Problem: Exercise 2-8 of The C Programming Language, "Write a function rightrot(x,n) that returns the value of the integer x, rotated to the right by n positions."
I have done this every way that I know how. Here is the issue that I am having. Take a given number for this exercise, say 29, and rotate it right one position.
11101 and it becomes 11110 or 30. Let's say for the sake of argument that the system we are working on has an unsigned integer type size of 32 bits. Let's further say that we have the number 29 stored in an unsigned integer variable. In memory the number will have 27 zeros ahead of it. So when we rotate 29 right one using one of several algorithms mine is posted below, we get the number 2147483662. This is obviously not the desired result.
unsigned int rightrot(unsigned x, int n) {
return (x >> n) | (x << (sizeof(x) * CHAR_BIT) - n);
}
Technically, this is correct, but I was thinking that the 27 zeros that are in front of 11101 were insignificant. I have also tried a couple of other solutions:
int wordsize(void) { // compute the wordsize on a given machine...
unsigned x = ~0;
int b;
for(b = 0; x; b++)
x &= x-1;
return x;
}
unsigned int rightrot(unsigned x, int n) {
unsigned rbit;
while(n --) {
rbit = x >> 1;
x |= (rbit << wordsize() - 1);
}
return x;
This last and final solution is the one where I thought that I had it, I will explain where it failed once I get to the end. I am sure that you will see my mistake...
int bitcount(unsigned x) {
int b;
for(b = 0; x; b++)
x &= x-1;
return b;
}
unsigned int rightrot(unsigned x, int n) {
unsigned rbit;
int shift = bitcount(x);
while(n--) {
rbit = x & 1;
x >>= 1;
x |= (rbit << shift);
}
}
This solution gives the expected answer of 30 that I was looking for, but if you use a number for x like oh say 31 (11111), then there are issues, specifically the outcome is 47, using one for n. I did not think of this earlier, but if a number like 8 (1000) is used then mayhem. There is only one set bit in 8, so the shift is most certainly going to be wrong. My theory at this point is that the first two solutions are correct (mostly) and I am just missing something...
A bitwise rotation is always necessarily within an integer of a given width. In this case, as you're assuming a 32-bit integer, 2147483662 (0b10000000000000000000000000001110) is indeed the correct answer; you aren't doing anything wrong!
0b11110 would not be considered the correct result by any reasonable definition, as continuing to rotate it right using the same definition would never give you back the original input. (Consider that another right rotation would give 0b1111, and continuing to rotate that would have no effect.)
In my opinion, the spirit of the section of the book which immediately precedes this exercise would have the reader do this problem without knowing anything about the size (in bits) of integers, or any other type. The examples in the section do not require that information; I don't believe the exercises should either.
Regardless of my belief, the book had not yet introduced the sizeof operator by section 2.9, so the only way to figure the size of a type is to count the bits "by hand".
But we don't need to bother with all that. We can do bit rotation in n steps, regardless of how many bits there are in the data type, by rotating one bit at a time.
Using only the parts of the language that are covered by the book up to section 2.9, here's my implementation (with integer parameters, returning an integer, as specified by the exercise): Loop n times, x >> 1 each iteration; if the old low bit of x was 1, set the new high bit.
int rightrot(int x, int n) {
int lowbit;
while (n-- > 0) {
lowbit = x & 1; /* save low bit */
x = (x >> 1) & (~0u >> 1); /* shift right by one, and clear the high bit (in case of sign extension) */
if (lowbit)
x = x | ~(~0u >> 1); /* set the high bit if the low bit was set */
}
return x;
}
You could find the location of the first '1' in the 32-bit value using binary search. Then note the bit in the LSB location, right shift the value by the required number of places, and put the LSB bit in the location of the first '1'.
int bitcount(unsigned x) {
int b;
for(b = 0; x; b++)
x &= x-1;
return b;
}
unsigned rightrot(unsigned x,int n) {
int b = bitcount(x);
unsigned a = (x&~(~0<<n))<<(b-n+1);
x>> = n;
x| = a;
}
How to find number of trailing 0s in a binary number?Based on K&R bitcount example of finding 1s in a binary number i modified it a bit to find the trailing 0s.
int bitcount(unsigned x)
{
int b;
for(b=0;x!=0;x>>=1)
{
if(x&01)
break;
else
b++;
}
I would like to review this method.
Here's a way to compute the count in parallel for better efficiency:
unsigned int v; // 32-bit word input to count zero bits on right
unsigned int c = 32; // c will be the number of zero bits on the right
v &= -signed(v);
if (v) c--;
if (v & 0x0000FFFF) c -= 16;
if (v & 0x00FF00FF) c -= 8;
if (v & 0x0F0F0F0F) c -= 4;
if (v & 0x33333333) c -= 2;
if (v & 0x55555555) c -= 1;
On GCC on X86 platform you can use __builtin_ctz(no)
On Microsoft compilers for X86 you can use _BitScanForward
They both emit a bsf instruction
Another approach (I'm surprised it's not mentioned here) would be to build a table of 256 integers, where each element in the array is the lowest 1 bit for that index. Then, for each byte in the integer, you look up in the table.
Something like this (I haven't taken any time to tweak this, this is just to roughly illustrate the idea):
int bitcount(unsigned x)
{
static const unsigned char table[256] = { /* TODO: populate with constants */ };
for (int i=0; i<sizeof(x); ++i, x >>= 8)
{
unsigned char r = table[x & 0xff];
if (r)
return r + i*8; // Found a 1...
}
// All zeroes...
return sizeof(x)*8;
}
The idea with some of the table-driven approaches to a problem like this is that if statements cost you something in terms of branch prediction, so you should aim to reduce them. It also reduces the number of bit shifts. Your approach does an if statement and a shift per bit, and this one does one per byte. (Hopefully the optimizer can unroll the for loop, and not issue a compare/jump for that.) Some of the other answers have even fewer if statements than this, but a table approach is simple and easy to understand. Of course you should be guided by actual measurements to see if any of this matters.
I think your method is working (allthough you might want to use unsigned int). You check the last digit each time, and if it's zero, you discard it an increment the number of trailing zero-bits.
I think for trailing zeroes you don't need a loop.
Consider the following:
What happens with the number (in binary representation, of course) if you subtract 1? Which digits change, which stay the same?
How could you combine the original number and the decremented version such that only bits representing trailing zeroes are left?
If you apply the above steps correctly, you can just find the highest bit set in O(lg n) steps (look here if you're interested in how to do).
Should be:
int bitcount(unsigned char x)
{
int b;
for(b=0; b<7; x>>=1)
{
if(x&1)
break;
else
b++;
}
return b;
}
or even
int bitcount(unsigned char x)
{
int b;
for(b=0; b<7 && !(x&1); x>>=1) b++;
return b;
}
or even (yay!)
int bitcount(unsigned char x)
{
int b;
for(b=0; b<7 && !(x&1); b++) x>>=1;
return b;
}
or ...
Ah, whatever, there are 100500 millions methods of doing this. Use whatever you need or like.
We can easily get it using bit operations, we don't need to go through all the bits. Pseudo code:
int bitcount(unsigned x) {
int xor = x ^ (x-1); // this will have (1 + #trailing 0s) trailing 1s
return log(i & xor); // i & xor will have only one bit 1 and its log should give the exact number of zeroes
}
int countTrailZero(unsigned x) {
if (x == 0) return DEFAULT_VALUE_YOU_NEED;
return log2 (x & -x);
}
Explanation:
x & -x returns the number of right most bit set with 1.
e.g. 6 -> "0000,0110", (6 & -6) -> "0000,0010"
You can deduct this by two complement:
x = "a1b", where b represents all trailing zeros.
then
-x = !(x) + 1 = !(a1b) + 1 = (!a)0(!b) + 1 = (!a)0(1...1) + 1 = (!a)1(0...0) = (!a)1b
so
x & (-x) = (a1b) & (!a)1b = (0...0)1(0...0)
you can get the number of trailing zeros just by doing log2.