Picking good first estimates for Goldschmidt division - c

I'm calculating fixedpoint reciprocals in Q22.10 with Goldschmidt division for use in my software rasterizer on ARM.
This is done by just setting the numerator to 1, i.e the numerator becomes the scalar on the first iteration. To be honest, I'm kind of following the wikipedia algorithm blindly here. The article says that if the denominator is scaled in the half-open range (0.5, 1.0], a good first estimate can be based on the denominator alone: Let F be the estimated scalar and D be the denominator, then F = 2 - D.
But when doing this, I lose a lot of precision. Say if I want to find the reciprocal of 512.00002f. In order to scale the number down, I lose 10 bits of precision in the fraction part, which is shifted out. So, my questions are:
Is there a way to pick a better estimate which does not require normalization? Why? Why not? A mathematical proof of why this is or is not possible would be great.
Also, is it possible to pre-calculate the first estimates so the series converges faster? Right now, it converges after the 4th iteration on average. On ARM this is about ~50 cycles worst case, and that's not taking emulation of clz/bsr into account, nor memory lookups. If it's possible, I'd like to know if doing so increases the error, and by how much.
Here is my testcase. Note: The software implementation of clz on line 13 is from my post here. You can replace it with an intrinsic if you want. clz should return the number of leading zeros, and 32 for the value 0.
#include <stdio.h>
#include <stdint.h>
const unsigned int BASE = 22ULL;
static unsigned int divfp(unsigned int val, int* iter)
{
/* Numerator, denominator, estimate scalar and previous denominator */
unsigned long long N,D,F, DPREV;
int bitpos;
*iter = 1;
D = val;
/* Get the shift amount + is right-shift, - is left-shift. */
bitpos = 31 - clz(val) - BASE;
/* Normalize into the half-range (0.5, 1.0] */
if(0 < bitpos)
D >>= bitpos;
else
D <<= (-bitpos);
/* (FNi / FDi) == (FN(i+1) / FD(i+1)) */
/* F = 2 - D */
F = (2ULL<<BASE) - D;
/* N = F for the first iteration, because the numerator is simply 1.
So don't waste a 64-bit UMULL on a multiply with 1 */
N = F;
D = ((unsigned long long)D*F)>>BASE;
while(1){
DPREV = D;
F = (2<<(BASE)) - D;
D = ((unsigned long long)D*F)>>BASE;
/* Bail when we get the same value for two denominators in a row.
This means that the error is too small to make any further progress. */
if(D == DPREV)
break;
N = ((unsigned long long)N*F)>>BASE;
*iter = *iter + 1;
}
if(0 < bitpos)
N >>= bitpos;
else
N <<= (-bitpos);
return N;
}
int main(int argc, char* argv[])
{
double fv, fa;
int iter;
unsigned int D, result;
sscanf(argv[1], "%lf", &fv);
D = fv*(double)(1<<BASE);
result = divfp(D, &iter);
fa = (double)result / (double)(1UL << BASE);
printf("Value: %8.8lf 1/value: %8.8lf FP value: 0x%.8X\n", fv, fa, result);
printf("iteration: %d\n",iter);
return 0;
}

I could not resist spending an hour on your problem...
This algorithm is described in section 5.5.2 of "Arithmetique des ordinateurs" by Jean-Michel Muller (in french). It is actually a special case of Newton iterations with 1 as starting point. The book gives a simple formulation of the algorithm to compute N/D, with D normalized in range [1/2,1[:
e = 1 - D
Q = N
repeat K times:
Q = Q * (1+e)
e = e*e
The number of correct bits doubles at each iteration. In the case of 32 bits, 4 iterations will be enough. You can also iterate until e becomes too small to modify Q.
Normalization is used because it provides the max number of significant bits in the result. It is also easier to compute the error and number of iterations needed when the inputs are in a known range.
Once your input value is normalized, you don't need to bother with the value of BASE until you have the inverse. You simply have a 32-bit number X normalized in range 0x80000000 to 0xFFFFFFFF, and compute an approximation of Y=2^64/X (Y is at most 2^33).
This simplified algorithm may be implemented for your Q22.10 representation as follows:
// Fixed point inversion
// EB Apr 2010
#include <math.h>
#include <stdio.h>
// Number X is represented by integer I: X = I/2^BASE.
// We have (32-BASE) bits in integral part, and BASE bits in fractional part
#define BASE 22
typedef unsigned int uint32;
typedef unsigned long long int uint64;
// Convert FP to/from double (debug)
double toDouble(uint32 fp) { return fp/(double)(1<<BASE); }
uint32 toFP(double x) { return (int)floor(0.5+x*(1<<BASE)); }
// Return inverse of FP
uint32 inverse(uint32 fp)
{
if (fp == 0) return (uint32)-1; // invalid
// Shift FP to have the most significant bit set
int shl = 0; // normalization shift
uint32 nfp = fp; // normalized FP
while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead
uint64 q = 0x100000000ULL; // 2^32
uint64 e = 0x100000000ULL - (uint64)nfp; // 2^32-NFP
int i;
for (i=0;i<4;i++) // iterate
{
// Both multiplications are actually
// 32x32 bits truncated to the 32 high bits
q += (q*e)>>(uint64)32;
e = (e*e)>>(uint64)32;
printf("Q=0x%llx E=0x%llx\n",q,e);
}
// Here, (Q/2^32) is the inverse of (NFP/2^32).
// We have 2^31<=NFP<2^32 and 2^32<Q<=2^33
return (uint32)(q>>(64-2*BASE-shl));
}
int main()
{
double x = 1.234567;
uint32 xx = toFP(x);
uint32 yy = inverse(xx);
double y = toDouble(yy);
printf("X=%f Y=%f X*Y=%f\n",x,y,x*y);
printf("XX=0x%08x YY=0x%08x XX*YY=0x%016llx\n",xx,yy,(uint64)xx*(uint64)yy);
}
As noted in the code, the multiplications are not full 32x32->64 bits. E will become smaller and smaller and fits initially on 32 bits. Q will always be on 34 bits. We take only the high 32 bits of the products.
The derivation of 64-2*BASE-shl is left as an exercise for the reader :-). If it becomes 0 or negative, the result is not representable (the input value is too small).
EDIT. As a follow-up to my comment, here is a second version with an implicit 32-th bit on Q. Both E and Q are now stored on 32 bits:
uint32 inverse2(uint32 fp)
{
if (fp == 0) return (uint32)-1; // invalid
// Shift FP to have the most significant bit set
int shl = 0; // normalization shift for FP
uint32 nfp = fp; // normalized FP
while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead
int shr = 64-2*BASE-shl; // normalization shift for Q
if (shr <= 0) return (uint32)-1; // overflow
uint64 e = 1 + (0xFFFFFFFF ^ nfp); // 2^32-NFP, max value is 2^31
uint64 q = e; // 2^32 implicit bit, and implicit first iteration
int i;
for (i=0;i<3;i++) // iterate
{
e = (e*e)>>(uint64)32;
q += e + ((q*e)>>(uint64)32);
}
return (uint32)(q>>shr) + (1<<(32-shr)); // insert implicit bit
}

A couple of ideas for you, though none that solve your problem directly as stated.
Why this algo for division? Most divides I've seen in ARM use some varient of
adcs hi, den, hi, lsl #1
subcc hi, hi, den
adcs lo, lo, lo
repeated n bits times with a binary search off of the clz to determine where to start. That's pretty dang fast.
If precision is a big problem, you are not limited to 32/64 bits for your fixed point representation. It'll be a bit slower, but you can do add/adc or sub/sbc to move values across registers. mul/mla are also designed for this kind of work.
Again, not direct answers for you, but possibly a few ideas to go forward this. Seeing the actual ARM code would probably help me a bit as well.

Mads, you are not losing any precision at all. When you divide 512.00002f by 2^10, you merely decrease the exponent of your floating point number by 10. Mantissa remains the same. Of course unless the exponent hits its minimum value but that shouldn't happen since you're scaling to (0.5, 1].
EDIT: Ok so you're using a fixed decimal point. In that case you should allow a different representation of the denominator in your algorithm. The value of D is from (0.5, 1] not only at the beginning but throughout the whole calculation (it's easy to prove that x * (2-x) < 1 for x < 1). So you should represent the denominator with decimal point at base = 32. This way you will have 32 bits of precision all the time.
EDIT: To implement this you'll have to change the following lines of your code:
//bitpos = 31 - clz(val) - BASE;
bitpos = 31 - clz(val) - 31;
...
//F = (2ULL<<BASE) - D;
//N = F;
//D = ((unsigned long long)D*F)>>BASE;
F = -D;
N = F >> (31 - BASE);
D = ((unsigned long long)D*F)>>31;
...
//F = (2<<(BASE)) - D;
//D = ((unsigned long long)D*F)>>BASE;
F = -D;
D = ((unsigned long long)D*F)>>31;
...
//N = ((unsigned long long)N*F)>>BASE;
N = ((unsigned long long)N*F)>>31;
Also in the end you'll have to shift N not by bitpos but some different value which I'm too lazy to figure out right now :).

Related

function to convert float to int (huge integers)

This is a university question. Just to make sure :-) We need to implement (float)x
I have the following code which must convert integer x to its floating point binary representation stored in an unsigned integer.
unsigned float_i2f(int x) {
if (!x) return x;
/* get sign of x */
int sign = (x>>31) & 0x1;
/* absolute value of x */
int a = sign ? ~x + 1 : x;
/* calculate exponent */
int e = 0;
int t = a;
while(t != 1) {
/* divide by two until t is 0*/
t >>= 1;
e++;
};
/* calculate mantissa */
int m = a << (32 - e);
/* logical right shift */
m = (m >> 9) & ~(((0x1 << 31) >> 9 << 1));
/* add bias for 32bit float */
e += 127;
int res = sign << 31;
res |= (e << 23);
res |= m;
/* lots of printf */
return res;
}
One problem I encounter now is that when my integers are too big then my code fails. I have this control procedure implemented:
float f = (float)x;
unsigned int r;
memcpy(&r, &f, sizeof(unsigned int));
This of course always produces the correct output.
Now when I do some test runs, this are my outputs (GOAL is what It needs to be, result is what I got)
:!make && ./btest -f float_i2f -1 0x80004999
make: Nothing to be done for `all'.
Score Rating Errors Function
x: [-2147464807] 10000000000000000100100110011001
sign: 1
expone: 01001110100000000000000000000000
mantis: 00000000011111111111111101101100
result: 11001110111111111111111101101100
GOAL: 11001110111111111111111101101101
So in this case, a 1 is added as the LSB.
Next case:
:!make && ./btest -f float_i2f -1 0x80000001
make: Nothing to be done for `all'.
Score Rating Errors Function
x: [-2147483647] 10000000000000000000000000000001
sign: 1
expone: 01001110100000000000000000000000
mantis: 00000000011111111111111111111111
result: 11001110111111111111111111111111
GOAL: 11001111000000000000000000000000
Here 1 is added to the exponent while the mantissa is the complement of it.
I tried hours to look ip up on the internet plus in my books etc but I can't find any references to this problem. I guess It has something to do with the fact that the mantissa is only 23 bits. But how do I have to handle it then?
EDIT: THIS PART IS OBSOLETE THANKS TO THE COMMENTS BELOW. int l must be unsigned l.
int x = 2147483647;
float f = (float)x;
int l = f;
printf("l: %d\n", l);
then l becomes -2147483648.
How can this happen? So C is doing the casting wrong?
Hope someone can help me here!
Thx
Markus
EDIT 2:
My updated code is now this:
unsigned float_i2f(int x) {
if (x == 0) return 0;
/* get sign of x */
int sign = (x>>31) & 0x1;
/* absolute value of x */
int a = sign ? ~x + 1 : x;
/* calculate exponent */
int e = 158;
int t = a;
while (!(t >> 31) & 0x1) {
t <<= 1;
e--;
};
/* calculate mantissa */
int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
m &= 0x7fffff;
int res = sign << 31;
res |= (e << 23);
res |= m;
return res;
}
I also figured out that the code works for all integers in the range -2^24, 2^24. Everything above/below sometimes works but mostly doesn't.
Something is missing, but I really have no idea what. Can anyone help me?
The answer printed is absolutely correct as it's totally dependent on the underlying representation of numbers being cast. However, If we understand the binary representation of the number, you won't get surprised with this result.
To understand an implicit conversion is associated with the assignment operator (ref C99 Standard 6.5.16). The C99 Standard goes on to say:
6.3.1.4 Real floating and integer
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
Your earlier example illustrates undefined behavior due to assigning a value outside the range of the destination type. Trying to assign a negative value to an unsigned type, not from converting floating point to integer.
The asserts in the following snippet ought to prevent any undefined behavior from occurring.
#include <limits.h>
#include <math.h>
unsigned int convertFloatingPoint(double v) {
double d;
assert(isfinite(v));
d = trunc(v);
assert((d>=0.0) && (d<=(double)UINT_MAX));
return (unsigned int)d;
}
Another way for doing the same thing, Create a union containing a 32-bit integer and a float. The int and float are now just different ways of looking at the same bit of memory;
union {
int myInt;
float myFloat;
} my_union;
my_union.myInt = 0x BFFFF2E5;
printf("float is %f\n", my_union.myFloat);
float is -1.999600
You are telling the compiler to take the number you have (large integer) and make it into a float, not to interpret the number AS float. To do that, you need to tell the compiler to read the number from that address in a different form, so this:
myFloat = *(float *)&myInt ;
That means, if we take it apart, starting from the right:
&myInt - the location in memory that holds your integer.
(float *) - really, I want the compiler use this as a pointer to float, not whatever the compiler thinks it may be.
* - read from the address of whatever is to the right.
myFloat = - set this variable to whatever is to the right.
So, you are telling the compiler: In the location of (myInt), there is a floating point number, now put that float into myFloat.

Full variation of Random numbers in C

I am trying to generate 64 bit random numbers using the following code. I want the numbers in binary,but the problem is I cant get all the bits to vary. I want the numbers to vary as much as possible
void PrintDoubleAsCBytes(double d, FILE* f)
{
f = fopen("tb.txt","a");
unsigned char a[sizeof(d)];
unsigned i;
memcpy(a, &d, sizeof(d));
for (i = 0; i < sizeof(a); i++){
fprintf(f, "%0*X", (CHAR_BIT + 3) / 4, a[sizeof(d)-1-i]);
}
fprintf(f,"\n");
fclose(f); /*done!*/
}
int main (int argc, char *argv)
{
int limit = 100 ;
double a, b;
double result;
int i ;
printf("limit = %d", limit );
for (i= 0 ; i< limit;i++)
{
a= rand();
b= rand();
result = a * b;
printf ("A= %f B = %f\n",a,b);
printf ("result= %f\n",result);
PrintDoubleAsCBytes(a, stdout); puts("");
PrintDoubleAsCBytes(b, stdout); puts("");
PrintDoubleAsCBytes(result, stdout); puts("");
}
}
OUTPUT FILE
41DAE2D159C00000 //Last bits remain zero, I want them to change as well as in case of the result
41C93D91E3000000
43B534EE7FAEB1C3
41D90F261A400000
41D98CD21CC00000
43C4021C95228080
41DD2C3714400000
41B9495CFF000000
43A70D6CAD0EE321
How do I do I achieve this?I do not have much experience in software coding
In Java it is very easy:
Random rng = new Random(); //do this only once
long randLong = rng.NextLong();
double randDoubleFromBits = Double.longBitsToDouble(randLong);
In C I only know of a hack way to do it :)
Since RAND_MAX can be as low as 2^15-1 but is implementation defined, maybe you can get 64 random bits out of rand() by doing masks and bitshifts:
//seed program once at the start
srand(time(NULL));
uint64_t a = rand()&0x7FFF;
uint64_t b = rand()&0x7FFF;
uint64_t c = rand()&0x7FFF;
uint64_t d = rand()&0x7FFF;
uint64_t e = rand()&0x7FFF;
uint64_t random = (a<<60)+(b<<45)+(c<<30)+(d<<15)+e;
Then stuff it in a union and use the other member of the union to interpret its bits as a double. Something like
union
{
double d;
long l;
} doubleOrLong;
doubleOrLong.l = random;
double randomDouble = doubleOrLong.d;
(I haven't tested this code)
EDIT: Explanation of how it should work
First, srand(time(NULL)); seeds rand with the current timestamp. So you only need to do this once at the start, and if you want to reproduce an earlier RNG series you can reuse that seed if you like.
rand() returns a random, unbiased integer between 0 and RAND_MAX inclusive. RAND_MAX is guaranteed to be at least 2^15-1, which is 0x7FFF. To write the program such that it doesn't matter what RAND_MAX is (for example, it could be 2^16-1, 2^31-1, 2^32-1...), we mask out all but the bottom 15 bits - 0x7FFF is 0111 1111 1111 1111 in binary, or the bottom 15 bits.
Now we have to pack all of our 15 random bits into 64 bits. The bitshift operator, <<, shifts the left operand (right operand) bits to the left. So the final uint64_t we call random has random bits derived from the other variables like so:
aaaa bbbb bbbb bbbb bbbc cccc cccc cccc ccdd dddd dddd dddd deee eeee eeee eeee
But this is still being treated as a uint64_t, not as a double. It's undefined behaviour to do so, so you should make sure it works the way you expect on your compiler of choice, but if you put this uint64_t in a union and then read the union's other double member, then you'll (hopefully!) interpret those same bits as a double made up of random bits.
Depending on your platform, but assuming IEEE 754, e.g. Wikipedia, why not explicitly handle the internal double format?
(Barring mistakes), this generates random but valid doubles.
[ Haven't quite covered all bases here, e.g. case where exp = 0 or 0x7ff ]
double randomDouble()
{
uint64_t buf = 0ull;
// sign bit
bool odd = rand()%2 > 0;
if (odd)
buf = 1ull<<63;
// exponent
int exponentLength = 11;
int exponentMask = (1 << exponentLength) - 1;
int exponentLocation = 63 - exponentLength;
uint64_t exponent = rand()&exponentMask;
buf += exponent << exponentLocation;
// fraction
int fractionLength = exponentLocation;
int fractionMask = (1 << exponentLocation) - 1;
// Courtesy of Patashu
uint64_t a = rand()&0x7FFF;
uint64_t b = rand()&0x7FFF;
uint64_t c = rand()&0x7FFF;
uint64_t d = rand()&0x7FFF;
uint64_t fraction = (a<<45)+(b<<30)+(c<<15)+d;
fraction = fraction& fractionMask;
buf += fraction;
double* res = reinterpret_cast<double*>(&buf);
return *res;
}
Use could use this:
void GenerateRandomDouble(double* d)
{
unsigned char* p = (unsigned char*)d;
unsigned i;
for (i = 0; i < sizeof(d); i++)
p[i] = rand();
}
The problem with this method is that your C program may be unable to use some of the values returned by this function, because they're invalid or special floating point values.
But if you're testing your hardware, you could generate random bytes and feed them directly into said hardware without first converting them into a double.
The only place where you need to treat these random bytes as a double is the point of validation of the results returned by the hardware.
At that point you need to look at the bytes and see if they represent a valid value. If they do, you can memcpy() the bytes into a double and use it.
The next problem to deal with is overflows/underflows and exceptions resulting from whatever you need to do with these random doubles (addition, multiplication, etc). You need to figure out how to deal with them on your platform (compiler+CPU+OS), whether or not you can safely and reliably detect them.
But that looks like a separate question and it has probably already been asked and answered.

Fixed point multiplication

I need to convert a value from one unit to another according to a non constant factor. The input value range from 0 to 1073676289 and the range value range from 0 to 1155625. The conversion can be described like this:
output = input * (range / 1073676289)
My own initial fixed point implementation feels a bit clumsy:
// Input values (examples)
unsigned int input = 536838144; // min 0, max 1073676289
unsigned int range = 1155625; // min 0, max 1155625
// Conversion
unsigned int tmp = (input >> 16) * ((range) >> 3u);
unsigned int output = (tmp / ((1073676289) >> 16u)) << 3u;
Can my code be improved to be simpler or to have better accuracy?
This will give you the best precision with no floating point values and the result will be rounded to the nearest integer value:
output = (input * (long long) range + 536838144) / 1073676289;
The problem is that input * range would overflow a 32-bit integer. Fix that by using a 64-bit integer.
uint64_least_t tmp;
tmp = input;
tmp = tmp * range;
tmp = tmp / 1073676289ul;
output = temp;
A quick trip out to google brought http://sourceforge.net/projects/fixedptc/ to my attention
It's a c library in a header for managing fixed point math in 32 or 64 bit integers.
A little bit of experimentation with the following code:
#include <stdio.h>
#include <stdint.h>
#define FIXEDPT_BITS 64
#include "fixedptc.h"
int main(int argc, char ** argv)
{
unsigned int input = 536838144; // min 0, max 1073676289
unsigned int range = 1155625; // min 0, max 1155625
// Conversion
unsigned int tmp = (input >> 16) * ((range) >> 3u);
unsigned int output = (tmp / ((1073676289) >> 16u)) << 3u;
double output2 = (double)input * ((double)range / 1073676289.0);
uint32_t output3 = fixedpt_toint(fixedpt_xmul(fixedpt_fromint(input), fixedpt_xdiv(fixedpt_fromint(range), fixedpt_fromint(1073676289))));
printf("baseline = %g, better = %d, library = %d\n", output2, output, output3);
return 0;
}
Got me the following results:
baseline = 577812, better = 577776, library = 577812
Showing better precision (matching the floating point) than you were getting with your code. Under the hood it's not doing anything terribly complicated (and doesn't work at all in 32 bits)
/* Multiplies two fixedpt numbers, returns the result. */
static inline fixedpt
fixedpt_mul(fixedpt A, fixedpt B)
{
return (((fixedptd)A * (fixedptd)B) >> FIXEDPT_FBITS);
}
/* Divides two fixedpt numbers, returns the result. */
static inline fixedpt
fixedpt_div(fixedpt A, fixedpt B)
{
return (((fixedptd)A << FIXEDPT_FBITS) / (fixedptd)B);
}
But it does show that you can get the precision you want. You'll just need 64 bits to do it
You won't get it any simpler then output = input * (range / 1073676289)
As noted below in the comments if you are restircted to integer operations then for range < 1073676289: range / 1073676289 == 0 so you would be good to go with:
output = range < 1073676289 ? 0 : input
If that is not what you wanted and you actually want precision then
output = (input * range) / 1073676289
will be the way to go.
If you need to do a lot of those then i suggest you use double and have your compiler vectorise your operations. Precision will be ok too.

Bit Rotation in C

The Problem: Exercise 2-8 of The C Programming Language, "Write a function rightrot(x,n) that returns the value of the integer x, rotated to the right by n positions."
I have done this every way that I know how. Here is the issue that I am having. Take a given number for this exercise, say 29, and rotate it right one position.
11101 and it becomes 11110 or 30. Let's say for the sake of argument that the system we are working on has an unsigned integer type size of 32 bits. Let's further say that we have the number 29 stored in an unsigned integer variable. In memory the number will have 27 zeros ahead of it. So when we rotate 29 right one using one of several algorithms mine is posted below, we get the number 2147483662. This is obviously not the desired result.
unsigned int rightrot(unsigned x, int n) {
return (x >> n) | (x << (sizeof(x) * CHAR_BIT) - n);
}
Technically, this is correct, but I was thinking that the 27 zeros that are in front of 11101 were insignificant. I have also tried a couple of other solutions:
int wordsize(void) { // compute the wordsize on a given machine...
unsigned x = ~0;
int b;
for(b = 0; x; b++)
x &= x-1;
return x;
}
unsigned int rightrot(unsigned x, int n) {
unsigned rbit;
while(n --) {
rbit = x >> 1;
x |= (rbit << wordsize() - 1);
}
return x;
This last and final solution is the one where I thought that I had it, I will explain where it failed once I get to the end. I am sure that you will see my mistake...
int bitcount(unsigned x) {
int b;
for(b = 0; x; b++)
x &= x-1;
return b;
}
unsigned int rightrot(unsigned x, int n) {
unsigned rbit;
int shift = bitcount(x);
while(n--) {
rbit = x & 1;
x >>= 1;
x |= (rbit << shift);
}
}
This solution gives the expected answer of 30 that I was looking for, but if you use a number for x like oh say 31 (11111), then there are issues, specifically the outcome is 47, using one for n. I did not think of this earlier, but if a number like 8 (1000) is used then mayhem. There is only one set bit in 8, so the shift is most certainly going to be wrong. My theory at this point is that the first two solutions are correct (mostly) and I am just missing something...
A bitwise rotation is always necessarily within an integer of a given width. In this case, as you're assuming a 32-bit integer, 2147483662 (0b10000000000000000000000000001110) is indeed the correct answer; you aren't doing anything wrong!
0b11110 would not be considered the correct result by any reasonable definition, as continuing to rotate it right using the same definition would never give you back the original input. (Consider that another right rotation would give 0b1111, and continuing to rotate that would have no effect.)
In my opinion, the spirit of the section of the book which immediately precedes this exercise would have the reader do this problem without knowing anything about the size (in bits) of integers, or any other type. The examples in the section do not require that information; I don't believe the exercises should either.
Regardless of my belief, the book had not yet introduced the sizeof operator by section 2.9, so the only way to figure the size of a type is to count the bits "by hand".
But we don't need to bother with all that. We can do bit rotation in n steps, regardless of how many bits there are in the data type, by rotating one bit at a time.
Using only the parts of the language that are covered by the book up to section 2.9, here's my implementation (with integer parameters, returning an integer, as specified by the exercise): Loop n times, x >> 1 each iteration; if the old low bit of x was 1, set the new high bit.
int rightrot(int x, int n) {
int lowbit;
while (n-- > 0) {
lowbit = x & 1; /* save low bit */
x = (x >> 1) & (~0u >> 1); /* shift right by one, and clear the high bit (in case of sign extension) */
if (lowbit)
x = x | ~(~0u >> 1); /* set the high bit if the low bit was set */
}
return x;
}
You could find the location of the first '1' in the 32-bit value using binary search. Then note the bit in the LSB location, right shift the value by the required number of places, and put the LSB bit in the location of the first '1'.
int bitcount(unsigned x) {
int b;
for(b = 0; x; b++)
x &= x-1;
return b;
}
unsigned rightrot(unsigned x,int n) {
int b = bitcount(x);
unsigned a = (x&~(~0<<n))<<(b-n+1);
x>> = n;
x| = a;
}

How to do serialization of float numbers on network?

I found a piece of code to do the serialization of float numbers on network.
uint32_t htonf(float f)
{
uint32_t p;
uint32_t sign;
if (f < 0) { sign = 1; f = -f; }
else { sign = 0; }
p = ((((uint32_t)f)&0x7fff)<<16) | (sign<<31); // whole part and sign
p |= (uint32_t)(((f - (int)f) * 65536.0f))&0xffff; // fraction
return p;
}
Spec: The above code is sort of a naive implementation that stores a float in a 32-bit number. The high bit (31) is used to store the sign of the number ("1" means negative), and the next seven bits (30-16) are used to store the whole number portion of the float. Finally, the remaining bits (15-0) are used to store the fractional portion of the number.
The others are fine but I cannot figure out what this means. How does this get us the 15-0 bits? Why do we need the "*65536.0f"?
p |= (uint32_t)(((f - (int)f) * 65536.0f))&0xffff
Anyone can explain on this?
f - (int)f
gives you the fractional part of the number. You want to store this fraction in 16 bits, so think of it as a fraction with 2^16 as the denominator. The numerator is:
(f - (int)f) * 65536.0f)
The rest just uses bit shifting to pack it up into the right bits in the 32 bit number. Then that 32 bit int is serialized on the network like any other 32 bit int, and presumably the opposite of the above routine is used to re-create a floating point number.
You could use a union.
uint32_t htonf(float f)
{
union {
float f1;
uint32_t i1;
};
f1 = f;
return i1;
}

Resources