Full variation of Random numbers in C - c

I am trying to generate 64 bit random numbers using the following code. I want the numbers in binary,but the problem is I cant get all the bits to vary. I want the numbers to vary as much as possible
void PrintDoubleAsCBytes(double d, FILE* f)
{
f = fopen("tb.txt","a");
unsigned char a[sizeof(d)];
unsigned i;
memcpy(a, &d, sizeof(d));
for (i = 0; i < sizeof(a); i++){
fprintf(f, "%0*X", (CHAR_BIT + 3) / 4, a[sizeof(d)-1-i]);
}
fprintf(f,"\n");
fclose(f); /*done!*/
}
int main (int argc, char *argv)
{
int limit = 100 ;
double a, b;
double result;
int i ;
printf("limit = %d", limit );
for (i= 0 ; i< limit;i++)
{
a= rand();
b= rand();
result = a * b;
printf ("A= %f B = %f\n",a,b);
printf ("result= %f\n",result);
PrintDoubleAsCBytes(a, stdout); puts("");
PrintDoubleAsCBytes(b, stdout); puts("");
PrintDoubleAsCBytes(result, stdout); puts("");
}
}
OUTPUT FILE
41DAE2D159C00000 //Last bits remain zero, I want them to change as well as in case of the result
41C93D91E3000000
43B534EE7FAEB1C3
41D90F261A400000
41D98CD21CC00000
43C4021C95228080
41DD2C3714400000
41B9495CFF000000
43A70D6CAD0EE321
How do I do I achieve this?I do not have much experience in software coding

In Java it is very easy:
Random rng = new Random(); //do this only once
long randLong = rng.NextLong();
double randDoubleFromBits = Double.longBitsToDouble(randLong);
In C I only know of a hack way to do it :)
Since RAND_MAX can be as low as 2^15-1 but is implementation defined, maybe you can get 64 random bits out of rand() by doing masks and bitshifts:
//seed program once at the start
srand(time(NULL));
uint64_t a = rand()&0x7FFF;
uint64_t b = rand()&0x7FFF;
uint64_t c = rand()&0x7FFF;
uint64_t d = rand()&0x7FFF;
uint64_t e = rand()&0x7FFF;
uint64_t random = (a<<60)+(b<<45)+(c<<30)+(d<<15)+e;
Then stuff it in a union and use the other member of the union to interpret its bits as a double. Something like
union
{
double d;
long l;
} doubleOrLong;
doubleOrLong.l = random;
double randomDouble = doubleOrLong.d;
(I haven't tested this code)
EDIT: Explanation of how it should work
First, srand(time(NULL)); seeds rand with the current timestamp. So you only need to do this once at the start, and if you want to reproduce an earlier RNG series you can reuse that seed if you like.
rand() returns a random, unbiased integer between 0 and RAND_MAX inclusive. RAND_MAX is guaranteed to be at least 2^15-1, which is 0x7FFF. To write the program such that it doesn't matter what RAND_MAX is (for example, it could be 2^16-1, 2^31-1, 2^32-1...), we mask out all but the bottom 15 bits - 0x7FFF is 0111 1111 1111 1111 in binary, or the bottom 15 bits.
Now we have to pack all of our 15 random bits into 64 bits. The bitshift operator, <<, shifts the left operand (right operand) bits to the left. So the final uint64_t we call random has random bits derived from the other variables like so:
aaaa bbbb bbbb bbbb bbbc cccc cccc cccc ccdd dddd dddd dddd deee eeee eeee eeee
But this is still being treated as a uint64_t, not as a double. It's undefined behaviour to do so, so you should make sure it works the way you expect on your compiler of choice, but if you put this uint64_t in a union and then read the union's other double member, then you'll (hopefully!) interpret those same bits as a double made up of random bits.

Depending on your platform, but assuming IEEE 754, e.g. Wikipedia, why not explicitly handle the internal double format?
(Barring mistakes), this generates random but valid doubles.
[ Haven't quite covered all bases here, e.g. case where exp = 0 or 0x7ff ]
double randomDouble()
{
uint64_t buf = 0ull;
// sign bit
bool odd = rand()%2 > 0;
if (odd)
buf = 1ull<<63;
// exponent
int exponentLength = 11;
int exponentMask = (1 << exponentLength) - 1;
int exponentLocation = 63 - exponentLength;
uint64_t exponent = rand()&exponentMask;
buf += exponent << exponentLocation;
// fraction
int fractionLength = exponentLocation;
int fractionMask = (1 << exponentLocation) - 1;
// Courtesy of Patashu
uint64_t a = rand()&0x7FFF;
uint64_t b = rand()&0x7FFF;
uint64_t c = rand()&0x7FFF;
uint64_t d = rand()&0x7FFF;
uint64_t fraction = (a<<45)+(b<<30)+(c<<15)+d;
fraction = fraction& fractionMask;
buf += fraction;
double* res = reinterpret_cast<double*>(&buf);
return *res;
}

Use could use this:
void GenerateRandomDouble(double* d)
{
unsigned char* p = (unsigned char*)d;
unsigned i;
for (i = 0; i < sizeof(d); i++)
p[i] = rand();
}
The problem with this method is that your C program may be unable to use some of the values returned by this function, because they're invalid or special floating point values.
But if you're testing your hardware, you could generate random bytes and feed them directly into said hardware without first converting them into a double.
The only place where you need to treat these random bytes as a double is the point of validation of the results returned by the hardware.
At that point you need to look at the bytes and see if they represent a valid value. If they do, you can memcpy() the bytes into a double and use it.
The next problem to deal with is overflows/underflows and exceptions resulting from whatever you need to do with these random doubles (addition, multiplication, etc). You need to figure out how to deal with them on your platform (compiler+CPU+OS), whether or not you can safely and reliably detect them.
But that looks like a separate question and it has probably already been asked and answered.

Related

sprintf - producing char array from an int in C

I'm doing an assignment for school to swap the bytes in an unsigned long, and return the swapped unsigned long. ex. 0x12345678 -> 0x34127856.
I figured I'll make a char array, use sprintf to insert the long into a char array, and then do the swapping, stepping through the array. I'm pretty familiar with c++, but C seems a little more low level. I researched a few topics on sprintf, and I tried to make an array, but I'm not sure why it's not working.
unsigned long swap_bytes(unsigned long n) {
char new[64];
sprintf(new, "%l", n);
printf("Char array is now: %s\n", new);
}
TLDR; The correct approach is at the bottom
Preamble
Issues with what you're doing
First off using sprintf for byte swapping is the wrong approach because
it is a MUCH MUCH slower process than using the mathematical properties of bit operations to perform the byte swapping.
A byte is not a digit in a number. (a wrong assumption that you've made in your approach)
It's even more painful when you don't know the size of your integer (is it 32-bits, 64 bits or what)
The correct approach
Use bit manipulation to swap the bytes (see way way below)
The absolutely incorrect implementation with wrong output (because we're ignoring issue #2 above)
There are many technical reasons why sprintf is much slower but suffice it to say that it's so because moving contents of memory around is a slow operation, and of course more data you're moving around the slower it gets:
In your case, by changing a number (which sits in one manipulatable 'word' (think of it as a cell)) into its human readable string-equivalence you are doing two things:
You are converting (let's assume a 64-bit CPU) a single number represented by 8 bytes in a single CPU cell (officially a register) into a human equivalence string and putting it in RAM (memory). Now, each character in the string now takes up at least a byte: So a 16 digit number takes up 16 bytes (rather than 8)
You are then moving these characters around using memory operations (which are slow compared do doing something directly on CPU, by factor of a 1000)
Then you're converting the characters back to integers, which is a long and tedious operation
However, since that's the solution that you came up with let's first look at it.
The really wrong code with a really wrong answer
Starting (somewhat) with your code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned long swap_bytes(unsigned long n) {
int i, l;
char new[64]; /* the fact that you used 64 here told me that you made assumption 2 */
sprintf(new, "%lu", n); /* you forgot the `u` here */
printf("The number is: %s\n", new); /* well it shows up :) */
l = strlen(new);
for(i = 0; i < l; i+=4) {
char tmp[2];
tmp[0] = new[i+2]; /* get next two characters */
tmp[1] = new[i+3];
new[i+2] = new[i];
new[i+3] = new[i+1];
new[i] = tmp[0];
new[i+1] = tmp[1];
}
return strtoul(new, NULL, 10); /* convert new back */
}
/* testing swap byte */
int main() {
/* seems to work: */
printf("Swapping 12345678: %lu\n", swap_bytes(12345678));
/* how about 432? (err not) */
printf("Swapping 432: %lu\n", swap_bytes(432));
}
As you can see the above is not really byte swapping but character swapping. And any attempt to try and "fix" the above code is nonsensical. For example,how do we deal with odd number of digits?
Well, I suppose we can pad odd digit counts with a zero:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned long swap_bytes(unsigned long n) {
int i, l;
char new[64]; /* the fact that you used 64 here told me that you made assumption 2 */
sprintf(new, "%lu", n); /* you forgot the `u` here */
printf("The number is: %s\n", new); /* well it shows up :) */
l = strlen(new);
if(l % 2 == 1) { /* check if l is odd */
printf("adding a pad to make n even digit count");
sprintf(new, "0%lu", n);
l++; /* length has increased */
}
for(i = 0; i < l; i+=4) {
char tmp[2];
tmp[0] = new[i+2]; /* get next two characters */
tmp[1] = new[i+3];
new[i+2] = new[i];
new[i+3] = new[i+1];
new[i] = tmp[0];
new[i+1] = tmp[1];
}
return strtoul(new, NULL, 10); /* convert new back */
}
/* testing swap byte */
int main() {
/* seems to work: */
printf("Swapping 12345678: %lu\n", swap_bytes(12345678));
printf("Swapping 432: %lu\n", swap_bytes(432));
/* how about 432516? (err not) */
printf("Swapping 432: %lu\n", swap_bytes(432));
}
Now we run into an issue with numbers which are not divisible by 4... Do we pad them with zeros on the right or the left or the middle? err NOT REALLY.
In any event this entire approach is wrong because we're not swapping bytes anyhow, we're swapping characters.
Now what?
So you may be asking
what the heck is my assignment talking about?
Well numbers are represented as bytes in memory, and what the assignment is asking for is for you to get that representation and swap it.
So for example, if we took a number like 12345678 it's actually stored as some sequence of bytes (1 byte == 8 bits). So let's look at the normal math way of representing 12345678 (base 10) in bits (base 2) and bytes (base 8):
(12345678)10 = (101111000110000101001110)2
Splitting the binary bits into groups of 4 for visual ease gives:
(12345678)10 = (1011 1100 0110 0001 0100 1110)2
But 4 bits are equal to 1 hex number (0, 1, 2, 3... 9, A, B...F), so we can convert the bits into nibbles (4-bit hex numbers) easily:
(12345678)10 = 1011 | 1100 | 0110 | 0001 | 0100 | 1110
(12345678)10 = B | C | 6 | 1 | 4 | E
But each byte (8-bits) is two nibbles (4-bits) so if we squish this a bit:
(12345678)10 = (BC 61 4E)16
So 12345678 is actually representable in 3 bytes;
However CPUs have specific sizes for integers, usually these are multiples of 2 and divisible by 4. This is so because of a variety of reasons that are beyond the scope of this discussion, suffice it to say that you will get things like 16-bit, 32-bit, 64-bit, 128-bit etc... And most often the CPU of a particular bit-size (say a 64bit CPU) will be able to manipulate unsigned integers representable in that bit-size directly without having to store parts of the number in RAM.
Slight Digression
So let's say we have a 32-bit CPU, and somewhere at byte number α in RAM. The CPU could store the number 12345678 as:
> 00 BC 61 4E
> ↑ α ↑ α+1 ↑ α+2 ↑ α+3
(Figure 1)
Here the most significant part of the number, is sitting at the lowest memory address index α
Or the CPU could store it differently, where the least significant part of the number is sitting at the lowest memory.
> 4E 61 BC 00
> ↑ α ↑ α+1 ↑ α+2 ↑ α+3
(Figure 2)
The way a CPU stores a number is called Endianness (of the CPU). Where, if the most significant part is on the left then it's called Big-Endian CPU (Figure 1), or Little-Endian if it stores it as in (Figure 2)
Getting the correct answer (the wrong way)
Now that we have an idea of how things may be stored, let's try and pull this out still using sprintf.
We're going to use a couple of tricks here:
we'll convert the numbers to hexadecimal and then pad the number to 8 bytes
we'll use printf's (therefore sprintf) format string capability that if we want to use a variable to specify the width of an argument then we can use a * after the % sign like so:
printf("%*d", width, num);
If we set our format string to %0*x we get a hex number that's zero padded in output automatically, so:
sprintf(new, "%0*llx", sizeof(n), n);
Our program then becomes:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
unsigned long swap_bytes(unsigned long n) {
int i, l;
char new[64] = "";
sprintf(new, "%0*llx", sizeof(n), n);
printf("The number is: %s\n", new);
l = strlen(new);
for(i = 0; i < l; i+=4) {
char tmp[2];
tmp[0] = new[i+2]; /* get next two characters */
tmp[1] = new[i+3];
new[i+2] = new[i];
new[i+3] = new[i+1];
new[i] = tmp[0];
new[i+1] = tmp[1];
}
return strtoul(new, NULL, 16); /* convert new back */
}
/* testing swap byte */
int main() {
printf("size of unsigned long is %ld\n", sizeof(unsigned long));
printf("Swapping 12345678: %llx\n", swap_bytes(12345678));
/* how about 123456? */
printf("Swapping 123456: %llx\n", swap_bytes(123456));
printf("Swapping 123456: %llx\n", swap_bytes(98899));
}
The output would look something like:
size of unsigned long is 8
The number is: 00bc614e
Swapping 12345678: bc004e61
The number is: 0001e240
Swapping 123456: 10040e2
The number is: 00018253
Swapping 123456: 1005382
Obviously we can change our outputs by using %ld and print the base 10 versions of the numbers, rather than base 16 as is happening above. I'll leave that to you.
Now let's do it the right way
This is however rather terrible, since byte swapping can be done much faster without ever doing the integer to string and string to integer conversion.
Let's see how that's done:
The rather explicit way
Before we go on, just a bit on bit shifting in C:
If I have a number, say 6 (=1102) and I shift all the bits to the left by 1 I would get 12 (11002) (we simply shifted everything to the left adding zeros on the right as needed)
This is written in C as 6 << 1.
A right shift is similar and can be expressed in C with >> so if I have a number say 240 = (11110000)2 and I right-shift it 4 times I would get 15 = (1111)2 this is expressed as 240 >> 3
Now we have unsigned long integers which are (in my case at least) 64 bits long, or 8 bytes long.
Let's say my number is 12345678 which is (00 00 00 00 00 bc 61 4e)16 in hex at 8 bytes long. If I want to get the value of byte number 3 I can extract it by taking the number 0xFF (1111 1111) all bits of a byte set to 1 and left shifting it until i get to the byte 3 (so left shift 3*8 = 24 times) performing a bitwise and with the number and then right shifting the results to get rid of the zeros. This is what it looks like:
0xFF << (3 * 8) = 0xFF0000 & 0000 0000 00bc 614e = 0000 0000 00bc 0000
Now right shift:
0xFF0000 & 0000 0000 00bc 0000 >> (3 * 8) = bc
Another (better) way to do it would be to right shift first and then perform bitwise and with 0xFF to drop all higher bits:
0000 0000 00bc 614e >> 24 = 0000 0000 0000 00bc & 0xFF = bc
We will use the second way, and make a macro using #define now we can add the bytes back at the right location by right shifting each kth byte k+1 times and each k+1st byte k times.
Here is a sample implementation of this:
#define GET_BYTE(N, B) ((N >> (8 * (B))) & 0xFFUL)
unsigned long swap_bytes(unsigned long n)
{
unsigned long long rv = 0ULL;
int k;
printf("number is %016llx\n", n);
for(k =0 ; k < sizeof(n); k+=2) {
printf("swapping bytes %d[%016lx] and %d[%016lx]\n", k, GET_BYTE(n, k),
k+1, GET_BYTE(n, k+1));
rv += GET_BYTE(n, k) << 8*(k+1);
rv += GET_BYTE(n, k+1) << 8*k;
}
return rv;
}
/* testing swap byte */
int main() {
printf("size of unsigned long is: %ld\n", sizeof(unsigned long));
printf("Swapping 12345678: %llx\n", swap_bytes(12345678));
/* how about 123456? */
printf("Swapping 123456: %llx\n", swap_bytes(123456));
printf("Swapping 123456: %llx\n", swap_bytes(98899));
}
But this can be done so much more efficiently. I leave it here for now. We'll come back to using bit blitting and xor swapping later.
Update with GET_BYTE as a function instead of a macro:
#define GET_BYTE(N, B) ((N >> (8 * (B))) & 0xFFUL)
Just for fun we also use a shift operator for multiplying by 8. You can note that left shifting a number by 1 is like multiplying it by 2 (makes sense since in binary 2 is 10 and multiplying by 10 adds a zero to the end and therefore is the same as shifting something left by one space) So multiplying by 8 (1000)2 is like shifting something three spaces over or basically tacking on 3 zeros (overflows notwithstanding):
unsigned long __inline__ get_byte(const unsigned long n, const unsigned char idx) {
return ((n >> (idx << 3)) & 0xFFUL);
}
Now the really really fun and correct way to do this
Okay so a fast way to swap integers around is to realize that if we have two integers x, and y we can use properties of xor function to swap their values. The basic algorithm is this:
X := X XOR Y
Y := Y XOR X
X := X XOR Y
Now we know that a char is one byte in C. So we can force the compiler to treat the 8 byte integer as a sequence of 1-byte chars (hehe it's a bit of a mind bender considering everything I said about not doing it in sprintf) but this is different. You have to just think about it a bit.
We'll take the memory address of our integer, cast it to a char pointer (char *) and treat the result as an array of chars. Then we'll use the xor function property above to swap the two consecutive array values.
To do this I am going to use a macro (although we could use a function) but using a function will make the code uglier.
One thing you'll note is that there is the use of ?: in XORSWAP below. That's like an if-then-else in C but with expressions rather than statements, so basically (conditional_expression) ? (value_if_true) : (value_if_false) means if conditional_expression is non-zero the result will be value_if_true, otherwise it will be value_if_false. AND it's important not to xor a value with itself because you will always get 0 as a result and clobber the content. So we use the conditional to check if the addresses of the values we are changing are DIFFERENT from each other. If the addresses are the same (&a == &b) we simply return the value at the address (&a == &b) ? a : (otherwise_do_xor)
So let's do it:
#include <stdio.h>
/* this macro swaps any two non floating C values that are at
* DIFFERENT memory addresses. That's the entire &a == &b ? a : ... business
*/
#define XORSWAP(a, b) ((&(a) == &(b)) ? (a) : ((a)^=(b),(b)^=(a),(a)^=(b)))
unsigned long swap_bytes(const unsigned long n) {
unsigned long rv = n; /* we are not messing with original value */
int k;
for(k = 0; k < sizeof(rv); k+=2) {
/* swap k'th byte with k+1st byte */
XORSWAP(((char *)&rv)[k], ((char *)&rv)[k+1]);
}
return rv;
}
int main()
{
printf("swapped: %lx", swap_bytes(12345678));
return 0;
}
Here endeth the lesson. I hope that you will go through all the examples. If you have any more questions just ask in comments and I'll try to elaborate.
unsigned long swap_bytes(unsigned long n) {
char new[64];
sprintf(new, "%lu", n);
printf("Char array is now: %s\n", new);
}
You need to use %lu - long unsigned, for format in sprintf(), the compiler should also given you conversion lacks type warning because of this.
To get it to print you need to use %lu (for unsigned)
It doesn't seem like you attempted the swap, could I see your try?

Adding 32 bit signed in C

I have been given this problem and would like to solve it in C:
Assume you have a 32-bit processor and that the C compiler does not support long long (or long int). Write a function add(a,b) which returns c = a+b where a and b are 32-bit integers.
I wrote this code which is able to detect overflow and underflow
#define INT_MIN (-2147483647 - 1) /* minimum (signed) int value */
#define INT_MAX 2147483647 /* maximum (signed) int value */
int add(int a, int b)
{
if (a > 0 && b > INT_MAX - a)
{
/* handle overflow */
printf("Handle over flow\n");
}
else if (a < 0 && b < INT_MIN - a)
{
/* handle underflow */
printf("Handle under flow\n");
}
return a + b;
}
I am not sure how to implement the long using 32 bit registers so that I can print the value properly. Can someone help me with how to use the underflow and overflow information so that I can store the result properly in the c variable with I think should be 2 32 bit locations. I think that is what the problem is saying when it hints that that long is not supported. Would the variable c be 2 32 bit registers put together somehow to hold the correct result so that it can be printed? What action should I preform when the result over or under flows?
Since this is a homework question I'll try not to spoil it completely.
One annoying aspect here is that the result is bigger than anything you're allowed to use (I interpret the ban on long long to also include int64_t, otherwise there's really no point to it). It may be temping to go for "two ints" for the result value, but that's weird to interpret the value of. So I'd go for two uint32_t's and interpret them as two halves of a 64 bit two's complement integer.
Unsigned multiword addition is easy and has been covered many times (just search). The signed variant is really the same if the inputs are sign-extended: (not tested)
uint32_t a_l = a;
uint32_t a_h = -(a_l >> 31); // sign-extend a
uint32_t b_l = b;
uint32_t b_h = -(b_l >> 31); // sign-extend b
// todo: implement the addition
return some struct containing c_l and c_h
It can't overflow the 64 bit result when interpreted signed, obviously. It can (and should, sometimes) wrap.
To print that thing, if that's part of the assignment, first reason about which values c_h can have. There aren't many possibilities. It should be easy to print using existing integer printing functions (that is, you don't have to write a whole multiword-itoa, just handle a couple of cases).
As a hint for the addition: what happens when you add two decimal digits and the result is larger than 9? Why is the low digit of 7+6=13 a 3? Given only 7, 6 and 3, how can you determine the second digit of the result? You should be able to apply all this to base 232 as well.
First, the simplest solution that satisfies the problem as stated:
double add(int a, int b)
{
// this will not lose precision, as a double-precision float
// will have more than 33 bits in the mantissa
return (double) a + b;
}
More seriously, the professor probably expected the number to be decomposed into a combination of ints. Holding the sum of two 32-bit integers requires 33 bits, which can be represented with an int and a bit for the carry flag. Assuming unsigned integers for simplicity, adding would be implemented like this:
struct add_result {
unsigned int sum;
unsigned int carry:1;
};
struct add_result add(unsigned int a, unsigned int b)
{
struct add_result ret;
ret.sum = a + b;
ret.carry = b > UINT_MAX - a;
return ret;
}
The harder part is doing something useful with the result, such as printing it. As proposed by harold, a printing function doesn't need to do full division, it can simply cover the possible large 33-bit values and hard-code the first digits for those ranges. Here is an implementation, again limited to unsigned integers:
void print_result(struct add_result n)
{
if (!n.carry) {
// no carry flag - just print the number
printf("%d\n", n.sum);
return;
}
if (n.sum < 705032704u)
printf("4%09u\n", n.sum + 294967296u);
else if (n.sum < 1705032704u)
printf("5%09u\n", n.sum - 705032704u);
else if (n.sum < 2705032704u)
printf("6%09u\n", n.sum - 1705032704u);
else if (n.sum < 3705032704u)
printf("7%09u\n", n.sum - 2705032704u);
else
printf("8%09u\n", n.sum - 3705032704u);
}
Converting this to signed quantities is left as an exercise.

function to convert float to int (huge integers)

This is a university question. Just to make sure :-) We need to implement (float)x
I have the following code which must convert integer x to its floating point binary representation stored in an unsigned integer.
unsigned float_i2f(int x) {
if (!x) return x;
/* get sign of x */
int sign = (x>>31) & 0x1;
/* absolute value of x */
int a = sign ? ~x + 1 : x;
/* calculate exponent */
int e = 0;
int t = a;
while(t != 1) {
/* divide by two until t is 0*/
t >>= 1;
e++;
};
/* calculate mantissa */
int m = a << (32 - e);
/* logical right shift */
m = (m >> 9) & ~(((0x1 << 31) >> 9 << 1));
/* add bias for 32bit float */
e += 127;
int res = sign << 31;
res |= (e << 23);
res |= m;
/* lots of printf */
return res;
}
One problem I encounter now is that when my integers are too big then my code fails. I have this control procedure implemented:
float f = (float)x;
unsigned int r;
memcpy(&r, &f, sizeof(unsigned int));
This of course always produces the correct output.
Now when I do some test runs, this are my outputs (GOAL is what It needs to be, result is what I got)
:!make && ./btest -f float_i2f -1 0x80004999
make: Nothing to be done for `all'.
Score Rating Errors Function
x: [-2147464807] 10000000000000000100100110011001
sign: 1
expone: 01001110100000000000000000000000
mantis: 00000000011111111111111101101100
result: 11001110111111111111111101101100
GOAL: 11001110111111111111111101101101
So in this case, a 1 is added as the LSB.
Next case:
:!make && ./btest -f float_i2f -1 0x80000001
make: Nothing to be done for `all'.
Score Rating Errors Function
x: [-2147483647] 10000000000000000000000000000001
sign: 1
expone: 01001110100000000000000000000000
mantis: 00000000011111111111111111111111
result: 11001110111111111111111111111111
GOAL: 11001111000000000000000000000000
Here 1 is added to the exponent while the mantissa is the complement of it.
I tried hours to look ip up on the internet plus in my books etc but I can't find any references to this problem. I guess It has something to do with the fact that the mantissa is only 23 bits. But how do I have to handle it then?
EDIT: THIS PART IS OBSOLETE THANKS TO THE COMMENTS BELOW. int l must be unsigned l.
int x = 2147483647;
float f = (float)x;
int l = f;
printf("l: %d\n", l);
then l becomes -2147483648.
How can this happen? So C is doing the casting wrong?
Hope someone can help me here!
Thx
Markus
EDIT 2:
My updated code is now this:
unsigned float_i2f(int x) {
if (x == 0) return 0;
/* get sign of x */
int sign = (x>>31) & 0x1;
/* absolute value of x */
int a = sign ? ~x + 1 : x;
/* calculate exponent */
int e = 158;
int t = a;
while (!(t >> 31) & 0x1) {
t <<= 1;
e--;
};
/* calculate mantissa */
int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
m &= 0x7fffff;
int res = sign << 31;
res |= (e << 23);
res |= m;
return res;
}
I also figured out that the code works for all integers in the range -2^24, 2^24. Everything above/below sometimes works but mostly doesn't.
Something is missing, but I really have no idea what. Can anyone help me?
The answer printed is absolutely correct as it's totally dependent on the underlying representation of numbers being cast. However, If we understand the binary representation of the number, you won't get surprised with this result.
To understand an implicit conversion is associated with the assignment operator (ref C99 Standard 6.5.16). The C99 Standard goes on to say:
6.3.1.4 Real floating and integer
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
Your earlier example illustrates undefined behavior due to assigning a value outside the range of the destination type. Trying to assign a negative value to an unsigned type, not from converting floating point to integer.
The asserts in the following snippet ought to prevent any undefined behavior from occurring.
#include <limits.h>
#include <math.h>
unsigned int convertFloatingPoint(double v) {
double d;
assert(isfinite(v));
d = trunc(v);
assert((d>=0.0) && (d<=(double)UINT_MAX));
return (unsigned int)d;
}
Another way for doing the same thing, Create a union containing a 32-bit integer and a float. The int and float are now just different ways of looking at the same bit of memory;
union {
int myInt;
float myFloat;
} my_union;
my_union.myInt = 0x BFFFF2E5;
printf("float is %f\n", my_union.myFloat);
float is -1.999600
You are telling the compiler to take the number you have (large integer) and make it into a float, not to interpret the number AS float. To do that, you need to tell the compiler to read the number from that address in a different form, so this:
myFloat = *(float *)&myInt ;
That means, if we take it apart, starting from the right:
&myInt - the location in memory that holds your integer.
(float *) - really, I want the compiler use this as a pointer to float, not whatever the compiler thinks it may be.
* - read from the address of whatever is to the right.
myFloat = - set this variable to whatever is to the right.
So, you are telling the compiler: In the location of (myInt), there is a floating point number, now put that float into myFloat.

Fixed point multiplication

I need to convert a value from one unit to another according to a non constant factor. The input value range from 0 to 1073676289 and the range value range from 0 to 1155625. The conversion can be described like this:
output = input * (range / 1073676289)
My own initial fixed point implementation feels a bit clumsy:
// Input values (examples)
unsigned int input = 536838144; // min 0, max 1073676289
unsigned int range = 1155625; // min 0, max 1155625
// Conversion
unsigned int tmp = (input >> 16) * ((range) >> 3u);
unsigned int output = (tmp / ((1073676289) >> 16u)) << 3u;
Can my code be improved to be simpler or to have better accuracy?
This will give you the best precision with no floating point values and the result will be rounded to the nearest integer value:
output = (input * (long long) range + 536838144) / 1073676289;
The problem is that input * range would overflow a 32-bit integer. Fix that by using a 64-bit integer.
uint64_least_t tmp;
tmp = input;
tmp = tmp * range;
tmp = tmp / 1073676289ul;
output = temp;
A quick trip out to google brought http://sourceforge.net/projects/fixedptc/ to my attention
It's a c library in a header for managing fixed point math in 32 or 64 bit integers.
A little bit of experimentation with the following code:
#include <stdio.h>
#include <stdint.h>
#define FIXEDPT_BITS 64
#include "fixedptc.h"
int main(int argc, char ** argv)
{
unsigned int input = 536838144; // min 0, max 1073676289
unsigned int range = 1155625; // min 0, max 1155625
// Conversion
unsigned int tmp = (input >> 16) * ((range) >> 3u);
unsigned int output = (tmp / ((1073676289) >> 16u)) << 3u;
double output2 = (double)input * ((double)range / 1073676289.0);
uint32_t output3 = fixedpt_toint(fixedpt_xmul(fixedpt_fromint(input), fixedpt_xdiv(fixedpt_fromint(range), fixedpt_fromint(1073676289))));
printf("baseline = %g, better = %d, library = %d\n", output2, output, output3);
return 0;
}
Got me the following results:
baseline = 577812, better = 577776, library = 577812
Showing better precision (matching the floating point) than you were getting with your code. Under the hood it's not doing anything terribly complicated (and doesn't work at all in 32 bits)
/* Multiplies two fixedpt numbers, returns the result. */
static inline fixedpt
fixedpt_mul(fixedpt A, fixedpt B)
{
return (((fixedptd)A * (fixedptd)B) >> FIXEDPT_FBITS);
}
/* Divides two fixedpt numbers, returns the result. */
static inline fixedpt
fixedpt_div(fixedpt A, fixedpt B)
{
return (((fixedptd)A << FIXEDPT_FBITS) / (fixedptd)B);
}
But it does show that you can get the precision you want. You'll just need 64 bits to do it
You won't get it any simpler then output = input * (range / 1073676289)
As noted below in the comments if you are restircted to integer operations then for range < 1073676289: range / 1073676289 == 0 so you would be good to go with:
output = range < 1073676289 ? 0 : input
If that is not what you wanted and you actually want precision then
output = (input * range) / 1073676289
will be the way to go.
If you need to do a lot of those then i suggest you use double and have your compiler vectorise your operations. Precision will be ok too.

Picking good first estimates for Goldschmidt division

I'm calculating fixedpoint reciprocals in Q22.10 with Goldschmidt division for use in my software rasterizer on ARM.
This is done by just setting the numerator to 1, i.e the numerator becomes the scalar on the first iteration. To be honest, I'm kind of following the wikipedia algorithm blindly here. The article says that if the denominator is scaled in the half-open range (0.5, 1.0], a good first estimate can be based on the denominator alone: Let F be the estimated scalar and D be the denominator, then F = 2 - D.
But when doing this, I lose a lot of precision. Say if I want to find the reciprocal of 512.00002f. In order to scale the number down, I lose 10 bits of precision in the fraction part, which is shifted out. So, my questions are:
Is there a way to pick a better estimate which does not require normalization? Why? Why not? A mathematical proof of why this is or is not possible would be great.
Also, is it possible to pre-calculate the first estimates so the series converges faster? Right now, it converges after the 4th iteration on average. On ARM this is about ~50 cycles worst case, and that's not taking emulation of clz/bsr into account, nor memory lookups. If it's possible, I'd like to know if doing so increases the error, and by how much.
Here is my testcase. Note: The software implementation of clz on line 13 is from my post here. You can replace it with an intrinsic if you want. clz should return the number of leading zeros, and 32 for the value 0.
#include <stdio.h>
#include <stdint.h>
const unsigned int BASE = 22ULL;
static unsigned int divfp(unsigned int val, int* iter)
{
/* Numerator, denominator, estimate scalar and previous denominator */
unsigned long long N,D,F, DPREV;
int bitpos;
*iter = 1;
D = val;
/* Get the shift amount + is right-shift, - is left-shift. */
bitpos = 31 - clz(val) - BASE;
/* Normalize into the half-range (0.5, 1.0] */
if(0 < bitpos)
D >>= bitpos;
else
D <<= (-bitpos);
/* (FNi / FDi) == (FN(i+1) / FD(i+1)) */
/* F = 2 - D */
F = (2ULL<<BASE) - D;
/* N = F for the first iteration, because the numerator is simply 1.
So don't waste a 64-bit UMULL on a multiply with 1 */
N = F;
D = ((unsigned long long)D*F)>>BASE;
while(1){
DPREV = D;
F = (2<<(BASE)) - D;
D = ((unsigned long long)D*F)>>BASE;
/* Bail when we get the same value for two denominators in a row.
This means that the error is too small to make any further progress. */
if(D == DPREV)
break;
N = ((unsigned long long)N*F)>>BASE;
*iter = *iter + 1;
}
if(0 < bitpos)
N >>= bitpos;
else
N <<= (-bitpos);
return N;
}
int main(int argc, char* argv[])
{
double fv, fa;
int iter;
unsigned int D, result;
sscanf(argv[1], "%lf", &fv);
D = fv*(double)(1<<BASE);
result = divfp(D, &iter);
fa = (double)result / (double)(1UL << BASE);
printf("Value: %8.8lf 1/value: %8.8lf FP value: 0x%.8X\n", fv, fa, result);
printf("iteration: %d\n",iter);
return 0;
}
I could not resist spending an hour on your problem...
This algorithm is described in section 5.5.2 of "Arithmetique des ordinateurs" by Jean-Michel Muller (in french). It is actually a special case of Newton iterations with 1 as starting point. The book gives a simple formulation of the algorithm to compute N/D, with D normalized in range [1/2,1[:
e = 1 - D
Q = N
repeat K times:
Q = Q * (1+e)
e = e*e
The number of correct bits doubles at each iteration. In the case of 32 bits, 4 iterations will be enough. You can also iterate until e becomes too small to modify Q.
Normalization is used because it provides the max number of significant bits in the result. It is also easier to compute the error and number of iterations needed when the inputs are in a known range.
Once your input value is normalized, you don't need to bother with the value of BASE until you have the inverse. You simply have a 32-bit number X normalized in range 0x80000000 to 0xFFFFFFFF, and compute an approximation of Y=2^64/X (Y is at most 2^33).
This simplified algorithm may be implemented for your Q22.10 representation as follows:
// Fixed point inversion
// EB Apr 2010
#include <math.h>
#include <stdio.h>
// Number X is represented by integer I: X = I/2^BASE.
// We have (32-BASE) bits in integral part, and BASE bits in fractional part
#define BASE 22
typedef unsigned int uint32;
typedef unsigned long long int uint64;
// Convert FP to/from double (debug)
double toDouble(uint32 fp) { return fp/(double)(1<<BASE); }
uint32 toFP(double x) { return (int)floor(0.5+x*(1<<BASE)); }
// Return inverse of FP
uint32 inverse(uint32 fp)
{
if (fp == 0) return (uint32)-1; // invalid
// Shift FP to have the most significant bit set
int shl = 0; // normalization shift
uint32 nfp = fp; // normalized FP
while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead
uint64 q = 0x100000000ULL; // 2^32
uint64 e = 0x100000000ULL - (uint64)nfp; // 2^32-NFP
int i;
for (i=0;i<4;i++) // iterate
{
// Both multiplications are actually
// 32x32 bits truncated to the 32 high bits
q += (q*e)>>(uint64)32;
e = (e*e)>>(uint64)32;
printf("Q=0x%llx E=0x%llx\n",q,e);
}
// Here, (Q/2^32) is the inverse of (NFP/2^32).
// We have 2^31<=NFP<2^32 and 2^32<Q<=2^33
return (uint32)(q>>(64-2*BASE-shl));
}
int main()
{
double x = 1.234567;
uint32 xx = toFP(x);
uint32 yy = inverse(xx);
double y = toDouble(yy);
printf("X=%f Y=%f X*Y=%f\n",x,y,x*y);
printf("XX=0x%08x YY=0x%08x XX*YY=0x%016llx\n",xx,yy,(uint64)xx*(uint64)yy);
}
As noted in the code, the multiplications are not full 32x32->64 bits. E will become smaller and smaller and fits initially on 32 bits. Q will always be on 34 bits. We take only the high 32 bits of the products.
The derivation of 64-2*BASE-shl is left as an exercise for the reader :-). If it becomes 0 or negative, the result is not representable (the input value is too small).
EDIT. As a follow-up to my comment, here is a second version with an implicit 32-th bit on Q. Both E and Q are now stored on 32 bits:
uint32 inverse2(uint32 fp)
{
if (fp == 0) return (uint32)-1; // invalid
// Shift FP to have the most significant bit set
int shl = 0; // normalization shift for FP
uint32 nfp = fp; // normalized FP
while ( (nfp & 0x80000000) == 0 ) { nfp <<= 1; shl++; } // use "clz" instead
int shr = 64-2*BASE-shl; // normalization shift for Q
if (shr <= 0) return (uint32)-1; // overflow
uint64 e = 1 + (0xFFFFFFFF ^ nfp); // 2^32-NFP, max value is 2^31
uint64 q = e; // 2^32 implicit bit, and implicit first iteration
int i;
for (i=0;i<3;i++) // iterate
{
e = (e*e)>>(uint64)32;
q += e + ((q*e)>>(uint64)32);
}
return (uint32)(q>>shr) + (1<<(32-shr)); // insert implicit bit
}
A couple of ideas for you, though none that solve your problem directly as stated.
Why this algo for division? Most divides I've seen in ARM use some varient of
adcs hi, den, hi, lsl #1
subcc hi, hi, den
adcs lo, lo, lo
repeated n bits times with a binary search off of the clz to determine where to start. That's pretty dang fast.
If precision is a big problem, you are not limited to 32/64 bits for your fixed point representation. It'll be a bit slower, but you can do add/adc or sub/sbc to move values across registers. mul/mla are also designed for this kind of work.
Again, not direct answers for you, but possibly a few ideas to go forward this. Seeing the actual ARM code would probably help me a bit as well.
Mads, you are not losing any precision at all. When you divide 512.00002f by 2^10, you merely decrease the exponent of your floating point number by 10. Mantissa remains the same. Of course unless the exponent hits its minimum value but that shouldn't happen since you're scaling to (0.5, 1].
EDIT: Ok so you're using a fixed decimal point. In that case you should allow a different representation of the denominator in your algorithm. The value of D is from (0.5, 1] not only at the beginning but throughout the whole calculation (it's easy to prove that x * (2-x) < 1 for x < 1). So you should represent the denominator with decimal point at base = 32. This way you will have 32 bits of precision all the time.
EDIT: To implement this you'll have to change the following lines of your code:
//bitpos = 31 - clz(val) - BASE;
bitpos = 31 - clz(val) - 31;
...
//F = (2ULL<<BASE) - D;
//N = F;
//D = ((unsigned long long)D*F)>>BASE;
F = -D;
N = F >> (31 - BASE);
D = ((unsigned long long)D*F)>>31;
...
//F = (2<<(BASE)) - D;
//D = ((unsigned long long)D*F)>>BASE;
F = -D;
D = ((unsigned long long)D*F)>>31;
...
//N = ((unsigned long long)N*F)>>BASE;
N = ((unsigned long long)N*F)>>31;
Also in the end you'll have to shift N not by bitpos but some different value which I'm too lazy to figure out right now :).

Resources