function to convert float to int (huge integers) - c

This is a university question. Just to make sure :-) We need to implement (float)x
I have the following code which must convert integer x to its floating point binary representation stored in an unsigned integer.
unsigned float_i2f(int x) {
if (!x) return x;
/* get sign of x */
int sign = (x>>31) & 0x1;
/* absolute value of x */
int a = sign ? ~x + 1 : x;
/* calculate exponent */
int e = 0;
int t = a;
while(t != 1) {
/* divide by two until t is 0*/
t >>= 1;
e++;
};
/* calculate mantissa */
int m = a << (32 - e);
/* logical right shift */
m = (m >> 9) & ~(((0x1 << 31) >> 9 << 1));
/* add bias for 32bit float */
e += 127;
int res = sign << 31;
res |= (e << 23);
res |= m;
/* lots of printf */
return res;
}
One problem I encounter now is that when my integers are too big then my code fails. I have this control procedure implemented:
float f = (float)x;
unsigned int r;
memcpy(&r, &f, sizeof(unsigned int));
This of course always produces the correct output.
Now when I do some test runs, this are my outputs (GOAL is what It needs to be, result is what I got)
:!make && ./btest -f float_i2f -1 0x80004999
make: Nothing to be done for `all'.
Score Rating Errors Function
x: [-2147464807] 10000000000000000100100110011001
sign: 1
expone: 01001110100000000000000000000000
mantis: 00000000011111111111111101101100
result: 11001110111111111111111101101100
GOAL: 11001110111111111111111101101101
So in this case, a 1 is added as the LSB.
Next case:
:!make && ./btest -f float_i2f -1 0x80000001
make: Nothing to be done for `all'.
Score Rating Errors Function
x: [-2147483647] 10000000000000000000000000000001
sign: 1
expone: 01001110100000000000000000000000
mantis: 00000000011111111111111111111111
result: 11001110111111111111111111111111
GOAL: 11001111000000000000000000000000
Here 1 is added to the exponent while the mantissa is the complement of it.
I tried hours to look ip up on the internet plus in my books etc but I can't find any references to this problem. I guess It has something to do with the fact that the mantissa is only 23 bits. But how do I have to handle it then?
EDIT: THIS PART IS OBSOLETE THANKS TO THE COMMENTS BELOW. int l must be unsigned l.
int x = 2147483647;
float f = (float)x;
int l = f;
printf("l: %d\n", l);
then l becomes -2147483648.
How can this happen? So C is doing the casting wrong?
Hope someone can help me here!
Thx
Markus
EDIT 2:
My updated code is now this:
unsigned float_i2f(int x) {
if (x == 0) return 0;
/* get sign of x */
int sign = (x>>31) & 0x1;
/* absolute value of x */
int a = sign ? ~x + 1 : x;
/* calculate exponent */
int e = 158;
int t = a;
while (!(t >> 31) & 0x1) {
t <<= 1;
e--;
};
/* calculate mantissa */
int m = (t >> 8) & ~(((0x1 << 31) >> 8 << 1));
m &= 0x7fffff;
int res = sign << 31;
res |= (e << 23);
res |= m;
return res;
}
I also figured out that the code works for all integers in the range -2^24, 2^24. Everything above/below sometimes works but mostly doesn't.
Something is missing, but I really have no idea what. Can anyone help me?

The answer printed is absolutely correct as it's totally dependent on the underlying representation of numbers being cast. However, If we understand the binary representation of the number, you won't get surprised with this result.
To understand an implicit conversion is associated with the assignment operator (ref C99 Standard 6.5.16). The C99 Standard goes on to say:
6.3.1.4 Real floating and integer
When a finite value of real floating type is converted to an integer type other than _Bool, the fractional part is discarded (i.e., the value is truncated toward zero). If the value of the integral part cannot be represented by the integer type, the behavior is undefined.
Your earlier example illustrates undefined behavior due to assigning a value outside the range of the destination type. Trying to assign a negative value to an unsigned type, not from converting floating point to integer.
The asserts in the following snippet ought to prevent any undefined behavior from occurring.
#include <limits.h>
#include <math.h>
unsigned int convertFloatingPoint(double v) {
double d;
assert(isfinite(v));
d = trunc(v);
assert((d>=0.0) && (d<=(double)UINT_MAX));
return (unsigned int)d;
}
Another way for doing the same thing, Create a union containing a 32-bit integer and a float. The int and float are now just different ways of looking at the same bit of memory;
union {
int myInt;
float myFloat;
} my_union;
my_union.myInt = 0x BFFFF2E5;
printf("float is %f\n", my_union.myFloat);
float is -1.999600
You are telling the compiler to take the number you have (large integer) and make it into a float, not to interpret the number AS float. To do that, you need to tell the compiler to read the number from that address in a different form, so this:
myFloat = *(float *)&myInt ;
That means, if we take it apart, starting from the right:
&myInt - the location in memory that holds your integer.
(float *) - really, I want the compiler use this as a pointer to float, not whatever the compiler thinks it may be.
* - read from the address of whatever is to the right.
myFloat = - set this variable to whatever is to the right.
So, you are telling the compiler: In the location of (myInt), there is a floating point number, now put that float into myFloat.

Related

How to check that a value fits in a type without invoking undefined behaviour?

I am looking to check if a double value can be represented as an int (or the same for any pair of floating point an integer types). This is a simple way to do it:
double x = ...;
int i = x; // potentially undefined behaviour
if ((double) i != x) {
// not representable
}
However, it invokes undefined behaviour on the marked line, and triggers UBSan (which some will complain about).
Questions:
Is this method considered acceptable in general?
Is there a reasonably simple way to do it without invoking undefined behaviour?
Clarifications, as requested:
The situation I am facing right now involves conversion from double to various integer types (int, long, long long) in C. However, I have encountered similar situations before, thus I am interested in answers both for float -> integer and integer -> float conversions.
Examples of how the conversion may fail:
Float -> integer conversion may fail is the value is not a whole number, e.g. 3.5.
The source value may be out of the range of the target type (larger or small than max and min representable values). For example 1.23e100.
The source values may be +-Inf or NaN, NaN being tricky as any comparison with it returns false.
Integer -> float conversion may fail when the float type does not have enough precision. For example, typical double have 52 binary digits compared to 63 digits in a 64-bit integer type. For example, on a typical 64-bit system, (long) (double) ((1L << 53) + 1L).
I do understand that 1L << 53 (as opposed to (1L << 53) + 1) is technically exactly representable as a double, and that the code I proposed would accept this conversion, even though it probably shouldn't be allowed.
Anything I didn't think of?
Create range limits exactly as FP types
The "trick" is to form the limits without loosing precision.
Let us consider float to int.
Conversion of float to int is valid (for example with 32-bit 2's complement int) for -2,147,483,648.9999... to 2,147,483,647.9999... or nearly INT_MIN -1 to INT_MAX + 1.
We can take advantage that integer_MAX is always a power-of-2 - 1 and integer_MIN is -(power-of-2) (for common 2's complement).
Avoid the limit of FP_INT_MIN_minus_1 as it may/may not be exactly encodable as a FP.
// Form FP limits of "INT_MAX plus 1" and "INT_MIN"
#define FLOAT_INT_MAX_P1 ((INT_MAX/2 + 1)*2.0f)
#define FLOAT_INT_MIN ((float) INT_MIN)
if (f < FLOAT_INT_MAX_P1 && f - FLOAT_INT_MIN > -1.0f) {
// Within range.
Use modff() to detect a fraction if desired.
}
More pedantic code would use !isnan(f) and consider non-2's complement encoding.
Using known limits and floating-point number validity. Check what's inside limits.h header.
You can write something like this:
#include <limits.h>
#include <math.h>
// Of course, constants used are specific to "int" type... There is others for other types.
if ((isnormal(x)) && (x>=INT_MIN) && (x<=INT_MAX) && (round(x)==x))
// Safe assignation from double to int.
i = (int)x ;
else
// Handle error/overflow here.
ERROR(.....) ;
Code relies on lazy boolean evaluation, obviously.
Please refer to IEEE 754 representation of floating point numbers in Memory
https://en.wikipedia.org/wiki/IEEE_754
Take double as an example:
Sign bit: 1 bit
Exponent: 11 bits
Fraction: 52 bits
There are three special values to point out here:
If the exponent is 0 and the fractional part of the mantissa is 0, the number is ±0
If the exponent is 2047 and the fractional part of the mantissa is 0, the number is ±∞
If the exponent is 2047 and the fractional part of the mantissa is non-zero, the number is NaN.
This is an example of convert from double to int on 64-bit, just for reference
#include <stdint.h>
#define EXPBITS 11
#define FRACTIONBITS 52
#define GENMASK(n) (((uint64_t)1 << (n)) - 1)
#define EXPBIAS GENMASK(EXPBITS - 1)
#define SIGNMASK (~GENMASK(FRACTIONBITS + EXPBITS))
#define EXPMASK (GENMASK(EXPBITS) << FRACTIONBITS)
#define FRACTIONMASK GENMASK(FRACTIONBITS)
int double_to_int(double src, int *dst)
{
union {
double d;
uint64_t i;
} y;
int exp;
int sign;
int maxbits;
uint64_t fraction;
y.d = src;
sign = (y.i & SIGNMASK) ? 1 : 0;
exp = (y.i & EXPMASK) >> FRACTIONBITS;
fraction = (y.i & FRACTIONMASK);
// 0
if (fraction == 0 && exp == 0) {
*dst = 0;
return 0;
}
exp -= EXPBIAS;
// not a whole number
if (exp < 0)
return -1;
// out of the range of int
maxbits = sizeof(*dst) * 8 - 1;
if (exp >= maxbits && !(exp == maxbits && sign && fraction == 0))
return -2;
// not a whole number
if (fraction & GENMASK(FRACTIONBITS - exp))
return -3;
// convert to int
*dst = src;
return 0;
}

Multiplying single precision float with 2

This is a homework question. I already found a lot of code online, including some code in StackOverflow. But I just want the concept not the code. I want to implement it myself. So the function I want to implement is:
float_twice - Return bit-level equivalent of expression 2*f for floating point argument f.
Both the argument and result are passed as unsigned int's, but they are to be interpreted as the bit-level representation of single-precision floating point values.
I want to know how to do this. I know floating point representation. And read wiki page on how to multiply two floats, but didn't understand it. I just want to know the concept/algorithm for it.
Edit:
Thanks everyone. Based on your suggestions I wrote the following code:
unsigned float_twice(unsigned uf) {
int s = (uf >> 31) << 31;
int e = ((uf >> 23) & 0xFF) << 23;
int f = uf & 0x7FFF;
// if exponent is all 1's then its a special value NaN/infinity
if (e == 0xFF000000){
return uf;
} else if (e > 0){ //if exponent is bigger than zero(not all zeros', not al 1's,
// then its in normal form, add a number to the exponent
return uf + (1 << 23);
} else { // if not exponent not all 1's and not bigger than zero, then its all
// 0's, meaning denormalized form, and we have to add one to fraction
return uf +1;
} //end of if
} //end of function
You could do something like this (although some would claim that it breaks strict-aliasing rules):
unsigned int func(unsigned int n)
{
float x = *(float*)&n;
x *= 2;
return *(unsigned int*)&x;
}
void test(float x)
{
unsigned int n = *(unsigned int*)&x;
printf("%08X\n",func(n));
}
In any case, you'll have to assert that the size of float is equal to the size of int on your platform.
If you just want to take an unsigned int operand and perform on it the equivalent operation of multiplying a float by 2, then you can simply add 1 to the exponent part of it (located in bits 20-30):
unsigned int func(unsigned int n)
{
return n+(1<<20);
}

Fixed point multiplication

I need to convert a value from one unit to another according to a non constant factor. The input value range from 0 to 1073676289 and the range value range from 0 to 1155625. The conversion can be described like this:
output = input * (range / 1073676289)
My own initial fixed point implementation feels a bit clumsy:
// Input values (examples)
unsigned int input = 536838144; // min 0, max 1073676289
unsigned int range = 1155625; // min 0, max 1155625
// Conversion
unsigned int tmp = (input >> 16) * ((range) >> 3u);
unsigned int output = (tmp / ((1073676289) >> 16u)) << 3u;
Can my code be improved to be simpler or to have better accuracy?
This will give you the best precision with no floating point values and the result will be rounded to the nearest integer value:
output = (input * (long long) range + 536838144) / 1073676289;
The problem is that input * range would overflow a 32-bit integer. Fix that by using a 64-bit integer.
uint64_least_t tmp;
tmp = input;
tmp = tmp * range;
tmp = tmp / 1073676289ul;
output = temp;
A quick trip out to google brought http://sourceforge.net/projects/fixedptc/ to my attention
It's a c library in a header for managing fixed point math in 32 or 64 bit integers.
A little bit of experimentation with the following code:
#include <stdio.h>
#include <stdint.h>
#define FIXEDPT_BITS 64
#include "fixedptc.h"
int main(int argc, char ** argv)
{
unsigned int input = 536838144; // min 0, max 1073676289
unsigned int range = 1155625; // min 0, max 1155625
// Conversion
unsigned int tmp = (input >> 16) * ((range) >> 3u);
unsigned int output = (tmp / ((1073676289) >> 16u)) << 3u;
double output2 = (double)input * ((double)range / 1073676289.0);
uint32_t output3 = fixedpt_toint(fixedpt_xmul(fixedpt_fromint(input), fixedpt_xdiv(fixedpt_fromint(range), fixedpt_fromint(1073676289))));
printf("baseline = %g, better = %d, library = %d\n", output2, output, output3);
return 0;
}
Got me the following results:
baseline = 577812, better = 577776, library = 577812
Showing better precision (matching the floating point) than you were getting with your code. Under the hood it's not doing anything terribly complicated (and doesn't work at all in 32 bits)
/* Multiplies two fixedpt numbers, returns the result. */
static inline fixedpt
fixedpt_mul(fixedpt A, fixedpt B)
{
return (((fixedptd)A * (fixedptd)B) >> FIXEDPT_FBITS);
}
/* Divides two fixedpt numbers, returns the result. */
static inline fixedpt
fixedpt_div(fixedpt A, fixedpt B)
{
return (((fixedptd)A << FIXEDPT_FBITS) / (fixedptd)B);
}
But it does show that you can get the precision you want. You'll just need 64 bits to do it
You won't get it any simpler then output = input * (range / 1073676289)
As noted below in the comments if you are restircted to integer operations then for range < 1073676289: range / 1073676289 == 0 so you would be good to go with:
output = range < 1073676289 ? 0 : input
If that is not what you wanted and you actually want precision then
output = (input * range) / 1073676289
will be the way to go.
If you need to do a lot of those then i suggest you use double and have your compiler vectorise your operations. Precision will be ok too.

Bit Rotation in C

The Problem: Exercise 2-8 of The C Programming Language, "Write a function rightrot(x,n) that returns the value of the integer x, rotated to the right by n positions."
I have done this every way that I know how. Here is the issue that I am having. Take a given number for this exercise, say 29, and rotate it right one position.
11101 and it becomes 11110 or 30. Let's say for the sake of argument that the system we are working on has an unsigned integer type size of 32 bits. Let's further say that we have the number 29 stored in an unsigned integer variable. In memory the number will have 27 zeros ahead of it. So when we rotate 29 right one using one of several algorithms mine is posted below, we get the number 2147483662. This is obviously not the desired result.
unsigned int rightrot(unsigned x, int n) {
return (x >> n) | (x << (sizeof(x) * CHAR_BIT) - n);
}
Technically, this is correct, but I was thinking that the 27 zeros that are in front of 11101 were insignificant. I have also tried a couple of other solutions:
int wordsize(void) { // compute the wordsize on a given machine...
unsigned x = ~0;
int b;
for(b = 0; x; b++)
x &= x-1;
return x;
}
unsigned int rightrot(unsigned x, int n) {
unsigned rbit;
while(n --) {
rbit = x >> 1;
x |= (rbit << wordsize() - 1);
}
return x;
This last and final solution is the one where I thought that I had it, I will explain where it failed once I get to the end. I am sure that you will see my mistake...
int bitcount(unsigned x) {
int b;
for(b = 0; x; b++)
x &= x-1;
return b;
}
unsigned int rightrot(unsigned x, int n) {
unsigned rbit;
int shift = bitcount(x);
while(n--) {
rbit = x & 1;
x >>= 1;
x |= (rbit << shift);
}
}
This solution gives the expected answer of 30 that I was looking for, but if you use a number for x like oh say 31 (11111), then there are issues, specifically the outcome is 47, using one for n. I did not think of this earlier, but if a number like 8 (1000) is used then mayhem. There is only one set bit in 8, so the shift is most certainly going to be wrong. My theory at this point is that the first two solutions are correct (mostly) and I am just missing something...
A bitwise rotation is always necessarily within an integer of a given width. In this case, as you're assuming a 32-bit integer, 2147483662 (0b10000000000000000000000000001110) is indeed the correct answer; you aren't doing anything wrong!
0b11110 would not be considered the correct result by any reasonable definition, as continuing to rotate it right using the same definition would never give you back the original input. (Consider that another right rotation would give 0b1111, and continuing to rotate that would have no effect.)
In my opinion, the spirit of the section of the book which immediately precedes this exercise would have the reader do this problem without knowing anything about the size (in bits) of integers, or any other type. The examples in the section do not require that information; I don't believe the exercises should either.
Regardless of my belief, the book had not yet introduced the sizeof operator by section 2.9, so the only way to figure the size of a type is to count the bits "by hand".
But we don't need to bother with all that. We can do bit rotation in n steps, regardless of how many bits there are in the data type, by rotating one bit at a time.
Using only the parts of the language that are covered by the book up to section 2.9, here's my implementation (with integer parameters, returning an integer, as specified by the exercise): Loop n times, x >> 1 each iteration; if the old low bit of x was 1, set the new high bit.
int rightrot(int x, int n) {
int lowbit;
while (n-- > 0) {
lowbit = x & 1; /* save low bit */
x = (x >> 1) & (~0u >> 1); /* shift right by one, and clear the high bit (in case of sign extension) */
if (lowbit)
x = x | ~(~0u >> 1); /* set the high bit if the low bit was set */
}
return x;
}
You could find the location of the first '1' in the 32-bit value using binary search. Then note the bit in the LSB location, right shift the value by the required number of places, and put the LSB bit in the location of the first '1'.
int bitcount(unsigned x) {
int b;
for(b = 0; x; b++)
x &= x-1;
return b;
}
unsigned rightrot(unsigned x,int n) {
int b = bitcount(x);
unsigned a = (x&~(~0<<n))<<(b-n+1);
x>> = n;
x| = a;
}

Converting double to float without relying on the FPU rounding mode

Does anyone have handy the snippets of code to convert an IEEE 754 double to the immediately inferior (resp. superior) float, without changing or assuming anything about the FPU's current rounding mode?
Note: this constraint probably implies not using the FPU at all. I expect the simplest way to do it in these conditions is to read the bits of the double in a 64-bit long and to work with that.
You can assume the endianness of your choice for simplicity, and that the double in question is available through the d field of the union below:
union double_bits
{
long i;
double d;
};
I would try to do it myself but I am certain I would introduce hard-to-notice bugs for denormalized or negative numbers.
I think the following works, but I will state my assumptions first:
floating-point numbers are stored in IEEE-754 format on your implementation,
No overflow,
You have nextafterf() available (it's specified in C99).
Also, most likely, this method is not very efficient.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
int main(int argc, char *argv[])
{
/* Change to non-zero for superior, otherwise inferior */
int superior = 0;
/* double value to convert */
double d = 0.1;
float f;
double tmp = d;
if (argc > 1)
d = strtod(argv[1], NULL);
/* First, get an approximation of the double value */
f = d;
/* Now, convert that back to double */
tmp = f;
/* Print the numbers. %a is C99 */
printf("Double: %.20f (%a)\n", d, d);
printf("Float: %.20f (%a)\n", f, f);
printf("tmp: %.20f (%a)\n", tmp, tmp);
if (superior) {
/* If we wanted superior, and got a smaller value,
get the next value */
if (tmp < d)
f = nextafterf(f, INFINITY);
} else {
if (tmp > d)
f = nextafterf(f, -INFINITY);
}
printf("converted: %.20f (%a)\n", f, f);
return 0;
}
On my machine, it prints:
Double: 0.10000000000000000555 (0x1.999999999999ap-4)
Float: 0.10000000149011611938 (0x1.99999ap-4)
tmp: 0.10000000149011611938 (0x1.99999ap-4)
converted: 0.09999999403953552246 (0x1.999998p-4)
The idea is that I am converting the double value to a float value—this could be less than or greater than the double value depending upon the rounding mode. When converted back to double, we can check if it is smaller or greater than the original value. Then, if the value of the float is not in the right direction, we look at the next float number from the converted number in the original number's direction.
To do this job more accurately than just re-combine mantissa and exponent bit's check this out:
http://www.mathworks.com/matlabcentral/fileexchange/23173
regards
I posted code to do this here: https://stackoverflow.com/q/19644895/364818 and copied it below for your convenience.
// d is IEEE double, but double is not natively supported.
static float ConvertDoubleToFloat(void* d)
{
unsigned long long x;
float f; // assumed to be IEEE float
unsigned long long sign ;
unsigned long long exponent;
unsigned long long mantissa;
memcpy(&x,d,8);
// IEEE binary64 format (unsupported)
sign = (x >> 63) & 1; // 1
exponent = ((x >> 52) & 0x7FF); // 11
mantissa = (x >> 0) & 0x000FFFFFFFFFFFFFULL; // 52
exponent -= 1023;
// IEEE binary32 format (supported)
exponent += 127; // rebase
exponent &= 0xFF;
mantissa >>= (52-23); // left justify
x = mantissa | (exponent << 23) | (sign << 31);
memcpy(&f,&x,4);
return f;
}

Resources