Compare 2 floats by their bitwise representation in C - c

I had this question on my exam, and I couldn't realy solve it, will appreciate some help.
Fill the blanks only, function must return true if and only if x<y.
Assume x,y cannot be NaN (but can be +-inf) no casting is allowed, use only ux, uy, sx, sy
bool func(float x, float y) {
unsigned* uxp = ______________ ;
unsigned* uyp = ______________ ;
unsigned ux = *uxp;
unsigned uy = *uyp;
unsigned sx = (ux>>31);
unsigned sy = (uy>>31);
return ___________________________;
}

Presumably the assignment assumes float uses IEEE-754 binary32 and unsigned is 32 bits.
It is not proper to alias float objects with an unsigned type, although some C implementations support it. Instead, you can create a compound literal union, initialize its float member with the float value, and access its unsigned member. (This is supported by the C standard but not by C++.)
After that, it is simply a matter of dividing the comparison into cases depending on the sign bits:
#include <stdbool.h>
bool func(float x, float y) {
unsigned* uxp = & (union { float f; unsigned u; }) {x} .u;
unsigned* uyp = & (union { float f; unsigned u; }) {y} .u;
unsigned ux = *uxp;
unsigned uy = *uyp;
unsigned sx = (ux>>31);
unsigned sy = (uy>>31);
return
sx && sy ? uy < ux : // Negative values are in "reverse" order.
sx && !sy ? (uy | ux) & 0x7fffffffu : // Negative x is always less than positive y except for x = -0 and y = +0.
!sx && sy ? 0 : // Positive x is never less than negative y.
ux < uy ; // Positive values are in "normal" order.
}
#include <stdio.h>
int main(void)
{
// Print expected values and function values for comparison.
printf("1, %d\n", func(+3, +4));
printf("1, %d\n", func(-3, +4));
printf("0, %d\n", func(+3, -4));
printf("0, %d\n", func(-3, -4));
printf("0, %d\n", func(+4, +3));
printf("1, %d\n", func(-4, +3));
printf("0, %d\n", func(+4, -3));
printf("1, %d\n", func(-4, -3));
}
Sample output:
1, 1
1, 1
0, 0
0, 0
0, 0
1, 1
0, 0
1, 1

Related

Check if a number is +-Inf or NaN

For the robustness reason, I want check if a float number is IEEE-754 +-Inf or IEEE-754 Nan. My code is in the following, I want know if it is correct:
#define PLUS_INFINITE (1.0f/0.0f)
#define MINUS_INFINITE (-1.0f/0.0f)
#define NAN (0.0f/0.0f)
float Local_Var;
/*F is a float numnber.*/
if((unsigned long)(F) == 0x7f800000ul)
{
Local_Var = PLUS_INFINITE;
}
elseif((unsigned long)(F) == 0xff800000ul)
{
Local_Var = MINUS_INFINITE;
}
/*fraction = anything except all 0 bits (since all 0 bits represents infinity).*/
elseif((((unsigned long)(F) & 0x007ffffful) != 0ul )
&&((unsigned long)(F) == 0x7f800000ul))
||
(((unsigned long)(F) & 0x807ffffful) != 0ul )
&&
((unsigned long)(F) == 0xff800000ul))
{
Local_Var = NAN;
}
else{}
C99 has macros for the classification of floating-point numbers:
fpclassify(x) returns one of:
FP_NAN: x is not a number;
FP_INFINITE: x is plus or minus infinite;
FP_ZERO: x is zero;
FP_SUBNORMAL: x is too small to be represented in normalized format or
FP_NORMAL: normal floating-point number, i.e. none of the above.
There are also shortcuts that check for one of these classes, which return non-zero if x is what :
isfinite(x)
isnormal(x)
isnan(x)
isinf(x)
The argument x can be any floating-point expression; the macros detect the type of the argument and work for float and double.
EDIT: Since you don't want to use (or cannot use) <math.h>, you could use other properties of nan and inf to classify your numers:
nan compares false to all numbers, including to itself;
inf is greater than FLT_MAX;
-inf is smaller than -FLT_MAX.
So:
#include <stdlib.h>
#include <stdio.h>
#include <float.h>
int main()
{
float f[] = {
0.0, 1.0, FLT_MAX, 0.0 / 0.0, 1.0/0.0, -1.0/0.0
};
int i;
for (i = 0; i < 6; i++) {
float x = f[i];
int is_nan = (x != x);
int is_inf = (x < -FLT_MAX || x > FLT_MAX);
printf("%20g%4d%4d\n", x, is_nan, is_inf);
}
return 0;
}
In this solution, you must adapt the limits if you want to use double.
Casting floats to longs like that is wrong. It should be either a union, or a type-punned pointer.
Here's a working example from dietlibc (with doubles):
https://github.com/ensc/dietlibc/blob/master/lib/__isinf.c
https://github.com/ensc/dietlibc/blob/master/lib/__isnan.c
Musl has a shorter fpclassify, and also proper constants for floats:
http://git.musl-libc.org/cgit/musl/tree/src/math/__fpclassifyf.c
Best to use the fpclassify() functions of #M Oehm answer
Alternatives:
float F;
if (F <= FLT_MAX) {
if (F >= -FLT_MAX) {
puts("Finite");
} else {
puts("-Infinity");
}
} else {
if (F > 0) {
puts("+Infinity");
} else {
puts("NaN");
}
}
If code wants to mess with the bits and assuming float are in binary32 format:
assert(sizeof (float) == sizeof (uint32_t));
union {
float f;
uint32_t u32;
} x;
x.f = F;
Masks depend on relative endian of float and uint32_t endian. They usually are the same.
// Is F one of the 3 special: +inf, -inf, NaN?
if (x.u32 & 0x7F800000 == 0x7F800000) {
if (x.u32 & 0x007FFFFF) {
puts("NaN");
} else if (x.u32 & 0x80000000) {
puts("-Inf");
} else {
puts("+Inf");
}
}

casting signed to double different result than casting to float then double

So as part of an assignment I am working if a expression : (double) (float) x == (double) x
returns awlays 1 or not.(x is a signed integer)
it works for every value except for INT_MAX. I was wondering why is it so? if i print the values, they both show the same value,even for INT_MAX.
x = INT_MAX ;
printf("Signed X: %d\n",x);
float fx1 = (float)x;
double dx1 = (double)x;
double dfx = (double)(float)x;
printf("(double) x: %g\n",dx1);
printf("(float) x: %f \n",fx1);
printf("(double)(float)x: %g\n",dfx);
if((double) (float) x == (double) x){
printf("RESULT:%d\n", ((double)(float) x == (double) x));
}
EDIT: the entire program:
#include<stdio.h>
#include<stdlib.h>
#include<limits.h>
int main(int argc, char *argv[]){
//create random values
int x = INT_MAX ;
printf("Signed X: %d\n",x);
float fx1 = (float)x;
double dx1 = (double)x;
double dfx = (double)(float)x;
printf("(double) x: %g\n",dx1);
printf("(float) x: %f \n",fx1);
printf("(double)(float)x: %g\n",dfx);
if((double) (float) x == (double) x){
printf("RESULT:%d\n", ((double)(float) x == (double) x));
}
return 0;
}//end of main function
int and float have most likely the same number of bits in their representation, namely 32. float has a mantissa, an exponent and a sign bit, so the mantissa must have less than 31 bit, needed for the bigger int values like INT_MAX. So there loss of precision when storing in float.

Function to scale float values to (0-100) in C

I am trying to convert a float variable into an integer of value between 0 and 100. The float is always positive. the corresponding integer value should reflect the size of the float value compared to the maximum value for a 32-bit float, e.g. 0.0 converts to 0 and 3.402823466 E + 38 converts to a 100, and anything else goes in between.
Here is what I have so far but I keep getting -1 as the output for any non-zero input.
int convFloat(float x){
int y;
y = (int) (x/3.4e38) * 100;
return y;
}
What am I doing wrong here?
This:
y = (int) (x/3.4e38) * 100;
// ^--------------^
// cast (x/3.4e38)to int
Should be:
y = (int) ((x/3.4e38) * 100);
// ^----------------------^
// cast ((x/3.4e38) * 100)to int
((union { float f; uint32_t u; }){ val }.u>>23&255)*100/255

How to check if float can be exactly represented as an integer

I'm looking to for a reasonably efficient way of determining if a floating point value (double) can be exactly represented by an integer data type (long, 64 bit).
My initial thought was to check the exponent to see if it was 0 (or more precisely 127). But that won't work because 2.0 would be e=1 m=1...
So basically, I am stuck. I have a feeling that I can do this with bit masks, but I'm just not getting my head around how to do that at this point.
So how can I check to see if a double is exactly representable as a long?
Thanks
I think I have found a way to clamp a double into an integer in a standard-conforming fashion (this is not really what the question is about, but it helps a lot). First, we need to see why the obvious code is not correct.
// INCORRECT CODE
uint64_t double_to_uint64 (double x)
{
if (x < 0.0) {
return 0;
}
if (x > UINT64_MAX) {
return UINT64_MAX;
}
return x;
}
The problem here is that in the second comparison, UINT64_MAX is being implicitly converted to double. The C standard does not specify exactly how this conversion works, only that it is to be rounded up or down to a representable value. This means that the second comparison may be false, even if should mathematically be true (which can happen when UINT64_MAX is rounded up, and 'x' is mathematically between UINT64_MAX and (double)UINT64_MAX). As such, the conversion of double to uint64_t can result in undefined behavior in that edge case.
Surprisingly, the solution is very simple. Consider that while UINT64_MAX may not be exactly representable in a double, UINT64_MAX+1, being a power of two (and not too large), certainly is. So, if we first round the input to an integer, the comparison x > UINT64_MAX is equivalent to x >= UINT64_MAX+1, except for possible overflow in the addition. We can fix the overflow by using ldexp instead of adding one to UINT64_MAX. That being said, the following code should be correct.
/* Input: a double 'x', which must not be NaN.
* Output: If 'x' is lesser than zero, then zero;
* otherwise, if 'x' is greater than UINT64_MAX, then UINT64_MAX;
* otherwise, 'x', rounded down to an integer.
*/
uint64_t double_to_uint64 (double x)
{
assert(!isnan(x));
double y = floor(x);
if (y < 0.0) {
return 0;
}
if (y >= ldexp(1.0, 64)) {
return UINT64_MAX;
}
return y;
}
Now, to back to your question: is x is exactly representable in an uint64_t? Only if it was neither rounded nor clamped.
/* Input: a double 'x', which must not be NaN.
* Output: If 'x' is exactly representable in an uint64_t,
* then 1, otherwise 0.
*/
int double_representable_in_uint64 (double x)
{
assert(!isnan(x));
return (floor(x) == x && x >= 0.0 && x < ldexp(1.0, 64));
}
The same algorithm can be used for integers of different size, and also for signed integers with a minor modification. The code that follows does some very basic tests of the uint32_t and uint64_t versions (only false positives can possibly be caught), but is also suitable for manual examination of the edge cases.
#include <inttypes.h>
#include <math.h>
#include <limits.h>
#include <assert.h>
#include <stdio.h>
uint32_t double_to_uint32 (double x)
{
assert(!isnan(x));
double y = floor(x);
if (y < 0.0) {
return 0;
}
if (y >= ldexp(1.0, 32)) {
return UINT32_MAX;
}
return y;
}
uint64_t double_to_uint64 (double x)
{
assert(!isnan(x));
double y = floor(x);
if (y < 0.0) {
return 0;
}
if (y >= ldexp(1.0, 64)) {
return UINT64_MAX;
}
return y;
}
int double_representable_in_uint32 (double x)
{
assert(!isnan(x));
return (floor(x) == x && x >= 0.0 && x < ldexp(1.0, 32));
}
int double_representable_in_uint64 (double x)
{
assert(!isnan(x));
return (floor(x) == x && x >= 0.0 && x < ldexp(1.0, 64));
}
int main ()
{
{
printf("Testing 32-bit\n");
for (double x = 4294967295.999990; x < 4294967296.000017; x = nextafter(x, INFINITY)) {
uint32_t y = double_to_uint32(x);
int representable = double_representable_in_uint32(x);
printf("%f -> %" PRIu32 " representable=%d\n", x, y, representable);
assert(!representable || (double)(uint32_t)x == x);
}
}
{
printf("Testing 64-bit\n");
double x = ldexp(1.0, 64) - 40000.0;
for (double x = 18446744073709510656.0; x < 18446744073709629440.0; x = nextafter(x, INFINITY)) {
uint64_t y = double_to_uint64(x);
int representable = double_representable_in_uint64(x);
printf("%f -> %" PRIu64 " representable=%d\n", x, y, representable);
assert(!representable || (double)(uint64_t)x == x);
}
}
}
Here's one method that could work in most cases. I'm not sure if/how it will break if you give it NaN, INF, very large (overflow) numbers...
(Though I think they will all return false - not exactly representable.)
You could:
Cast it to an integer.
Cast it back to a floating-point.
Compare with original value.
Something like this:
double val = ... ; // Value
if ((double)(long long)val == val){
// Exactly representable
}
floor() and ceil() are also fair game (though they may fail if the value overflows an integer):
floor(val) == val
ceil(val) == val
And here's a messy bit-mask solution:
This uses union type-punning and assumes IEEE double-precision. Union type-punning is only valid in C99 TR2 and later.
int representable(double x){
// Handle corner cases:
if (x == 0)
return 1;
// -2^63 is representable as a signed 64-bit integer, but +2^63 is not.
if (x == -9223372036854775808.)
return 1;
// Warning: Union type-punning is only valid in C99 TR2 or later.
union{
double f;
uint64_t i;
} val;
val.f = x;
uint64_t exp = val.i & 0x7ff0000000000000ull;
uint64_t man = val.i & 0x000fffffffffffffull;
man |= 0x0010000000000000ull; // Implicit leading 1-bit.
int shift = (exp >> 52) - 1075;
// Out of range
if (shift < -52 || shift > 10)
return 0;
// Test mantissa
if (shift < 0){
shift = -shift;
return ((man >> shift) << shift) == man;
}else{
return ((man << shift) >> shift) == man;
}
}
You can use the modf function to split a float into the integer and fraction parts. modf is in the standard C library.
#include <math.h>
#include <limits.h>
double val = ...
double i;
long l;
/* check if fractional part is 0 */
if (modf(val, &i) == 0.0) {
/* val is an integer. check if it can be stored in a long */
if (val >= LONG_MIN && val <= LONG_MAX) {
/* can be exactly represented by a long */
l = val;
}
}
How to check if float can be exactly represented as an integer ?
I'm looking to for a reasonably efficient way of determining if a floating point value double can be exactly represented by an integer data type long, 64 bit.
Range (LONG_MIN, LONG_MAX) and fraction (frexp()) tests needed. Also need to watch out for not-a-numbers.
The usual idea is to test like (double)(long)x == x, but to avoid its direct usage. (long)x, when x is out of range, is undefined behavior (UB).
The valid range of conversion for (long)x is LONG_MIN - 1 < x < LONG_MAX + 1 as code discards any fractional part of x during the conversion. So code needs to test, using FP math, if x is in range.
#include <limits.h>
#include <stdbool.h>
#define DBL_LONG_MAXP1 (2.0*(LONG_MAX/2+1))
#define DBL_LONG_MINM1 (2.0*(LONG_MIN/2-1))
bool double_to_long_exact_possible(double x) {
if (x < DBL_LONG_MAXP1) {
double whole_number_part;
if (frexp(x, &whole_number_part) != 0.0) {
return false; // Fractional part exist.
}
#if -LONG_MAX == LONG_MIN
// rare non-2's complement machine
return x > DBL_LONG_MINM1;
#else
return x - LONG_MIN > -1.0;
#endif
}
return false; // Too large or NaN
}
Any IEEE floating-point double or float value with a magnitude at or above 2^52 or 2^23 will be whole number. Adding 2^52 or 2^23 to a positive number whose magnitude is less than that will cause it to be rounded to a whole number. Subtracting the value that was added will yield a whole number which will equal the original iff the original was a whole number. Note that this algorithm will fail with some numbers larger than 2^52, but it isn't needed for numbers that big.
Could you use the modulus operator to check if the double is divisible by one... or am I completely misunderstanding the question?
double val = ... ; // Value
if(val % 1 == 0) {
// Val is evenly divisible by 1 and is therefore a whole number
}

Subtraction without minus sign in C

How can I subtract two integers in C without the - operator?
int a = 34;
int b = 50;
You can convert b to negative value using negation and adding 1:
int c = a + (~b + 1);
printf("%d\n", c);
-16
This is two's complement sign negation. Processor is doing it when you use '-' operator when you want to negate value or subtrackt it.
Converting float is simpler. Just negate first bit (shoosh gave you example how to do this).
EDIT:
Ok, guys. I give up. Here is my compiler independent version:
#include <stdio.h>
unsigned int adder(unsigned int a, unsigned int b) {
unsigned int loop = 1;
unsigned int sum = 0;
unsigned int ai, bi, ci;
while (loop) {
ai = a & loop;
bi = b & loop;
ci = sum & loop;
sum = sum ^ ai ^ bi; // add i-th bit of a and b, and add carry bit stored in sum i-th bit
loop = loop << 1;
if ((ai&bi)|(ci&ai)|(ci&bi)) sum = sum^loop; // add carry bit
}
return sum;
}
unsigned int sub(unsigned int a, unsigned int b) {
return adder(a, adder(~b, 1)); // add negation + 1 (two's complement here)
}
int main() {
unsigned int a = 35;
unsigned int b = 40;
printf("%u - %u = %d\n", a, b, sub(a, b)); // printf function isn't compiler independent here
return 0;
}
I'm using unsigned int so that any compiler will treat it the same.
If you want to subtract negative values, then do it that way:
unsgined int negative15 = adder(~15, 1);
Now we are completly independent of signed values conventions. In my approach result all ints will be stored as two's complement - so you have to be careful with bigger ints (they have to start with 0 bit).
Pontus is right, 2's complement is not mandated by the C standard (even if it is the de facto hardware standard). +1 for Phil's creative answers; here's another approach to getting -1 without using the standard library or the -- operator.
C mandates three possible representations, so you can sniff which is in operation and get a different -1 for each:
negation= ~1;
if (negation+1==0) /* one's complement arithmetic */
minusone= ~1;
else if (negation+2==0) /* two's complement arithmetic */
minusone= ~0;
else /* sign-and-magnitude arithmetic */
minusone= ~0x7FFFFFFE;
r= a+b*minusone;
The value 0x7FFFFFFFE would depend on the width (number of ‘value bits’) of the type of integer you were interested in; if unspecified, you have more work to find that out!
+ No bit setting
+ Language independent
+ Can be adjusted for different number types (int, float, etc)
- Almost certainly not your C homework answer (which is likely to be about bits)
Expand a-b:
a-b = a + (-b)
= a + (-1).b
Manufacture -1:
float: pi = asin(1.0);
(with minusone_flt = sin(3.0/2.0*pi);
math.h) or = cos(pi)
or = log10(0.1)
complex: minusone_cpx = (0,1)**2; // i squared
integer: minusone_int = 0; minusone_int--; // or convert one of the floats above
+ No bit setting
+ Language independent
+ Independent of number type (int, float, etc)
- Requires a>b (ie positive result)
- Almost certainly not your C homework answer (which is likely to be about bits)
a - b = c
restricting ourselves to the number space 0 <= c < (a+b):
(a - b) mod(a+b) = c mod(a+b)
a mod(a+b) - b mod(a+b) = c mod(a+b)
simplifying the second term:
(-b).mod(a+b) = (a+b-b).mod(a+b)
= a.mod(a+b)
substituting:
a.mod(a+b) + a.mod(a+b) = c.mod(a+b)
2a.mod(a+b) = c.mod(a+b)
if b>a, then b-a>0, so:
c.mod(a+b) = c
c = 2a.mod(a+b)
So, if a is always greater than b, then this would work.
Given that encoding integers to support two's complement is not mandated in C, iterate until done. If they want you to jump through flaming hoops, no need to be efficient about it!
int subtract(int a, int b)
{
if ( b < 0 )
return a+abs(b);
while (b-- > 0)
--a;
return a;
}
Silly question... probably silly interview!
For subtracting in C two integers you only need:
int subtract(int a, int b)
{
return a + (~b) + 1;
}
I don't believe that there is a simple an elegant solution for float or double numbers like for integers. So you can transform your float numbers in arrays and apply an algorithm similar with one simulated here
If you want to do it for floats, start from a positive number and change its sign bit like so:
float f = 3;
*(int*)&f |= 0x80000000;
// now f is -3.
float m = 4 + f;
// m = 1
You can also do this for doubles using the appropriate 64 bit integer. in visual studio this is __int64 for instance.
I suppose this
b - a = ~( a + ~b)
Assembly (accumulator) style:
int result = a;
result -= b;
As the question asked for integers not ints, you could implement a small interpreter than uses Church numerals.
Create a lookup table for every possible case of int-int!
Not tested. Without using 2's complement:
#include <stdlib.h>
#include <stdio.h>
int sillyNegate(int x) {
if (x <= 0)
return abs(x);
else {
// setlocale(LC_ALL, "C"); // if necessary.
char buffer[256];
snprintf(buffer, 255, "%c%d", 0x2d, x);
sscanf(buffer, "%d", &x);
return x;
}
}
Assuming the length of an int is much less than 255, and the snprintf/sscanf round-trip won't produce any unspecified behavior (right? right?).
The subtraction can be computed using a - b == a + (-b).
Alternative:
#include <math.h>
int moreSillyNegate(int x) {
return x * ilogb(0.5); // ilogb(0.5) == -1;
}
This would work using integer overflow:
#include<limits.h>
int subtractWithoutMinusSign(int a, int b){
return a + (b * (INT_MAX + INT_MAX + 1));
}
This also works for floats (assuming you make a float version…)
For the maximum range of any data type , one's complement provide the negative value decreased by 1 to any corresponding value. ex:
~1 --------> -2
~2---------> -3
and so on... I will show you this observation using little code snippet
#include<stdio.h>
int main()
{
int a , b;
a=10;
b=~a; // b-----> -11
printf("%d\n",a+~b+1);// equivalent to a-b
return 0;
}
Output: 0
Note : This is valid only for the range of data type. means for int data type this rule will be applicable only for the value of range[-2,147,483,648 to 2,147,483,647].
Thankyou .....May this help you
Iff:
The Minuend is greater or equal to 0, or
The Subtrahend is greater or equal to 0, or
The Subtrahend and the Minuend are less than 0
multiply the Minuend by -1 and add the result to the Subtrahend:
SUB + (MIN * -1)
Else multiply the Minuend by 1 and add the result to the Subtrahend.
SUB + (MIN * 1)
Example (Try it online):
#include <stdio.h>
int subtract (int a, int b)
{
if ( a >= 0 || b >= 0 || ( a < 0 && b < 0 ) )
{
return a + (b * -1);
}
return a + (b * 1);
}
int main (void)
{
int x = -1;
int y = -5;
printf("%d - %d = %d", x, y, subtract(x, y) );
}
Output:
-1 - -5 = 4
int num1, num2, count = 0;
Console.WriteLine("Enter two numebrs");
num1 = int.Parse(Console.ReadLine());
num2 = int.Parse(Console.ReadLine());
if (num1 < num2)
{
num1 = num1 + num2;
num2 = num1 - num2;
num1 = num1 - num2;
}
for (; num2 < num1; num2++)
{
count++;
}
Console.WriteLine("The diferrence is " + count);
void main()
{
int a=5;
int b=7;
while(b--)a--;
printf("sud=%d",a);
}

Resources