Finding the smallest integer that can not be represented as an IEEE-754 32 bit float [duplicate] - c

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Which is the first integer that an IEEE 754 float is incapable of representing exactly?
Firstly, this IS a homework question, just to clear this up immediately. I'm not looking for a spoon fed solution of course, just maybe a little pointer to the right direction.
So, my task is to find the smallest positive integer that can not be represented as an IEEE-754 float (32 bit). I know that testing for equality on something like "5 == 5.00000000001" will fail, so I thought I'd simply loop over all the numbers and test for that in this fashion:
int main(int argc, char **argv)
{
unsigned int i; /* Loop counter. No need to inizialize here. */
/* Header output */
printf("IEEE floating point rounding failure detection\n\n");
/* Main program processing */
/* Loop over every integer number */
for (i = 0;; ++i)
{
float result = (float)i;
/* TODO: Break condition for integer wrapping */
/* Test integer representation against the IEEE-754 representation */
if (result != i)
break; /* Break the loop here */
}
/* Result output */
printf("The smallest integer that can not be precisely represented as IEEE-754"
" is:\n\t%d", i);
return 0;
}
This failed. Then I tried to subtract the integer "i" from the floating point "result" that is "i" hoping to achieve something of a "0.000000002" that I could try and detect, which failed, too.
Can someone point me out a property of floating points that I can rely on to get the desired break condition?
-------------------- Update below ---------------
Thanks for help on this one! I learned multiple things here:
My original thought was indeed correct and determined the result on the machine it was intended to be run on (Solaris 10, 32 bit), yet failed to work on my Linux systems (64 bit and 32 bit).
The changes that Hans Passant added made the program also work with my systems, there seem to be some platform differences going on here that I didn't expect,
Thanks to everyone!

The problem is that your equality test is a float point test. The i variable will be converted to float first and that of course produces the same float. Convert the float back to int to get an integer equality test:
float result = (float)i;
int truncated = (int)result;
if (truncated != i) break;
If it starts with the digits 16 then you found the right one. Convert it to hex and explain why that was the one that failed for a grade bonus.

I think you should reason on the representation of the floating numbers as (base, sign,significand,exponent)
Here it is an excerpt from Wikipedia that can give you a clue:
A given format comprises:
* Finite numbers, which may be either base 2 (binary) or base 10
(decimal). Each finite number is most
simply described by three integers: s=
a sign (zero or one), c= a significand
(or 'coefficient'), q= an exponent.
The numerical value of a finite number
is
(−1)s × c × bq
where b is the base (2 or 10). For example, if the sign is 1
(indicating negative), the significand
is 12345, the exponent is −3, and the
base is 10, then the value of the
number is −12.345.

That would be FLT_MAX+1. See float.h.
Edit: or actually not. Check the modf() function in math.h

Related

Decimal To Binary Conversion in C using For

I am not able to convert from decimal to binary in C.Everytime I get a output which is one less than the desired output.For ex.:5 should be 101 but shows up as 100 or 4 should be 100 but shows up as 99.
#include<stdio.h>
#include<math.h>
void main() {
int a,b=0;
int n;
printf("Enter a Decimal Number\n");
scanf("%d",&n);
for(int i=0;n>0;i++) {
a=n%2;
n=n/2;
b=b+(pow(10,i)*a);
}
printf("%d",b);
}
My output is always one less than the correct answer and I dont know why.It fixes the problem if take b as 1 instead of 0 in the beginning but i dont know why.Please Help.I have just started C a few days ago.
pow is a floating-point function; it takes a double argument and returns a double value. In the C implementation you are using, pow is badly implemented. It does not always produce a correct result even when the correct result is exactly representable. Stop using it for integer arithmetic.
Rewrite the code to compute the desired power of ten using integer arithmetic.
Also, do not compute binary numerals by encoding them a decimal within a int type. It is wasteful and quickly runs into bounds of the type. Use either bits within an unsigned type or an array of char. When scanf("%d",&n); executes, it converts the input string into binary and stores that in n. So n is already binary; you do not need to decode it. Use a loop to find its highest set bit. Then use another loop to print each bit from that position down to the least significant bit.
This code seems fine. I quickly tested it on an online compiler and it seems to be working okay.
I am very sure it has to do with different versions of compilers.
compiler which I tested your code in: https://www.onlinegdb.com/online_c_compiler
Edit:
pow() function is not reliable when used with integers since the integer you pass into it as parameter is implicitly converted into data type of double and returns double as output. When you stuff this value into the integer again, it drops the decimal values. Some compilers seem to produce "correct" result with their version of pow() while some don't.
Instead, you can use a different approach to solve your decimal to binary conversion without errors in general use:
#include<stdio.h>
void main() {
int remainder,result = 0,multiplier = 1;
int input;
printf("Enter a Decimal Number\n");
scanf("%d",&input);
while(input){
remainder = input%2;
result = remainder*multiplier + result;
multiplier*=10;
input/=2;
}
printf("The binary version of the decimal value is: %d",result);
}

create number 0 only using int in the C

#include <stdio.h>
struct real_num
{
int int_num;
int frac_num;
};
void main()
{
struct real_num num1;
printf("input the number : ");
scanf("%d.%d",&num1.int_num,&num1.frac_num):
printf("%d.%d",num1.int_num,num1.frac_num);
}
i input 12.012 but buffer save 12.12 i want a 012 but this buffer save 12
what should i do? i want a save 012 (using only int)
Numbers are a matter of arithmetic. 1, 01, 1.0, 1.000, 0x01, 1e0 all describe the same number: whichever representation you use has the same mathematical properties, and behaves identically in calculation (ignoring the matter of computer storage of numbers as int or float or double... which is again another matter entirely).
The representation of a number is a matter of sequences of characters, or strings. Representations of numbers can be formatted differently, and can be in different bases, but can't be calculated with directly by a computer. To store leading zeroes, you need a string, not an int.
You typically convert from number representation to number at input, and from number to number representation at output. You would achieve your stated desire by not converting from number representation to number at input, but leaving it as a string.
You don't want to store 012, you want to store 0.012.
The value 0.012 in binary is (approximately):
0.00000011000100100110111010010111b
..and the value 12.012 is (approximately):
110.00000011000100100110111010010111b
Note that 0.012 is impossible to store precisely in binary because it would consume an infinite number of bits; in the same way that 1/3 can't be written precisely in decimal (0.333333333.....) because you'd need an infinite number of digits.
Let's look at 12.012. In hex it's this:
0x0000000C.03126E97
This makes it easier to see how the number would be stored in a pair of 32-bit integers. The integer part in one 32-bit integer, and the fractional part in another 32-bit integer.
The first problem is that you're using signed 32-bit integers, which means that one of the bits of the fraction is wasted for a sign bit. Essentially, you're using a "sign + 31 bit integer + wasted bit + 31 bit fraction" fixed point format. It'd be easier and better to use an unsigned integer for the fractional bits.
The second problem is that standard C functions don't support fixed point formats. This means that you either have to write your own "string to fixed point" and "fixed point to string" conversion routines, or you have use C's floating point conversion routines and write your own "floating point to fixed point" and "fixed point to floating point" conversion routines.
Note that the latter is harder (floating point is messy), slower, and less precise (double floating point format only supports 53 bits of precision while you can store 62 bits of precision).
A fraction does not consists of a single integer. A fraction consists of 2 integers: numerator/denominator.
Code needs to keep track of width of the fraction input. Could use "%n" to record offset in scan.
#include <stdio.h>
struct real_number {
int ipart;
int num;
int den_pow10;
};
void main(void) {
struct real_number num1;
printf("input the number : ");
fflush(stdout);
int n1 = 0;
int n2 = 0;
scanf("%d.%n%d%n",&num1.ipart, &n1, &num1.num , &n2):
if (n2 == 0) {
fprintf(stderr, "bad input\n");
return -1;
}
num1.den_pow10 = n2 - n1;
printf("%d.%*0d",num1.ipart,num1.den_pow10, num1.frac_num);
return 0;
}
Input/Output
input the number : 12.00056
Result 12.00056

get the float number's exponents [duplicate]

This question already has answers here:
How to get the sign, mantissa and exponent of a floating point number
(7 answers)
Closed 6 years ago.
I just started learning floating point and get to know the SME stuff. I'm still very confused about the mantissa... Can somebody explain to me how can I get the exp part of the float. I am sorry if that's a super stupid and basic question but I am having a hard time understanding it...
Also how do I implement the following function... clearly my implementation is wrong. But how do I do it?
// Extract the 8-bit exponent field of single precision
// floating point number f and return it as an unsigned byte
unsigned char get_exponent_field(float f)
{
// TODO: Your code here.
int bias = 127;
int expp = (int)f;
unsigned char E = expp-bias;
return E;
}
If you want to extract the IEEE-754 single precision exponent from a float value (in excess 127 notation), you can use the float functions, or you can use a simple union with a shift and mask to do the same:
unsigned float_getexp (float f)
{
union {
unsigned u;
float f;
} uf;
uf.f = f;
return (uf.u >> 23) & 0xff;
}
If you want the actual exponent bias (i.e. the number of places the mantissa decimal is shifted during normalization prior to hidden bit removal), just subtract 127 from the value returned, or if you want that value returned, subtract it before the return.
Give it a try and let me know if you have questions. (note: the type should be unsigned for your exponent, instead of the int you have).
First, get your floating-point number and calculate its binary form by converting both the integral and fractional parts separately. Once you've got that, say you've got 11010.101(base-2). Normalize the binary string: 1.1010101 x 2^4. Next, add your excess value, say excess 15, to the exponent of the sci. not. value, which would give you 19(base-ten). Convert this to base-two; this will be your exponent.
This is just the structure of the operation, plug in your own bias, etc.

Count number of digits after `.` in floating point numbers?

This is one interview question.
How do you compute the number of digit after . in floating point number.
e.g. if given 3.554 output=3
for 43.000 output=0.
My code snippet is here
double no =3.44;
int count =0;
while(no!=((int)no))
{
count++;
no=no*10;
}
printf("%d",count);
There are some numbers that can not be indicated by float type. for example, there is no 73.487 in float type, the number indicated by float in c is 73.486999999999995 to approximate it.
Now how to solve it as it is going in some infinite loop.
Note : In the IEEE 754 Specifications, a 32 bit float is divided as 24+7+1 bits. The 7 bits indicate the mantissa.
I doubt this is what you want since the question is asking for something that's not usually meaningful with floating point numbers, but here is the answer:
int digits_after_decimal_point(double x)
{
int i;
for (i=0; x!=rint(x); x+=x, i++);
return i;
}
The problem isn't really solvable as stated, since floating-point is typically represented in binary, not in decimal. As you say, many (in fact most) decimal numbers are not exactly representable in floating-point.
On the other hand, all numbers that are exactly representable in binary floating-point are decimals with a finite number of digits -- but that's not particularly useful if you want a result of 2 for 3.44.
When I run your code snippet, it says that 3.44 has 2 digits after the decimal point -- because 3.44 * 10.0 * 10.0 just happens to yield exactly 344.0. That might not happen for another number like, say, 3.43 (I haven't tried it).
When I try it with 1.0/3.0, it goes into an infinite loop. Adding some printfs shows that no becomes exactly 33333333333333324.0 after 17 iterations -- but that number is too big to be represented as an int (at least on my system), and converting it to int has undefined behavior.
And for large numbers, repeatedly multiplying by 10 will inevitably give you a floating-point overflow. There are ways to avoid that, but they don't solve the other problems.
If you store the value 3.44 in a double object, the actual value stored (at least on my system) is exactly 3.439999999999999946709294817992486059665679931640625, which has 51 decimal digits in its fractional part. Suppose you really want to compute the number of decimal digits after the point in 3.439999999999999946709294817992486059665679931640625. Since 3.44 and 3.439999999999999946709294817992486059665679931640625 are effectively the same number, there's no way for any C function to distinguish between them and know whether it should return 2 or 51 (or 50 if you meant 3.43999999999999994670929481799248605966567993164062, or ...).
You could probably detect that the stored value is "close enough" to 3.44, but that makes it a much more complex problem -- and it loses the ability to determine the number of decimal digits in the fractional part of 3.439999999999999946709294817992486059665679931640625.
The question is meaningful only if the number you're given is stored in some format that can actually represent decimal fractions (such as a string), or if you add some complex requirement for determining which decimal fraction a given binary approximation is meant to represent.
There's probably a reasonable way to do the latter by looking for the unique decimal fraction whose nearest approximation in the given floating-point type is the given binary floating-point number.
The question could be interpreted as such:
Given a floating point number, find the shortest decimal representation that would be re-interpreted as the same floating point value with correct rounding.
Once formulated like this, the answer is Yes we can - see this algorithm:
Printing floating point numbers quickly and accurately. Robert G. Burger and R. Kent Dybvig. ACM SIGPLAN 1996 Conference on Programming Language Design and Implementation, June 1996
http://www.cs.indiana.edu/~dyb/pubs/FP-Printing-PLDI96.pdf
See also references from Compute the double value nearest preferred decimal result for a Smalltalk implementation.
Sounds like you need to either use sprintf to get an actual rounded version, or have the input be a string (and not parsed to a float).
Either way, once you have a string version of the number, counting characters after the decimal should be trivial.
It is my logic to count the number of digits.
number = 245.98
Take input as a string
char str[10] = "245.98";
Convert string to int using to count the number of digits before the decimal point.
int atoi(const char *string)
Use logic n/10 inside the while to count the numbers.
Numbers after decimal logic
Get the length of the string using strlen(n)
inside the while (a[i]! ='.'). then increment i
Later you can add step 3 logic output and step 4 logic output
Code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main()
{
char num[100] = "345653.8768";
int count=0;
int i=0;
int len;
int before_decimal = atoi(num);
int after_decimal;
int total_Count;
printf("Converting string to int : %d\n", before_decimal);
//Lets count the numbers of digits before before_decimal
while(before_decimal!=0){
before_decimal = before_decimal/10;
count++;
}
printf("number of digits before decimal are %d\n",count);
//Lets get the number of digits after decimal
// first get the lenght of the string
len = strlen(num);
printf("Total number of digits including '.' are =%d\n",len);
//Now count the number after '.' decimal points
// Hope you know how to compare the strings
while(num[i]!='.'){
i++;
}
// total lenght of number - numberof digits after decimal -1(becuase every string ends with '\0')
after_decimal= len-i-1;
printf("Number of digits after decimal points are %d\n",after_decimal);
//Lets add both count Now
// ie. Number of digits before decmal and after decimal
total_Count = count+ after_decimal;
printf("Total number of digits are :%d\n",total_Count);
return 0;
}
Output:
Converting string to int : 345653
number of digits before decimal are 6
Total number of digits including '.' are =11
Number of digits after decimal points are 4
Total number of digits are :10
There are no general exact solutions. But you can convert the value to string and don't count the part exceeding the type's precision and exclude the trailing 0s or 9s. This will work for more cases but it still won't return the correct answer for all.
For example double's accuracy is about 15 digits if the input is a decimal string from the user (17 digits for binary-decimal-binary round trip), so for 73.486999999999995 there are 15 - 2 = 13 digits after the radix point (minus the 2 digits in the int part). After that there are still many 9s in the fractional part, subtract them from the count too. Here there are ten 9s which means there are 13 - 10 = 3 decimal digits. If you use 17 digits then the last digit which may be just garbage, exclude it before counting the 9s or 0s.
Alternatively just start from the 15 or 16th digit and iterate until you see the first non-0 and non-9 digit. Count the remaining digits and you'll get 3 in this case. Of course while iterating you must also make sure that the trailing is all 0s or all 9s
Request: e.g. if given 3.554 output = 3, for 43.000 output = 0
Problem: that's already a decimal like 0.33345. When this gets converted to a double, it might be something like 0.333459999...125. The goal is merely to determine that 0.33345 is a shorter decimal that will produce the same double. The solution is to convert it to a string with the right number of digits that results in the same original value.
int digits(double v){
int d=0; while(d < 50){
string t=DoubleToString(v,d); double vt = StrToDouble(t);
if(MathAbs(v-vt) < 1e-15) break;
++d;
}
return d;
}
double v=0.33345; PrintFormat("v=%g, d=%i", v,digits(v));// v=0.33345, d=5
v=0.01; PrintFormat("v=%g, d=%i", v,digits(v));// v=0.01, d=2
v=0.00001; PrintFormat("v=%g, d=%i", v,digits(v));// v=1e-05, d=5
v=5*0.00001; PrintFormat("v=%g, d=%i", v,digits(v));// v=5e-05, d=5
v=5*.1*.1*.1; PrintFormat("v=%g, d=%i", v,digits(v));// v=0.005, d=3
v=0.05; PrintFormat("v=%g, d=%i", v,digits(v));// v=0.05, d=2
v=0.25; PrintFormat("v=%g, d=%i", v,digits(v));// v=0.25, d=2
v=1/3.; PrintFormat("v=%g, d=%i", v,digits(v));// v=0.333333, d=15
What you can do is multiply the number by various powers of 10, round that to the nearest integer, and then divide by the same number of powers of 10. When the final result compares different from the original number, you've gone one digit too far.
I haven't read it in a long time, so I don't know how it relates to this idea, but How to Print Floating-Point Numbers Accurately from PLDI 1990 and 2003 Retrospective are probably very relevant to the basic problem.

Why do I need 17 significant digits (and not 16) to represent a double?

Can someone give me an example of a floating point number (double precision), that needs more than 16 significant decimal digits to represent it?
I have found in this thread that sometimes you need up to 17 digits, but I am not able to find an example of such a number (16 seems enough to me).
Can somebody clarify this?
My other answer was dead wrong.
#include <stdio.h>
int
main(int argc, char *argv[])
{
unsigned long long n = 1ULL << 53;
unsigned long long a = 2*(n-1);
unsigned long long b = 2*(n-2);
printf("%llu\n%llu\n%d\n", a, b, (double)a == (double)b);
return 0;
}
Compile and run to see:
18014398509481982
18014398509481980
0
a and b are just 2*(253-1) and 2*(253-2).
Those are 17-digit base-10 numbers. When rounded to 16 digits, they are the same. Yet a and b clearly only need 53 bits of precision to represent in base-2. So if you take a and b and cast them to double, you get your counter-example.
The correct answer is the one by Nemo above. Here I am just pasting a simple Fortran program showing an example of the two numbers, that need 17 digits of precision to print, showing, that one does need (es23.16) format to print double precision numbers, if one doesn't want to loose any precision:
program test
implicit none
integer, parameter :: dp = kind(0.d0)
real(dp) :: a, b
a = 1.8014398509481982e+16_dp
b = 1.8014398509481980e+16_dp
print *, "First we show, that we have two different 'a' and 'b':"
print *, "a == b:", a == b, "a-b:", a-b
print *, "using (es22.15)"
print "(es22.15)", a
print "(es22.15)", b
print *, "using (es23.16)"
print "(es23.16)", a
print "(es23.16)", b
end program
it prints:
First we show, that we have two different 'a' and 'b':
a == b: F a-b: 2.0000000000000000
using (es22.15)
1.801439850948198E+16
1.801439850948198E+16
using (es23.16)
1.8014398509481982E+16
1.8014398509481980E+16
I think the guy on that thread is wrong, and 16 base-10 digits are always enough to represent an IEEE double.
My attempt at a proof would go something like this:
Suppose otherwise. Then, necessarily, two distinct double-precision numbers must be represented by the same 16-significant-digit base-10 number.
But two distinct double-precision numbers must differ by at least one part in 253, which is greater than one part in 1016. And no two numbers differing by more than one part in 1016 could possibly round to the same 16-significant-digit base-10 number.
This is not completely rigorous and could be wrong. :-)
Dig into the single and double precision basics and wean yourself of the notion of this or that (16-17) many DECIMAL digits and start thinking in (53) BINARY digits. The necessary examples may be found here at stackoverflow if you spend some time digging.
And I fail to see how you can award a best answer to anyone giving a DECIMAL answer without qualified BINARY explanations. This stuff is straight-forward but it is not trivial.
The largest continuous range of integers that can be exactly represented by a double (8-byte IEEE) is -253 to 253 (-9007199254740992. to 9007199254740992.). The numbers -253-1 and 253+1 cannot be exactly represented by a double.
Therefore, no more than 16 significant decimal digits to the left of the decimal point will exactly represent a double in the continuous range.

Resources