Why is my float division off by 0.00390625? - c

float a=67107842,b=512;
float c=a/b;
printf("%lf\n",c);
Why is c 131070.000000 instead of the correct value 131070.00390625?

Your compiler's float type is probably using the 32-bit IEEE 754 single-precision format.
67107842 is a 26-bit binary number:
11111111111111110000000010
The single-precision format represents most numbers as 1.x multipled by some (positive or negative) power of two, where 23 bits are stored after the binary place, with the leading 1. being implied (very small numbers are an exception).
But 67107842 would require 24 bits after the binary place (to be represented as 1.111111111111111000000001 multipled by 225). As there is only room to store 23 bits, the final 1 gets lost. So it is the value in a that is wrong in this case, not the division - a actually contains 67107840 (11111111111111110000000000), which is exactly 131070 * 512.
You can see this if you print a as well:
printf("%lf %lf %lf\n", a, b, c);
gives
67107840.000000 512.000000 131070.000000

Try changing a and c to be type "double", rather than float. That will give you better precision / accuracy. (Floats have about 6 or so significant digits; doubles have more than twice that.)

A float typically uses 32bit IEEE-754 single precision representation, and is good for only approximately 6 significant decimal figures. A double is good for 15, and where supported an 80 bit long double gets to 20 significant figures.
Note that on some compilers there is no distinction between double and long double, or even no support for long double at all.
One solution is to use an arbitrary-precision numeric library, or to use a decimal-floating point library rather then the built-in binary floating point support. Decimal floating point is not intrinsically more precise (though often such libraries support larger, more precise types), but will not show up the artefacts that occur when displaying a decimal representation of a binary floating point value. Decimal floating point is also likely to be much slower, since it is not typically implemented in hardware.

Related

Matlab "single" precision vs C floating point?

My Matlab script reads a string value "0.001044397222448" from a file, and after parsing the file, this value printed in the console shows as double precision:
value_double =
0.001044397222448
After I convert this number to singe using value_float = single(value_double), the value shows as:
value_float =
0.0010444
What is the real value of this variable, that I later use in my Simulink simulation? Is it really truncated/rounded to 0.0010444?
My problem is that later on, after I compare this with analogous C code, I get differences. In the C code the value is read as float gf = 0.001044397222448f; and it prints out as 0.001044397242367267608642578125000. So the C code keeps good precision. But, does Matlab?
The number 0.001044397222448 (like the vast majority of decimal fractions) cannot be exactly represented in binary floating point.
As a single-precision float, it's most closely represented as (hex) 0x0.88e428 × 2-9, which in decimal is 0.001044397242367267608642578125.
In double precision, it's most closely represented as 0x0.88e427d4327300 × 2-9, which in decimal is 0.001044397222447999984407118745366460643708705902099609375.
Those are what the numbers are, internally, in both C and Matlab.
Everything else you see is an artifact of how the numbers are printed back out, possibly rounded and/or truncated.
When I said that the single-precision representation "in decimal is 0.001044397242367267608642578125", that's mildly misleading, because it makes it look like there are 28 or more digits' worth of precision. Most of those digits, however, are an artifact of the conversion from base 2 back to base 10. As other answers have noted, single-precision floating point actually gives you only about 7 decimal digits of precision, as you can see if you notice where the single- and double-precision equivalents start to diverge:
0.001044397242367267608642578125
0.001044397222447999984407118745366460643708705902099609375
^
difference
Similarly, double precision gives you roughly 16 decimal digits worth of precision, as you can see if you compare the results of converting a few previous and next mantissa values:
0x0.88e427d43272f8 0.00104439722244799976756668424826557384221814572811126708984375
0x0.88e427d4327300 0.001044397222447999984407118745366460643708705902099609375
0x0.88e427d4327308 0.00104439722244800020124755324246734744519926607608795166015625
0x0.88e427d4327310 0.0010443972224480004180879877395682342466898262500762939453125
^
changes
This also demonstrates why you can never exactly represent your original value 0.001044397222448 in binary. If you're using double, you can have 0.00104439722244799998, or you can have 0.0010443972224480002, but you can't have anything in between. (You'd get a little less close with float, and you could get considerably closer with long double, but you'll never get your exact value.)
In C, and whether you're using float or double, you can ask for as little or as much precision as you want when printing things with %f, and under a high-quality implementation you'll always get properly-rounded results. (Of course the results you get will always be the result of rounding the actual, internal value, not necessarily the decimal value you started with.) For example, if I run this code:
printf("%.5f\n", 0.001044397222448);
printf("%.10f\n", 0.001044397222448);
printf("%.15f\n", 0.001044397222448);
printf("%.20f\n", 0.001044397222448);
printf("%.30f\n", 0.001044397222448);
printf("%.40f\n", 0.001044397222448);
printf("%.50f\n", 0.001044397222448);
printf("%.60f\n", 0.001044397222448);
printf("%.70f\n", 0.001044397222448);
I see these results, which as you can see match the analysis above.
(Note that this particular example is using double, not float.)
0.00104
0.0010443972
0.001044397222448
0.00104439722244799998
0.001044397222447999984407118745
0.0010443972224479999844071187453664606437
0.00104439722244799998440711874536646064370870590210
0.001044397222447999984407118745366460643708705902099609375000
0.0010443972224479999844071187453664606437087059020996093750000000000000
I'm not sure how Matlab prints things.
In answer to your specific questions:
What is the real value of this variable, that I later use in my Simulink simulation? Is it really truncated/rounded to 0.0010444?
As a float, it is really "truncated" to a number which, converted back to decimal, is exactly 0.001044397242367267608642578125. But as we've seen, most of those digits are essentially meaningless, and the result can more properly thought of as being about 0.0010443972.
In the C code the value is read as float gf = 0.001044397222448f; and it prints out as 0.001044397242367267608642578125000
So C got the same answer I did -- but, again, most of those digits are not meaningful.
So the C code keeps good precision. But, does Matlab?
I'd be willing to bet that Matlab keeps the same internal precision for ordinary floats and doubles.
MATLAB uses IEEE-754 binary64 for its double-precision type and binary32 for single-precision. When 0.001044397222448 is rounded to the nearest value representable in binary64, the result is 4816432068447840•2−62 = 0.001044397222447999984407118745366460643708705902099609375.
When that is rounded to the nearest value representable in binary32, the result is 8971304•2−33 = 0.001044397242367267608642578125.
Various software (C, Matlab, others) displays floating-point numbers in diverse ways, with more or fewer digits. The above values are the exact numbers represented by the floating-point data, per the IEEE 754 specification, and they are the values the data has when used in arithmetic operations.
All single precisions should be the same
So here is the thing. According to documentation, both matlab and C comply with the IEEE 754 standard. Which means that there should not be any difference between what is actually stored in memory.
You could compute the binary representation by hand but according to this(thanks #Danijel) handy website, the representation of 0.001044397222448 should be 0x3a88e428.
The question is how precise is your representation? It is a bit tricky with floating point but the short answer is your number is accurate up to the 9th decimal and has decimal represented up to the 33rd decimal. If you want the long answer see the tow paragraphs at the end of this post.
A display issue
The fact that you are not seeing the same thing when you print does not mean that you don't have the same bits in memory (and you should have the exact same bytes in memory in C and MATLAB). The only reason you see a difference on your display is because the print functions truncate your number. If you print the 33 decimals in each language you should not have any difference.
To do so in matlab use: fprintf('%.33f', value_float);
To do so in c use printf('%.33f\n', gf);
About floating point precision
Now in more details, the question was: how precise is this representation? Well the tricky thing with floating point is that the precision of the representation depends on what number you are representing. The representation is over 32 bits and is divide with 1 bit for the sign, 8 for the exponent and 23 for the fraction.
The number can be computed as sign * 2^(exponent-127) * 1.fraction. This basically means that the maximal error/precision (depending on how you want to call it) is basically 2^(exponent-127-23), the 23 is here to represent the 23 bytes of the fraction. (There are a few edge cases, I won't elaborate on it). In our case the exponent is 117, which means your precision is 2^(117-127-23) = 1.16415321826934814453125e-10. That means that your single precision float should represent your number accurately up to the 9th decimal, after that it is up to luck.
Further details
I know this is a rather short explanation. For more details, this post explains the floating point imprecision more precisely and this website gives you some useful info and allows you to play visually with the representation.

large numbers and float and double in C

I need to deal with very large matrices and/or large numbers and I don't know why
double result = 2251.000000 * 9488.000000 + 7887.000000 * 8397.000000;
gives me the correct output of 87584627.000000.
Same with int result.
However, if I use float result = 2251.000000f + ... etc,
it gives me 87584624.000000 and I have no idea why!
Can somebody tell me what I'm missing?
The most common format for floating point numbers in C is the IEEE-754 format, described in this wikipedia article. The binary32 format corresponds to a float, and the binary64 format corresponds to a double.
A float has just over 7 decimal digits of precision. Since the answer to your equation has 8 significant digits, the answer cannot be exactly represented as a float.
A double has almost 16 decimal digits of precision, and therefore does have an exact representation of the answer. Therefore, in general, when you are doing general purpose mathematics, you should be using doubles. However, it's important to note that even a double may not have enough precision for every application. For example, the national debt of the United States is 18,149,752,816,959.61 which barely fits into a double.

float strange imprecision error in c [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 8 years ago.
today happened to me a strange thing, when I try to compile and execute the output of this code isn't what I expected. Here is the code that simply add floating values to an array of float and then print it out.
The simple code:
int main(){
float r[10];
int z;
int i=34;
for(z=0;z<10;z++){
i=z*z*z;
r[z]=i;
r[z]=r[z]+0.634;
printf("%f\n",r[z]);
}
}
the output:
0.634000
1.634000
8.634000
27.634001
64.634003
125.634003
216.634003
343.634003
512.633972
729.633972
note that from the 27 appears numbers after the .634 that should not be there. Anyone know why this happened? It's an event caused by floating point approximation?..
P.S I have a linux debian system, 64 bit
thanks all
A number maybe represented in the following form:
[sign] [mantissa] * 2[exponent]
So there will be rounding or relative errors when the space is less in memory.
From wiki:
Single-precision floating-point format is a computer number format that occupies 4 bytes (32 bits) in computer memory and represents a wide dynamic range of values by using a floating point.
The IEEE 754 standard specifies a binary32 as having:
Sign bit: 1 bit
Exponent width: 8 bits
Significand precision: 24 bits (23 explicitly stored)
This gives from 6 to 9 significant decimal digits precision (if a
decimal string with at most 6 significant decimal is converted to IEEE
754 single precision and then converted back to the same number of
significant decimal, then the final string should match the original;
and if an IEEE 754 single precision is converted to a decimal string
with at least 9 significant decimal and then converted back to single,
then the final number must match the original [4]).
Edit (Edward's comment): Larger (more bits) floating point representations allow for greater precision.
Yes, this is a floating point approximation error or Round-off error. Floating point numbers representation uses quantization to represent a large range of numbers, so it only represent steps and round all the in-between numbers to the nearest step. This cause error if the wanted number is not one of these steps.
In addition to the other useful answers, it can be illustrative to print more digits than the default:
int main(){
float r[10];
int z;
int i=34;
for(z=0;z<10;z++){
i=z*z*z;
r[z]=i;
r[z]=r[z]+0.634;
printf("%.30f\n",r[z]);
}
}
gives
0.634000003337860107421875000000
1.633999943733215332031250000000
8.633999824523925781250000000000
27.634000778198242187500000000000
64.634002685546875000000000000000
125.634002685546875000000000000000
216.634002685546875000000000000000
343.634002685546875000000000000000
512.633972167968750000000000000000
729.633972167968750000000000000000
In particular, note that 0.634 isn't actually "0.634", but instead is the closest number representable by a float.
"float" has only about six digit precision, so it isn't unexpected that you get errors that large.
If you used "double", you would have about 15 digits precision. You would have an error, but you would get for example 125.634000000000003 and not 125.634003.
So you will always get rounding errors and your results will not be quite what you expect, but by using double the effect will be minimal. Warning: If you do things like adding 125 + 0.634 and then subtract 125, the result will (most likely) not be 0.634. No matter whether you use float or double. But with double, the result will be very, very close to 0.634.
In principle, given the choice of float and double, you should never use float, unless you have a very, very good reason.

Max value of datatypes in C

I am trying to understand the maximum value that I can store in C. I tried doing printf("%f", pow(2, x)). The answer holds good until x = 1023. It says Inf when x = 1024.
I am sorry that it is a basic question but I am trying to understand how C assigns datatypes' sizes based on my machine.
I have a Mac (64-bit processor). A clear understanding that I have is that my processor being a 64-bit one, it will be able to do calculations up to the value (264). Clearly pow(2, 1023) is greater than that. But my program is working fine till x = 1023. How is this possible? Is GNU compiler has something to do with this?
If this is a duplicate of other question kindly give the link.
In C the pow() functions returns a double, and the double type is typically a 64-bit IEEE format representation of a floating point number.
The basic idea of floating point is to express a number in the same general way as e.g. 1.234×1056. Here you have a mantissa 1.234 and an exponent 56. C++, and probably also C, allows decimal representation for floating point numbers (but not for integer types), but in practice the internal representation will be binary, with a power of 2 rather than a power of 10.
The limit you ran up against was the supported range for the exponent in your compiler's representation of double numbers; probably 64-bit IEEE 754.
The limits of the various built-in integral numerical types are available as symbolic constants from <limits.h>. The limits of the built-in floating point types are available as symbolic constants from <float.h>. See the table over at cppreference.com for more details.
In C++ these limits are also available via the numeric_limits class template from <limits>.
"64-bit processor" typically means that it can deal with integers that contain at most 64 bits at a time (i.e. in a single instruction), not that it can only process numbers with 64 binary digits or less. Using arbitrary precision arithmetic you can do calculations on numbers that are arbitrarily large, provided that you have enough memory (and time), just like how us humans can do operations on big values with only 10 fingers. Read more here: What is the biggest number you can generate using a 64-bit processor?
However pow(2, 1023) is a little bit different. It's not an integer but a floating-point number (of type double in C) represented by a sign, a mantissa and an exponent like this (-1)sign × 1 × 21023. Not all the digits are stored so it's only accurate to the first few digits. However most systems use binary floating-point types so they can store the precise value of a power of 2 up to a large exponent depending on the exponent range. Most modern systems' floating-point types conform to IEEE-754 standard with double maps to binary64/double precision, therefore the maximum value will be
21023 × (1 + (1 − 2−52)) ≈ 1.7976931348623157 × 10308
The maximum value for a double is DBL_MAX. This is defined by <float.h> in C, or <cfloat> in C++. The numeric value may vary across systems, but you can always refer to it by the macro DBL_MAX.
You can print this:
printf("%f\n", DBL_MAX);
The integer data types all have similar macros defined in <limits.h>: e.g. ULLONG_MAX is the biggest value for unsigned long long. If printing with printf make sure to use the correct format specifier.

Why does the addition of two float numbers is incorrect in C?

I have a problem with the addition of two float numbers.
Code below:
float a = 30000.0f;
float b = 4499722832.0f;
printf("%f\n", a+b);
Why the output result is 450002816.000000? (The correct one should be 450002832.)
Float are not represented exactly in C - see http://en.wikipedia.org/wiki/Floating_point#IEEE_754:_floating_point_in_modern_computers and http://en.wikipedia.org/wiki/Single_precision, so calculations with float can only give an approximate result.
This is especially apparent for larger values, since the possible difference can be represented as a percentage of the value. In case of adding/subtracting two values, you get the worse precision of both (and of the result).
Floating-point values cannot represent all integer values.
Remember that single-precision floating-point numbers only have 24 (or 23, depending on how you count) bits of precision (i.e. significant figures). So as values get larger, you begin to lose low-end precision, which is why the result of your calculation isn't quite "correct".
From wikipedia
Single precision, called "float" in the C language family, and "real" or "real*4" in Fortran. This is a binary format that occupies 32 bits (4 bytes) and its significand has a precision of 24 bits (about 7 decimal digits).
So your number doesn't actually fit in float. You can use double instead.

Resources