large numbers and float and double in C - c

I need to deal with very large matrices and/or large numbers and I don't know why
double result = 2251.000000 * 9488.000000 + 7887.000000 * 8397.000000;
gives me the correct output of 87584627.000000.
Same with int result.
However, if I use float result = 2251.000000f + ... etc,
it gives me 87584624.000000 and I have no idea why!
Can somebody tell me what I'm missing?

The most common format for floating point numbers in C is the IEEE-754 format, described in this wikipedia article. The binary32 format corresponds to a float, and the binary64 format corresponds to a double.
A float has just over 7 decimal digits of precision. Since the answer to your equation has 8 significant digits, the answer cannot be exactly represented as a float.
A double has almost 16 decimal digits of precision, and therefore does have an exact representation of the answer. Therefore, in general, when you are doing general purpose mathematics, you should be using doubles. However, it's important to note that even a double may not have enough precision for every application. For example, the national debt of the United States is 18,149,752,816,959.61 which barely fits into a double.

Related

Matlab "single" precision vs C floating point?

My Matlab script reads a string value "0.001044397222448" from a file, and after parsing the file, this value printed in the console shows as double precision:
value_double =
0.001044397222448
After I convert this number to singe using value_float = single(value_double), the value shows as:
value_float =
0.0010444
What is the real value of this variable, that I later use in my Simulink simulation? Is it really truncated/rounded to 0.0010444?
My problem is that later on, after I compare this with analogous C code, I get differences. In the C code the value is read as float gf = 0.001044397222448f; and it prints out as 0.001044397242367267608642578125000. So the C code keeps good precision. But, does Matlab?
The number 0.001044397222448 (like the vast majority of decimal fractions) cannot be exactly represented in binary floating point.
As a single-precision float, it's most closely represented as (hex) 0x0.88e428 × 2-9, which in decimal is 0.001044397242367267608642578125.
In double precision, it's most closely represented as 0x0.88e427d4327300 × 2-9, which in decimal is 0.001044397222447999984407118745366460643708705902099609375.
Those are what the numbers are, internally, in both C and Matlab.
Everything else you see is an artifact of how the numbers are printed back out, possibly rounded and/or truncated.
When I said that the single-precision representation "in decimal is 0.001044397242367267608642578125", that's mildly misleading, because it makes it look like there are 28 or more digits' worth of precision. Most of those digits, however, are an artifact of the conversion from base 2 back to base 10. As other answers have noted, single-precision floating point actually gives you only about 7 decimal digits of precision, as you can see if you notice where the single- and double-precision equivalents start to diverge:
0.001044397242367267608642578125
0.001044397222447999984407118745366460643708705902099609375
^
difference
Similarly, double precision gives you roughly 16 decimal digits worth of precision, as you can see if you compare the results of converting a few previous and next mantissa values:
0x0.88e427d43272f8 0.00104439722244799976756668424826557384221814572811126708984375
0x0.88e427d4327300 0.001044397222447999984407118745366460643708705902099609375
0x0.88e427d4327308 0.00104439722244800020124755324246734744519926607608795166015625
0x0.88e427d4327310 0.0010443972224480004180879877395682342466898262500762939453125
^
changes
This also demonstrates why you can never exactly represent your original value 0.001044397222448 in binary. If you're using double, you can have 0.00104439722244799998, or you can have 0.0010443972224480002, but you can't have anything in between. (You'd get a little less close with float, and you could get considerably closer with long double, but you'll never get your exact value.)
In C, and whether you're using float or double, you can ask for as little or as much precision as you want when printing things with %f, and under a high-quality implementation you'll always get properly-rounded results. (Of course the results you get will always be the result of rounding the actual, internal value, not necessarily the decimal value you started with.) For example, if I run this code:
printf("%.5f\n", 0.001044397222448);
printf("%.10f\n", 0.001044397222448);
printf("%.15f\n", 0.001044397222448);
printf("%.20f\n", 0.001044397222448);
printf("%.30f\n", 0.001044397222448);
printf("%.40f\n", 0.001044397222448);
printf("%.50f\n", 0.001044397222448);
printf("%.60f\n", 0.001044397222448);
printf("%.70f\n", 0.001044397222448);
I see these results, which as you can see match the analysis above.
(Note that this particular example is using double, not float.)
0.00104
0.0010443972
0.001044397222448
0.00104439722244799998
0.001044397222447999984407118745
0.0010443972224479999844071187453664606437
0.00104439722244799998440711874536646064370870590210
0.001044397222447999984407118745366460643708705902099609375000
0.0010443972224479999844071187453664606437087059020996093750000000000000
I'm not sure how Matlab prints things.
In answer to your specific questions:
What is the real value of this variable, that I later use in my Simulink simulation? Is it really truncated/rounded to 0.0010444?
As a float, it is really "truncated" to a number which, converted back to decimal, is exactly 0.001044397242367267608642578125. But as we've seen, most of those digits are essentially meaningless, and the result can more properly thought of as being about 0.0010443972.
In the C code the value is read as float gf = 0.001044397222448f; and it prints out as 0.001044397242367267608642578125000
So C got the same answer I did -- but, again, most of those digits are not meaningful.
So the C code keeps good precision. But, does Matlab?
I'd be willing to bet that Matlab keeps the same internal precision for ordinary floats and doubles.
MATLAB uses IEEE-754 binary64 for its double-precision type and binary32 for single-precision. When 0.001044397222448 is rounded to the nearest value representable in binary64, the result is 4816432068447840•2−62 = 0.001044397222447999984407118745366460643708705902099609375.
When that is rounded to the nearest value representable in binary32, the result is 8971304•2−33 = 0.001044397242367267608642578125.
Various software (C, Matlab, others) displays floating-point numbers in diverse ways, with more or fewer digits. The above values are the exact numbers represented by the floating-point data, per the IEEE 754 specification, and they are the values the data has when used in arithmetic operations.
All single precisions should be the same
So here is the thing. According to documentation, both matlab and C comply with the IEEE 754 standard. Which means that there should not be any difference between what is actually stored in memory.
You could compute the binary representation by hand but according to this(thanks #Danijel) handy website, the representation of 0.001044397222448 should be 0x3a88e428.
The question is how precise is your representation? It is a bit tricky with floating point but the short answer is your number is accurate up to the 9th decimal and has decimal represented up to the 33rd decimal. If you want the long answer see the tow paragraphs at the end of this post.
A display issue
The fact that you are not seeing the same thing when you print does not mean that you don't have the same bits in memory (and you should have the exact same bytes in memory in C and MATLAB). The only reason you see a difference on your display is because the print functions truncate your number. If you print the 33 decimals in each language you should not have any difference.
To do so in matlab use: fprintf('%.33f', value_float);
To do so in c use printf('%.33f\n', gf);
About floating point precision
Now in more details, the question was: how precise is this representation? Well the tricky thing with floating point is that the precision of the representation depends on what number you are representing. The representation is over 32 bits and is divide with 1 bit for the sign, 8 for the exponent and 23 for the fraction.
The number can be computed as sign * 2^(exponent-127) * 1.fraction. This basically means that the maximal error/precision (depending on how you want to call it) is basically 2^(exponent-127-23), the 23 is here to represent the 23 bytes of the fraction. (There are a few edge cases, I won't elaborate on it). In our case the exponent is 117, which means your precision is 2^(117-127-23) = 1.16415321826934814453125e-10. That means that your single precision float should represent your number accurately up to the 9th decimal, after that it is up to luck.
Further details
I know this is a rather short explanation. For more details, this post explains the floating point imprecision more precisely and this website gives you some useful info and allows you to play visually with the representation.

printf behaviour in C

I am trying to understand what is the difference between the following:
printf("%f",4.567f);
printf("%f",4.567);
How does using the f suffix change/influence the output?
How using the 'f' changes/influences the output?
The f at the end of a floating point constant determines the type and can affect the value.
4.567 is floating point constant of type and precision of double. A double can represent exactly typical about 264 different values. 4.567 is not one on them*1. The closest alternative typically is exactly
4.56700000000000017053025658242404460906982421875 // best
4.56699999999999928235183688229881227016448974609375 // next best double
4.567f is floating point constant of type and precision of float. A float can represent exactly typical about 232 different values. 4.567 is not one on them. The closest alternative typically is exactly
4.566999912261962890625 // best
4.56700038909912109375 // next best float
When passed to printf() as part of the ... augments, a float is converted to double with the same value.
So the question becomes what is the expected difference in printing?
printf("%f",4.56700000000000017053025658242404460906982421875);
printf("%f",4.566999912261962890625);
Since the default number of digits after the decimal point to print for "%f" is 6, the output for both rounds to:
4.567000
To see a difference, print with more precision or try 4.567e10, 4.567e10f.
45670000000.000000 // double
45669998592.000000 // float
Your output may slightly differ to to quality of implementation issues.
*1 C supports many floating point encodings. A common one is binary64. Thus typical floating-point values are encoded as an sign * binary fraction * 2exponent. Even simple decimal values like 0.1 can not be represented exactly as such.

float strange imprecision error in c [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 8 years ago.
today happened to me a strange thing, when I try to compile and execute the output of this code isn't what I expected. Here is the code that simply add floating values to an array of float and then print it out.
The simple code:
int main(){
float r[10];
int z;
int i=34;
for(z=0;z<10;z++){
i=z*z*z;
r[z]=i;
r[z]=r[z]+0.634;
printf("%f\n",r[z]);
}
}
the output:
0.634000
1.634000
8.634000
27.634001
64.634003
125.634003
216.634003
343.634003
512.633972
729.633972
note that from the 27 appears numbers after the .634 that should not be there. Anyone know why this happened? It's an event caused by floating point approximation?..
P.S I have a linux debian system, 64 bit
thanks all
A number maybe represented in the following form:
[sign] [mantissa] * 2[exponent]
So there will be rounding or relative errors when the space is less in memory.
From wiki:
Single-precision floating-point format is a computer number format that occupies 4 bytes (32 bits) in computer memory and represents a wide dynamic range of values by using a floating point.
The IEEE 754 standard specifies a binary32 as having:
Sign bit: 1 bit
Exponent width: 8 bits
Significand precision: 24 bits (23 explicitly stored)
This gives from 6 to 9 significant decimal digits precision (if a
decimal string with at most 6 significant decimal is converted to IEEE
754 single precision and then converted back to the same number of
significant decimal, then the final string should match the original;
and if an IEEE 754 single precision is converted to a decimal string
with at least 9 significant decimal and then converted back to single,
then the final number must match the original [4]).
Edit (Edward's comment): Larger (more bits) floating point representations allow for greater precision.
Yes, this is a floating point approximation error or Round-off error. Floating point numbers representation uses quantization to represent a large range of numbers, so it only represent steps and round all the in-between numbers to the nearest step. This cause error if the wanted number is not one of these steps.
In addition to the other useful answers, it can be illustrative to print more digits than the default:
int main(){
float r[10];
int z;
int i=34;
for(z=0;z<10;z++){
i=z*z*z;
r[z]=i;
r[z]=r[z]+0.634;
printf("%.30f\n",r[z]);
}
}
gives
0.634000003337860107421875000000
1.633999943733215332031250000000
8.633999824523925781250000000000
27.634000778198242187500000000000
64.634002685546875000000000000000
125.634002685546875000000000000000
216.634002685546875000000000000000
343.634002685546875000000000000000
512.633972167968750000000000000000
729.633972167968750000000000000000
In particular, note that 0.634 isn't actually "0.634", but instead is the closest number representable by a float.
"float" has only about six digit precision, so it isn't unexpected that you get errors that large.
If you used "double", you would have about 15 digits precision. You would have an error, but you would get for example 125.634000000000003 and not 125.634003.
So you will always get rounding errors and your results will not be quite what you expect, but by using double the effect will be minimal. Warning: If you do things like adding 125 + 0.634 and then subtract 125, the result will (most likely) not be 0.634. No matter whether you use float or double. But with double, the result will be very, very close to 0.634.
In principle, given the choice of float and double, you should never use float, unless you have a very, very good reason.

Why does the addition of two float numbers is incorrect in C?

I have a problem with the addition of two float numbers.
Code below:
float a = 30000.0f;
float b = 4499722832.0f;
printf("%f\n", a+b);
Why the output result is 450002816.000000? (The correct one should be 450002832.)
Float are not represented exactly in C - see http://en.wikipedia.org/wiki/Floating_point#IEEE_754:_floating_point_in_modern_computers and http://en.wikipedia.org/wiki/Single_precision, so calculations with float can only give an approximate result.
This is especially apparent for larger values, since the possible difference can be represented as a percentage of the value. In case of adding/subtracting two values, you get the worse precision of both (and of the result).
Floating-point values cannot represent all integer values.
Remember that single-precision floating-point numbers only have 24 (or 23, depending on how you count) bits of precision (i.e. significant figures). So as values get larger, you begin to lose low-end precision, which is why the result of your calculation isn't quite "correct".
From wikipedia
Single precision, called "float" in the C language family, and "real" or "real*4" in Fortran. This is a binary format that occupies 32 bits (4 bytes) and its significand has a precision of 24 bits (about 7 decimal digits).
So your number doesn't actually fit in float. You can use double instead.

Why is my float division off by 0.00390625?

float a=67107842,b=512;
float c=a/b;
printf("%lf\n",c);
Why is c 131070.000000 instead of the correct value 131070.00390625?
Your compiler's float type is probably using the 32-bit IEEE 754 single-precision format.
67107842 is a 26-bit binary number:
11111111111111110000000010
The single-precision format represents most numbers as 1.x multipled by some (positive or negative) power of two, where 23 bits are stored after the binary place, with the leading 1. being implied (very small numbers are an exception).
But 67107842 would require 24 bits after the binary place (to be represented as 1.111111111111111000000001 multipled by 225). As there is only room to store 23 bits, the final 1 gets lost. So it is the value in a that is wrong in this case, not the division - a actually contains 67107840 (11111111111111110000000000), which is exactly 131070 * 512.
You can see this if you print a as well:
printf("%lf %lf %lf\n", a, b, c);
gives
67107840.000000 512.000000 131070.000000
Try changing a and c to be type "double", rather than float. That will give you better precision / accuracy. (Floats have about 6 or so significant digits; doubles have more than twice that.)
A float typically uses 32bit IEEE-754 single precision representation, and is good for only approximately 6 significant decimal figures. A double is good for 15, and where supported an 80 bit long double gets to 20 significant figures.
Note that on some compilers there is no distinction between double and long double, or even no support for long double at all.
One solution is to use an arbitrary-precision numeric library, or to use a decimal-floating point library rather then the built-in binary floating point support. Decimal floating point is not intrinsically more precise (though often such libraries support larger, more precise types), but will not show up the artefacts that occur when displaying a decimal representation of a binary floating point value. Decimal floating point is also likely to be much slower, since it is not typically implemented in hardware.

Resources