Max value of datatypes in C - c

I am trying to understand the maximum value that I can store in C. I tried doing printf("%f", pow(2, x)). The answer holds good until x = 1023. It says Inf when x = 1024.
I am sorry that it is a basic question but I am trying to understand how C assigns datatypes' sizes based on my machine.
I have a Mac (64-bit processor). A clear understanding that I have is that my processor being a 64-bit one, it will be able to do calculations up to the value (264). Clearly pow(2, 1023) is greater than that. But my program is working fine till x = 1023. How is this possible? Is GNU compiler has something to do with this?
If this is a duplicate of other question kindly give the link.

In C the pow() functions returns a double, and the double type is typically a 64-bit IEEE format representation of a floating point number.
The basic idea of floating point is to express a number in the same general way as e.g. 1.234×1056. Here you have a mantissa 1.234 and an exponent 56. C++, and probably also C, allows decimal representation for floating point numbers (but not for integer types), but in practice the internal representation will be binary, with a power of 2 rather than a power of 10.
The limit you ran up against was the supported range for the exponent in your compiler's representation of double numbers; probably 64-bit IEEE 754.
The limits of the various built-in integral numerical types are available as symbolic constants from <limits.h>. The limits of the built-in floating point types are available as symbolic constants from <float.h>. See the table over at cppreference.com for more details.
In C++ these limits are also available via the numeric_limits class template from <limits>.

"64-bit processor" typically means that it can deal with integers that contain at most 64 bits at a time (i.e. in a single instruction), not that it can only process numbers with 64 binary digits or less. Using arbitrary precision arithmetic you can do calculations on numbers that are arbitrarily large, provided that you have enough memory (and time), just like how us humans can do operations on big values with only 10 fingers. Read more here: What is the biggest number you can generate using a 64-bit processor?
However pow(2, 1023) is a little bit different. It's not an integer but a floating-point number (of type double in C) represented by a sign, a mantissa and an exponent like this (-1)sign × 1 × 21023. Not all the digits are stored so it's only accurate to the first few digits. However most systems use binary floating-point types so they can store the precise value of a power of 2 up to a large exponent depending on the exponent range. Most modern systems' floating-point types conform to IEEE-754 standard with double maps to binary64/double precision, therefore the maximum value will be
21023 × (1 + (1 − 2−52)) ≈ 1.7976931348623157 × 10308

The maximum value for a double is DBL_MAX. This is defined by <float.h> in C, or <cfloat> in C++. The numeric value may vary across systems, but you can always refer to it by the macro DBL_MAX.
You can print this:
printf("%f\n", DBL_MAX);
The integer data types all have similar macros defined in <limits.h>: e.g. ULLONG_MAX is the biggest value for unsigned long long. If printing with printf make sure to use the correct format specifier.

Related

Matlab "single" precision vs C floating point?

My Matlab script reads a string value "0.001044397222448" from a file, and after parsing the file, this value printed in the console shows as double precision:
value_double =
0.001044397222448
After I convert this number to singe using value_float = single(value_double), the value shows as:
value_float =
0.0010444
What is the real value of this variable, that I later use in my Simulink simulation? Is it really truncated/rounded to 0.0010444?
My problem is that later on, after I compare this with analogous C code, I get differences. In the C code the value is read as float gf = 0.001044397222448f; and it prints out as 0.001044397242367267608642578125000. So the C code keeps good precision. But, does Matlab?
The number 0.001044397222448 (like the vast majority of decimal fractions) cannot be exactly represented in binary floating point.
As a single-precision float, it's most closely represented as (hex) 0x0.88e428 × 2-9, which in decimal is 0.001044397242367267608642578125.
In double precision, it's most closely represented as 0x0.88e427d4327300 × 2-9, which in decimal is 0.001044397222447999984407118745366460643708705902099609375.
Those are what the numbers are, internally, in both C and Matlab.
Everything else you see is an artifact of how the numbers are printed back out, possibly rounded and/or truncated.
When I said that the single-precision representation "in decimal is 0.001044397242367267608642578125", that's mildly misleading, because it makes it look like there are 28 or more digits' worth of precision. Most of those digits, however, are an artifact of the conversion from base 2 back to base 10. As other answers have noted, single-precision floating point actually gives you only about 7 decimal digits of precision, as you can see if you notice where the single- and double-precision equivalents start to diverge:
0.001044397242367267608642578125
0.001044397222447999984407118745366460643708705902099609375
^
difference
Similarly, double precision gives you roughly 16 decimal digits worth of precision, as you can see if you compare the results of converting a few previous and next mantissa values:
0x0.88e427d43272f8 0.00104439722244799976756668424826557384221814572811126708984375
0x0.88e427d4327300 0.001044397222447999984407118745366460643708705902099609375
0x0.88e427d4327308 0.00104439722244800020124755324246734744519926607608795166015625
0x0.88e427d4327310 0.0010443972224480004180879877395682342466898262500762939453125
^
changes
This also demonstrates why you can never exactly represent your original value 0.001044397222448 in binary. If you're using double, you can have 0.00104439722244799998, or you can have 0.0010443972224480002, but you can't have anything in between. (You'd get a little less close with float, and you could get considerably closer with long double, but you'll never get your exact value.)
In C, and whether you're using float or double, you can ask for as little or as much precision as you want when printing things with %f, and under a high-quality implementation you'll always get properly-rounded results. (Of course the results you get will always be the result of rounding the actual, internal value, not necessarily the decimal value you started with.) For example, if I run this code:
printf("%.5f\n", 0.001044397222448);
printf("%.10f\n", 0.001044397222448);
printf("%.15f\n", 0.001044397222448);
printf("%.20f\n", 0.001044397222448);
printf("%.30f\n", 0.001044397222448);
printf("%.40f\n", 0.001044397222448);
printf("%.50f\n", 0.001044397222448);
printf("%.60f\n", 0.001044397222448);
printf("%.70f\n", 0.001044397222448);
I see these results, which as you can see match the analysis above.
(Note that this particular example is using double, not float.)
0.00104
0.0010443972
0.001044397222448
0.00104439722244799998
0.001044397222447999984407118745
0.0010443972224479999844071187453664606437
0.00104439722244799998440711874536646064370870590210
0.001044397222447999984407118745366460643708705902099609375000
0.0010443972224479999844071187453664606437087059020996093750000000000000
I'm not sure how Matlab prints things.
In answer to your specific questions:
What is the real value of this variable, that I later use in my Simulink simulation? Is it really truncated/rounded to 0.0010444?
As a float, it is really "truncated" to a number which, converted back to decimal, is exactly 0.001044397242367267608642578125. But as we've seen, most of those digits are essentially meaningless, and the result can more properly thought of as being about 0.0010443972.
In the C code the value is read as float gf = 0.001044397222448f; and it prints out as 0.001044397242367267608642578125000
So C got the same answer I did -- but, again, most of those digits are not meaningful.
So the C code keeps good precision. But, does Matlab?
I'd be willing to bet that Matlab keeps the same internal precision for ordinary floats and doubles.
MATLAB uses IEEE-754 binary64 for its double-precision type and binary32 for single-precision. When 0.001044397222448 is rounded to the nearest value representable in binary64, the result is 4816432068447840•2−62 = 0.001044397222447999984407118745366460643708705902099609375.
When that is rounded to the nearest value representable in binary32, the result is 8971304•2−33 = 0.001044397242367267608642578125.
Various software (C, Matlab, others) displays floating-point numbers in diverse ways, with more or fewer digits. The above values are the exact numbers represented by the floating-point data, per the IEEE 754 specification, and they are the values the data has when used in arithmetic operations.
All single precisions should be the same
So here is the thing. According to documentation, both matlab and C comply with the IEEE 754 standard. Which means that there should not be any difference between what is actually stored in memory.
You could compute the binary representation by hand but according to this(thanks #Danijel) handy website, the representation of 0.001044397222448 should be 0x3a88e428.
The question is how precise is your representation? It is a bit tricky with floating point but the short answer is your number is accurate up to the 9th decimal and has decimal represented up to the 33rd decimal. If you want the long answer see the tow paragraphs at the end of this post.
A display issue
The fact that you are not seeing the same thing when you print does not mean that you don't have the same bits in memory (and you should have the exact same bytes in memory in C and MATLAB). The only reason you see a difference on your display is because the print functions truncate your number. If you print the 33 decimals in each language you should not have any difference.
To do so in matlab use: fprintf('%.33f', value_float);
To do so in c use printf('%.33f\n', gf);
About floating point precision
Now in more details, the question was: how precise is this representation? Well the tricky thing with floating point is that the precision of the representation depends on what number you are representing. The representation is over 32 bits and is divide with 1 bit for the sign, 8 for the exponent and 23 for the fraction.
The number can be computed as sign * 2^(exponent-127) * 1.fraction. This basically means that the maximal error/precision (depending on how you want to call it) is basically 2^(exponent-127-23), the 23 is here to represent the 23 bytes of the fraction. (There are a few edge cases, I won't elaborate on it). In our case the exponent is 117, which means your precision is 2^(117-127-23) = 1.16415321826934814453125e-10. That means that your single precision float should represent your number accurately up to the 9th decimal, after that it is up to luck.
Further details
I know this is a rather short explanation. For more details, this post explains the floating point imprecision more precisely and this website gives you some useful info and allows you to play visually with the representation.

platform independent way to reduce precision of floating point constant values

The use case:
I have some large data arrays containing floating point constants that.
The file defining that array is generated and the template can be easily adapted.
I would like to make some tests, how reduced precision does influence the results in terms of quality, but also in compressibility of the binary.
Since I do not want to change other source code than the generated file, I am looking for a way to reduce the precision of the constants.
I would like to limit the mantissa to a fixed number of bits (set the lower ones to 0). But since floating point literals are in decimal, there are some difficulties, specifying numbers in a way that the binary representation does contain all zeros at the lower mantissa bits.
The best case would be something like:
#define FP_REDUCE(float) /* some macro */
static const float32_t veryLargeArray[] = {
FP_REDUCE(23.423f), FP_REDUCE(0.000023f), FP_REDUCE(290.2342f),
// ...
};
#undef FP_REDUCE
This should be done at compile time and it should be platform independent.
The following uses the Veltkamp-Dekker splitting algorithm to remove n bits (with rounding) from x, where p = 2n (for example, to remove eight bits, use 0x1p8f for the second argument). The casts to float32_t coerce the results to that type, as the C standard otherwise permits implementations to use more precision within expressions. (Double-rounding could produce incorrect results in theory, but this will not occur when float32_t is the IEEE basic 32-bit binary format and the C implementation computes this expression in that format or the 64-bit format or wider, as the former is the desired format and the latter is wide enough to represent intermediate results exactly.)
IEEE-754 binary floating-point is assumed, with round-to-nearest. Overflow occurs if x•(p+1) rounds to infinity.
#define RemoveBits(x, p) (float32_t) (((float32_t) ((x) * ((p)+1))) - (float32_t) (((float32_t) ((x) * ((p)+1))) - (x))))
What you're asking for can be done with varying degrees of partial portability, but not absolute unless you want to run the source file through your own preprocessing tool at build time to reduce the precision. If that's an option for you, it's probably your best one.
Short of that, I'm going to assume at least that your floating point types are base 2 and obey Annex F/IEEE semantics. This should be a reasonable assumption, but the latter is false with gcc on platforms (including 32-bit x86) with extended-precision under the default standards-conformance profile; you need -std=cNN or -fexcess-precision=standard to fix it.
One approach is to add and subtract a power of two chosen to cause rounding to the desired precision:
#define FP_REDUCE(x,p) ((x)+(p)-(p))
Unfortunately, this works in absolute precisions, not relative, and requires knowing the right value p for the particular x, which is going to be equal to the value of the leading base-2 place of x, times 2 raised to the power of FLT_MANT_DIG minus the bits of precision you want. This cannot be evaluated as a constant expression for use as an initializer, but you can write it in terms of FLT_EPSILON and, if you can assume C99+, a preprocessor-token-pasting to form a hex float literal, yielding the correct value for this factor. But you still need to know the power of two for the leading digit of x; I don't see any way to extract that as a constant expression.
Edit: I believe this is fixable, so as not to need an absolute precision but rather automatically scale to the value, but it depends on correctness of a work in progress. See Is there a correct constant-expression, in terms of a float, for its msb?. If that works I will later integrate the result with this answer.
Another approach I like, if your compiler supports compound literals in static initializers and if you can assume IEEE type representations, is using a union and masking off bits:
union { float x; uint32_t r; } fr;
#define FP_REDUCE(x) ((union fr){.r=(union fr){x}.r & (0xffffffffu<<n)}.x)
where n is the number of bits you want to drop. This will round towards zero rather than to nearest; if you want to make it round to nearest, it should be possible by adding an appropriate constant to the low bits before masking, but you have to take care about what happens when the addition overflows into the exponent bits.

Very long definition of PI

I'm debugging some old C code and it has a definition #define PI 3.14... where ... is about 50 other digits.
Why is this? I said I could reduce the number to about 16 decimal places but my boss snarled at me saying that the other numbers are there for platform independence and forward compatibility. But will is slow the program down?
No, this will not slow down the program, unless you are running on an incredibly underpowered 1MHz DSP chip that has to do floating point arithmetic in software as opposed to passing it off to a dedicated FPU. This would mean that any mathematical operations that use floating point data are much slower than just using integer arithmetic.
In general, greater precision is only going to introduce a slowdown if the most time-consuming part of your program is doing a lot of calculations in rapid succession, and floating point calculations are especially slow. On a modern CPU, this is generally not the case, with the possible exception of certain chips that cause an 80-cycle stall on things like floating point underflow. That kind of issue likely exceeds the domain of this question.
First, it's better to use a common standard definition of PI, like in the C standard header, <math.h>, where it is defined as #define M_PI 3.14159265358979323846. If you insist, you can go ahead and define it manually.
Also, the best precision currently available in C is the equivalent of about 19 digits.
According to Wikipedia, 80-bit "Intel" IEEE 754 extended-precision
long double, which is 80 bits padded to 16 bytes in memory, has 64
bits mantissa, with no implicit bit, which gets you 19.26 decimal
digits. This has been the almost universal standard for long double
for ages, but recently things have started to change.
The newer 128-bit quad-precision format has 112 mantissa bits plus an
implicit bit, which gets you 34 decimal digits. GCC implements this as
the __float128 type and there is (if memory serves) a compiler option
to set long double to it.
Personally, if I were required to use our own definition of pi, I'd write something like this:
#ifndef M_PI
#define PI 3.14159265358979323846264338327950288419716939937510
#else
#define PI M_PI
#endif
If the latest C standard supports an even wider floating point primitive data type, it's pretty much a guarantee that constants in the math library would be updated to support this.
References
More Precise Floating point Data Types than double?, Accessed 2014-03-13, <https://stackoverflow.com/questions/15659668/more-precise-floating-point-data-types-than-double>
Math constant PI value in C, Accessed 2014-03-13, <https://stackoverflow.com/questions/9912151/math-constant-pi-value-in-c>
The number of digits in a macro definition almost certainly will have no effect at all on run-time performance.
Macro expansion is textual. That means that if you have:
#define PI 3.14159... /* 50 digits */
then any time you refer to PI in code to which that definition is visible, it will be as if you had written out 3.14159....
C has just three floating-point types: float, double, and long double. There sizes and precisions are implementation-defined, but they're typically 32 bits, 64 bits, and something wider than 64 bits (the size of long double typically varies more from system to system than the other two do.)
If you use PI in an expression, it will be evaluated as a value of some specific type. And in fact, if there's no L suffix on the literal, it will be of type double.
So if you write:
double x = PI / 2.0;
it's as if you had written:
double x = 3.14159... / 2.0;
The compiler will probably evaluate the division at compile time generating a value of type double. Any extra precision in the literal will be discarded.
To see this, you can try writing a small program that uses the PI macro and examining an assembly listing.
For example:
#include <stdio.h>
#define PI 3.141592653589793238462643383279502884198716939937510582097164
int main(void) {
double x = PI;
printf("x = %g\n", x);
}
On my x86_64 system, the generated machine code has no reference to the full precision value. The instruction corresponding to the initialization is:
movabsq $4614256656552045848, %rax
where 4614256656552045848 is a 64-bit integer corresponding to the binary IEEE double-precision representation of a number as close as possible to 3.141592653589793238462643383279502884198716939937510582097164.
The actual stored floating-point value on my system happens to be exactly:
3.1415926535897931159979634685441851615905761718750000000000000000
of which only about 16 decimal digits are significant.

fixed point fx notation and converting

I have a fx1.15 notation. The underlying integer value is 63183 (register value).
Now, according to wikipedia the the complete length is 15 bits. The value does not fit inside, right?
So assuming it is a fx1.16 value, how do I convert it to a human readable value?
To convert a fixed-point value into something human-readable, do a floating-point divide by 2 to the number of fractional bits. For example, if there are 15 fractional bits, 2^15 = 32768, so you would use something like this:
int x = <fixed-point-value-in-1.15-format>
printf("x = %g\n", x / 32768.0);
Now converting fixed-point numbers to floating-point and invoking printf() are expensive operations, and they usually destroy any performance gained by using fixed-point. I presume you are only doing this for diagnostic purposes.
Also, note that if your platform is doing fixed-point because floating-point operations are forbidden or not available, then you'll have to do something different, along the lines of manually doing the decimal conversion. Model the integer as the underlying floating-point value multiplied by 32768 and go from there. There's some useful fixed-point code here.
p.s. I'm not sure you're still interested in this answer, ashirk, (I wrote it more for others), but if you are, welcome to Stack Overflow!

Why is my float division off by 0.00390625?

float a=67107842,b=512;
float c=a/b;
printf("%lf\n",c);
Why is c 131070.000000 instead of the correct value 131070.00390625?
Your compiler's float type is probably using the 32-bit IEEE 754 single-precision format.
67107842 is a 26-bit binary number:
11111111111111110000000010
The single-precision format represents most numbers as 1.x multipled by some (positive or negative) power of two, where 23 bits are stored after the binary place, with the leading 1. being implied (very small numbers are an exception).
But 67107842 would require 24 bits after the binary place (to be represented as 1.111111111111111000000001 multipled by 225). As there is only room to store 23 bits, the final 1 gets lost. So it is the value in a that is wrong in this case, not the division - a actually contains 67107840 (11111111111111110000000000), which is exactly 131070 * 512.
You can see this if you print a as well:
printf("%lf %lf %lf\n", a, b, c);
gives
67107840.000000 512.000000 131070.000000
Try changing a and c to be type "double", rather than float. That will give you better precision / accuracy. (Floats have about 6 or so significant digits; doubles have more than twice that.)
A float typically uses 32bit IEEE-754 single precision representation, and is good for only approximately 6 significant decimal figures. A double is good for 15, and where supported an 80 bit long double gets to 20 significant figures.
Note that on some compilers there is no distinction between double and long double, or even no support for long double at all.
One solution is to use an arbitrary-precision numeric library, or to use a decimal-floating point library rather then the built-in binary floating point support. Decimal floating point is not intrinsically more precise (though often such libraries support larger, more precise types), but will not show up the artefacts that occur when displaying a decimal representation of a binary floating point value. Decimal floating point is also likely to be much slower, since it is not typically implemented in hardware.

Resources