Integer to floating conversion in C - c

float f1 = 123.125;
int i1 = -150;
f1 = i1; // integer to floating conversion
printf("%i assigned to an float produces %f\n", i1, f1);
Output:
-150 assigned to an float produces -150.000000
My question is why the result has 6 zeros (000000) after the . and not 7 or 8 or some number?

That's just what printf does. See the man page where it says
f, F
The double argument shall be converted to decimal notation in the style "[-]ddd.ddd", where the number of digits after the radix character is equal to the precision specification. If the precision is missing, it shall be taken as 6; if the precision is explicitly zero and no '#' flag is present, no radix character shall appear. If a radix character appears, at least one digit appears before it. The low-order digit shall be rounded in an implementation-defined manner.
(emphasis mine)
It has nothing to do with how 150 is represented as a floating point number in memory (and in fact, it's promoted to a double because printf is varargs).

The number of zeros you see is a result of the default precision used by the %f printf conversion. It's basically unrelated to the integer to floating point conversion.

Because the C standard (§7.19.6.1) says that in the absence of information to the contrary, %f will print 6 decimal places.
f,F A double argument representing a floating-point number is converted to
decimal notation in the style [−]ddd.ddd, where the number of digits after
the decimal-point character is equal to the precision specification. If the
precision is missing, it is taken as 6; if the precision is zero and the # flag is
not specified, no decimal-point character appears.

Floating point arithmetic is not exact. printf is just showing that number of zeroes.
From the documentation:
The default number of digits after the
decimal point is six, but this can be
changed with a precision field. If a
decimal point appears, at least one
digit appears before it. The "double"
value is rounded to the correct number
of decimal places.

Related

What's the difference between printf("%.d", 0) and printf("%.1d", 0)?

I'm working on recoding printf and I'm blocked for a moment now with the precision flag. So I read that the default precision when type conversion specifier is d is 1:
So I supposed that there is no difference between %.d and %.1d, but when I test:
printf(".d =%.d, .1d= %.1d", 0, 0);
I do find one:
.d =, .1d= 0
If you use . after % without specifying the precision, it is set to zero.
From the printf page on cppreference.com:
. followed by integer number or *, or neither that specifies
precision of the conversion. In the case when * is used, the precision
is specified by an additional argument of type int. If the value of
this argument is negative, it is ignored. If neither a number nor *
is used, the precision is taken as zero.
It defaults to 1 if you use %d (without .):
printf("d = %d, 1d= %1d", 0, 0);
# Output: d = 0, 1d= 0
The C18 standard - ISO/IEC 9899:2018 - (emphasize mine) states:
"An optional precision that gives the minimum number of digits to appear for the d, i, o, u, x, and X conversions, the number of digits to appear after the decimal-point character for a, A, e, E, f, and F conversions, the maximum number of significant digits for the g and G conversions, or the maximum number of bytes to be written for s conversions. The precision takes the form of a period (.) followed either by an asterisk * (described later) or by an optional non negative decimal integer; if only the period is specified, the precision is taken as zero. If a precision appears with any other conversion specifier, the behavior is undefined."
Source: C18, §7.21.6.1/4
Means %.d is equal to %.0d and with that different to %.1d.
Furthermore:
"d,i - The int argument is converted to signed decimal in the style [-]dddd. The precision specifies the minimum number of digits to appear; if the value being converted can be represented in fewer digits, it is expanded with leading zeros. The default precision is 1. The result of converting a zero value with a precision of zero is no characters."
Source: C18, §7.21.6.1/8
That means if you convert a 0 value by using %.d in a printf() call, the result is guaranteed to be no characters printed (which matches to your test experience).
When the precision is set to zero or its value is omitted like
printf( "%.d", x )'
when according to the description of the conversion specifiers d and i (7.21.6.1 The fprintf function)
The int argument is converted to signed decimal in the style [−]dddd.
The precision specifies the minimum number of digits to appear; if the
value being converted can be represented in fewer digits, it is
expanded with leading zeros. The default precision is 1. The result
of converting a zero value with a precision of zero is no
characters.
Here is a demonstrative program
#include <stdio.h>
int main(void)
{
printf( "%.d\n", 0 );
printf( "%.0d\n", 0 );
printf( "%.1d\n", 0 );
return 0;
}
Its output is
0
That is when the precision is equal to 0 or its value is absent then if 0 is specified as an argument when nothing will be outputted.

Rounding floats in C

While testing the float type and printing it with it's format specifier %f I was testing it's rounding methods.
I've declared the variable as float and gave it the value 5.123456. As you know float must represent at least 6 significant figures.
I then changed it's value to 5.1234567 and printed the value with the %f. It baffles me why it prints out as 5.123456. But if I change the variable value to 5.1234568, it prints out as 5.123457. It rounds properly.
If I haven't made myself clear or the explanation is very confusing:
float a = 5.1234567
printf("%d", a);
// prints out as 5.123456
float a = 5.1234568
printf("%d", a);
// prints out as 5.123457
I've compiled using CodeBlocks and MinGW, same result.
OP is experiencing the effects of double rounding
First, the values 5.123456, 5.1234567, etc. are rounded by the compiler to the closest representable float. Then printf() is rounding the float value to the closest 0.000001 decimal textual representation.
I've declared the variable as float and gave it the value 5.123456. As you know float must represent at least 6 significant figures.
A float can represent about 2^32 different values. 5.123456 is not one of them. The closest value a typical float can represent is 5.12345600128173828125 and that is correct for 6 significant digits: 5.12345...
float x = 5.123456f;
// 5.123455524444580078125 representable float just smaller than 5.123456
// 5.123456 OP's code
// 5.12345600128173828125 representable float just larger than 5.123456 (best)
// The following prints 7 significant digits
// %f prints 6 places after the decimal point.
printf("%f", 5.123456f); // --> 5.123456
With 5.1234567, the closest float has an exact value of 5.123456478118896484375. When using "%f", this is expected print rounded to the closest 0.000001 or 5.123456
float x = 5.1234567f;
// 5.123456478118896484375 representable float just smaller than 5.1234567 (best)
// 5.1234567 OP's code
// 5.1234569549560546875 representable float just larger than 5.1234567
// %f prints 6 places after the decimal point.
printf("%f", 5.1234567f); // --> 5.123456
Significant digits is not the number of digit after the decimal point. It is the number of digits starting with the left-most (most significant) digit.
To print a float to 6 significant figures, use "%.*e".
See Printf width specifier to maintain precision of floating-point value for more details.
float x = 5.1234567;
printf("%.*e\n", 6 - 1, x); // 5.12346e+00
// x xxxxx 6 significant digits
There is no exact float representation for the number 5.1234567 you intend to show here.
If you check here:
https://www.h-schmidt.net/FloatConverter/IEEE754.html
You can see that this number is converted into 5.1234565, or the double 5.1234564781188965 and this rounds down,
While the number 5.1234568 is representable in float, and has a double representation of 5.123456954956055, and this rounds up.
There are two levels of rounding going on:
Your constant of 5.1234567 gets rounded to the nearest value which can be represented by a float (5.123456478...).
The float gets rounded to 6 digits when printed.
It will become obvious if you print the value with more digits.
What it comes down to is that the mantissa of a float has 23 bits and this is not the same as 6 decimal digits (or any number of digits really). Even some apparently simple values like 0.1 don't have an exact float representation.

How big of a number can you store in double and float in c?

I am trying to figure out exactly how big number I can use as floating point number and double. But it does not store the way I expected except integer value. double should hold 8 bytes of information which is enough to hold variable a, but it does not hold it right. It shows 1234567890123456768 in which last 2 digits are different. And when I stored 214783648 or any digit in the last digit in float variable b, it shows the same value 214783648. which is supposed to be the limit. So what's going on?
double a;
float b;
int c;
a = 1234567890123456789;
b = 2147483648;
c = 2147483647;
printf("Bytes of double: %d\n", sizeof(double));
printf("Bytes of integer: %d\n", sizeof(int));
printf("Bytes of float: %d\n", sizeof(float));
printf("\n");
printf("You can count up to %.0f in 4 bytes\n", pow(2,32));
printf("You can count up to %.0f with + or - sign in 4 bytes\n", pow(2,31));
printf("You can count up to %.0f in 4 bytes\n", pow(2,64));
printf("You can count up to %.0f with + or - sign in in 8 bytes\n", pow(2,63));
printf("\n");
printf("double number: %.0f\n", a);
printf("floating point: %.0f\n", b);
printf("integer: %d\n", c);
return 0;
The answer to the question of what is the largest (finite) number that can be stored in a floating point type would be FLT_MAX or DBL_MAX for float and double, respectively.
However, that doesn't mean that the type can precisely represent every smaller number or integer (in fact, not even close).
First you need to understand that not all bits of a floating point number are “equal”. A floating point number has an exponent (8 bits in IEEE-754 standard float, 11 bits in double), and a mantissa (23 and 52 bits in float, and double respectively). The number is obtained by multiplying the mantissa (which has an implied leading 1-bit and binary point) by 2exponent (after normalizing the exponent; its binary value is not used directly). There is also a separate sign bit, so the following applies to negative numbers as well.
As the exponent changes, the distance between consecutive values of the mantissa changes as well, i.e., the greater the exponent, the further apart consecutive representable values of the floating point number are. Thus you may be able to store one number of a given magnitude precisely, but not the “next” number. One should also remember that some seemingly simple fractions can not be represented precisely with any number of binary digits (e.g., 1/10, one tenth, is an infinitely repeating sequence in binary, like 1/3, one third, is in decimal).
When it comes to integers, you can precisely represent every integer up to 2mantissa_bits + 1 magnitude. Thus an IEEE-754 float can represent all integers up to 224 and a double up to 253 (in the last half of these ranges the consecutive floating point values are exactly one integer apart, since the entire mantissa is used for the integer part only). There are individual larger integers that can be represented, but they are spaced more than one integer apart, i.e., you can represent some integers greater than 2mantissa_bits + 1 but every integer only up to that magnitude.
For example:
float f = powf(2.0f, 24.0f);
float f1 = f + 1.0f, f2 = f1 + 2.0f;
double d = pow(2.0, 53.0);
double d1 = d + 1.0, d2 = d + 2.0;
(void) printf("2**24 float = %.0f, +1 = %.0f, +2 = %.0f\n", f, f1, f2);
(void) printf("2**53 double = %.0f, +1 = %.0f, +2 = %.0f\n", d, d1, d2);
Outputs:
2**24 float = 16777216, +1 = 16777216, +2 = 16777218
2**53 double = 9007199254740992, +1 = 9007199254740992, +2 = 9007199254740994
As you can see, adding 1 to 2mantissa_bits + 1 makes no difference since the result is not representable, but adding 2 does produce the correct answer (as it happens, at this magnitude the representable numbers are two integers apart since the multiplier has doubled).
 
TL;DR An IEE-754 float can precisely represent all integers up to 224 and double up to 253, but only some integers of greater magnitude (the spacing of representable values depends on the magnitude).
sizeof(double) is 8, true, but double needs some bits to store the exponent part as well.
Assuming IEEE-754 is used, double can represent integers at most 253 precisely, which is less than 1234567890123456789.
See also Double-precision floating-point format.
You can use constants to know what are the limits :
FLT_MAX
DBL_MAX
LDBL_MAX
From CPP reference
You can print the actual limits of the standard POD-types by printing the limits stored in the 'limits.h' header file (for C++ the equivalent is 'std::numeric_limits' identifier as shown here:
enter link description here)
Due to the fact that the hardware doesn't work with floating types respectively cannot represent floating types by hardware in reality, the hardware uses the bit-length of your hardware to represent a floating type. Since you don't have an infinit length for floating types, you can only show/present a double variable for a specific precision. Most of the hardware uses for the floating type presentation the IEEE-754 standard.
To get more precision you could try 'long double' (dependend on the hardware this could be of quadruple-precision than double), AVX,SSE registers, big-num libraries or you coudl do it yourself.
The sizeof an object only reports the memory space it occupies. It does not show the valid range. It would be well possible to have an unsigned int with e.g. 2**16 (65536) possible value occupy 32 bits im memory.
For floating point objects, it is more difficult. They consist of (simplified) two fields: an integer mantissa and an exponent (see details in the linked article). Both with a fixed width.
As the mantissa only has a limited range, trailing bits are truncated or rounded and the exponent is corrected, if required. This is one reason one should never use floating point types to store precise values like currency.
In decimal (note: computers use binary representation) with 4 digit mantissa:
1000 --> 1.000e3
12345678 --> 1.234e7
The paramters for your implementation are defined in float.h similar to limits.h which provides parameters for integers.
On Linux, #include <values.h>
On Windows,#include <float.h>
There is a fairly comprehensive list of defines

Why aren't the rightmost digits zeros (C/Linux)?

If you print a float with more precision than is stored in memory, aren't the extra places supposed to have zeros in them? I have code that is something like this:
double z[2*N]="0";
...
for( n=1; n<=2*N; n++) {
fprintf( u1, "%.25g", z[n-1]);
fputc( n<2*N ? ',' : '\n', u1);
}
Which is creating output like this:
0,0.7071067811865474617150085,....
A float should have only 17 decimal places (right? Doesn't 53 bits comes out to 17 decimal places). If that's so, then the 18th, 19th... 25th places should have zeros. Notice in the above output that they have digits other than 0 in them.
Am I misunderstanding something? If so, what?
No, 53 bits means that the 17 decimal places are what you can trust, but because base-10 notation that we use is in a different base from which the double is stored (binary), the later digits are just because 1/2^53 is not exactly 1/10^n, i.e.,
1/2^53 = .0000000000000001110223024625156540423631668090820312500000000
The string printed by your implementation shows the exact value of the double in your example, and this is permitted by the C standard, as I show below.
First, we should understand what the floating-point object represents. The C standard does a poor job of this, but, presuming your implementation uses the IEEE 754 floating-point standard, a normal floating-point object represents exactly (-1)s•2e•(1+f) for some sign bit s (0 or 1), exponent e (in range for the specific type, -1022 to 1023 for double), and fraction f (also in range, 52 bits after a radix point for double). Many people use the object to approximate nearby values, but, according to the standard, the object only represents the one value it is defined to be.
The value you show, 0.7071067811865474617150085, is exactly representable as a double (sign bit 0, exponent -1, and fraction bits [in hexadecimal] .6a09e667f3bcc16). It is important to understand the double with this value represents exactly that value; it does not represent nearby values, such as 0.707106781186547461715.
Now that we know the value being passed to fprintf, we can consider what the C standard says about this. First, the C standard defines a constant named DECIMAL_DIG. C 2011 5.2.4.2.2 11 defines this to be the number of decimal digits such that any floating-point number in the widest supported type can be rounded to that many decimal digits and back again without change to the value. The precision you passed to fprintf, 25, is likely greater than the value of DECIMAL_DIG on your system.
In C 2011 7.21.6.1 13, the standard says “If the number of significant decimal digits is more than DECIMAL_DIG but the source value is exactly representable with DECIMAL_DIG digits, then the result should be an exact representation with trailing zeros. Otherwise, the source value is bounded by two adjacent decimal strings L < U , both having DECIMAL_DIG significant digits; the value of the resultant decimal string D should satisfy L ≤ D ≤ U, with the extra stipulation that the error should have a correct sign for the current rounding direction.”
This wording allows the compiler some wiggle room. The intent is that the result must be accurate enough that it can be converted back to the original double with no error. It may be more accurate, and some C implementations will produce the exactly correct value, which is permitted since it satisfies the paragraph above.
Incidentally, the value you show is not the double closest to sqrt(2)/2. That value is +0x1.6A09E667F3BCDp-1 = 0.70710678118654757273731092936941422522068023681640625.
There is enough precision to represent 0.7071067811865474617150085 in double precision floating point. The 64 bit output is actually 3FE6A09E667F3BCC
The formula used to evaluate the number is an exponentiation, so you cannot say that 53 bits will take 17 decimal places.
EDIT:
Look at the example below in the wiki article for another instance:
0.333333333333333314829616256247390992939472198486328125
=2^(−54) × 15 5555 5555 5555 base16
=2^(−2) × (15 5555 5555 5555 base16 × 2^(−52) )
You are asking for float, but in your code appears double.
Anyway, neither float or double have always the same number of decimals. Float have assigned 32 bits (4 bytes) for a floating point representation according to IEEE 754.
From Wikipedia:
The IEEE 754 standard specifies a binary32 as having:
Sign bit: 1 bit
Exponent width: 8 bits
Significand precision: 24 (23 explicitly stored)
This gives from 6 to 9 significant decimal digits precision (if a
decimal string with at most 6 significant decimal is converted to IEEE
754 single precision and then converted back to the same number of
significant decimal, then the final string should match the original;
and if an IEEE 754 single precision is converted to a decimal string
with at least 9 significant decimal and then converted back to single,
then the final number must match the original).
In the case of double, from Wikipedia again:
Double-precision binary floating-point is a commonly used format on
PCs, due to its wider range over single-precision floating point, in
spite of its performance and bandwidth cost. As with single-precision
floating-point format, it lacks precision on integer numbers when
compared with an integer format of the same size. It is commonly known
simply as double. The IEEE 754 standard specifies a binary64 as
having:
Sign bit: 1 bit
Exponent width: 11 bits
Significand precision: 53 bits (52 explicitly stored)
This gives from 15 - 17 significant
decimal digits precision. If a decimal string with at most 15
significant decimal is converted to IEEE 754 double precision and then
converted back to the same number of significant decimal, then the
final string should match the original; and if an IEEE 754 double
precision is converted to a decimal string with at least 17
significant decimal and then converted back to double, then the final
number must match the original.
On the other hand, you can't expect that if you have a float and print it out with more precision that the really stored, the rest of digits will fill with 0s. The compiler can't imagine the tricks you are trying to do.

What is p-notation in C programming?

I'm learning C right now and there is a conversion specifier %a which writes a number in p-notation as opposed to %e which writes something in e-notation (exponential notation).
What is p-notation?
You use %a to get a hexadecimal representation of a floating-point number. This might be useful if you are a student learning floating-point representations, or if you want to be able to read and write an exact floating-point number with no rounding error (but not very human-readable).
This format specificier, along with many others, was added as part of the C99 standard. Dinkumware have an excellent C99 library reference free online; it's PJ Plauger's company, and he had a lot to do with both C89 and C99 standard libraries. Link above is to printing functions; the general library reference is http://www.dinkumware.com/manuals/default.aspx
Here is an extract from the c99 standard, section 7.19.6.1 (7) which shows the details for %a or %A (similar to the mac details given by dmckee above):
A double argument representing a
floating-point number is converted in
the style [−]0xh.hhhhp±d, where there
is one hexadecimal digit (which is
nonzero if the argument is a
normalized floating-point number and
is otherwise unspecified) before the
decimal-point character and the
number of hexadecimal digits after it
is equal to the precision; if the
precision is missing and FLT_RADIX is
a power of 2, then the precision is
sufficient for an exact representation
of the value; if the precision is
missing and FLT_RADIX is not a power
of 2, then the precision is sufficient
to distinguish248) values of type
double, except that trailing zeros may
be omitted; if the precision is zero
and the # flag is not specified, no
decimal- point character appears. The
letters abcdef are used for a
conversion and the letters ABCDEF for
A conversion. The A conversion
specifier produces a number with X and
P instead of x and p. The exponent
always contains at least one digit,
and only as many more digits as
necessary to represent the decimal
exponent of 2. If the value is zero,
the exponent is zero.
From the printf(3) man page on my Mac OS X box (therefore the BSD c standard library implementation):
aA
The double argument is rounded and converted to hexadecimal nota-
tion in the style [-]0xh.hhhp[+-]d, where the number of digits
after the hexadecimal-point character is equal to the precision
specification. If the precision is missing, it is taken as
enough to represent the floating-point number exactly, and no
rounding occurs. If the precision is zero, no hexadecimal-point
character appears. The p is a literal character p', and the
exponent consists of a positive or negative sign followed by a
decimal number representing an exponent of 2. The A conversion
uses the prefix ``0X'' (rather than ``0x''), the letters
``ABCDEF'' (rather than ``abcdef'') to represent the hex digits,
and the letterP' (rather than `p') to separate the mantissa and
exponent.
The 'p' (or 'P') serves to separate the (hexadecimal) mantissa from the (hexadecimal) exponent.
These specifiers are not in my K&R, and the man page is not specific about what standard (if any) specifies them.
I just checked my Debian 5.0 box (using glibc 2.7) which also has it; that man page says that it is c99 related (again, no reference to any particular standard).
This might be useful: http://www.cppreference.com/wiki/c/io/printf
Specifically, here are the format specifiers you can use in printf (w/o modifiers like .02 etc):
Code Format
%c character
%d signed integers
%i signed integers
%I64d long long (8B integer), MS-specific
%I64u unsigned long long (8B integer), MS-specific
%e scientific notation, with a lowercase “e”
%E scientific notation, with a uppercase “E”
%f floating point
%g use %e or %f, whichever is shorter
%G use %E or %f, whichever is shorter
%o octal
%s a string of characters
%u unsigned integer
%x unsigned hexadecimal, with lowercase letters
%X unsigned hexadecimal, with uppercase letters
%p a pointer
%n the argument shall be a pointer to an integer into which is placed the number of characters written so far
There is no %a format specifier (as as I'm aware, and certainly not in any of the common implementations).
There is a %p format specifier which prints a pointer address.
Ref.
UPDATE: please see other posts.

Resources