Different rounding between assignment and printf - c

I have a program with two variables of type int.
int num;
int other_num;
/* change the values of num and other_num with conditional increments */
printf ("result = %f\n", (float) num / other_num);
float r = (float) num / other_num;
printf ("result = %f\n", r);
The value written in the first printf is different from the value written by the second printf (by 0.000001, when printed with 6 decimal places).
Before the division, the values are:
num = 10201
other_num = 2282
I've printed the resulting numbers to 15 decimal places. Those numbers diverge in the 7th decimal place which explains the difference in the 6th one.
Here are the numbers with 15 decimal places:
4.470201577563540
4.470201492309570
I'm aware of floating point rounding issues, but I was expecting the calculated result to be the same when performed in the assignment and in the printf argument.
Why is this expectation incorrect?
Thanks.

Probably because FLT_EVAL_METHOD is something other than 0 on your system.
In the first case, the expression (float) num / other_num has nominal type float, but is possibly evaluated at higher precision (if you're on x86, probably long double). It's then converted to double for passing to printf.
In the second case, you assign the result to a variable of type float, which forces dropping of excess precision. The float is then promoted to double when passed to printf.
Of course without actual numbers, this is all just guesswork. If you want a more definitive answer, provide complete details on your problem.

The point is the actual position of the result of the expressions during the execution of the program. C values can live on the memory (which includes caches) or just on registers if the compiler decides that this kind of optimization is possible in the specific case.
In the first printf, the expression result is stored in a register, as the value is just used in the same C instruction, so the compiler thinks (correctly), that it would be useless to store it somewhere less volatile; as result, the value is stored as double or long double depending on the architecture.
In the second case, the compiler did not perform such optimization: the value is stored in a variable within the stack, which is memory, not register; the same value is therefore chopped at the 23th significant bit.
More examples are provided by streflop and its documentation.

Related

Is it defined what will happen if you shift a float?

I am following This video tutorial to implement a raycaster. It contains this code:
if(ra > PI) { ry = (((int)py>>6)<<6)-0.0001; rx=(py-ry)*aTan+px; yo=-64; xo=-yo*aTan; }//looking up
I hope I have transcribed this correctly. In particular, my question is about casting py (it's declared as float) to integer, shifting it back and forth, subtracting something, and then assigning it to a ry (also a float) This line of code is entered at time 7:24, where he also explains that he wants to
round the y position to the nearest 64th value
(I'm unsure if that means the nearest multiple of 64 or the nearest (1/64), but I know that the 6 in the source is derived from the number 64, being 2⁶)
For one thing, I think that it would be valid for the compiler to load (say) a 32-bit float into a machine register, and then shift that value down by six spaces, and then shift it back up by six spaces (these two operations could interfere with the mantissa, or the exponent, or maybe something else, or these two operations could be deleted by a peephole optimisation step.)
Also I think it would be valid for the compiler to make demons fly out of your nose when this statement is executed.
So my question is, is (((int)py>>6)<<6) defined in C when py is float?
is (((int)py>>6)<<6) defined in C when py is float?
It is certainly undefined behavior (UB) for many float. The cast to an int is UB for float with a whole number value outside the [INT_MIN ... INT_MAX] range.
So code is UB for about 38% of all typical float - the large valued ones, NaNs and infinities.
For typical float, a cast to int128_t is defined for nearly all float.
To get to OP's goal, code could use the below, which I believe to be well defined for all float.
If anything, use the below to assess the correctness of one's crafted code.
// round the y position to the nearest 64th value
float round_to_64th(float x) {
if (isfinite(x)) {
float ipart;
// The modf functions break the argument value into integral and fractional parts
float frac = modff(x, &ipart);
x = ipart + roundf(frac*64)/64;
}
return x;
}
"I'm unsure if that means the nearest multiple of 64 or the nearest (1/64)"
On review, OP's code is attempting to truncate to the nearest multiple of 64 or 2⁶.
It is still UB for many float.
That code doesn't shift a float because the bitshift operators aren't defined for floating-point types. If you try it you will get a compiler error.
Notice that the code is (int)py >> 6, the float is cast to an int before the shift operation. The integer value is what is being shifted.
If your question is "what will happen if you shift a float?", the answer is it won't compile. Example on Compiler Explorer.
The best possible recreation of the shift ops for floating points, in short, without using additional functions are the following:
Left shift:
ShiftFloat(py,6,1);
Right shift:
ShiftFloat(py,6,0);
float ShiftFloat(float x, int count, int ismultiplication)
{
float value = x;
for (int i = 0; i < count; ++i)
{
value *= (powf(0.5,(float)(ismultiplication^1)) / powf(2.0,(float)(ismultiplication)));
}
return count != 0 ? value : x;
}

Size of a float variable and compilation

I'm struggling to understand the behavior of gcc in this. The size of a float is of 4 bytes for my architecture. But I can still store a 8 bytes real value in a float, and my compiler says nothing about it.
For example I have :
#include <stdio.h>
int main(int argc, char** argv){
float someFloatNumb = 0xFFFFFFFFFFFF;
printf("%i\n", sizeof(someFloatNumb));
printf("%f\n", someFloatNumb);
printf("%i\n", sizeof(281474976710656));
return 0;
}
I expected the compiler to insult me, or displaying a disclaimer of some sort, because I shouldn't be able to something like that, at least I think it's kind of twisted wizardry.
The program simply run :
4
281474976710656.000000
8
So, if I print the size of someFloatNumb, I get 4 bytes, which is expected. But the affected value isn't, as seen just below.
So I have a few questions:
Does sizeof(variable) simply get the variable type and return sizeof(type), which in this case would explain the result?
Does/Can gcc grow the capacity of a type? (managing multiple variables behind the curtains to allow us that sort of things)
1)
Does sizeof(variable) simply get the variable type and return sizeof(type), which in this case would explain the result ?
Except for variable-length arrays, sizeof doesn't evaluate its operand. So yes, all it cares is the type. So sizeof(someFloatNumb) is 4 which is equivalent to sizeof(float). This explains printf("%i\n", sizeof(someFloatNumb));.
2)
[..] But I can still store a 8 bytes real value in a float, and my compiler says nothing about it.
Does/Can gcc grow the capacity of a type ? (managing multiple variables behind the curtains to allow us that sort of things)
No. Capacity doesn't grow. You simply misunderstood how floats are represented/stored. sizeof(float) being 4 doesn't mean
it can't store more than 2^32 (assuming 1 byte == 8 bits). See Floating point representation.
What the maximum value of a float can represent is defined by the constant FLT_MAX (see <float.h>). sizeof(someFloatNumb) simply yields how many bytes the object (someFloatNumb) takes up in memory which isn't necessarily equal to the range of values it can represent.
This explains why printf("%f\n", someFloatNumb); prints the value as expected (and there's no automatic "capacity growth").
3)
printf("%i\n", sizeof(281474976710656));
This is slightly more involved. As said before in (1), sizeof only cares about the type here. But the type of 281474976710656 is not necessarily int.
The C standard defines the type of integer constants according to the smallest type that can represent the value. See https://stackoverflow.com/a/42115024/1275169 for an explanation.
On my system 281474976710656 can't be represented in an int and it's stored in a long int which is likely to be case on your system as well. So what you see is essentially equivalent to sizeof(long).
There's no portable way to determine the type of integer constants. But since you are using gcc, you could use a little trick with typeof:
typeof(281474976710656) x;
printf("%s", x); /* deliberately using '%s' to generate warning from gcc. */
generates:
warning: format ‘%s’ expects argument of type ‘char *’, but argument 2
has type ‘long int’ [-Wformat=]
printf("%s", x);
P.S: sizeof results a size_t for which the correct format specifier is %zu. So that's what you should be using in your 1st and 3rd printf statements.
This doesn't store "8 bytes" of data, that value gets converted to an integer by the compiler, then converted to a float for assignment:
float someFloatNumb = 0xFFFFFFFFFFFF; // 6 bytes of data
Since float can represent large values, this isn't a big deal, but you will lose a lot of precision if you're only using 32-bit floats. Notice there's a slight but important difference here:
float value = 281474976710656.000000;
int value = 281474976710655;
This is because float becomes an approximation when it runs out of precision.
Capacities don't "grow" for standard C types. You'll have to use a "bignum" library for that.
But I can still store a 8 bytes real value in a float, and my compiler
says nothing about it.
That's not what's happening.
float someFloatNumb = 0xFFFFFFFFFFFF;
0xFFFFFFFFFFFF is an integer constant. Its value, expressed in decimal, is 281474976710655, and its type is probably either long or long long. (Incidentally, that value can be stored in 48 bits, but most systems don't have a 48-bit integer type, so it will probably be stored in 64 bits, of which the high-order 16 bits will be zero.)
When you use an expression of one numeric type to initialize an object of a different numeric type, the value is converted. This conversion doesn't depend on the size of the source expression, only on its numeric value. For an integer-to-float conversion, the result is the closest representation to the integer value. There may be some loss of precision (and in this case, there is). Some compilers may have options to warn about loss of precision, but the conversion is perfectly valid so you probably won't get a warning by default.
Here's a small program to illustrate what's going on:
#include <stdio.h>
int main(void) {
long long ll = 0xFFFFFFFFFFFF;
float f = 0xFFFFFFFFFFFF;
printf("ll = %lld\n", ll);
printf("f = %f\n", f);
}
The output on my system is:
ll = 281474976710655
f = 281474976710656.000000
As you can see, the conversion has lost some precision. 281474976710656 is an exact power of two, and floating-point types generally can represent those exactly. There's a very small difference between the two values because you chose an integer value that's very close to one that can be represented exactly. If I change the value:
#include <stdio.h>
int main(void) {
long long ll = 0xEEEEEEEEEEEE;
float f = 0xEEEEEEEEEEEE;
printf("ll = %lld\n", ll);
printf("f = %f\n", f);
}
the apparent loss of precision is much larger:
ll = 262709978263278
f = 262709979381760.000000
0xFFFFFFFFFFFF == 281474976710655
If you init a float with that value, it will end up being
0xFFFFFFFFFFFF +1 == 0x1000000000000 == 281474976710656 == 1<<48
That fits easily in a 4byte float, simple mantisse, small exponent.
It does however NOT store the correct value (one lower) because that IS hard to store in a float.
Note that the " +1" does not imply incrementation. It ends up one higher because the representation can only get as close as off-by-one to the attempted value. You may consider that "rounding up to the next power of 2 mutliplied by whatever the mantisse can store". Mantisse, by the way, usually is interpreted as a fraction between 0 and 1.
Getting closer would indeed require the 48 bits of your initialisation in the mantisse; plus whatever number of bits would be used to store the exponent; and maybe a few more for other details.
Look at the value printed... 0xFFFF...FFFF is an odd value, but the value printed in your example is even. You are feeding the float variable with an int value that is converted to float. The conversion is loosing precision, as expected by the value used, which doesn't fit in the 23 bits reserved to the target variable mantissa. And finally you get an approximation with is the value 0x1000000....0000 (the next value, which is the closest value to the one you used, as posted #Yunnosch in his answer)

Precision loss / rounding difference when directly assigning double result to an int

Is there a reason why converting from a double to an int performs as expected in this case:
double value = 45.33;
double multResult = (double) value*100.0; // assign to double
int convert = multResult; // assign to int
printf("convert = %d\n", convert); // prints 4533 as expected
But not in this case:
double value = 45.33;
int multResultInt = (double) value*100.0; // assign directly to int
printf("multResultInt = %d\n", multResultInt); // prints 4532??
It seems to me there should be no difference. In the second case the result is still first stored as a double before being converted to an int unless I am not understanding some difference between casts and hard assignments.
There is indeed no difference between the two, but compilers are used to take some freedom when it comes down to floating point computations. For example compilers are free to use higher precision for intermediate results of computations but higher still means different so the results may vary.
Some compilers provide switches to always drop extra precision and convert all intermediate results to the prescribed floating point numbers (say 64bit double-precision numbers). This will make the code slower, however.
In the specific the number 45.33 cannot be represented exactly with a floating point value (it's a periodic number when expressed in binary and it would require an infinite number of bits). When multiplying by 100 this value may be you don't get an integer, but something very close (just below or just above).
int conversion or cast is performed using truncation and something very close to 4533 but below will become 4532, when above will become 4533; even if the difference is incredibly tiny, say 1E-300.
To avoid having problems be sure to account for numeric accuracy problems. If you are doing a computation that depends on exact values of floating point numbers then you're using the wrong tool.
#6502 has given you the theory, here's how to look at things experimentally
double v = 45.33;
int x = v * 100.0;
printf("x=%d v=%.20lf v100=%.20lf\n", x, v, v * 100.0 );
On my machine, this prints
x=4533 v=45.32999999999999829470 v100=4533.00000000000000000000
The value 45.33 does not have an exact representation when encoded as a 64-bit IEEE-754 floating point number. The actual value of v is slightly lower than the intended value due to the limited precision of the encoding.
So why does multiplying by 100.0 fix the problem on some machines? One possibility is that the multiplication is done with 80-bits of precision and then rounded to fit into a 64-bit result. The 80-bit number 4532.999... will round to 4533 when converted to 64-bits.
On your machine, the multiplication is evidently done with 64-bits of precision, and I would expect that v100 will print as 4532.999....

How does ' %f ' work in C?

Hey i need to know how %f works , that is how
printf("%f",number);
extract a floating point number from a series of bits in number.
Consider the code:
main()
{
int i=1;
printf("\nd %d\nf %f",i,i);
}
Output is :
d 1
f -0.000000
So ultimately it doesn't depend on variable 'i', but just depends on the usage of %d and %f(or whatever) i just need to know how %f extracts the float number corresponding to series of bits in 'i'
To all those who misunderstood my question i know that %f can't be used to an integer and would load garbage values if size of integer was smaller than float. As for my case the size of integer and float are 4 bytes.
Let me be clear if value of is 1 then the corresponding binary value of i will be this:
0000 0000 0000 0000 0000 0000 0000 0001 [32 bits]
How would %f extract -0.0000 as in this case from this series of bits.(How it knows where to put decimal point etc , i can't find it from IEEE 754)
[PLEASE DO CORRECT ME IF I AM WRONG IN MY EXPLANATION OR ASSUMPION]
It's undefined behavior to use "%f" to an int, so the answer to your question is: you don't need to know, and you shouldn't do it.
The output depends on the format specifier like "%f" instead of the type of the argument i is because variadic functions (like printf() or scanf()) have no way of knowing the type of variable argument part.
As others have said, giving mismatched "%" specifier and arguments is undefined behavior, and, according to the C standard, anything can happen.
What does happen, in this case, on most modern computers, is this:
printf looks at the place in memory where the data should have been, interprets whatever data it finds there as a floating-point number, and prints that number.
Since printf is a function that can take a variable number of arguments, all floats are converted to doubles before being sent to the function, so printf expects to find a double, which (on normal modern computers) is 64 bits. But you send an int, which is only 32 bits, so printf will look at the 32 bits from the int, and 32 more bits of garbage that just happened to be there. When you tried this, it seems that the combination was a bit pattern corresponding to the double floating-point value -0.0.
Well.
It's easy to see how an integer can be packed into bytes, but how do you represent decimals?
The simplest technique is fixed point: of the n bits, the first m are before the point and the rest after. This is not a very good representation, however. Bits are wasted on some numbers, and it has uniform precision, while in real life, most desired decimals are between 0 and 1.
Enter floating point. The IEEE 754 spec defines a way of interpreting bits that has, since then, been almost universally accepted. It has very high near-zero precision, is compact, expandable and allows for very large numbers as well.
The linked articles are a good read.
You can output a floating-point number (float x;) manually by treating the value as a "black box" and extracting the digits one-by-one.
First, check if x < 0. If so, output a minus-sign - and negate the number. Now we know that it is positive.
Next, output the integer portion. Assign the floating-point number to an integer variable, which will truncate it, ie. int integer = x;. Then determine how many digits there are using the base-10 logarithm log10(). Note, log10(0) is undefined, so you'll have to handle zero as a special case. Then iterate from 0 up to the number of digits, each time dividing by 10^digit_index to move the desired digit into the unit's position, and take the 10-residue (modulus).
for (i=digits; i>=0; i--)
dig = (integer / pow(10,i)) % 10;
Then, output the decimal point ..
For the fractional part, subtract the integer from the original (absolute-value, remember) floating-point number. And output each digit in a similar way, but this time multiplying by 10^frac_digits. You won't be able to predict the number of significant fractional digits this way, so just use a fixed precision (constant number of fractional digits).
I have C code to fill a string with the representation of a floating-point number here, although I make no claims as to its readability.
IEEE formats store the number as a normalized binary fraction. It's more similar to scientific notation, like 3.57×102 instead of 357.0. So it is stored as an exponent-mantissa pair. Being "normalized" means there's actually an implicit additional 1 bit at the front of the mantissa that is not stored. Hopefully that's enough to help you understand a more detailed description of the format from elsewhere.
Remember, we're in binary, so there's no "decimal point". And with the exponent-mantissa notation, there isn't even a binary point in the format. It's implicitly represented in the exponent.
On the tangentially-related issue of passing floats to printf, remember that this is a variadic function. So it does not declare types of arguments that it receives, and all arguments passed undergo automatic conversions. So, float will automatically promote to double. So what you're doing is (substituting hex for brevity), passing 2 64-bit values:
double f, double f
0xabcdefgh 0xijklmnop 0xabcdefgh 0xijklmnop
Then you tell printf to interpret this sequence of words as an int followed by a double. So the 32-bit int seen by printf is only the first half of the floating-point number, and then the floating-point number seem by printf has its words reversed. The fourth word is never used.
To get the integer representation, you'll need to use type-punning with a pointer.
printf("%d %f\n", *(int *)&f, f);
Which reads (from right-to-left): take the address of the float, treat it as a pointer-to-int, follow the pointer.

Wrong output from printf of a number

int main()
{
double i=4;
printf("%d",i);
return 0;
}
Can anybody tell me why this program gives output of 0?
When you create a double initialised with the value 4, its 64 bits are filled according to the IEEE-754 standard for double-precision floating-point numbers. A float is divided into three parts: a sign, an exponent, and a fraction (also known as a significand, coefficient, or mantissa). The sign is one bit and denotes whether the number is positive or negative. The sizes of the other fields depend on the size of the number. To decode the number, the following formula is used:
1.Fraction × 2Exponent - 1023
In your example, the sign bit is 0 because the number is positive, the fractional part is 0 because the number is initialised as an integer, and the exponent part contains the value 1025 (2 with an offset of 1023). The result is:
1.0 × 22
Or, as you would expect, 4. The binary representation of the number (divided into sections) looks like this:
0 10000000001 0000000000000000000000000000000000000000000000000000
Or, in hexadecimal, 0x4010000000000000. When passing a value to printf using the %d specifier, it attempts to read sizeof(int) bytes from the parameters you passed to it. In your case, sizeof(int) is 4, or 32 bits. Since the first (rightmost) 32 bits of the 64-bit floating-point number you supply are all 0, it stands to reason that printf produces 0 as its integer output. If you were to write:
printf("%d %d", i);
Then you might get 0 1074790400, where the second number is equivalent to 0x40100000. I hope you see why this happens. Other answers have already given the fix for this: use the %f format specifier and printf will correctly accept your double.
Jon Purdy gave you a wonderful explanation of why you were seeing this particular result. However, bear in mind that the behavior is explicitly undefined by the language standard:
7.19.6.1.9: If a conversion specification is invalid, the behavior is undefined.248) If any argument is not the correct type for the corresponding conversion specification, the behavior is undefined.
(emphasis mine) where "undefined behavior" means
3.4.3.1: behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements
IOW, the compiler is under no obligation to produce a meaningful or correct result. Most importantly, you cannot rely on the result being repeatable. There's no guarantee that this program would output 0 on other platforms, or even on the same platform with different compiler settings (it probably will, but you don't want to rely on it).
%d is for integers:
int main()
{
int i=4;
double f = 4;
printf("%d",i); // prints 4
printf("%0.f",f); // prints 4
return 0;
}
Because the language allows you to screw up and you happily do it.
More specifically, '%d' is the formatting for an int and therefore printf("%d") consumes as many bytes from the arguments as an int takes. But a double is much larger, so printf only gets a bunch of zeros. Use '%lf'.
Because "%d" specifies that you want to print an int, but i is a double. Try printf("%f\n"); instead (the \n specifies a new-line character).
The simple answer to your question is, as others have said, that you're telling printf to print a integer number (for example a variable of the type int) whilst passing it a double-precision number (as your variable is of the type double), which is wrong.
Here's a snippet from the printf(3) linux programmer's manual explaining the %d and %f conversion specifiers:
d, i The int argument is converted to signed decimal notation. The
precision, if any, gives the minimum number of digits that must
appear; if the converted value requires fewer digits, it is
padded on the left with zeros. The default precision is 1.
When 0 is printed with an explicit precision 0, the output is
empty.
f, F The double argument is rounded and converted to decimal notation
in the style [-]ddd.ddd, where the number of digits after the
decimal-point character is equal to the precision specification.
If the precision is missing, it is taken as 6; if the precision
is explicitly zero, no decimal-point character appears. If a
decimal point appears, at least one digit appears before it.
To make your current code work, you can do two things. The first alternative has already been suggested - substitute %d with %f.
The other thing you can do is to cast your double to an int, like this:
printf("%d", (int) i);
The more complex answer(addressing why printf acts like it does) was just answered briefly by Jon Purdy. For a more in-depth explanation, have a look at the wikipedia article relating to floating point arithmetic and double precision.
Because i is a double and you tell printf to use it as if it were an int (%d).
#jagan, regarding the sub-question:
What is Left most third byte. Why it is 00000001? Can somebody explain?"
10000000001 is for 1025 in binary format.

Resources