Implicit conversion of float to int and possibility of loss of value - c

I'm learning about data-types in C.
Our course material details as follows
When we assign different variables of different data-types, there is a
possibility of loss of value.
float f = 100.6537;
int i = f;
After execution of above code, i = 100. So correct me if I'm wrong, assigning float to int just chops of fractional value and assigns only the integral value to left of decimal point? and loss of value here being the removal of numbers after decimal point ?
But when I do,
int i = 100;
float f = i;
I think that there is no loss of value here ?

Not every int can be represented as a float. The float lacks enough "places" to represent all possible values. Remember, sizeof(int)==sizeof(float) on many machines. In IEEE-754 format you only get 24 bits of "value" in a float.
In other words:
int = snnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
float = seeeeeeeennnnnnnnnnnnnnnnnnnnnnn
Where the e part is the exponent. Note how the int has a lot more bits to represent the numerical value.
For anything that fits neatly in a 24 bit number you should be fine, but it's worth testing on your hardware to be sure.

Related

Is it defined what will happen if you shift a float?

I am following This video tutorial to implement a raycaster. It contains this code:
if(ra > PI) { ry = (((int)py>>6)<<6)-0.0001; rx=(py-ry)*aTan+px; yo=-64; xo=-yo*aTan; }//looking up
I hope I have transcribed this correctly. In particular, my question is about casting py (it's declared as float) to integer, shifting it back and forth, subtracting something, and then assigning it to a ry (also a float) This line of code is entered at time 7:24, where he also explains that he wants to
round the y position to the nearest 64th value
(I'm unsure if that means the nearest multiple of 64 or the nearest (1/64), but I know that the 6 in the source is derived from the number 64, being 2⁶)
For one thing, I think that it would be valid for the compiler to load (say) a 32-bit float into a machine register, and then shift that value down by six spaces, and then shift it back up by six spaces (these two operations could interfere with the mantissa, or the exponent, or maybe something else, or these two operations could be deleted by a peephole optimisation step.)
Also I think it would be valid for the compiler to make demons fly out of your nose when this statement is executed.
So my question is, is (((int)py>>6)<<6) defined in C when py is float?
is (((int)py>>6)<<6) defined in C when py is float?
It is certainly undefined behavior (UB) for many float. The cast to an int is UB for float with a whole number value outside the [INT_MIN ... INT_MAX] range.
So code is UB for about 38% of all typical float - the large valued ones, NaNs and infinities.
For typical float, a cast to int128_t is defined for nearly all float.
To get to OP's goal, code could use the below, which I believe to be well defined for all float.
If anything, use the below to assess the correctness of one's crafted code.
// round the y position to the nearest 64th value
float round_to_64th(float x) {
if (isfinite(x)) {
float ipart;
// The modf functions break the argument value into integral and fractional parts
float frac = modff(x, &ipart);
x = ipart + roundf(frac*64)/64;
}
return x;
}
"I'm unsure if that means the nearest multiple of 64 or the nearest (1/64)"
On review, OP's code is attempting to truncate to the nearest multiple of 64 or 2⁶.
It is still UB for many float.
That code doesn't shift a float because the bitshift operators aren't defined for floating-point types. If you try it you will get a compiler error.
Notice that the code is (int)py >> 6, the float is cast to an int before the shift operation. The integer value is what is being shifted.
If your question is "what will happen if you shift a float?", the answer is it won't compile. Example on Compiler Explorer.
The best possible recreation of the shift ops for floating points, in short, without using additional functions are the following:
Left shift:
ShiftFloat(py,6,1);
Right shift:
ShiftFloat(py,6,0);
float ShiftFloat(float x, int count, int ismultiplication)
{
float value = x;
for (int i = 0; i < count; ++i)
{
value *= (powf(0.5,(float)(ismultiplication^1)) / powf(2.0,(float)(ismultiplication)));
}
return count != 0 ? value : x;
}

Nonintuitive result of the assignment of a double precision number to an int variable in C

Could someone give me an explanation why I get two different
numbers, resp. 14 and 15, as an output from the following code?
#include <stdio.h>
int main()
{
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
printf("%d %d",b,c); // 14 15, why?
return 0;
}
I expect to get 15 in both cases but it seems I'm missing some fundamentals of the language.
I am not sure if it's relevant but I was doing the test in CodeBlocks. However, if I type the same lines of code in some on-line compiler ( this one for example) I get an answer of 15 for the two printed variables.
... why I get two different numbers ...
Aside from the usual float-point issues, the computation paths to b and c are arrived in different ways. c is calculated by first saving the value as double a.
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
C allows intermediate floating-point math to be computed using wider types. Check the value of FLT_EVAL_METHOD from <float.h>.
Except for assignment and cast (which remove all extra range and precision), ...
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the
type;
1 evaluate operations and constants of type float and double to the
range and precision of the double type, evaluate long double
operations and constants to the range and precision of the long double
type;
2 evaluate all operations and constants to the range and precision of the
long double type.
C11dr §5.2.4.2.2 9
OP reported 2
By saving the quotient in double a = (Vmax-Vmin)/step;, precision is forced to double whereas int b = (Vmax-Vmin)/step; could compute as long double.
This subtle difference results from (Vmax-Vmin)/step (computed perhaps as long double) being saved as a double versus remaining a long double. One as 15 (or just above), and the other just under 15. int truncation amplifies this difference to 15 and 14.
On another compiler, the results may both have been the same due to FLT_EVAL_METHOD < 2 or other floating-point characteristics.
Conversion to int from a floating-point number is severe with numbers near a whole number. Often better to round() or lround(). The best solution is situation dependent.
This is indeed an interesting question, here is what happens precisely in your hardware. This answer gives the exact calculations with the precision of IEEE double precision floats, i.e. 52 bits mantissa plus one implicit bit. For details on the representation, see the wikipedia article.
Ok, so you first define some variables:
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
The respective values in binary will be
Vmax = 10.111001100110011001100110011001100110011001100110011
Vmin = 1.0110011001100110011001100110011001100110011001100110
step = .00011001100110011001100110011001100110011001100110011010
If you count the bits, you will see that I have given the first bit that is set plus 52 bits to the right. This is exactly the precision at which your computer stores a double. Note that the value of step has been rounded up.
Now you do some math on these numbers. The first operation, the subtraction, results in the precise result:
10.111001100110011001100110011001100110011001100110011
- 1.0110011001100110011001100110011001100110011001100110
--------------------------------------------------------
1.1000000000000000000000000000000000000000000000000000
Then you divide by step, which has been rounded up by your compiler:
1.1000000000000000000000000000000000000000000000000000
/ .00011001100110011001100110011001100110011001100110011010
--------------------------------------------------------
1110.1111111111111111111111111111111111111111111111111100001111111111111
Due to the rounding of step, the result is a tad below 15. Unlike before, I have not rounded immediately, because that is precisely where the interesting stuff happens: Your CPU can indeed store floating point numbers of greater precision than a double, so rounding does not take place immediately.
So, when you convert the result of (Vmax-Vmin)/step directly to an int, your CPU simply cuts off the bits after the fractional point (this is how the implicit double -> int conversion is defined by the language standards):
1110.1111111111111111111111111111111111111111111111111100001111111111111
cutoff to int: 1110
However, if you first store the result in a variable of type double, rounding takes place:
1110.1111111111111111111111111111111111111111111111111100001111111111111
rounded: 1111.0000000000000000000000000000000000000000000000000
cutoff to int: 1111
And this is precisely the result you got.
The "simple" answer is that those seemingly-simple numbers 2.9, 1.4, and 0.1 are all represented internally as binary floating point, and in binary, the number 1/10 is represented as the infinitely-repeating binary fraction 0.00011001100110011...[2] . (This is analogous to the way 1/3 in decimal ends up being 0.333333333... .) Converted back to decimal, those original numbers end up being things like 2.8999999999, 1.3999999999, and 0.0999999999. And when you do additional math on them, those .0999999999's tend to proliferate.
And then the additional problem is that the path by which you compute something -- whether you store it in intermediate variables of a particular type, or compute it "all at once", meaning that the processor might use internal registers with greater precision than type double -- can end up making a significant difference.
The bottom line is that when you convert a double back to an int, you almost always want to round, not truncate. What happened here was that (in effect) one computation path gave you 15.0000000001 which truncated down to 15, while the other gave you 14.999999999 which truncated all the way down to 14.
See also question 14.4a in the C FAQ list.
An equivalent problem is analyzed in analysis of C programs for FLT_EVAL_METHOD==2.
If FLT_EVAL_METHOD==2:
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
computes b by evaluating a long double expression then truncating it to a int, whereas for c it's evaluating from long double, truncating it to double and then to int.
So both values are not obtained with the same process, and this may lead to different results because floating types does not provides usual exact arithmetic.

Converting int16 to float in C

How do i convert a 16 bit int to a floating point number?
I have a signed 16 bit variable which i'm told i need to display with an accuracy of 3 decimal places, so i presume this would involve a conversion to float?
I've tried the below which just copy's my 16 bits into a float but this doesn't seem right.
float myFloat = 0;
int16_t myInt = 0x3e00;
memcpy(&myFloat, &myInt, sizeof(int));
I've also read about the Half-precision floating-point format but am unsure how to handle this... if i need to.
I'm using GCC.
update:
The source of the data is a char array [2] which i get from an i2c interface. I then stitch this together into a signed int.
Can anyone help?
I have a signed 16 bit variable which i'm told i need to display with
an accuracy of 3 decimal places
If someone told you the integer value can be displayed this way he/she should start from the C begginers course.
The only possibility is that the integer value has been scaled (multiplied). For example the value of 12.456 can be stored in the integer if multiplied by 1000. If this is the case:
float flv;
int intv = 12456;
flv = (float)intv / 1000.0f;
You can also print this scaled integer without convering to float
printf("%s%d.%03d\n", intv < 0 ? "-": "", abs(intv / 1000), abs(intv % 1000));

Rounding float to int

Self studying coding (noob here), the answer to a practice problem is as follows:
amount = (int) round (c);
Where c is a float.
Is it safe to say that this line converts the float to an integer through rounding?
I tried researching methods of converting floats to integers but none used the syntax as above.
You should look at the return value of round.
If it returns a float, then your int casting will not lose precision and will convert the float to an int.
If it returns an int, then the conversion happens in the function, and there is no need to try converting it again.
This is of course if you really wish to round the number. If you want 10.8 to become 11, then your code is a possible solution, but if you want it to become 10, then just convert (cast) it to an int.
I would just do amount = int(c)
Here is a full example
amount = 10.3495829
amount = int(amount)
print(amount)
It should print 10!
float has the higher range than integer primitive value, which means a float is a bigger than int. Due to this fact you can convert a float to an int by just down-casting it
int value = (int) 9.99f; // Will return 9
Just note, that this typecasting will truncate everything after the decimal point , it won't perform any rounding or flooring operation on the value.
As you see from above example if you have float of 9.999, (down) casting to an integer will return 9 . However If you need rounding then use Math.round() method, which converts float to its nearest integer by adding +0.5 to it's value and then truncating it.
Java tutorial , primitive datatypes
Java language specifications, casting of primitive types

Fixed point numbers in C without float

In C is it possible to present a fixed point number in binary form so it can be transmitted without the use floats ?
I know how to convert a float or double to the desired fixed point representation but I'm stuck when it shall be done without floating points. The problem is that the system I have to develop on has this limitation.
My idea is to create a struct which holds the full representation and a processable integer and fractional part. And after creating the struct with either only the received binary representation or the integer and fractional values there shall be a function which does the conversion.
Update:
My Question seems not to be precise enough so I'll add some details.
Within my code I have to create and receive Numbers in a certain fixed point representation. As described by the answers below this is nothing but a pointer to a sequence of bits. My problem is that i have to create this sequence of bits when sending or interpret it when receiving the information.
This conversion is my problem ignoring signdness it is quiet easy thing to do when you can use a float to convert from (code not tested, but must work like this):
float sourceValue = 12.223445;
int intPart = 0;
float fractPart = 0.0;
//integer part is easy, just cast it
intPart = (int)sourceValue;
//the fractinoal part is the rest
fractPart = sourceValue - intPart;
//multipling the fract part by the precision of the fixed point number (Q9.25)
//gets us the fractional part in the desired representation
u_int64_t factor = 1;
factor = factor << 25;
u_int64_t fractPart = fractPart * factor;
The rest can be done by some shifting and the use of logical bit operators.
But how can I do this without a float in the middle, starting with something like this:
int intPart = 12;
int fractPart = 223445;
Is it even possible ? As told, I'm kind a stuck here.
Thanks for your help!
I don't know what you are really up to, but a fixed-point number can be viewed as an integer number with a constant factor applied to it.
For example, if you want to express a number in the interval [0; 1) in 16 bits, you can map it to the range [0; 65536) by simply multiplying it with 65536.
This said, it completely depends on how your integer values look like and how they are intended to be represented. In almost any case, you can apply a multiplication or division to it and are done.
Everything boils down to bits, be it an integer, float, etc. All you need is the memory base address and the size of that certain memory. For example,
float src = 0.5;
float dest;
char bytes[sizeof(src)];
memcpy(bytes, &num, sizeof(src));
dest = *((float *)bytes);
should give you dest equal to src.
Hope this helped.

Resources