Is it defined what will happen if you shift a float? - c

I am following This video tutorial to implement a raycaster. It contains this code:
if(ra > PI) { ry = (((int)py>>6)<<6)-0.0001; rx=(py-ry)*aTan+px; yo=-64; xo=-yo*aTan; }//looking up
I hope I have transcribed this correctly. In particular, my question is about casting py (it's declared as float) to integer, shifting it back and forth, subtracting something, and then assigning it to a ry (also a float) This line of code is entered at time 7:24, where he also explains that he wants to
round the y position to the nearest 64th value
(I'm unsure if that means the nearest multiple of 64 or the nearest (1/64), but I know that the 6 in the source is derived from the number 64, being 2⁶)
For one thing, I think that it would be valid for the compiler to load (say) a 32-bit float into a machine register, and then shift that value down by six spaces, and then shift it back up by six spaces (these two operations could interfere with the mantissa, or the exponent, or maybe something else, or these two operations could be deleted by a peephole optimisation step.)
Also I think it would be valid for the compiler to make demons fly out of your nose when this statement is executed.
So my question is, is (((int)py>>6)<<6) defined in C when py is float?

is (((int)py>>6)<<6) defined in C when py is float?
It is certainly undefined behavior (UB) for many float. The cast to an int is UB for float with a whole number value outside the [INT_MIN ... INT_MAX] range.
So code is UB for about 38% of all typical float - the large valued ones, NaNs and infinities.
For typical float, a cast to int128_t is defined for nearly all float.
To get to OP's goal, code could use the below, which I believe to be well defined for all float.
If anything, use the below to assess the correctness of one's crafted code.
// round the y position to the nearest 64th value
float round_to_64th(float x) {
if (isfinite(x)) {
float ipart;
// The modf functions break the argument value into integral and fractional parts
float frac = modff(x, &ipart);
x = ipart + roundf(frac*64)/64;
}
return x;
}
"I'm unsure if that means the nearest multiple of 64 or the nearest (1/64)"
On review, OP's code is attempting to truncate to the nearest multiple of 64 or 2⁶.
It is still UB for many float.

That code doesn't shift a float because the bitshift operators aren't defined for floating-point types. If you try it you will get a compiler error.
Notice that the code is (int)py >> 6, the float is cast to an int before the shift operation. The integer value is what is being shifted.
If your question is "what will happen if you shift a float?", the answer is it won't compile. Example on Compiler Explorer.

The best possible recreation of the shift ops for floating points, in short, without using additional functions are the following:
Left shift:
ShiftFloat(py,6,1);
Right shift:
ShiftFloat(py,6,0);
float ShiftFloat(float x, int count, int ismultiplication)
{
float value = x;
for (int i = 0; i < count; ++i)
{
value *= (powf(0.5,(float)(ismultiplication^1)) / powf(2.0,(float)(ismultiplication)));
}
return count != 0 ? value : x;
}

Related

Decompose a floating-point number into its integral and fractional part

I am implementing a fractional delay line algorithm.
One of the tasks involved is the decomposition of a floating-point value into its integral and fractional part.
I know there are a lot of posts about this topic on SO and I probably read most of them.
However I haven’t found one post that deals with the specifics of this scenario.
The algorithm must be using 64-bit floating-point values.
Input floating-point values are guaranteed to always be positive. (delay times cannot be negative)
The output integer part has to be represented by an integer datatype.
The integer datatype must have enough bits so that the double-to-integer conversion occurs without the risk of overflowing.
Issues resulting from floating-point values lacking an exact internal representation must be avoided.
(i.e. 9223372036854775809.0 might be internally represented as 9223372036854775808.9999998 and when cast to integer it erroneously becomes 9223372036854775808)
The implementation should work regardless of rounding mode or compiler optimization settings.
So I wrote a function:
double my_modf(double x, int64_t *intPartOut);
As you can see its signature is similar to the modf() function in the C standard library.
The first implementation I came up with is:
double my_modf(double x, int64_t *intPartOut)
{
double y;
double fracPart = modf(x, &y);
*intPartOut = (int64_t)y;
return fracPart;
}
I have also been experimenting with this implementation which - at least on my machine - runs faster than the previous, however I doubt its robustness.
double my_modf(double x, int64_t *intPartOut)
{
int64_t y = (int64_t)x;
*intPartOut = y;
return x - y;
}
...and this is my latest attempt:
double my_modf(double x, int64_t *intPartOut)
{
*intPartOut = llround(x);
return x - floor(x);
}
I can't make up my mind as to which implementation would be best to use, or if there are other implementations that I haven't considered that would better accomplish the following goals.
I am looking for the (1) most robust and (2) most efficient implementation to decompose a floating-point number into its integral and fractional part, keeping into consideration the list of points mentioned above.
Given that the maximum value of the integer part of the floating-point input x is 263−1 and that x is non-negative, then both:
double my_modf(double x, int64_t *intPartOut)
{
double y;
double fracPart = modf(x, &y);
*intPartOut = y;
return fracPart;
}
and:
double my_modf(double x, int64_t *intPartOut)
{
int64_t y = x;
*intPartOut = y;
return x - y;
}
will correctly return the integer part in intPartOut and the fractional part in the return value regardless of rounding mode.
GCC 9.2 for x86_64 does a better job optimizing the latter version, and so does Apple Clang 11.0.0.
llround will not return the integer part as desired because it rounds to the nearest integer rather than truncating.
Issues about x containing errors cannot be resolved with the information provided in the question. The routines shown above have no error; they return exactly the integer and fractional parts of their input.
Updated answer after reading your comment below.
If you are already sure the values are within [0, 2^63-1] then a simple cast will be faster than llround() since this function may also check for overflow (on my system, the manual page states so, however the C standard does not require it).
On my machine for example (x86-64 Nehalem) casting is a single instruction (cvttsd2si) and llround() is obviously more than one.
Am I guaranteed to get the right result with a simple cast (truncation) or is it safer to round?
Depends on what you mean with "right". If the value in the double can be correctly represented by an int64_t, then sure you're going to get exactly the same value. However, if the value cannot be precisely represented by the double then truncation is automatically performed when casting. If you want to round the value in a different way that's another story and you'll have to use one of ceil(), floor() or round().
If you also are sure that no values will be +/- Infinity or NaN (and in that case you can use -Ofast), then your second implementation should be the fastest if you want truncation, while the third should be the fastest if you want to floor() the value.

Nonintuitive result of the assignment of a double precision number to an int variable in C

Could someone give me an explanation why I get two different
numbers, resp. 14 and 15, as an output from the following code?
#include <stdio.h>
int main()
{
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
printf("%d %d",b,c); // 14 15, why?
return 0;
}
I expect to get 15 in both cases but it seems I'm missing some fundamentals of the language.
I am not sure if it's relevant but I was doing the test in CodeBlocks. However, if I type the same lines of code in some on-line compiler ( this one for example) I get an answer of 15 for the two printed variables.
... why I get two different numbers ...
Aside from the usual float-point issues, the computation paths to b and c are arrived in different ways. c is calculated by first saving the value as double a.
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
C allows intermediate floating-point math to be computed using wider types. Check the value of FLT_EVAL_METHOD from <float.h>.
Except for assignment and cast (which remove all extra range and precision), ...
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the
type;
1 evaluate operations and constants of type float and double to the
range and precision of the double type, evaluate long double
operations and constants to the range and precision of the long double
type;
2 evaluate all operations and constants to the range and precision of the
long double type.
C11dr §5.2.4.2.2 9
OP reported 2
By saving the quotient in double a = (Vmax-Vmin)/step;, precision is forced to double whereas int b = (Vmax-Vmin)/step; could compute as long double.
This subtle difference results from (Vmax-Vmin)/step (computed perhaps as long double) being saved as a double versus remaining a long double. One as 15 (or just above), and the other just under 15. int truncation amplifies this difference to 15 and 14.
On another compiler, the results may both have been the same due to FLT_EVAL_METHOD < 2 or other floating-point characteristics.
Conversion to int from a floating-point number is severe with numbers near a whole number. Often better to round() or lround(). The best solution is situation dependent.
This is indeed an interesting question, here is what happens precisely in your hardware. This answer gives the exact calculations with the precision of IEEE double precision floats, i.e. 52 bits mantissa plus one implicit bit. For details on the representation, see the wikipedia article.
Ok, so you first define some variables:
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
The respective values in binary will be
Vmax = 10.111001100110011001100110011001100110011001100110011
Vmin = 1.0110011001100110011001100110011001100110011001100110
step = .00011001100110011001100110011001100110011001100110011010
If you count the bits, you will see that I have given the first bit that is set plus 52 bits to the right. This is exactly the precision at which your computer stores a double. Note that the value of step has been rounded up.
Now you do some math on these numbers. The first operation, the subtraction, results in the precise result:
10.111001100110011001100110011001100110011001100110011
- 1.0110011001100110011001100110011001100110011001100110
--------------------------------------------------------
1.1000000000000000000000000000000000000000000000000000
Then you divide by step, which has been rounded up by your compiler:
1.1000000000000000000000000000000000000000000000000000
/ .00011001100110011001100110011001100110011001100110011010
--------------------------------------------------------
1110.1111111111111111111111111111111111111111111111111100001111111111111
Due to the rounding of step, the result is a tad below 15. Unlike before, I have not rounded immediately, because that is precisely where the interesting stuff happens: Your CPU can indeed store floating point numbers of greater precision than a double, so rounding does not take place immediately.
So, when you convert the result of (Vmax-Vmin)/step directly to an int, your CPU simply cuts off the bits after the fractional point (this is how the implicit double -> int conversion is defined by the language standards):
1110.1111111111111111111111111111111111111111111111111100001111111111111
cutoff to int: 1110
However, if you first store the result in a variable of type double, rounding takes place:
1110.1111111111111111111111111111111111111111111111111100001111111111111
rounded: 1111.0000000000000000000000000000000000000000000000000
cutoff to int: 1111
And this is precisely the result you got.
The "simple" answer is that those seemingly-simple numbers 2.9, 1.4, and 0.1 are all represented internally as binary floating point, and in binary, the number 1/10 is represented as the infinitely-repeating binary fraction 0.00011001100110011...[2] . (This is analogous to the way 1/3 in decimal ends up being 0.333333333... .) Converted back to decimal, those original numbers end up being things like 2.8999999999, 1.3999999999, and 0.0999999999. And when you do additional math on them, those .0999999999's tend to proliferate.
And then the additional problem is that the path by which you compute something -- whether you store it in intermediate variables of a particular type, or compute it "all at once", meaning that the processor might use internal registers with greater precision than type double -- can end up making a significant difference.
The bottom line is that when you convert a double back to an int, you almost always want to round, not truncate. What happened here was that (in effect) one computation path gave you 15.0000000001 which truncated down to 15, while the other gave you 14.999999999 which truncated all the way down to 14.
See also question 14.4a in the C FAQ list.
An equivalent problem is analyzed in analysis of C programs for FLT_EVAL_METHOD==2.
If FLT_EVAL_METHOD==2:
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
computes b by evaluating a long double expression then truncating it to a int, whereas for c it's evaluating from long double, truncating it to double and then to int.
So both values are not obtained with the same process, and this may lead to different results because floating types does not provides usual exact arithmetic.

Precision loss / rounding difference when directly assigning double result to an int

Is there a reason why converting from a double to an int performs as expected in this case:
double value = 45.33;
double multResult = (double) value*100.0; // assign to double
int convert = multResult; // assign to int
printf("convert = %d\n", convert); // prints 4533 as expected
But not in this case:
double value = 45.33;
int multResultInt = (double) value*100.0; // assign directly to int
printf("multResultInt = %d\n", multResultInt); // prints 4532??
It seems to me there should be no difference. In the second case the result is still first stored as a double before being converted to an int unless I am not understanding some difference between casts and hard assignments.
There is indeed no difference between the two, but compilers are used to take some freedom when it comes down to floating point computations. For example compilers are free to use higher precision for intermediate results of computations but higher still means different so the results may vary.
Some compilers provide switches to always drop extra precision and convert all intermediate results to the prescribed floating point numbers (say 64bit double-precision numbers). This will make the code slower, however.
In the specific the number 45.33 cannot be represented exactly with a floating point value (it's a periodic number when expressed in binary and it would require an infinite number of bits). When multiplying by 100 this value may be you don't get an integer, but something very close (just below or just above).
int conversion or cast is performed using truncation and something very close to 4533 but below will become 4532, when above will become 4533; even if the difference is incredibly tiny, say 1E-300.
To avoid having problems be sure to account for numeric accuracy problems. If you are doing a computation that depends on exact values of floating point numbers then you're using the wrong tool.
#6502 has given you the theory, here's how to look at things experimentally
double v = 45.33;
int x = v * 100.0;
printf("x=%d v=%.20lf v100=%.20lf\n", x, v, v * 100.0 );
On my machine, this prints
x=4533 v=45.32999999999999829470 v100=4533.00000000000000000000
The value 45.33 does not have an exact representation when encoded as a 64-bit IEEE-754 floating point number. The actual value of v is slightly lower than the intended value due to the limited precision of the encoding.
So why does multiplying by 100.0 fix the problem on some machines? One possibility is that the multiplication is done with 80-bits of precision and then rounded to fit into a 64-bit result. The 80-bit number 4532.999... will round to 4533 when converted to 64-bits.
On your machine, the multiplication is evidently done with 64-bits of precision, and I would expect that v100 will print as 4532.999....

Arithmetic Bit Shift of Double Variable Data Type in C

I am trying to arithmetic bit shift a double data type in C. I was wondering if this is the correct way to do it:
NOTE: firdelay[ ][ ] is declared in main as
double firdelay[8][12]
void function1(double firdelay[][12]) {
int * shiftptr;
// Cast address of element of 2D matrix (type double) to integer pointer
*shiftptr = (int *) (&firdelay[0][5]);
// Dereference integer pointer and shift right by 12 bits
*shiftptr >>= 12;
}
Bitwise shifting a floating point data type will not give you the result you're looking for.
In Simulink, the Shift Arithmetic block only does bit shifting for integer data types. If you feed it a floating point type it divides the input signal by 2^N where N is the number of bits to shift specified in the mask dialog box.
EDIT:
Since you don't have the capability to perform any floating point math your options are:
understand the layout of a floating point single precision number, then figure out how to manipulate it bitwise to achieve division.
convert whatever algorithm you're porting to use fixed point data types instead of floating point
I'd recommend option 2, it's a whole lot easier than 1
Bit-shifting a floating-point data type (reinterpreted as an int) will give you gibberish (take a look at the diagrams of the binary representation here to see why).
If you want to multiply/divide by a power of 2, then you should do that explicitly.
According to the poorly worded and very unclear documentation, it seems that "bit shifting" in Simulink takes two arguments for floating point values and has the effect of multiplying a floating point value by two raised to the difference of the arguments.
You can use ldexp(double_number, bits_to_pseudo_shift) to obtain this behavior. The function ldexp is found in <math.h>.
There is no correct way to do this. Both operands of << must be of some integer type.
What you're doing is interpreting ("type-punning") a double object as if it were an int object, and then shifting the resulting int value. Even if double and int happen to be the same size, this is very unlikely to do anything useful. (And even if it is useful, it makes more sense to shift unsigned values rather than signed values).
One potential use case for this is for capturing mantissa bits, exponent bits, and sign bit if interested. For this you can use a union:
union doubleBits {
double d;
long l;
};
You can set take your double and set it in the union:
union doubleBits myUnion;
myUnion.d = myDouble;
And bit shift the long portion of the union after extracting the bits like so:
myUnion.l >>= 1;
Since the number of bits for each portion of the double is defined, this is one way to extract the underlying bit representation. This is one use case where it could be feasible to want to get the raw underlying bits. I'm not familiar with Simulink, but if that is possibly why the double was being shifted in the first place, this could be a way to achieve that behavior in C. The fact that it was always 12 bits makes me think otherwise, but just in case figured it was worth pointing out for others who stumble upon this question.
There's a way to achieve this : just add your n to the exponent part of the bitwise representation of the double.
Convert your double to long using "reinterpret" or bitwise conversion (using a union for instance).. extract bits from 52 to 63 (11 bits).. then add your shift and put back the result in the exponent.
You should take into account the special values of double (+infinity, nan or zero)
double operator_shift_left(double a,int n)
{
union
{
long long l;
double d;
} r;
r.d=a;
switch(r.l)
{
case 0x0000000000000000: // 0
case 0x8000000000000000: // -0
case 0x7FF0000000000000: // pos infnity
case 0xFFF0000000000000: // neg infnity
case 0x7FF0000000000001: // Nan
case 0x7FF8000000000001: // Nan
case 0x7FFFFFFFFFFFFFFF: // Nan
return a;
}
int nexp=(((r.l>>52)&0x7FF)+n); // new exponent
if (nexp<0) // underflow
{
r.l=r.l & 0x8000000000000000;
// returns +/- 0
return r.d;
}
if (nexp>2047) // overflow
{
r.l=(r.l & 0x8000000000000000)| 0x7FF0000000000000;
// returns +/- infinity
return r.d;
}
// returns the number with the new exponant
r.l=(r.l & 0x800FFFFFFFFFFFFF)|(((long long)nexp)<<52);
return r.d;
}
(there's probably some x64 processor instruction that does it ?)

Does floor() return something that's exactly representable?

In C89, floor() returns a double. Is the following guaranteed to work?
double d = floor(3.0 + 0.5);
int x = (int) d;
assert(x == 3);
My concern is that the result of floor might not be exactly representable in IEEE 754. So d gets something like 2.99999, and x ends up being 2.
For the answer to this question to be yes, all integers within the range of an int have to be exactly representable as doubles, and floor must always return that exactly represented value.
All integers can have exact floating point representation if your floating point type supports the required mantissa bits. Since double uses 53 bits for mantissa, it can store all 32-bit ints exactly. After all, you could just set the value as mantissa with zero exponent.
If the result of floor() isn't exactly representable, what do you expect the value of d to be? Surely if you've got the representation of a floating point number in a variable, then by definition it's exactly representable isn't it? You've got the representation in d...
(In addition, Mehrdad's answer is correct for 32 bit ints. In a compiler with a 64 bit double and a 64 bit int, you've got more problems of course...)
EDIT: Perhaps you meant "the theoretical result of floor(), i.e. the largest integer value less than or equal to the argument, may not be representable as an int". That's certainly true. Simple way of showing this for a system where int is 32 bits:
int max = 0x7fffffff;
double number = max;
number += 10.0;
double f = floor(number);
int oops = (int) f;
I can't remember offhand what C does when conversions from floating point to integer overflow... but it's going to happen here.
EDIT: There are other interesting situations to consider too. Here's some C# code and results - I'd imagine at least similar things would happen in C. In C#, double is defined to be 64 bits and so is long.
using System;
class Test
{
static void Main()
{
FloorSameInteger(long.MaxValue/2);
FloorSameInteger(long.MaxValue-2);
}
static void FloorSameInteger(long original)
{
double convertedToDouble = original;
double flooredToDouble = Math.Floor(convertedToDouble);
long flooredToLong = (long) flooredToDouble;
Console.WriteLine("Original value: {0}", original);
Console.WriteLine("Converted to double: {0}", convertedToDouble);
Console.WriteLine("Floored (as double): {0}", flooredToDouble);
Console.WriteLine("Converted back to long: {0}", flooredToLong);
Console.WriteLine();
}
}
Results:
Original value: 4611686018427387903
Converted to double:
4.61168601842739E+18Floored (as double): 4.61168601842739E+18
Converted back to long:
4611686018427387904
Original value: 9223372036854775805
Converted to double:
9.22337203685478E+18Floored (as double): 9.22337203685478E+18
Converted back to long:
-9223372036854775808
In other words:
(long) floor((double) original)
isn't always the same as original. This shouldn't come as any surprise - there are more long values than doubles (given the NaN values) and plenty of doubles aren't integers, so we can't expect every long to be exactly representable. However, all 32 bit integers are representable as doubles.
I think you're a bit confused about what you want to ask. floor(3 + 0.5) is not a very good example, because 3, 0.5, and their sum are all exactly representable in any real-world floating point format. floor(0.1 + 0.9) would be a better example, and the real question here is not whether the result of floor is exactly representable, but whether inexactness of the numbers prior to calling floor will result in a return value different from what you would expect, had all numbers been exact. In this case, I believe the answer is yes, but it depends a lot on your particular numbers.
I invite others to criticize this approach if it's bad, but one possible workaround might be to multiply your number by (1.0+0x1p-52) or something similar prior to calling floor (perhaps using nextafter would be better). This could compensate for cases where an error in the last binary place of the number causes it to fall just below rather than exactly on an integer value, but it will not account for errors which have accumulated over a number of operations. If you need that level of numeric stability/exactness, you need to either do some deep analysis or use an arbitrary-precision or exact-math library which can handle your numbers correctly.

Resources