I am implementing a fractional delay line algorithm.
One of the tasks involved is the decomposition of a floating-point value into its integral and fractional part.
I know there are a lot of posts about this topic on SO and I probably read most of them.
However I haven’t found one post that deals with the specifics of this scenario.
The algorithm must be using 64-bit floating-point values.
Input floating-point values are guaranteed to always be positive. (delay times cannot be negative)
The output integer part has to be represented by an integer datatype.
The integer datatype must have enough bits so that the double-to-integer conversion occurs without the risk of overflowing.
Issues resulting from floating-point values lacking an exact internal representation must be avoided.
(i.e. 9223372036854775809.0 might be internally represented as 9223372036854775808.9999998 and when cast to integer it erroneously becomes 9223372036854775808)
The implementation should work regardless of rounding mode or compiler optimization settings.
So I wrote a function:
double my_modf(double x, int64_t *intPartOut);
As you can see its signature is similar to the modf() function in the C standard library.
The first implementation I came up with is:
double my_modf(double x, int64_t *intPartOut)
double y;
double fracPart = modf(x, &y);
*intPartOut = (int64_t)y;
return fracPart;
I have also been experimenting with this implementation which - at least on my machine - runs faster than the previous, however I doubt its robustness.
double my_modf(double x, int64_t *intPartOut)
int64_t y = (int64_t)x;
*intPartOut = y;
return x - y;
...and this is my latest attempt:
double my_modf(double x, int64_t *intPartOut)
*intPartOut = llround(x);
return x - floor(x);
I can't make up my mind as to which implementation would be best to use, or if there are other implementations that I haven't considered that would better accomplish the following goals.
I am looking for the (1) most robust and (2) most efficient implementation to decompose a floating-point number into its integral and fractional part, keeping into consideration the list of points mentioned above.

Given that the maximum value of the integer part of the floating-point input x is 263−1 and that x is non-negative, then both:
double my_modf(double x, int64_t *intPartOut)
double y;
double fracPart = modf(x, &y);
*intPartOut = y;
return fracPart;
double my_modf(double x, int64_t *intPartOut)
int64_t y = x;
*intPartOut = y;
return x - y;
will correctly return the integer part in intPartOut and the fractional part in the return value regardless of rounding mode.
GCC 9.2 for x86_64 does a better job optimizing the latter version, and so does Apple Clang 11.0.0.
llround will not return the integer part as desired because it rounds to the nearest integer rather than truncating.
Issues about x containing errors cannot be resolved with the information provided in the question. The routines shown above have no error; they return exactly the integer and fractional parts of their input.

Updated answer after reading your comment below.
If you are already sure the values are within [0, 2^63-1] then a simple cast will be faster than llround() since this function may also check for overflow (on my system, the manual page states so, however the C standard does not require it).
On my machine for example (x86-64 Nehalem) casting is a single instruction (cvttsd2si) and llround() is obviously more than one.
Am I guaranteed to get the right result with a simple cast (truncation) or is it safer to round?
Depends on what you mean with "right". If the value in the double can be correctly represented by an int64_t, then sure you're going to get exactly the same value. However, if the value cannot be precisely represented by the double then truncation is automatically performed when casting. If you want to round the value in a different way that's another story and you'll have to use one of ceil(), floor() or round().
If you also are sure that no values will be +/- Infinity or NaN (and in that case you can use -Ofast), then your second implementation should be the fastest if you want truncation, while the third should be the fastest if you want to floor() the value.


Is it defined what will happen if you shift a float?

I am following This video tutorial to implement a raycaster. It contains this code:
if(ra > PI) { ry = (((int)py>>6)<<6)-0.0001; rx=(py-ry)*aTan+px; yo=-64; xo=-yo*aTan; }//looking up
I hope I have transcribed this correctly. In particular, my question is about casting py (it's declared as float) to integer, shifting it back and forth, subtracting something, and then assigning it to a ry (also a float) This line of code is entered at time 7:24, where he also explains that he wants to
round the y position to the nearest 64th value
(I'm unsure if that means the nearest multiple of 64 or the nearest (1/64), but I know that the 6 in the source is derived from the number 64, being 2⁶)
For one thing, I think that it would be valid for the compiler to load (say) a 32-bit float into a machine register, and then shift that value down by six spaces, and then shift it back up by six spaces (these two operations could interfere with the mantissa, or the exponent, or maybe something else, or these two operations could be deleted by a peephole optimisation step.)
Also I think it would be valid for the compiler to make demons fly out of your nose when this statement is executed.
So my question is, is (((int)py>>6)<<6) defined in C when py is float?
is (((int)py>>6)<<6) defined in C when py is float?
It is certainly undefined behavior (UB) for many float. The cast to an int is UB for float with a whole number value outside the [INT_MIN ... INT_MAX] range.
So code is UB for about 38% of all typical float - the large valued ones, NaNs and infinities.
For typical float, a cast to int128_t is defined for nearly all float.
To get to OP's goal, code could use the below, which I believe to be well defined for all float.
If anything, use the below to assess the correctness of one's crafted code.
// round the y position to the nearest 64th value
float round_to_64th(float x) {
if (isfinite(x)) {
float ipart;
// The modf functions break the argument value into integral and fractional parts
float frac = modff(x, &ipart);
x = ipart + roundf(frac*64)/64;
return x;
"I'm unsure if that means the nearest multiple of 64 or the nearest (1/64)"
On review, OP's code is attempting to truncate to the nearest multiple of 64 or 2⁶.
It is still UB for many float.
That code doesn't shift a float because the bitshift operators aren't defined for floating-point types. If you try it you will get a compiler error.
Notice that the code is (int)py >> 6, the float is cast to an int before the shift operation. The integer value is what is being shifted.
If your question is "what will happen if you shift a float?", the answer is it won't compile. Example on Compiler Explorer.
The best possible recreation of the shift ops for floating points, in short, without using additional functions are the following:
Left shift:
Right shift:
float ShiftFloat(float x, int count, int ismultiplication)
float value = x;
for (int i = 0; i < count; ++i)
value *= (powf(0.5,(float)(ismultiplication^1)) / powf(2.0,(float)(ismultiplication)));
return count != 0 ? value : x;

Is there any accuracy gain when casting to double and back when doing float division?

What is the difference between two following?
float f1 = some_number;
float f2 = some_near_zero_number;
float result;
result = f1 / f2;
float f1 = some_number;
float f2 = some_near_zero_number;
float result;
result = (double)f1 / (double)f2;
I am especially interested in very small f2 values which may produce +infinity when operating on floats. Is there any accuracy to be gained?
Some practical guidelines for using this kind of cast would be nice as well.
I am going to assume IEEE 754 binary floating point arithmetic, with float 32 bit and double 64 bit.
In general, there is no advantage to doing the calculation in double, and in some cases it may make things worse through doing two rounding steps.
Conversion from float to double is exact. For the infinite, NaN, or zero divisor inputs it makes no differences. Given a finite number result, the IEEE 754 standard requires the result to be the result of the real number division f1/f2, rounded to the type being using in the division.
If it is done as a float division that is the closest float to the exact result. If it is done as double division, it will be the closest double with an additional rounding step for the assignment to result.
For most inputs, the two will give the same answer. Any overflow or underflow that did not happen on the division because it was done in double will happen instead on the conversion.
For simple conversion, if the answer is very close to half way between two float values the two rounding steps may pick the wrong float. I had assumed this could also apply to division results. However, Pascal Cuoq, in a comment on this answer, has called attention to a very interesting paper, Innocuous Double Rounding of Basic Arithmetic
Operations by Pierre Roux, claiming proof that double rounding is harmless for several operations, including division, under conditions that are implied by the assumptions I made at the start of this answer.
If the result of an individual floating-point addition, subtraction, multiply, or divide, is immediately stored to a float, there will be no accuracy improvement using double for intermediate values. In cases where operations are chained together, however, accuracy will often be improved by using a higher-precision intermediate type, provided that one is consistent in using them. In Turbo Pascal circa 1986 code like:
Function TriangleArea(A: Single, B:Single, C:Single): Single
Var S: Extended; (* S stands for Semi-perimeter *)
S := (A+B+C) * 0.5;
TriangleArea := Sqrt((S-A)*(S-B)*(S-C)*S)
would extend all operands of floating-point operations to type Extended (80-bit float), and then convert them back to single- or double-precision when storing to variables of those types. Very nice semantics for numerical processing. Turbo C of that area behaved similarly, but rather unhelpfully failed to provide any numeric type capable of holding intermediate results; the failure of languages to provide a variable type which could hold intermediate results led to people's unfairly criticizing the concept of a higher-precision intermediate result type, when the real problem was that languages failed to support it properly.
Anyway, if one were to write the above method into a modern language like C#:
public static float triangleArea(float a, float b, float c)
double s = (a + b + c) * 0.5;
return (double)(Math.Sqrt((s - a) * (s - b) * (s - c) * s));
the code would work well if the compiler happens to promote the operands of the addition to double before performing the computation, but that's something it may or may not do. If the compiler performs the calculation as float, precision may be horrid. When using the above formula to compute the area of an isosceles triangle with long sides of 16777215 and a short side of 4, for example, eager promotion will yield a correct result of 3.355443E+7 while performing the math as float will, depending upon the order of the operands, yield 5.033165E+7 [more than 50% too big] or 16777214.0 [more than 50% too small].
Note that even though code like the above will work perfectly on some environments, but yield completely bogus results on others, compilers will generally not give any warning about the situation.
Although individual operations on float which are going to be immediately stored to float can be done just as accurately with type float as they could be with type double, eagerly promoting operands will often help considerably when operations are combined. In some cases, rearranging operations may avoid problems caused by loss of promotion (e.g. the above formula uses five additions, four multiplications, and a square root; rewriting the formula as:
increases the number of additions to eight, but will work correctly even if they are performed at single precision.
"Accuracy gain when casting to double and back when doing float division?"
The result depends on other factors aside from only the 2 posted methods.
C allows evaluation of float operations to happen at different levels depending on FLT_EVAL_METHOD. (See below table) If the current setting is 1 or 2, the two methods posted by OP will provide the same answer.
Depending on other code and compiler optimization levels, the quotient result may be used at wider precision in subsequent calculations in either of OP's cases.
Because of this, a float division that overflows or becomes to 0.0 (a result with total loss of precision) due to extreme float values, and if optimized for subsequent calculations may in fact not over/under flow as the quotient was carried forward as double.
To compel the quotient to become a float for future calculations in the midst of potential optimizations, code often uses volatile
volatile float result = f1 / f2;
C does not specify the precision of math operations, yet common application of standards like IEEE 754 provide the a single operation like binary32 divide will result in the closest answer representable. Should the divide occur at a wider format like double or long double, then the wider quotient conversion back to float experiences another rounding step that in rare occasions will result in a different answer than the direct float/float.
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the type;
1 evaluate operations and constants of type float and double to the
range and precision of the double type, evaluate long double operations and constants to the range and precision of the long double type;
2 evaluate all operations and constants to the range and precision of the
long double type.
Practical guidelines:
Use float vs. double to conserve space when needed. (float is usually narrower, rarely the same, as double) If precision is important, use double (or long double).
Using float vs. double to improve speed may or may not work as a platform's native operations may all be double. It may be faster, same or slower - profile to find out. Much of C was originally designed with double as only level FP was carried out aside from double to/from float conversions. Later C has added functions like sinf() to facilitate faster, direct float operations. So the more modern the compiler/platform, more likely float will be faster. Again: profile to find out.

Precision loss / rounding difference when directly assigning double result to an int

Is there a reason why converting from a double to an int performs as expected in this case:
double value = 45.33;
double multResult = (double) value*100.0; // assign to double
int convert = multResult; // assign to int
printf("convert = %d\n", convert); // prints 4533 as expected
But not in this case:
double value = 45.33;
int multResultInt = (double) value*100.0; // assign directly to int
printf("multResultInt = %d\n", multResultInt); // prints 4532??
It seems to me there should be no difference. In the second case the result is still first stored as a double before being converted to an int unless I am not understanding some difference between casts and hard assignments.
There is indeed no difference between the two, but compilers are used to take some freedom when it comes down to floating point computations. For example compilers are free to use higher precision for intermediate results of computations but higher still means different so the results may vary.
Some compilers provide switches to always drop extra precision and convert all intermediate results to the prescribed floating point numbers (say 64bit double-precision numbers). This will make the code slower, however.
In the specific the number 45.33 cannot be represented exactly with a floating point value (it's a periodic number when expressed in binary and it would require an infinite number of bits). When multiplying by 100 this value may be you don't get an integer, but something very close (just below or just above).
int conversion or cast is performed using truncation and something very close to 4533 but below will become 4532, when above will become 4533; even if the difference is incredibly tiny, say 1E-300.
To avoid having problems be sure to account for numeric accuracy problems. If you are doing a computation that depends on exact values of floating point numbers then you're using the wrong tool.
#6502 has given you the theory, here's how to look at things experimentally
double v = 45.33;
int x = v * 100.0;
printf("x=%d v=%.20lf v100=%.20lf\n", x, v, v * 100.0 );
On my machine, this prints
x=4533 v=45.32999999999999829470 v100=4533.00000000000000000000
The value 45.33 does not have an exact representation when encoded as a 64-bit IEEE-754 floating point number. The actual value of v is slightly lower than the intended value due to the limited precision of the encoding.
So why does multiplying by 100.0 fix the problem on some machines? One possibility is that the multiplication is done with 80-bits of precision and then rounded to fit into a 64-bit result. The 80-bit number 4532.999... will round to 4533 when converted to 64-bits.
On your machine, the multiplication is evidently done with 64-bits of precision, and I would expect that v100 will print as 4532.999....

Can I calculate error introduced by doubles?

Suppose I have an irrational number like \sqrt{3}. As it is irrational, it has no decimal representation. So when you try to express it with a IEEE 754 double, you will introduce an error.
A decimal representation with a lot of digits is:
Now, when I calculate \sqrt{3}, I get 1.732051:
#include <stdio.h> // printf
#include <math.h> // needed for sqrt
int main() {
double myVar = sqrt (3);
printf("as double:\t%f\n", myVar);
According to Wolfram|Alpha, I have an error of 1.11100... × 10^-7.
Is there any way I can calculate the error myself?
(I don't mind switching to C++, Python or Java. I could probably also use Mathematica, if there is no simple alternative)
Just to clarify: I don't want a solution that works only for sqrt{3}. I would like to get a function that gives me the error for any number. If that is not possible, I would at least like to know how Wolfram|Alpha gets more values.
My try
While writing this question, I found this:
#include <stdio.h> // printf
#include <math.h> // needed for sqrt
#include <float.h> // needed for higher precision
int main() {
long double r = sqrtl(3.0L);
printf("Precision: %d digits; %.*Lg\n",LDBL_DIG,LDBL_DIG,r);
With this one, I can get the error down to 2.0 * 10^-18 according to Wolfram|Alpha. So I thought this might be close enough to get a good estimation of the error. I wrote this:
#include <stdio.h> // printf
#include <math.h> // needed for sqrt
#include <float.h>
int main() {
double myVar = sqrt (3);
long double r = sqrtl(3.0L);
long double error = abs(r-myVar) / r;
printf("Double:\t\t%f\n", myVar);
printf("Precision:\t%d digits; %.*Lg\n",LDBL_DIG,LDBL_DIG,r);
printf("Error:\t\t%.*Lg\n", LDBL_DIG, error);
But it outputs:
Double: 1.732051
Precision: 18 digits; 1.73205080756887729
Error: 0
How can I fix that to get the error?
What every Programmer should know about Floating Point Arithmetic by Goldberg is the definite guide you are looking for.
printf rounds doubles to 6 places when you use %f without a precision.
double x = 1.3;
long double y = 1.3L;
long double err = y - (double) x;
printf("Error %.20Lf\n", err);
My output: -0.00000000000000004445
If the result is 0, your long double and double are the same.
One way to obtain an interval that is guaranteed to contain the real value of the computation is to use interval arithmetic. Then, comparing the double result to the interval tells you how far the double computation is, at worst, from the real computation.
Frama-C's value analysis can do this for you with option -all-rounding-modes.
double Frama_C_sqrt(double x);
double sqrt(double x)
return Frama_C_sqrt(x);
double y;
int main(){
y = sqrt(3.0);
Analyzing the program with:
frama-c -val t.c -float-normal -all-rounding-modes
[value] Values at end of function main:
y ∈ [1.7320508075688772 .. 1.7320508075688774]
This means that the real value of sqrt(3), and thus the value that would be in variable y if the program computed with real numbers, is within the double bounds [1.7320508075688772 .. 1.7320508075688774].
Frama-C's value analysis does not support the long double type, but if I understand correctly, you were only using long double as reference to estimate the error made with double. The drawback of that method is that long double is itself imprecise. With interval arithmetic as implemented in Frama-C's value analysis, the real value of the computation is guaranteed to be within the displayed bounds.
You have a mistake in printing Double: 1.732051 here printf("Double:\t\t%f\n", myVar);
The actual value of double myVar is
1.732050807568877281 //18 digits
so 1.732050807568877281-1.732050807568877281 is zero
According to the C standard printf("%f", d) will default to 6 digits after the decimal point. This is not the full precision of your double.
It might be that double and long double happen to be the same on your architecture. I have different sizes for them on my architecture and get a non-zero error in your example code.
You want fabsl instead of abs when calculating the error, at least when using C. (In C, abs is integer.) With this substitution, I get:
Double: 1.732051
Precision: 18 digits; 1.73205080756887729
Error: 5.79643049346087304e-17
(Calculated on Mac OS X 10.8.3 with Apple clang 4.0.)
Using long double to estimate the errors in double is a reasonable approach for a few simple calculations, except:
If you are calculating the more accurate long double results, why bother with double?
Error behavior in sequences of calculations is hard to describe and can grow to the point where long double is not providing an accurate estimate of the exact result.
There exist perverse situations where long double gets less accurate results than double. (Mostly encountered when somebody constructs an example to teach students a lesson, but they exist nonetheless.)
In general, there is no simple and efficient way to calculate the error in a floating-point result in a sequence of calculations. If there were, it would be effectively a means of calculating a more accurate result, and we would use that instead of the floating-point calculations alone.
In special cases, such as when developing math library routines, the errors resulting from a particular sequence of code are studied carefully (and the code is redesigned as necessary to have acceptable error behavior). More often, error is estimated either by performing various “experiments” to see how much results fluctuate with varying inputs or by studying general mathematical behavior of systems.
You also asked “I would like to get a function that gives me the error for any number.” Well, that is easy, given any number x and the calculated result x', the error is exactly x' – x. The actual problem is you probably do not have a description of x that can be used to evaluate that expression easily. In your example, x is sqrt(3). Obviously, then, the error is sqrt(3) – x, and x is exactly 1.732050807568877193176604123436845839023590087890625. Now all you need to do is evaluate sqrt(3). In other words, numerically evaluating the error is about as hard as numerically evaluating the original number.
Is there some class of numbers you want to perform this analysis for?
Also, do you actually want to calculate the error or just a good bound on the error? The latter is somewhat easier, although it remains hard for sequences of calculations. For all elementary operations, IEEE 754 requires the produced result to be the result that is nearest the mathematically exact result (in the appropriate direction for the rounding mode being used). In round-to-nearest mode, this implies that each result is at most 1/2 ULP (unit of least precision) away from the exact result. For operations such as those found in the standard math library (sine, logarithm, et cetera), most libraries will produce results within a few ULP of the exact result.

Does floor() return something that's exactly representable?

In C89, floor() returns a double. Is the following guaranteed to work?
double d = floor(3.0 + 0.5);
int x = (int) d;
assert(x == 3);
My concern is that the result of floor might not be exactly representable in IEEE 754. So d gets something like 2.99999, and x ends up being 2.
For the answer to this question to be yes, all integers within the range of an int have to be exactly representable as doubles, and floor must always return that exactly represented value.
All integers can have exact floating point representation if your floating point type supports the required mantissa bits. Since double uses 53 bits for mantissa, it can store all 32-bit ints exactly. After all, you could just set the value as mantissa with zero exponent.
If the result of floor() isn't exactly representable, what do you expect the value of d to be? Surely if you've got the representation of a floating point number in a variable, then by definition it's exactly representable isn't it? You've got the representation in d...
(In addition, Mehrdad's answer is correct for 32 bit ints. In a compiler with a 64 bit double and a 64 bit int, you've got more problems of course...)
EDIT: Perhaps you meant "the theoretical result of floor(), i.e. the largest integer value less than or equal to the argument, may not be representable as an int". That's certainly true. Simple way of showing this for a system where int is 32 bits:
int max = 0x7fffffff;
double number = max;
number += 10.0;
double f = floor(number);
int oops = (int) f;
I can't remember offhand what C does when conversions from floating point to integer overflow... but it's going to happen here.
EDIT: There are other interesting situations to consider too. Here's some C# code and results - I'd imagine at least similar things would happen in C. In C#, double is defined to be 64 bits and so is long.
using System;
class Test
static void Main()
static void FloorSameInteger(long original)
double convertedToDouble = original;
double flooredToDouble = Math.Floor(convertedToDouble);
long flooredToLong = (long) flooredToDouble;
Console.WriteLine("Original value: {0}", original);
Console.WriteLine("Converted to double: {0}", convertedToDouble);
Console.WriteLine("Floored (as double): {0}", flooredToDouble);
Console.WriteLine("Converted back to long: {0}", flooredToLong);
Original value: 4611686018427387903
Converted to double:
4.61168601842739E+18Floored (as double): 4.61168601842739E+18
Converted back to long:
Original value: 9223372036854775805
Converted to double:
9.22337203685478E+18Floored (as double): 9.22337203685478E+18
Converted back to long:
In other words:
(long) floor((double) original)
isn't always the same as original. This shouldn't come as any surprise - there are more long values than doubles (given the NaN values) and plenty of doubles aren't integers, so we can't expect every long to be exactly representable. However, all 32 bit integers are representable as doubles.
I think you're a bit confused about what you want to ask. floor(3 + 0.5) is not a very good example, because 3, 0.5, and their sum are all exactly representable in any real-world floating point format. floor(0.1 + 0.9) would be a better example, and the real question here is not whether the result of floor is exactly representable, but whether inexactness of the numbers prior to calling floor will result in a return value different from what you would expect, had all numbers been exact. In this case, I believe the answer is yes, but it depends a lot on your particular numbers.
I invite others to criticize this approach if it's bad, but one possible workaround might be to multiply your number by (1.0+0x1p-52) or something similar prior to calling floor (perhaps using nextafter would be better). This could compensate for cases where an error in the last binary place of the number causes it to fall just below rather than exactly on an integer value, but it will not account for errors which have accumulated over a number of operations. If you need that level of numeric stability/exactness, you need to either do some deep analysis or use an arbitrary-precision or exact-math library which can handle your numbers correctly.
