Fast float quantize, scaled by precision? - c

Since float precision reduces for larger values, in some cases it may be useful to quantize the value based on its size - instead of quantizing by an absolute value.
A naive approach could be to detect the precision and scale it up:
float quantize(float value, float quantize_scale) {
float factor = (nextafterf(fabsf(value)) - fabsf(value)) * quantize_scale;
return floorf((value / factor) + 0.5f) * factor;
However this seems too heavy.
Instead, it should be possible to mask out bits in the floats mantisa
to simulate something like casting to a 16bit float, then back - for eg.
Not being expert in float bit twiddling, I couldn't say if the resulting float would be valid (or need normalizing)
For speed, when exact behavior regarding rounding isn't important, what is a fast way to quantize floats, taking their magnitude into account?

The Veltkamp-Dekker splitting algorithm will split a floating-point number into high and low parts. Sample code is below.
If there are s bits in the significand (53 in IEEE 754 64-bit binary), and the value Scale in the code below is 2b, then *x0 receives the high s-b bits of x, and *x1 receives the remaining bits, which you may discard (or remove from the code below, so it is never calculated). If b is known at compile time, e.g., the constant 43, you can replace Scale with the appropriate constant, such as 0x1p43. Otherwise, you must produce 2b in some way.
This requires round-to-nearest mode. IEEE 754 arithmetic suffices, but other reasonable arithmetic may be okay too. It rounds ties to even.
This assumes that x * (Scale + 1) does not overflow. The operations must be evaluated in the same precision as the value being separated. (double for double, float for float, and so on. If the compiler evaluates float expressions with double, this would break. A workaround would be to convert the inputs to the widest floating-point type supported, perform the split in that type [with Scale adjusted correspondingly], and then convert back.)
void Split(double *x0, double *x1, double x)
double d = x * (Scale + 1);
double t = d - x;
*x0 = d - t;
*x1 = x - *x0;


Why casting double to int might give different results?

I am using fixed decimal point number (using uint16_t) to store percentage with 2 fractional digits. I have found that the way I am casting the double value to integer makes a difference in the resulting value.
const char* testString = "99.85";
double percent = atof(testString);
double hundred = 100;
uint16_t reInt1 = (uint16_t)(hundred * percent);
double stagedDouble = hundred * percent;
uint16_t reInt2 = (uint16_t)stagedDouble;
Example output:
percent: 99.850000
stagedDouble: 9985.000000
reInt1: 9984
reInt2: 9985
The error is visible in about 47% of all values between 0 and 10000 (of the fixed point representation). It does not appear at all when casting with stagedDouble. And I do not understand why the two integers are different. I am using GCC 6.3.0.
Improved code snippet to demonstrate percent variable and to unify the coefficient between the two statements. The change of 100 into a double seems as a quality change that might affect the output, but it does not change a thing in my program.
Is percent a float? If so, look at what types you're multiplying.
reInt1 is double * float and stagedDouble is int * float. Mixing up floating point math can cause these types of rounding errors.
Changing the 100's to be both double or both int results in the same answer.
The reported behavior is consistent with percent being declared float, and the use of IEEE-754 basic 32-bit and 64-bit binary floating-point for float and double.
uint16_t reInt1 = (uint16_t)(100.0 * percent);
Since 100.0 is a double constant, this converts percent to double, performs a multiplication in double, and converts the result to uint16_t. The multiplication may have a very slight rounding error, up to ½ ULP of the double format, a relative error around 2−53.
double stagedDouble = 100 * percent;
uint16_t reInt2 = (uint16_t)stagedDouble;
Since 100 is an int constant, this converts 100 to float, performs a multiplication in float, and converts the result to uint16_t. The rounding error in the multiplication may be up to ½ ULP of the float format, a relative error around 2−24.
Since all of the values are near hundredths of an integer, a 50:50 ratio of errors up:down would make about half the results just under what is needed for the integer threshold. In the multiplications, all those with values that are 0, 25, 50, or 100 one-hundredths would be exact (because 25/100 is ¼, which is exactly representable in binary floating-point), so 96/100 would have rounding errors. If the directions of the float and double rounding errors behave as independent, uniform random variables, about half would round in different directions, producing different results, giving about 48% mismatches, which is consistent with the 47% reported in the question.
(However, when I measure the actual results, I get 42% differences between the float and double methods. I suspect that has something to do with the trailing bits in the float multiplication before rounding—the distribution might not act like a uniform distribution of two possibilities. It may be the OP’s code prepares the percent values in some way other than dividing an integer value by 100.)

Nonintuitive result of the assignment of a double precision number to an int variable in C

Could someone give me an explanation why I get two different
numbers, resp. 14 and 15, as an output from the following code?
#include <stdio.h>
int main()
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
printf("%d %d",b,c); // 14 15, why?
return 0;
I expect to get 15 in both cases but it seems I'm missing some fundamentals of the language.
I am not sure if it's relevant but I was doing the test in CodeBlocks. However, if I type the same lines of code in some on-line compiler ( this one for example) I get an answer of 15 for the two printed variables.
... why I get two different numbers ...
Aside from the usual float-point issues, the computation paths to b and c are arrived in different ways. c is calculated by first saving the value as double a.
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
C allows intermediate floating-point math to be computed using wider types. Check the value of FLT_EVAL_METHOD from <float.h>.
Except for assignment and cast (which remove all extra range and precision), ...
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the
1 evaluate operations and constants of type float and double to the
range and precision of the double type, evaluate long double
operations and constants to the range and precision of the long double
2 evaluate all operations and constants to the range and precision of the
long double type.
C11dr § 9
OP reported 2
By saving the quotient in double a = (Vmax-Vmin)/step;, precision is forced to double whereas int b = (Vmax-Vmin)/step; could compute as long double.
This subtle difference results from (Vmax-Vmin)/step (computed perhaps as long double) being saved as a double versus remaining a long double. One as 15 (or just above), and the other just under 15. int truncation amplifies this difference to 15 and 14.
On another compiler, the results may both have been the same due to FLT_EVAL_METHOD < 2 or other floating-point characteristics.
Conversion to int from a floating-point number is severe with numbers near a whole number. Often better to round() or lround(). The best solution is situation dependent.
This is indeed an interesting question, here is what happens precisely in your hardware. This answer gives the exact calculations with the precision of IEEE double precision floats, i.e. 52 bits mantissa plus one implicit bit. For details on the representation, see the wikipedia article.
Ok, so you first define some variables:
double Vmax = 2.9;
double Vmin = 1.4;
double step = 0.1;
The respective values in binary will be
Vmax = 10.111001100110011001100110011001100110011001100110011
Vmin = 1.0110011001100110011001100110011001100110011001100110
step = .00011001100110011001100110011001100110011001100110011010
If you count the bits, you will see that I have given the first bit that is set plus 52 bits to the right. This is exactly the precision at which your computer stores a double. Note that the value of step has been rounded up.
Now you do some math on these numbers. The first operation, the subtraction, results in the precise result:
- 1.0110011001100110011001100110011001100110011001100110
Then you divide by step, which has been rounded up by your compiler:
/ .00011001100110011001100110011001100110011001100110011010
Due to the rounding of step, the result is a tad below 15. Unlike before, I have not rounded immediately, because that is precisely where the interesting stuff happens: Your CPU can indeed store floating point numbers of greater precision than a double, so rounding does not take place immediately.
So, when you convert the result of (Vmax-Vmin)/step directly to an int, your CPU simply cuts off the bits after the fractional point (this is how the implicit double -> int conversion is defined by the language standards):
cutoff to int: 1110
However, if you first store the result in a variable of type double, rounding takes place:
rounded: 1111.0000000000000000000000000000000000000000000000000
cutoff to int: 1111
And this is precisely the result you got.
The "simple" answer is that those seemingly-simple numbers 2.9, 1.4, and 0.1 are all represented internally as binary floating point, and in binary, the number 1/10 is represented as the infinitely-repeating binary fraction 0.00011001100110011...[2] . (This is analogous to the way 1/3 in decimal ends up being 0.333333333... .) Converted back to decimal, those original numbers end up being things like 2.8999999999, 1.3999999999, and 0.0999999999. And when you do additional math on them, those .0999999999's tend to proliferate.
And then the additional problem is that the path by which you compute something -- whether you store it in intermediate variables of a particular type, or compute it "all at once", meaning that the processor might use internal registers with greater precision than type double -- can end up making a significant difference.
The bottom line is that when you convert a double back to an int, you almost always want to round, not truncate. What happened here was that (in effect) one computation path gave you 15.0000000001 which truncated down to 15, while the other gave you 14.999999999 which truncated all the way down to 14.
See also question 14.4a in the C FAQ list.
An equivalent problem is analyzed in analysis of C programs for FLT_EVAL_METHOD==2.
double a =(Vmax-Vmin)/step;
int b = (Vmax-Vmin)/step;
int c = a;
computes b by evaluating a long double expression then truncating it to a int, whereas for c it's evaluating from long double, truncating it to double and then to int.
So both values are not obtained with the same process, and this may lead to different results because floating types does not provides usual exact arithmetic.

Is there any accuracy gain when casting to double and back when doing float division?

What is the difference between two following?
float f1 = some_number;
float f2 = some_near_zero_number;
float result;
result = f1 / f2;
float f1 = some_number;
float f2 = some_near_zero_number;
float result;
result = (double)f1 / (double)f2;
I am especially interested in very small f2 values which may produce +infinity when operating on floats. Is there any accuracy to be gained?
Some practical guidelines for using this kind of cast would be nice as well.
I am going to assume IEEE 754 binary floating point arithmetic, with float 32 bit and double 64 bit.
In general, there is no advantage to doing the calculation in double, and in some cases it may make things worse through doing two rounding steps.
Conversion from float to double is exact. For the infinite, NaN, or zero divisor inputs it makes no differences. Given a finite number result, the IEEE 754 standard requires the result to be the result of the real number division f1/f2, rounded to the type being using in the division.
If it is done as a float division that is the closest float to the exact result. If it is done as double division, it will be the closest double with an additional rounding step for the assignment to result.
For most inputs, the two will give the same answer. Any overflow or underflow that did not happen on the division because it was done in double will happen instead on the conversion.
For simple conversion, if the answer is very close to half way between two float values the two rounding steps may pick the wrong float. I had assumed this could also apply to division results. However, Pascal Cuoq, in a comment on this answer, has called attention to a very interesting paper, Innocuous Double Rounding of Basic Arithmetic
Operations by Pierre Roux, claiming proof that double rounding is harmless for several operations, including division, under conditions that are implied by the assumptions I made at the start of this answer.
If the result of an individual floating-point addition, subtraction, multiply, or divide, is immediately stored to a float, there will be no accuracy improvement using double for intermediate values. In cases where operations are chained together, however, accuracy will often be improved by using a higher-precision intermediate type, provided that one is consistent in using them. In Turbo Pascal circa 1986 code like:
Function TriangleArea(A: Single, B:Single, C:Single): Single
Var S: Extended; (* S stands for Semi-perimeter *)
S := (A+B+C) * 0.5;
TriangleArea := Sqrt((S-A)*(S-B)*(S-C)*S)
would extend all operands of floating-point operations to type Extended (80-bit float), and then convert them back to single- or double-precision when storing to variables of those types. Very nice semantics for numerical processing. Turbo C of that area behaved similarly, but rather unhelpfully failed to provide any numeric type capable of holding intermediate results; the failure of languages to provide a variable type which could hold intermediate results led to people's unfairly criticizing the concept of a higher-precision intermediate result type, when the real problem was that languages failed to support it properly.
Anyway, if one were to write the above method into a modern language like C#:
public static float triangleArea(float a, float b, float c)
double s = (a + b + c) * 0.5;
return (double)(Math.Sqrt((s - a) * (s - b) * (s - c) * s));
the code would work well if the compiler happens to promote the operands of the addition to double before performing the computation, but that's something it may or may not do. If the compiler performs the calculation as float, precision may be horrid. When using the above formula to compute the area of an isosceles triangle with long sides of 16777215 and a short side of 4, for example, eager promotion will yield a correct result of 3.355443E+7 while performing the math as float will, depending upon the order of the operands, yield 5.033165E+7 [more than 50% too big] or 16777214.0 [more than 50% too small].
Note that even though code like the above will work perfectly on some environments, but yield completely bogus results on others, compilers will generally not give any warning about the situation.
Although individual operations on float which are going to be immediately stored to float can be done just as accurately with type float as they could be with type double, eagerly promoting operands will often help considerably when operations are combined. In some cases, rearranging operations may avoid problems caused by loss of promotion (e.g. the above formula uses five additions, four multiplications, and a square root; rewriting the formula as:
increases the number of additions to eight, but will work correctly even if they are performed at single precision.
"Accuracy gain when casting to double and back when doing float division?"
The result depends on other factors aside from only the 2 posted methods.
C allows evaluation of float operations to happen at different levels depending on FLT_EVAL_METHOD. (See below table) If the current setting is 1 or 2, the two methods posted by OP will provide the same answer.
Depending on other code and compiler optimization levels, the quotient result may be used at wider precision in subsequent calculations in either of OP's cases.
Because of this, a float division that overflows or becomes to 0.0 (a result with total loss of precision) due to extreme float values, and if optimized for subsequent calculations may in fact not over/under flow as the quotient was carried forward as double.
To compel the quotient to become a float for future calculations in the midst of potential optimizations, code often uses volatile
volatile float result = f1 / f2;
C does not specify the precision of math operations, yet common application of standards like IEEE 754 provide the a single operation like binary32 divide will result in the closest answer representable. Should the divide occur at a wider format like double or long double, then the wider quotient conversion back to float experiences another rounding step that in rare occasions will result in a different answer than the direct float/float.
-1 indeterminable;
0 evaluate all operations and constants just to the range and precision of the type;
1 evaluate operations and constants of type float and double to the
range and precision of the double type, evaluate long double operations and constants to the range and precision of the long double type;
2 evaluate all operations and constants to the range and precision of the
long double type.
Practical guidelines:
Use float vs. double to conserve space when needed. (float is usually narrower, rarely the same, as double) If precision is important, use double (or long double).
Using float vs. double to improve speed may or may not work as a platform's native operations may all be double. It may be faster, same or slower - profile to find out. Much of C was originally designed with double as only level FP was carried out aside from double to/from float conversions. Later C has added functions like sinf() to facilitate faster, direct float operations. So the more modern the compiler/platform, more likely float will be faster. Again: profile to find out.

Precision loss / rounding difference when directly assigning double result to an int

Is there a reason why converting from a double to an int performs as expected in this case:
double value = 45.33;
double multResult = (double) value*100.0; // assign to double
int convert = multResult; // assign to int
printf("convert = %d\n", convert); // prints 4533 as expected
But not in this case:
double value = 45.33;
int multResultInt = (double) value*100.0; // assign directly to int
printf("multResultInt = %d\n", multResultInt); // prints 4532??
It seems to me there should be no difference. In the second case the result is still first stored as a double before being converted to an int unless I am not understanding some difference between casts and hard assignments.
There is indeed no difference between the two, but compilers are used to take some freedom when it comes down to floating point computations. For example compilers are free to use higher precision for intermediate results of computations but higher still means different so the results may vary.
Some compilers provide switches to always drop extra precision and convert all intermediate results to the prescribed floating point numbers (say 64bit double-precision numbers). This will make the code slower, however.
In the specific the number 45.33 cannot be represented exactly with a floating point value (it's a periodic number when expressed in binary and it would require an infinite number of bits). When multiplying by 100 this value may be you don't get an integer, but something very close (just below or just above).
int conversion or cast is performed using truncation and something very close to 4533 but below will become 4532, when above will become 4533; even if the difference is incredibly tiny, say 1E-300.
To avoid having problems be sure to account for numeric accuracy problems. If you are doing a computation that depends on exact values of floating point numbers then you're using the wrong tool.
#6502 has given you the theory, here's how to look at things experimentally
double v = 45.33;
int x = v * 100.0;
printf("x=%d v=%.20lf v100=%.20lf\n", x, v, v * 100.0 );
On my machine, this prints
x=4533 v=45.32999999999999829470 v100=4533.00000000000000000000
The value 45.33 does not have an exact representation when encoded as a 64-bit IEEE-754 floating point number. The actual value of v is slightly lower than the intended value due to the limited precision of the encoding.
So why does multiplying by 100.0 fix the problem on some machines? One possibility is that the multiplication is done with 80-bits of precision and then rounded to fit into a 64-bit result. The 80-bit number 4532.999... will round to 4533 when converted to 64-bits.
On your machine, the multiplication is evidently done with 64-bits of precision, and I would expect that v100 will print as 4532.999....

Truncating a double to a float in C

This a very simple question, but an important one since it affects my whole project tremendously.
Suppose I have the following code snipet:
unsigned int x = 0xffffffff;
float f = (float)((double)x * (double)2.328306436538696e-010); // x/2^32
I would expect that f be something like 0.99999, but instead, it rounds up to 1, since it's the closest float approximation. That's not good since I need float values on the interval of [0,1), not [0,1]. I'm sure it's something simple, but I'd appreciate some help.
In C (since C99), you can change the rounding direction with fesetround from libm
#include <stdio.h>
#include <fenv.h>
int main()
// volatile -- uncomment for GNU gcc and whoever else doesn't support FENV
unsigned long x = 0xffffffff;
float f = (float)((double)x * (double)2.328306436538696e-010); // x/2^32
printf("%.50f\n", f);
Tested with IBM XL, Sun Studio, clang, GNU gcc. This gives me 0.99999994039535522460937500000000000000000000000000 in all cases
The value above which a double rounds to 1 or more when converted to float in the default IEEE 754 rounding mode is 0x1.ffffffp-1 (in C99's hexadecimal notation, since your question is tagged “C”).
Your options are:
turn the FPU rounding mode to round-downward before the conversion, or
multiply by (0x1.ffffffp-1 / 0xffffffffp0) (give or take one ULP) to exploit the full single-precision range [0, 1) without getting the value 1.0f.
Method 2 leads to use the constant 0x1.ffffff01fffffp-33:
double factor = nextafter(0x1.ffffffp-1 / 0xffffffffp0, 0.0);
unsigned int x = 0xffffffff;
float f = (float)((double)x * factor);
printf("factor:%a\nunrounded:%a\nresult:%a\n", factor, (double)x * factor, f);
You could just truncate the value to maximum precision (keeping the 24 high bits) and divide by 2^24 to get the closest value a float can represent without being rounded to 1;
unsigned int i = 0xffffffff;
float value = (float)(i>>8)/(1<<24);
printf("%.20f\n", value);
printf("%a\n", value);
>>> 0.99999994039535522461
>>> 0x1.fffffep-1
There's not much you can do - your int holds 32 bits but the mantissa of a float holds only 24. Rounding is going to happen. You could change the processor rounding mode to round down instead of to nearest, but that is going to cause some side effects that you want to avoid especially if you don't restore the rounding mode when you are finished.
There's nothing wrong with the formula you're using, it's producing the most accurate answer possible for the given input. There's just an end case that's failing a hard requirement. There's nothing wrong with testing for the specific end case and replacing it with the closest value that meets the requirement:
if (f >= 1.0f)
f = 0.99999994f;
0.999999940395355224609375 is the closest value that an IEEE-754 float can take without being equal to 1.0.
My eventual solution was to just shrink the size of my constant multiplier. It was probably the best solution since there was no point in multiplying by a double anyway. The precision was not seen after conversion to a float.
so 2.328306436538696e-010 was changed to 2.3283063
