platform independent way to reduce precision of floating point constant values

platform independent way to reduce precision of floating point constant values - c

The use case:
I have some large data arrays containing floating point constants that.
The file defining that array is generated and the template can be easily adapted.
I would like to make some tests, how reduced precision does influence the results in terms of quality, but also in compressibility of the binary.
Since I do not want to change other source code than the generated file, I am looking for a way to reduce the precision of the constants.
I would like to limit the mantissa to a fixed number of bits (set the lower ones to 0). But since floating point literals are in decimal, there are some difficulties, specifying numbers in a way that the binary representation does contain all zeros at the lower mantissa bits.
The best case would be something like:
#define FP_REDUCE(float) /* some macro */
static const float32_t veryLargeArray[] = {
FP_REDUCE(23.423f), FP_REDUCE(0.000023f), FP_REDUCE(290.2342f),
// ...
};
#undef FP_REDUCE
This should be done at compile time and it should be platform independent.

The following uses the Veltkamp-Dekker splitting algorithm to remove n bits (with rounding) from x, where p = 2n (for example, to remove eight bits, use 0x1p8f for the second argument). The casts to float32_t coerce the results to that type, as the C standard otherwise permits implementations to use more precision within expressions. (Double-rounding could produce incorrect results in theory, but this will not occur when float32_t is the IEEE basic 32-bit binary format and the C implementation computes this expression in that format or the 64-bit format or wider, as the former is the desired format and the latter is wide enough to represent intermediate results exactly.)
IEEE-754 binary floating-point is assumed, with round-to-nearest. Overflow occurs if x•(p+1) rounds to infinity.
#define RemoveBits(x, p) (float32_t) (((float32_t) ((x) * ((p)+1))) - (float32_t) (((float32_t) ((x) * ((p)+1))) - (x))))

What you're asking for can be done with varying degrees of partial portability, but not absolute unless you want to run the source file through your own preprocessing tool at build time to reduce the precision. If that's an option for you, it's probably your best one.
Short of that, I'm going to assume at least that your floating point types are base 2 and obey Annex F/IEEE semantics. This should be a reasonable assumption, but the latter is false with gcc on platforms (including 32-bit x86) with extended-precision under the default standards-conformance profile; you need -std=cNN or -fexcess-precision=standard to fix it.
One approach is to add and subtract a power of two chosen to cause rounding to the desired precision:
#define FP_REDUCE(x,p) ((x)+(p)-(p))
Unfortunately, this works in absolute precisions, not relative, and requires knowing the right value p for the particular x, which is going to be equal to the value of the leading base-2 place of x, times 2 raised to the power of FLT_MANT_DIG minus the bits of precision you want. This cannot be evaluated as a constant expression for use as an initializer, but you can write it in terms of FLT_EPSILON and, if you can assume C99+, a preprocessor-token-pasting to form a hex float literal, yielding the correct value for this factor. But you still need to know the power of two for the leading digit of x; I don't see any way to extract that as a constant expression.
Edit: I believe this is fixable, so as not to need an absolute precision but rather automatically scale to the value, but it depends on correctness of a work in progress. See Is there a correct constant-expression, in terms of a float, for its msb?. If that works I will later integrate the result with this answer.
Another approach I like, if your compiler supports compound literals in static initializers and if you can assume IEEE type representations, is using a union and masking off bits:
union { float x; uint32_t r; } fr;
#define FP_REDUCE(x) ((union fr){.r=(union fr){x}.r & (0xffffffffu<<n)}.x)
where n is the number of bits you want to drop. This will round towards zero rather than to nearest; if you want to make it round to nearest, it should be possible by adding an appropriate constant to the low bits before masking, but you have to take care about what happens when the addition overflows into the exponent bits.

Related

Why doesn't this cause overflow?

#include <stdio.h>
int main() {
double b = 3.14;
double c = -1e20;
c = -1e20 + b;
return 0;
}
As long as I know type "double" has 52 bits of fraction. To conform 3.14's exponent to -1e20, 3.14's faction part goes over 60 bits, which never fits to 52 bits.
In my understanding, rest of fraction bits other than 52, which roughly counts 14 bits, invades unassigned memory space, like this.
rough drawing
So I examined memory map in debug mode (gdb), suspecting that the bits next to the variable b or c would be corrupted. But I couldn't see any changes. What am I missing here?

You mix up 2 very different things:
Buffer overflow/overrun
Your added image shows what happens when you overflow your buffer.
Like definining char[100] and writing to index 150. Then the memory layout is important as you might corrupt neighboring variables.
Overflow in values of a data type
What your code shows can only be an overflow of values.
If you do int a= INT_MAX; a++ you get an integer overflow.
This only affects the resulting value.
It does not cause the variable to grow in size. An int will always stay an int.
You do not invade any memory area outside your data type.
Depending on data type and architecture the overflowing bits could just be chopped off or some saturation could be applied to set the value to maximum/minimum representable value.
I did not check for yout values but without inspecting the value of c in a debugger or printing it, you cannot tell anything about the overflow there.

Floating-point arithmetic is not defined to work by writing out all the bits of the operands, performing the arithmetic using all the bits involved, and storing those bits in memory. Rather, the way elementary floating-point operations work is that each operation is performed “as if it first produced an intermediate result correct to infinite precision and with unbounded range” and then rounded to a result that is representable in the floating-point format. That “as if” is important. It means that when computer processor designers are designing the floating-point arithmetic instructions, they figure out how to compute what the final rounded result would be. The processor does not always need to “write out” all the bits to do that.
Consider an example using decimal floating-point with four significant digits. If we add 6.543•1020 and 1.037•17 (equal to 0.001037•1020), the infinite-precision result would be 6.544037•1020, and then rounding that to the nearest number representable in the four-significant-digit format would give 6.544•1020. But we do not have to write out the infinite-precision result to compute that. We can compute the result is 6.544•1020 plus a tiny fraction, and then we can discard that fraction without actually writing out its digits. This is what processor designers do. The add, multiply, and other instructions compute the main part of a result, and they carefully manage information about the other parts to determine whether they would cause the result to round upward or downward in its last digit.
The resulting behavior is that, given any two operands in the format used for double, the computer always produces a result in that same format. It does not produce any extra bits.
Supplement
There are 53 bits in the fraction portion of the format commonly used for double. (This is the IEEE-754 binary64 format, also called double precision.) The fraction portion is called the significand. (You may see it referred to as a mantissa, but that is an old term for the fraction portion of a logarithm. The preferred term is “significand.” Significands are linear; mantissas are logarithmic.) You may see some people describe there being 52 bits for the significand, but that refers to a part of the encoding of the floating-point value, and it is only part of it.
Mathematically, a floating-point representation is defined to be s•f•be, where b is a fixed numeric base, s provides a sign (+1 or −1), f is a number with a fixed number of digits p in base b, and e is an exponent within fixed limits. p is called the precision of the format, and it is 53 for the binary64 format. When this number is encoded into bits, the last 52 bits of f are stored in the significand field, which is where the 52 comes from. However, the first bit is also encoded, by way of the exponent field. Whenever the stored exponent field is not zero (or the special value of all one bits), it means the first bit of f is 1. When the stored exponent field is zero, it means the first bit of f is 0. So there are 53 bits present in the encoding.

c: change variable type without casting

I'm changing an uint32_t to a float but without changing the actual bits.
Just to be sure: I don't wan't to cast it. So float f = (float) i is the exact opposite of what I wan't to do because it changes bits.
I'm going to use this to convert my (pseudo) random numbers to float without doing unneeded math.
What I'm currently doing and what is already working is this:
float random_float( uint64_t seed ) {
// Generate random and change bit format to ieee
uint32_t asInt = (random_int( seed ) & 0x7FFFFF) | (0x7E000000>>1);
// Make it a float
return *(float*)(void*)&asInt; // <-- pretty ugly and nees a variable
}
The Question: Now I'd like to get rid of the asInt variable and I'd like to know if there is a better / not so ugly way then getting the address of this variable, casting it twice and dereferencing it again?

You could try union - as long as you make sure the types are identical in memory sizes:
union convertor {
int asInt;
float asFloat;
};
Then you can assign your int to asFloat (or the other way around if you want to). I use it a lot when I need to do bitwise operations on one hand and still get a uint32_t representation on the number on the other hand
[EDIT]
Like many of the commentators rightfully state, you must take into consideration values that are not presentable by integers like like NAN, +INF, -INF, +0, -0.

So you seem to want to generate floating point numbers between 0.5 and 1.0 judging from your code.
Assuming that your microcontroller has a standard C library with floating point support, you can do this all standards compliant without actually involving any floating point operations, all you need is the ldexp function that itself doesn't actually do any floating point math.
This would look something like this:
return ldexpf((1 << 23) + random_thing_smaller_than_23_bits(), -24);
The trick here is that we happen to know that IEEE754 binary32 floating point numbers have integer precision between 2^23 and 2^24 (I could be off-by-one here, double check please, I'm translating this from some work I've done on doubles). So the compiler should know how to convert that number to a float trivially. Then ldexp multiplies that number by 2^-24 by just changing the bits in the exponent. No actual floating point operations involved and no undefined behavior, the code is fully portable to any standard C implementation with IEEE754 numbers. Double check the generated code, but a good compiler and c library should not use any floating point instructions here.
If you want to peek at some experiments I've done around generating random floating point numbers you can peek at this github repo. It's all about doubles, but should be trivially translatable to floats.

Reinterpreting the binary representation of an int to a float would result in major problems:
There are a lot of undefined codes in the binary representation of a float.
Other codes represent special conditions, like NAN, +INF, -INF, +0, -0 (sic!), etc.
Also, if that is a random value, even if catching all non-value representations, that would yield a very bad random distribution.
If you are working on an MCU without FPU, you should better think about avoiding float at all. An alternative might be fraction or scaled integers. There are many implementations of algorithms which use float, but can be easily converted to fixed point types with acceptable loss of precision (or even none at all). Some might even yield more precision than float (note that single precision float has only 23 bits of mantissa, an int32 would have 31 bits (+ 1 sign for either), same for a fractional or fixed scaled int.
Note that C11 added (optional) support for _Frac. You might want to research on that.
Edit:
According you your comments, you seem to convert the int to a float in range 0..<1. For that, you can assemble the float using bit operations on an uint32_t (e.g. the original value). You just need to follow the IEEE format (presumed your toolchain does comply to the C standard! See wikipedia.
The result (still uint32_t) can then be reinterpreted by a union or pointer as described by others already. Pack that in a system-dependent, well-commented library and dig it deep. Do not forget to check about endianess and alignment (likely both the same for float and uint32_t, but important for the bit-ops).

Max value of datatypes in C

I am trying to understand the maximum value that I can store in C. I tried doing printf("%f", pow(2, x)). The answer holds good until x = 1023. It says Inf when x = 1024.
I am sorry that it is a basic question but I am trying to understand how C assigns datatypes' sizes based on my machine.
I have a Mac (64-bit processor). A clear understanding that I have is that my processor being a 64-bit one, it will be able to do calculations up to the value (264). Clearly pow(2, 1023) is greater than that. But my program is working fine till x = 1023. How is this possible? Is GNU compiler has something to do with this?
If this is a duplicate of other question kindly give the link.

In C the pow() functions returns a double, and the double type is typically a 64-bit IEEE format representation of a floating point number.
The basic idea of floating point is to express a number in the same general way as e.g. 1.234×1056. Here you have a mantissa 1.234 and an exponent 56. C++, and probably also C, allows decimal representation for floating point numbers (but not for integer types), but in practice the internal representation will be binary, with a power of 2 rather than a power of 10.
The limit you ran up against was the supported range for the exponent in your compiler's representation of double numbers; probably 64-bit IEEE 754.
The limits of the various built-in integral numerical types are available as symbolic constants from <limits.h>. The limits of the built-in floating point types are available as symbolic constants from <float.h>. See the table over at cppreference.com for more details.
In C++ these limits are also available via the numeric_limits class template from <limits>.

"64-bit processor" typically means that it can deal with integers that contain at most 64 bits at a time (i.e. in a single instruction), not that it can only process numbers with 64 binary digits or less. Using arbitrary precision arithmetic you can do calculations on numbers that are arbitrarily large, provided that you have enough memory (and time), just like how us humans can do operations on big values with only 10 fingers. Read more here: What is the biggest number you can generate using a 64-bit processor?
However pow(2, 1023) is a little bit different. It's not an integer but a floating-point number (of type double in C) represented by a sign, a mantissa and an exponent like this (-1)sign × 1 × 21023. Not all the digits are stored so it's only accurate to the first few digits. However most systems use binary floating-point types so they can store the precise value of a power of 2 up to a large exponent depending on the exponent range. Most modern systems' floating-point types conform to IEEE-754 standard with double maps to binary64/double precision, therefore the maximum value will be
21023 × (1 + (1 − 2−52)) ≈ 1.7976931348623157 × 10308

The maximum value for a double is DBL_MAX. This is defined by <float.h> in C, or <cfloat> in C++. The numeric value may vary across systems, but you can always refer to it by the macro DBL_MAX.
You can print this:
printf("%f\n", DBL_MAX);
The integer data types all have similar macros defined in <limits.h>: e.g. ULLONG_MAX is the biggest value for unsigned long long. If printing with printf make sure to use the correct format specifier.

How to avoid floating point round off error in unit tests?

I'm trying to write unit tests for some simple vector math functions that operate on arrays of single precision floating point numbers. The functions use SSE intrinsics and I'm getting false positives (at least I think) when running the tests on a 32-bit system (the tests pass on 64-bit). As the operation runs through the array, I accumulate more and more round off error. Here is a snippet of unit test code and output (my actual question(s) follow):
Test Setup:
static const int N = 1024;
static const float MSCALAR = 42.42f;
static void setup(void) {
input = _mm_malloc(sizeof(*input) * N, 16);
ainput = _mm_malloc(sizeof(*ainput) * N, 16);
output = _mm_malloc(sizeof(*output) * N, 16);
expected = _mm_malloc(sizeof(*expected) * N, 16);
memset(output, 0, sizeof(*output) * N);
for (int i = 0; i < N; i++) {
input[i] = i * 0.4f;
ainput[i] = i * 2.1f;
expected[i] = (input[i] * MSCALAR) + ainput[i];
}
}
My main test code then calls the function to be tested (which does the same calculation used to generate the expected array) and checks its output against the expected array generated above. The check is for closeness (within 0.0001) not equality.
Sample output:
0.000000 0.000000 delta: 0.000000
44.419998 44.419998 delta: 0.000000
...snip 100 or so lines...
2043.319946 2043.319946 delta: 0.000000
2087.739746 2087.739990 delta: 0.000244
...snip 100 or so lines...
4086.639893 4086.639893 delta: 0.000000
4131.059570 4131.060059 delta: 0.000488
4175.479492 4175.479980 delta: 0.000488
...etc, etc...
I know I have two problems:
On 32-bit machines, differences between 387 and SSE floating point arithmetic units. I believe 387 uses more bits for intermediate values.
Non-exact representation of my 42.42 value that I'm using to generate expected values.
So my question is, what is the proper way to write meaningful and portable unit tests for math operations on floating point data?
*By portable I mean should pass on both 32 and 64 bit architectures.

Per a comment, we see that the function being tested is essentially:
for (int i = 0; i < N; ++i)
D[i] = A[i] * b + C[i];
where A[i], b, C[i], and D[i] all have type float. When referring to the data of a single iteration, I will use a, c, and d for A[i], C[i], and D[i].
Below is an analysis of what we could use for an error tolerance when testing this function. First, though, I want to point out that we can design the test so that there is no error. We can choose the values of A[i], b, C[i], and D[i] so that all the results, both final and intermediate results, are exactly representable and there is no rounding error. Obviously, this will not test the floating-point arithmetic, but that is not the goal. The goal is to test the code of the function: Does it execute instructions that compute the desired function? Simply choosing values that would reveal any failures to use the right data, to add, to multiply, or to store to the right location will suffice to reveal bugs in the function. We trust that the hardware performs floating-point correctly and are not testing that; we just want to test that the function was written correctly. To accomplish this, we could, for example, set b to a power of two, A[i] to various small integers, and C[i] to various small integers multiplied by b. I could detail limits on these values more precisely if desired. Then all results would be exact, and any need to allow for a tolerance in comparison would vanish.
That aside, let us proceed to error analysis.
The goal is to find bugs in the implementation of the function. To do this, we can ignore small errors in the floating-point arithmetic, because the kinds of bugs we are seeking almost always cause large errors: The wrong operation is used, the wrong data is used, or the result is not stored in the desired location, so the actual result is almost always very different from the expected result.
Now the question is how much error should we tolerate? Because bugs will generally cause large errors, we can set the tolerance quite high. However, in floating-point, “high” is still relative; an error of one million is small compared to values in the trillions, but it is too high to discover errors when the input values are in the ones. So we ought to do at least some analysis to decide the level.
The function being tested will use SSE intrinsics. This means it will, for each i in the loop above, either perform a floating-point multiply and a floating-point add or will perform a fused floating-point multiply-add. The potential errors in the latter are a subset of the former, so I will use the former. The floating-point operations for a*b+c do some rounding so that they calculate a result that is approximately a•b+c (interpreted as an exact mathematical expression, not floating-point). We can write the exact value calculated as (a•b•(1+e0)+c)•(1+e1) for some errors e0 and e1 with magnitudes at most 2-24, provided all the values are in the normal range of the floating-point format. (2-24 is the maximum relative error that can occur in any correctly rounded elementary floating-point operation in round-to-nearest mode in the IEEE-754 32-bit binary floating-point format. Rounding in round-to-nearest mode changes the mathematical value by at most half the value of the least significant bit in the significand, which is 23 bits below the most significant bit.)
Next, we consider what value the test program produces for its expected value. It uses the C code d = a*b + c;. (I have converted the long names in the question to shorter names.) Ideally, this would also calculate a multiply and an add in IEEE-754 32-bit binary floating-point. If it did, then the result would be identical to the function being tested, and there would be no need to allow for any tolerance in comparison. However, the C standard allows implementations some flexibility in performing floating-point arithmetic, and there are non-conforming implementations that take more liberties than the standard allows.
A common behavior is for an expression to be computed with more precision than its nominal type. Some compilers may calculate a*b + c using double or long double arithmetic. The C standard requires that results be converted to the nominal type in casts or assignments; extra precision must be discarded. If the C implementation is using extra precision, then the calculation proceeds: a*b is calculated with extra precision, yielding exactly a•b, because double and long double have enough precision to exactly represent the product of any two float values. A C implementation might then round this result to float. This is unlikely, but I allow for it anyway. However, I also dismiss it because it moves the expected result to be closer to the result of the function being tested, and we just need to know the maximum error that can occur. So I will continue, with the worse (more distant) case, that the result so far is a•b. Then c is added, yielding (a•b+c)•(1+e2) for some e2 with magnitude at most 2-53 (the maximum relative error of normal numbers in the 64-bit binary format). Finally, this value is converted to float for assignment to d, yielding (a•b+c)•(1+e2)•(1+e3) for some e3 with magnitude at most 2-24.
Now we have expressions for the exact result computed by a correctly operating function, (a•b•(1+e0)+c)•(1+e1), and for the exact result computed by the test code, (a•b+c)•(1+e2)•(1+e3), and we can calculate a bound on how much they can differ. Simple algebra tells us the exact difference is a•b•(e0+e1+e0•e1-e2-e3-e2•e3)+c•(e1-e2-e3-e2•e3). This is a simple function of e0, e1, e2, and e3, and we can see its extremes occur at endpoints of the potential values for e0, e1, e2, and e3. There are some complications due to interactions between possibilities for the signs of the values, but we can simply allow some extra error for the worst case. A bound on the maximum magnitude of the difference is |a•b|•(3•2-24+2-53+2-48)+|c|•(2•2-24+2-53+2-77).
Because we have plenty of room, we can simplify that, as long as we do it in the direction of making the values larger. E.g., it might be convenient to use |a•b|•3.001•2-24+|c|•2.001•2-24. This expression should suffice to allow for rounding in floating-point calculations while detecting nearly all implementation errors.
Note that the expression is not proportional to the final value, a*b+c, as calculated either by the function being tested or by the test program. This means that, in general, tests using a tolerance relative to the final values calculated by the function being tested or by the test program are wrong. The proper form of a test should be something like this:
double tolerance = fabs(input[i] * MSCALAR) * 0x3.001p-24 + fabs(ainput[i]) * 0x2.001p-24;
double difference = fabs(output[i] - expected[i]);
if (! (difference < tolerance))
// Report error here.
In summary, this gives us a tolerance that is larger than any possible differences due to floating-point rounding, so it should never give us a false positive (report the test function is broken when it is not). However, it is very small compared to the errors caused by the bugs we want to detect, so it should rarely give us a false negative (fail to report an actual bug).
(Note that there are also rounding errors computing the tolerance, but they are smaller than the slop I have allowed for in using .001 in the coefficients, so we can ignore them.)
(Also note that ! (difference < tolerance) is not equivalent to difference >= tolerance. If the function produces a NaN, due to a bug, any comparison yields false: both difference < tolerance and difference >= tolerance yield false, but ! (difference < tolerance) yields true.)

On 32-bit machines, differences between 387 and SSE floating point arithmetic units. I believe 387 uses more bits for intermediate values.
If you are using GCC as 32-bit compiler, you can tell it to generate SSE2 code still with options -msse2 -mfpmath=sse. Clang can be told to do the same thing with one of the two options and ignores the other one (I forget which). In both cases the binary program should implement strict IEEE 754 semantics, and compute the same result as a 64-bit program that also uses SSE2 instructions to implement strict IEEE 754 semantics.
Non-exact representation of my 42.42 value that I'm using to generate expected values.
The C standard says that a literal such as 42.42f must be converted to either the floating-point number immediately above or immediately below the number represented in decimal. Moreover, if the literal is representable exactly as a floating-point number of the intended format, then this value must be used. However, a quality compiler (such as GCC) will give you(*) the nearest representable floating-point number, of which there is only one, so again, this is not a real portability issue as long as you are using a quality compiler (or at the very least, the same compiler).
Should this turn out to be a problem, a solution is to write an exact representation of the constants you intend. Such an exact representation can be very long in decimal format (up to 750 decimal digits for the exact representation of a double) but is always quite compact in C99's hexadecimal format: 0x1.535c28p+5 for the exact representation of the float nearest to 42.42. A recent version of the static analysis platform for C programs Frama-C can provide the hexadecimal representation of all inexact decimal floating-point constants with option -warn-decimal-float:all.
(*) barring a few conversion bugs in older GCC versions. See Rick Regan's blog for details.

fixed point fx notation and converting

I have a fx1.15 notation. The underlying integer value is 63183 (register value).
Now, according to wikipedia the the complete length is 15 bits. The value does not fit inside, right?
So assuming it is a fx1.16 value, how do I convert it to a human readable value?

To convert a fixed-point value into something human-readable, do a floating-point divide by 2 to the number of fractional bits. For example, if there are 15 fractional bits, 2^15 = 32768, so you would use something like this:
int x = <fixed-point-value-in-1.15-format>
printf("x = %g\n", x / 32768.0);
Now converting fixed-point numbers to floating-point and invoking printf() are expensive operations, and they usually destroy any performance gained by using fixed-point. I presume you are only doing this for diagnostic purposes.
Also, note that if your platform is doing fixed-point because floating-point operations are forbidden or not available, then you'll have to do something different, along the lines of manually doing the decimal conversion. Model the integer as the underlying floating-point value multiplied by 32768 and go from there. There's some useful fixed-point code here.
p.s. I'm not sure you're still interested in this answer, ashirk, (I wrote it more for others), but if you are, welcome to Stack Overflow!

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight