How should I obtain the fractional part of a floating-point value? - c

I have a variable x of type float, and I need its fractional part. I know I can get it with
x - floorf(x), or
fmodf(x, 1.0f)
My questions: Is one of these always preferable to the other? Are they effectively the same? Is there a third alternative I might consider?
Notes:
If the answer depends on the processor I'm using, let's make it x86_64, and if you can elaborate about other processors that would be nice.
Please make sure and refer to the behavior on negative values of x. I don't mind this behavior or that, but I need to know what the behavior is.

Is there a third alternative I might consider?
There's the dedicated function for it. modff exists to decompose a number into its integral and fractional parts.
float modff( float arg, float* iptr );
Decomposes given floating point value arg into integral and fractional
parts, each having the same type and sign as arg. The integral part
(in floating-point format) is stored in the object pointed to by iptr.

I'd say that x - floorf(x) is pretty good (exact), except in corner cases
it has the wrong sign bit for negative zero or any other negative whole float
(we might expect the fraction part to wear the same sign bit).
it does not work that well with inf
modff does respect -0.0 sign bit for both int and frac part, and answer +/-0.0 for +/-inf fraction part - at least if implementation supports the IEC 60559 standard (IEEE 754).
A rationale for inf could be: since every float greater than 2^precision has a null fraction part, then it must be true for infinite float too.
That's minor, but nonetheless different.
EDIT Err, of course as pointed by #StoryTeller-UnslanderMonica the most obvious flaw of x - floor(x) is for the case of negative floating point with a fraction part, because applied to -2.25, it would return +0.75 for example, which is not what we expect...
Since c99 label is used, x - truncf(x) would be more correct, but still suffer from the minor problems onto which I initially focused.

Related

platform independent way to reduce precision of floating point constant values

The use case:
I have some large data arrays containing floating point constants that.
The file defining that array is generated and the template can be easily adapted.
I would like to make some tests, how reduced precision does influence the results in terms of quality, but also in compressibility of the binary.
Since I do not want to change other source code than the generated file, I am looking for a way to reduce the precision of the constants.
I would like to limit the mantissa to a fixed number of bits (set the lower ones to 0). But since floating point literals are in decimal, there are some difficulties, specifying numbers in a way that the binary representation does contain all zeros at the lower mantissa bits.
The best case would be something like:
#define FP_REDUCE(float) /* some macro */
static const float32_t veryLargeArray[] = {
FP_REDUCE(23.423f), FP_REDUCE(0.000023f), FP_REDUCE(290.2342f),
// ...
};
#undef FP_REDUCE
This should be done at compile time and it should be platform independent.
The following uses the Veltkamp-Dekker splitting algorithm to remove n bits (with rounding) from x, where p = 2n (for example, to remove eight bits, use 0x1p8f for the second argument). The casts to float32_t coerce the results to that type, as the C standard otherwise permits implementations to use more precision within expressions. (Double-rounding could produce incorrect results in theory, but this will not occur when float32_t is the IEEE basic 32-bit binary format and the C implementation computes this expression in that format or the 64-bit format or wider, as the former is the desired format and the latter is wide enough to represent intermediate results exactly.)
IEEE-754 binary floating-point is assumed, with round-to-nearest. Overflow occurs if x•(p+1) rounds to infinity.
#define RemoveBits(x, p) (float32_t) (((float32_t) ((x) * ((p)+1))) - (float32_t) (((float32_t) ((x) * ((p)+1))) - (x))))
What you're asking for can be done with varying degrees of partial portability, but not absolute unless you want to run the source file through your own preprocessing tool at build time to reduce the precision. If that's an option for you, it's probably your best one.
Short of that, I'm going to assume at least that your floating point types are base 2 and obey Annex F/IEEE semantics. This should be a reasonable assumption, but the latter is false with gcc on platforms (including 32-bit x86) with extended-precision under the default standards-conformance profile; you need -std=cNN or -fexcess-precision=standard to fix it.
One approach is to add and subtract a power of two chosen to cause rounding to the desired precision:
#define FP_REDUCE(x,p) ((x)+(p)-(p))
Unfortunately, this works in absolute precisions, not relative, and requires knowing the right value p for the particular x, which is going to be equal to the value of the leading base-2 place of x, times 2 raised to the power of FLT_MANT_DIG minus the bits of precision you want. This cannot be evaluated as a constant expression for use as an initializer, but you can write it in terms of FLT_EPSILON and, if you can assume C99+, a preprocessor-token-pasting to form a hex float literal, yielding the correct value for this factor. But you still need to know the power of two for the leading digit of x; I don't see any way to extract that as a constant expression.
Edit: I believe this is fixable, so as not to need an absolute precision but rather automatically scale to the value, but it depends on correctness of a work in progress. See Is there a correct constant-expression, in terms of a float, for its msb?. If that works I will later integrate the result with this answer.
Another approach I like, if your compiler supports compound literals in static initializers and if you can assume IEEE type representations, is using a union and masking off bits:
union { float x; uint32_t r; } fr;
#define FP_REDUCE(x) ((union fr){.r=(union fr){x}.r & (0xffffffffu<<n)}.x)
where n is the number of bits you want to drop. This will round towards zero rather than to nearest; if you want to make it round to nearest, it should be possible by adding an appropriate constant to the low bits before masking, but you have to take care about what happens when the addition overflows into the exponent bits.

c: change variable type without casting

I'm changing an uint32_t to a float but without changing the actual bits.
Just to be sure: I don't wan't to cast it. So float f = (float) i is the exact opposite of what I wan't to do because it changes bits.
I'm going to use this to convert my (pseudo) random numbers to float without doing unneeded math.
What I'm currently doing and what is already working is this:
float random_float( uint64_t seed ) {
// Generate random and change bit format to ieee
uint32_t asInt = (random_int( seed ) & 0x7FFFFF) | (0x7E000000>>1);
// Make it a float
return *(float*)(void*)&asInt; // <-- pretty ugly and nees a variable
}
The Question: Now I'd like to get rid of the asInt variable and I'd like to know if there is a better / not so ugly way then getting the address of this variable, casting it twice and dereferencing it again?
You could try union - as long as you make sure the types are identical in memory sizes:
union convertor {
int asInt;
float asFloat;
};
Then you can assign your int to asFloat (or the other way around if you want to). I use it a lot when I need to do bitwise operations on one hand and still get a uint32_t representation on the number on the other hand
[EDIT]
Like many of the commentators rightfully state, you must take into consideration values that are not presentable by integers like like NAN, +INF, -INF, +0, -0.
So you seem to want to generate floating point numbers between 0.5 and 1.0 judging from your code.
Assuming that your microcontroller has a standard C library with floating point support, you can do this all standards compliant without actually involving any floating point operations, all you need is the ldexp function that itself doesn't actually do any floating point math.
This would look something like this:
return ldexpf((1 << 23) + random_thing_smaller_than_23_bits(), -24);
The trick here is that we happen to know that IEEE754 binary32 floating point numbers have integer precision between 2^23 and 2^24 (I could be off-by-one here, double check please, I'm translating this from some work I've done on doubles). So the compiler should know how to convert that number to a float trivially. Then ldexp multiplies that number by 2^-24 by just changing the bits in the exponent. No actual floating point operations involved and no undefined behavior, the code is fully portable to any standard C implementation with IEEE754 numbers. Double check the generated code, but a good compiler and c library should not use any floating point instructions here.
If you want to peek at some experiments I've done around generating random floating point numbers you can peek at this github repo. It's all about doubles, but should be trivially translatable to floats.
Reinterpreting the binary representation of an int to a float would result in major problems:
There are a lot of undefined codes in the binary representation of a float.
Other codes represent special conditions, like NAN, +INF, -INF, +0, -0 (sic!), etc.
Also, if that is a random value, even if catching all non-value representations, that would yield a very bad random distribution.
If you are working on an MCU without FPU, you should better think about avoiding float at all. An alternative might be fraction or scaled integers. There are many implementations of algorithms which use float, but can be easily converted to fixed point types with acceptable loss of precision (or even none at all). Some might even yield more precision than float (note that single precision float has only 23 bits of mantissa, an int32 would have 31 bits (+ 1 sign for either), same for a fractional or fixed scaled int.
Note that C11 added (optional) support for _Frac. You might want to research on that.
Edit:
According you your comments, you seem to convert the int to a float in range 0..<1. For that, you can assemble the float using bit operations on an uint32_t (e.g. the original value). You just need to follow the IEEE format (presumed your toolchain does comply to the C standard! See wikipedia.
The result (still uint32_t) can then be reinterpreted by a union or pointer as described by others already. Pack that in a system-dependent, well-commented library and dig it deep. Do not forget to check about endianess and alignment (likely both the same for float and uint32_t, but important for the bit-ops).

In C, is specifying 2.0f the same as 2.000000f?

Are these lines the same?
float a = 2.0f;
and
float a = 2.000000f;
Yes, it is. No matter what representation you use, when the code is compiled, the number will be converted to a unique binary representation. There's only one way of representing 2 in the IEEE 754 binary32 standard used in modern computers to represent float numbers.
The only thing the C99 standard has to say on the matter is this (section 6.4.4.2):
For decimal floating constants ... the result is either
the nearest representable value, or the larger or smaller representable value immediately
adjacent to the nearest representable value, chosen in an implementation-defined manner.
That bit about "implementation-defined" means that technically an implementation could choose to do something different in each case. Although in practice, nothing weird is going to happen for a value like 2.
It's important to bear in mind that the C standards don't require IEEE-754.
Yes, they are the same.
Simple check:
http://codepad.org/FOQsufB4
int main() {
printf("%d",2.0f == 2.000000f);
}
^ Will output 1 (true)
Yes Sure it is the same extra zeros on the right are ignored just likes zeros on the left

fixed point fx notation and converting

I have a fx1.15 notation. The underlying integer value is 63183 (register value).
Now, according to wikipedia the the complete length is 15 bits. The value does not fit inside, right?
So assuming it is a fx1.16 value, how do I convert it to a human readable value?
To convert a fixed-point value into something human-readable, do a floating-point divide by 2 to the number of fractional bits. For example, if there are 15 fractional bits, 2^15 = 32768, so you would use something like this:
int x = <fixed-point-value-in-1.15-format>
printf("x = %g\n", x / 32768.0);
Now converting fixed-point numbers to floating-point and invoking printf() are expensive operations, and they usually destroy any performance gained by using fixed-point. I presume you are only doing this for diagnostic purposes.
Also, note that if your platform is doing fixed-point because floating-point operations are forbidden or not available, then you'll have to do something different, along the lines of manually doing the decimal conversion. Model the integer as the underlying floating-point value multiplied by 32768 and go from there. There's some useful fixed-point code here.
p.s. I'm not sure you're still interested in this answer, ashirk, (I wrote it more for others), but if you are, welcome to Stack Overflow!

Is there a C rounding function like MATLAB's round function?

I need a C rounding function which rounds numbers like MATLAB's round function. Is there one? If you don't know how MATLAB's round function works see this link:
MATLAB round function
I was thinking I might just write my own simple round function to match MATLAB's functionality.
Thanks,
DemiSheep
This sounds similar to the round() function from math.h
These functions shall round their
argument to the nearest integer value
in floating-point format, rounding
halfway cases away from zero,
regardless of the current rounding
direction.
There's also lrint() which gives you an int return value, though lrint() and friends obey the current rounding direction - you'll have to set that using fesetround() , the various rounding directions are found here.
Check out the standard header <fenv.c>, specifically the fesetround() function and the four macros FE_DOWNWARD, FE_TOWARDZERO, FE_TONEAREST and FE_UPWARD. This controls how floating point values are rounded to integers. Make sure your implementation (i.e., C compiler / C library) actually support this (by checking the return value of fesetround() and the documentation of your implementation).
Functions honoring these settings include (from <math.h>):
llrint()
llrintf()
llrintl()
lrint()
lrintf()
lrintl()
rint()
rintf()
rintl()
llround()
llroundf()
llroundl()
lround()
lroundf()
lroundl()
nearbyint()
nearbyintf()
nearbyintl()
depending on your needs (parameter type and return type, with or without inexact floating point exception).
NOTE: round(), roundf() and roundl() do look like they belong in the list above, but these three do not honor the rounding mode set by fesetround()!!
Refer to your most favourite standard library documentation for the exact details.
No, C (before C99) doesn't have a round function. The typical approach is something like this:
double sign(double x) {
if (x < 0.0)
return -1.0;
return 1.0;
}
double round(double x) {
return (long long)x + 0.5 * sign(x);
}
This rounds to an integer, assuming the original number is in the range that can be represented by a long long. If you want to round to a specific number of places after the decimal point, that can be a bit harder. If the numbers aren't too large or too small, you can multiply by 10N, round to an integer, and divide by 10N again (keeping in mind that this may introduce some rounding errors of its own).
If there isn't a round() function in the standard library, you could, if dealing with floating-point numbers, arbitrarily evaluate each value, analyze the number in the place after the place you want to round to, check to see if it's greater, equal-to, or less-than 5; Then, if the value is less than 5, you can floor() the number you're ultimately looking at. If the value of the digit after the place you're rounding to is 5 or greater, you can proceed to having the function floor() the number being evaluated, then add 1.
I apologize for any inefficiency tied to this.
If I'm not mistaken you are looking for something like floor and ceil and you shall find them in <math.h>
The documentation specifies
Y = round(X) rounds the elements of X to the nearest integers.
Not the plural: as per regular MATLAB operations, it operates on all elements of a matrix. The C equivalents posted above only deal with a single value at once. If you can use C++, check out Valarray. If not, then good ol' for loop is your friend.

Resources