Getting absolute value from a binary int using bit arithmetics - c

flt32 flt32_abs (flt32 x) {
int mask=x>>31;
printMask(mask,32);
puts("Original");
printMask(x,32);
x=x^mask;
puts("after XOR");
printMask(x,32);
x=x-mask;
puts("after x-mask");
printMask(x,32);
return x;
}
Here's my code, calling the function on the value -32 is returning .125. I'm confused because it's a pretty straight up formula for abs on bits, but I seem to be missing something. Any ideas?

Is flt32 a type for floating point or fixed point numbers?
I suspect it's a type for fixed point arithmetic and you are not using it correctly. Let me explain it.
A fixed-point number uses, as the name says, a fixed position for the decimal digit; this means it uses a fixed number of bits for the decimal part. It is, in fact, a scaled integer.
I guess the flt32 type you are using uses the most significant 24 bits for the whole part and the least significant 8 bits for the decimal part; the value as real number of the 32-bit representation is the value of the same 32 bit representation as integer, divided by 256 (i.e. 28).
For example, the 32-bit number 0x00000020 is interpreted as integer as 32. As fixed-point number using 8 bits for the decimal part, its value is 0.125 (=32/256).
The code you posted is correct but you are not using it correctly.
The number -32 encoded as fixed-point number using 8 decimal digits is 0xFFFFE000 which is the integer representation of -8192 (=-32*256). The algorithm correctly produces 8192 which is 0x00002000 (=32*256); this is also 32 when it is interpreted as fixed-point.
If you pass -32 to the function without taking care to encode it as fixed-point, it correctly converts it to 32 and returns this value. But 32 (0x00000020) is 0.125 (=1/8=32/256) when it is interpreted as fixed-point (what I assume the function printMask() does).
How can you test the code correctly?
You probably have a function that creates fixed-point numbers from integers. Use it to get the correct representation of -32 and pass that value to the flt32_abs() function.
In case you don't have such a function, it is easy to write it. Just multiply the integer with 256 (or even better, left-shift it 8 bits) and that's all:
function int_to_fx32(int x)
{
return x << 8;
}
The fixed-point libraries usually use macros for such conversions because they produce faster code. Expressed as macro, it looks like this:
#define int_to_fx32(x) ((x) << 8)
Now you do the test:
fx32 negative = int_to_fx32(-32);
fx32 positive = fx32_abs(negative);
// This should print 32
printMask(positive, 32);
// This should print 8192
printf("%d", positive);
// This should print -8192
printf("%d", negative);
// This should print 0.125
printMask(32, 32);

int flt32_abs (int x) {
^^^ ^^^
int mask=x>>31;
x=x^mask;
x=x-mask;
return x;
}
I've been able to fix this and obtain the result of 32 by changing float to int, else the code wouldn't build with the error:
error: invalid operands of types 'float' and 'int' to binary 'operator>>'
For an explanation of why binary operations on floats are not allowed in C++, see
How to perform a bitwise operation on floating point numbers
I would like to ask more experienced developers, why did the code even build for OP? Relaxed compiler settings, I guess?

Related

bitwise splitting the mantissa of a IEEE 754 double? how to access bit structure,

(sorry, I'm coming up with some funny ideas... bear with me...)
Let's say I have a 'double' value, consisting of:
implicit
sign exponent bit mantissa
0 10000001001 (1).0011010010101010000001000001100010010011011101001100
representing 1234.6565 if I'm right.
I'd like to be able to access the fields of sign, exponent, implicit and mantissa separately as bits!, and manipulate them with bitwise operations like AND, OR, XOR ... or string operations like 'left', mid etc.
And then I'd like to puzzle a new double together from the manipulated bits.
e.g. setting the sign bit to 1 would make the number negative, adding or subtracting 1 to/from the exponent would double/halve the value, stripping all bits behind the position indicated by the recalculated (unbiased) value of the exponent would convert the value to an integer and so on.
Other tasks would/could be to find the last set bit, calculate how much it contributes to the value, check if the last bit is '1' (binary 'odd') or '0' (binary 'even') and the like.
I have seen similar in programs, just can't find it on the fly. I may remember something with 'reinterpret cast' or similar? I think there are libraries or toolkits or 'howtos' around which offer access to such, and hope here are reading people who can point me to such.
I'd like a solution near simple processor instructions and simple C code. I'm working in Debian Linux and compiling with gcc which was in by default.
startpoint is any double value i can address as 'x',
startpoint 2 is I'm not! an experienced programmer :-(
How to do easy, and get it working with good performance?
This is straightforward, if a bit esoteric.
Step 1 is to access the individual bits of a float or double. There are a number of ways of doing this, but the commonest are to use a char * pointer, or a union. For our purposes today let's use a union. [There are subtleties to this choice, which I'll address in a footnote.]
union doublebits {
double d;
uint64_t bits;
};
union doublebits x;
x.d = 1234.6565;
So now x.bits lets us access the bits and bytes of our double value as a 64-bit unsigned integer. First, we could print them out:
printf("bits: %llx\n", x.bits);
This prints
bits: 40934aa04189374c
and we're on our way.
The rest is "simple" bit manipulation.
We'll start by doing it the brute-force, obvious way:
int sign = x.bits >> 63;
int exponent = (x.bits >> 52) & 0x7ff;
long long mantissa = x.bits & 0xfffffffffffff;
printf("sign = %d, exponent = %d, mantissa = %llx\n", sign, exponent, mantissa);
This prints
sign = 0, exponent = 1033, mantissa = 34aa04189374c
and these values exactly match the bit decomposition you showed in your question, so it looks like you were right about the number 1234.6565.
What we have so far are the raw exponent and mantissa values.
As you know, the exponent is offset, and the mantissa has an implicit leading "1", so let's take care of those:
exponent -= 1023;
mantissa |= 1ULL << 52;
(Actually this isn't quite right. Soon enough we're going to have to address some additional complications having to do with denormalized numbers, and infinities and NaNs.)
Now that we have the true mantissa and exponent, we can do some math to recombine them, to see if everything is working:
double check = (double)mantissa * pow(2, exponent);
But if you try that, it gives the wrong answer, and it's because of a subtlety that, for me, is always the hardest part of this stuff: Where is the decimal point in the mantissa, really?
(Actually, it's not a "decimal point", anyway, because we're not working in decimal. Formally it's a "radix point", but that sounds too stuffy, so I'm going to keep using "decimal point", even though it's wrong. Apologies to any pedants whom this rubs the wrong way.)
When we did mantissa * pow(2, exponent) we assumed a decimal point, in effect, at the right end of the mantissa, but really, it's supposed to be 52 bits to the left of that (where that number 52 is, of course, the number of explicit mantissa bits). That is, our hexadecimal mantissa 0x134aa04189374c (with the leading 1 bit restored) is actually supposed to be treated more like 0x1.34aa04189374c. We can fix this by adjusting the exponent, subtracting 52:
double check = (double)mantissa * pow(2, exponent - 52);
printf("check = %f\n", check);
So now check is 1234.6565 (plus or minus some roundoff error). And that's the same number we started with, so it looks like our extraction was correct in all respects.
But we have some unfinished business, because for a fully general solution, we have to handle "subnormal" (also known as "denormalized") numbers, and the special representations inf and NaN.
These wrinkles are controlled by the exponent field. If the exponent (before subtracting the bias) is exactly 0, this indicates a subnormal number, that is, one whose mantissa is not in the normal range of (decimal) 1.00000 to 1.99999. A subnormal number does not have the implicit leading "1" bit, and the mantissa ends up being in the range from 0.00000 to 0.99999. (This also ends up being the way the ordinary number 0.0 has to be represented, since it obviously can't have that implicit leading "1" bit!)
On the other hand, if the exponent field has its maximum value (that is, 2047, or 211-1, for a double) this indicates a special marker. In that case, if the mantissa is 0, we have an infinity, with the sign bit distinguishing between positive and negative infinity. Or, if the exponent is max and the mantissa is not 0, we have a "not a number" marker, or NaN. The specific nonzero value in the mantissa can be used to distinguish between different kinds of NaN, like "quiet" and "signaling" ones, although it turns out the particular values that might be used for this aren't standard, so we'll ignore that little detail.
(If you're not familiar with infinities and NaNs, they're what IEEE-754 says that certain operations are supposed to return when the proper mathematical result is, well, not an ordinary number. For example, sqrt(-1.0) returns NaN, and 1./0. typically gives inf. There's a whole set of IEEE-754 rules about infinities and NaNs, such as that atan(inf) returns π/2.)
The bottom line is that instead of just blindly tacking on the implicit 1 bit, we have to check the exponent value first, and do things slightly differently depending on whether the exponent has its maximum value (indicating specials), an in-between value (indicating ordinary numbers), or 0 (indicating subnormal numbers):
if(exponent == 2047) {
/* inf or NAN */
if(mantissa != 0)
printf("NaN\n");
else if(sign)
printf("-inf\n");
else printf("inf\n");
} else if(exponent != 0) {
/* ordinary value */
mantissa |= 1ULL << 52;
} else {
/* subnormal */
exponent++;
}
exponent -= 1023;
That last adjustment, adding 1 to the exponent for subnormal numbers, reflects the fact that subnormals are "interpreted with the value of the smallest allowed exponent, which is one greater" (per the Wikipedia article on subnormal numbers).
I said this was all "straightforward, if a bit esoteric", but as you can see, while extracting the raw mantissa and exponent values is indeed pretty straightforward, interpreting what they actually mean can be a challenge!
If you already have raw exponent and mantissa numbers, going back in the other direction — that is, constructing a double value from them — is just about as straightforward:
sign = 1;
exponent = 1024;
mantissa = 0x921fb54442d18;
x.bits = ((uint64_t)sign << 63) | ((uint64_t)exponent << 52) | mantissa;
printf("%.15f\n", x.d);
This answer is getting too long, so for now I'm not going to delve into the question of how to construct appropriate exponent and mantissa numbers from scratch for an arbitrary real number. (Me, I usually do the equivalent of x.d = atof(the number I care about), and then use the techniques we've been discussing so far.)
Your original question was about "bitwise splitting", which is what we've been discussing. But it's worth noting that there's a much more portable way to do all this, if you don't want to muck around with raw bits, and if you don't want/need to assume that your machine uses IEEE-754. If you just want to split a floating-point number into a mantissa and an exponent, you can use the standard library frexp function:
int exp;
double mant = frexp(1234.6565, &exp);
printf("mant = %.15f, exp = %d\n", mant, exp);
This prints
mant = 0.602859619140625, exp = 11
and that looks right, because 0.602859619140625 × 211 = 1234.6565 (approximately). (How does it compare to our bitwise decomposition? Well, our mantissa was 0x34aa04189374c, or 0x1.34aa04189374c, which in decimal is 1.20571923828125, which is twice the mantissa that ldexp just gave us. But our exponent was 1033 - 1023 = 10, which is one less, so it comes out in the wash: 1.20571923828125 × 210 = 0.602859619140625 × 211 = 1234.6565.)
There's also a function ldexp that goes in the other direction:
double x2 = ldexp(mant, exp);
printf("%f\n", x2);
This prints 1234.656500 again.
Footnote: When you're trying to access the raw bits of something, as of course we've been doing here, there are some lurking portability and correctness questions having to do with something called strict aliasing. Strictly speaking, and depending on who you ask, you may need to use an array of unsigned char as the other part of your union, not uint64_t as I've been doing here. And there are those who say that you can't portably use a union at all, that you have to use memcpy to copy the bytes into a completely separate data structure, although I think they're taking about C++, not C.

How does ' %f ' work in C?

Hey i need to know how %f works , that is how
printf("%f",number);
extract a floating point number from a series of bits in number.
Consider the code:
main()
{
int i=1;
printf("\nd %d\nf %f",i,i);
}
Output is :
d 1
f -0.000000
So ultimately it doesn't depend on variable 'i', but just depends on the usage of %d and %f(or whatever) i just need to know how %f extracts the float number corresponding to series of bits in 'i'
To all those who misunderstood my question i know that %f can't be used to an integer and would load garbage values if size of integer was smaller than float. As for my case the size of integer and float are 4 bytes.
Let me be clear if value of is 1 then the corresponding binary value of i will be this:
0000 0000 0000 0000 0000 0000 0000 0001 [32 bits]
How would %f extract -0.0000 as in this case from this series of bits.(How it knows where to put decimal point etc , i can't find it from IEEE 754)
[PLEASE DO CORRECT ME IF I AM WRONG IN MY EXPLANATION OR ASSUMPION]
It's undefined behavior to use "%f" to an int, so the answer to your question is: you don't need to know, and you shouldn't do it.
The output depends on the format specifier like "%f" instead of the type of the argument i is because variadic functions (like printf() or scanf()) have no way of knowing the type of variable argument part.
As others have said, giving mismatched "%" specifier and arguments is undefined behavior, and, according to the C standard, anything can happen.
What does happen, in this case, on most modern computers, is this:
printf looks at the place in memory where the data should have been, interprets whatever data it finds there as a floating-point number, and prints that number.
Since printf is a function that can take a variable number of arguments, all floats are converted to doubles before being sent to the function, so printf expects to find a double, which (on normal modern computers) is 64 bits. But you send an int, which is only 32 bits, so printf will look at the 32 bits from the int, and 32 more bits of garbage that just happened to be there. When you tried this, it seems that the combination was a bit pattern corresponding to the double floating-point value -0.0.
Well.
It's easy to see how an integer can be packed into bytes, but how do you represent decimals?
The simplest technique is fixed point: of the n bits, the first m are before the point and the rest after. This is not a very good representation, however. Bits are wasted on some numbers, and it has uniform precision, while in real life, most desired decimals are between 0 and 1.
Enter floating point. The IEEE 754 spec defines a way of interpreting bits that has, since then, been almost universally accepted. It has very high near-zero precision, is compact, expandable and allows for very large numbers as well.
The linked articles are a good read.
You can output a floating-point number (float x;) manually by treating the value as a "black box" and extracting the digits one-by-one.
First, check if x < 0. If so, output a minus-sign - and negate the number. Now we know that it is positive.
Next, output the integer portion. Assign the floating-point number to an integer variable, which will truncate it, ie. int integer = x;. Then determine how many digits there are using the base-10 logarithm log10(). Note, log10(0) is undefined, so you'll have to handle zero as a special case. Then iterate from 0 up to the number of digits, each time dividing by 10^digit_index to move the desired digit into the unit's position, and take the 10-residue (modulus).
for (i=digits; i>=0; i--)
dig = (integer / pow(10,i)) % 10;
Then, output the decimal point ..
For the fractional part, subtract the integer from the original (absolute-value, remember) floating-point number. And output each digit in a similar way, but this time multiplying by 10^frac_digits. You won't be able to predict the number of significant fractional digits this way, so just use a fixed precision (constant number of fractional digits).
I have C code to fill a string with the representation of a floating-point number here, although I make no claims as to its readability.
IEEE formats store the number as a normalized binary fraction. It's more similar to scientific notation, like 3.57×102 instead of 357.0. So it is stored as an exponent-mantissa pair. Being "normalized" means there's actually an implicit additional 1 bit at the front of the mantissa that is not stored. Hopefully that's enough to help you understand a more detailed description of the format from elsewhere.
Remember, we're in binary, so there's no "decimal point". And with the exponent-mantissa notation, there isn't even a binary point in the format. It's implicitly represented in the exponent.
On the tangentially-related issue of passing floats to printf, remember that this is a variadic function. So it does not declare types of arguments that it receives, and all arguments passed undergo automatic conversions. So, float will automatically promote to double. So what you're doing is (substituting hex for brevity), passing 2 64-bit values:
double f, double f
0xabcdefgh 0xijklmnop 0xabcdefgh 0xijklmnop
Then you tell printf to interpret this sequence of words as an int followed by a double. So the 32-bit int seen by printf is only the first half of the floating-point number, and then the floating-point number seem by printf has its words reversed. The fourth word is never used.
To get the integer representation, you'll need to use type-punning with a pointer.
printf("%d %f\n", *(int *)&f, f);
Which reads (from right-to-left): take the address of the float, treat it as a pointer-to-int, follow the pointer.

Integer representation as float, clarification needed

Take
int x = 5;
float y = x;
//"I know it's a float .. you know it's a float .. but take it's address
// and pretend you're looking at an integer and then dereference it"
printf("%d\n", *(int*)&y); //1084227584
Why am i seeing this number?
5 in binary is 0101
5 can be thought of as (1.25 * 2^2), which means that
Can be represented as:
[sign bit] - 0
[8 bits worth of exp] - 129 (129-127=2) - 1000|0001
[23 bits of .xxxxxxx] - 25 - 1100|1
Put together, i have
[sign bit][8 bits worth of exp][23 bits worth of .xxx]
0 10000001 11001000000000 //2126336
What am i missing please?
Others have pointed out it's not portable... but you know this already, and you've specified 64-bit OS X. Basically, you have the mantissa wrong. 1.25 is represented with an implicit leading bit for 1.0. The first explicit bit of the mantissa represents 0.5 and the second bit 0.25. So the mantissa is actually: 01000000000000000000000.
The sign bit 0, and biased exponent 10000001, followed by the mantissa gives:
0x40a00000 which is 1084227584 decimal.
Why am i seeing this number? Because you are printing a float as an int.
I know, I know, you clearly already know this, but the bottom line is the behaviour is undefined. On your system are ints and floats the same size? Have you looked up the standards your compiler uses to store floating points?

Why does a C floating-point type modify the actual input of 125.1 to 125.099998 on output?

I wrote the following program:
#include<stdio.h>
int main(void)
{
float f;
printf("\nInput a floating-point no.: ");
scanf("%f",&f);
printf("\nOutput: %f\n",f);
return 0;
}
I am on Ubuntu and used GCC to compile the above program. Here is my sample run and output I want to inquire about:
Input a floating-point no.: 125.1
Output: 125.099998
Why does the precision change?
Because the number 125.1 is impossible to represent exactly with floating-point numbers. This happens in most programming languages. Use e.g. printf("%.1f", f); if you want to print the number with one decimal, but be warned: the number itself is not exactly equal to 125.1.
Thank you all for your answers. Although almost all of you helped me look in the right direction I could not understand the exact reason for this behavior. So I did a bit of research in addition to reading the pages you guys pointed me to. Here is my understanding for this behavior:
Single Precision Floating Point numbers typically use 4 bytes for storage on x86/x86-64 architectures. However not all 32 bits (4 bytes = 32 bits) are used to store the magnitude of the number.
For storing as a single precision floating type, the input stream is formatted in the following notation (somewhat similar to scientific notation):
(-1)^s x 1.m x 2^(e-127), where
s = sign of the number, range:{0,1} - takes up 1 bit
m = mantissa (fractional portion) of the number - takes up 23 bits
e = exponent of the number offset by 127, range:{0,..,255} - takes up 8 bits
and then stored in memory as
0th byte 1st byte 2nd byte 3rd byte
mmmmmmmm mmmmmmmm emmmmmmm seeeeeee
Therefore the decimal number 125.1 is first converted to binary form but limited to 24 bits so that the mantissa is represented by no more than 23 bits. After conversion to binary form:
125.1 = 1111101.00011001100110011
NOTE: 0.1 in decimal can be represented up to infinite bits in binary but the computer limits the representation to 17 bits so the complete representation does not exceed 24 bits.
Now converting it into the specified notation we get:
125.1 = 1.111101 00011001100110011 x 2^6
= (-1)^0 + 1.111101 00011001100110011 x 2^(133-127)
which implies
s = 0
m = 11110100011001100110011
e = 133 = 10000101
Therefore, 125.1 will be stored in memory as:
0th byte 1st byte 2nd byte 3rd byte
mmmmmmmm mmmmmmmm emmmmmmm seeeeeee
00110011 00110011 11111010 01000010
On being passed to the printf() function the output stream is generated by converting the binary form to the decimal form. The bytes are actually stored in reverse order (from the input stream) and hence read in this order:
3rd byte 2nd byte 1st byte 0th byte
seeeeeee emmmmmmm mmmmmmmm mmmmmmmm
01000010 11111010 00110011 00110011
Next, it is converted into the specific notation for conversion
(-1)^0 + 1.111101 00011001100110011 x 2^(133-127)
On simplifying the above representation further:
= 1.111101 00011001100110011 x 2^6
= 1111101.00011001100110011
And finally converting it to decimal:
= 125.0999984741210938
but single precision floating point can represent only up to 6 decimal places, therefore the answer is rounded off to 125.099998.
Think about a fixed point representation first.
2^3=8 2^2=4 2^1=2 2^0=1 2^-1=1/2 2^-2=1/4 2^-3=1/8 2^-4=1/16
If we want to represent a fraction then we set the bits to the right of the point, so 5.5 is represented as 01011000.
But if we want to represent 5.6, there is not an exact fractional representation. The closest we can get is 01011001 == 5.5625
1/2 + 1/16 = 0.5625
2^-4 + 2^-1
Because its the closest representation of 125.1 , remember that single precision floating point are just 32 bits.
If I tell you to write 1/3 as decimal number down, you realize there a numbers which have no finite representation. .1 is the exact representation of 1/10 there this problem does not appear, BUT this is just in decimal representation. In binary representation .1 is one of those numbers that require infinite digits. As your number must be somehwere cut there is something lost.
No floating point numbers has an exact representation, they all have limited accuracy. When converting from a number in text to a float (with scanf or otherwise), you're in another world with different kinds of numbers, and precision may be lost. Same thing goes when converting from a float to a string: you decide on how many digits you want. You can't know "how many digits there are" in a float before converting to text or another format that can keep that information. This all has to do with how floats are stored:
significant_digits * baseexponent
The normal type used for floating point in C is double, not float. Your float is implicitly cast to a double, and because the float is less precise, the difference to the closest representable number to 125.1 is more apparent (and printf's default precision is tailored for use with doubles). Try this instead:
#include<stdio.h>
int main(void)
{
double f;
printf("\nInput a floating-point no.: ");
scanf("%lf",&f);
printf("\nOutput: %f\n",f);
return 0;
}

Does floor() return something that's exactly representable?

In C89, floor() returns a double. Is the following guaranteed to work?
double d = floor(3.0 + 0.5);
int x = (int) d;
assert(x == 3);
My concern is that the result of floor might not be exactly representable in IEEE 754. So d gets something like 2.99999, and x ends up being 2.
For the answer to this question to be yes, all integers within the range of an int have to be exactly representable as doubles, and floor must always return that exactly represented value.
All integers can have exact floating point representation if your floating point type supports the required mantissa bits. Since double uses 53 bits for mantissa, it can store all 32-bit ints exactly. After all, you could just set the value as mantissa with zero exponent.
If the result of floor() isn't exactly representable, what do you expect the value of d to be? Surely if you've got the representation of a floating point number in a variable, then by definition it's exactly representable isn't it? You've got the representation in d...
(In addition, Mehrdad's answer is correct for 32 bit ints. In a compiler with a 64 bit double and a 64 bit int, you've got more problems of course...)
EDIT: Perhaps you meant "the theoretical result of floor(), i.e. the largest integer value less than or equal to the argument, may not be representable as an int". That's certainly true. Simple way of showing this for a system where int is 32 bits:
int max = 0x7fffffff;
double number = max;
number += 10.0;
double f = floor(number);
int oops = (int) f;
I can't remember offhand what C does when conversions from floating point to integer overflow... but it's going to happen here.
EDIT: There are other interesting situations to consider too. Here's some C# code and results - I'd imagine at least similar things would happen in C. In C#, double is defined to be 64 bits and so is long.
using System;
class Test
{
static void Main()
{
FloorSameInteger(long.MaxValue/2);
FloorSameInteger(long.MaxValue-2);
}
static void FloorSameInteger(long original)
{
double convertedToDouble = original;
double flooredToDouble = Math.Floor(convertedToDouble);
long flooredToLong = (long) flooredToDouble;
Console.WriteLine("Original value: {0}", original);
Console.WriteLine("Converted to double: {0}", convertedToDouble);
Console.WriteLine("Floored (as double): {0}", flooredToDouble);
Console.WriteLine("Converted back to long: {0}", flooredToLong);
Console.WriteLine();
}
}
Results:
Original value: 4611686018427387903
Converted to double:
4.61168601842739E+18Floored (as double): 4.61168601842739E+18
Converted back to long:
4611686018427387904
Original value: 9223372036854775805
Converted to double:
9.22337203685478E+18Floored (as double): 9.22337203685478E+18
Converted back to long:
-9223372036854775808
In other words:
(long) floor((double) original)
isn't always the same as original. This shouldn't come as any surprise - there are more long values than doubles (given the NaN values) and plenty of doubles aren't integers, so we can't expect every long to be exactly representable. However, all 32 bit integers are representable as doubles.
I think you're a bit confused about what you want to ask. floor(3 + 0.5) is not a very good example, because 3, 0.5, and their sum are all exactly representable in any real-world floating point format. floor(0.1 + 0.9) would be a better example, and the real question here is not whether the result of floor is exactly representable, but whether inexactness of the numbers prior to calling floor will result in a return value different from what you would expect, had all numbers been exact. In this case, I believe the answer is yes, but it depends a lot on your particular numbers.
I invite others to criticize this approach if it's bad, but one possible workaround might be to multiply your number by (1.0+0x1p-52) or something similar prior to calling floor (perhaps using nextafter would be better). This could compensate for cases where an error in the last binary place of the number causes it to fall just below rather than exactly on an integer value, but it will not account for errors which have accumulated over a number of operations. If you need that level of numeric stability/exactness, you need to either do some deep analysis or use an arbitrary-precision or exact-math library which can handle your numbers correctly.

Resources