Maybe it seems a little bit rare question, but I would like to find a function able to transform a double (c number) into a long (c number). It's not necessary to preserve the double information. The most important thing is:
double a,b;
long c,d;
c = f(a);
d = f(b);
This must be truth:
if (a < b) then c < d for all a,b double and for all c,d long
Thank you to all of you.
Your requirement is feasible if the following two conditions hold:
The compiler defines sizeof(double) the same as sizeof(long)
The hardware uses IEEE 754 double-precision binary floating-point format
While the 2nd condition holds on every widely-used platform, the 1st condition does not.
If both conditions do hold on your platform, then you can implement the function as follows:
long f(double x)
{
if (x > 0)
return double_to_long(x);
if (x < 0)
return -double_to_long(-x);
return 0;
}
You have several different ways to implement the conversion function:
long double_to_long(double x)
{
long y;
memcpy(&y,&x,sizeof(x));
return y;
}
long double_to_long(double x)
{
long y;
y = *(long*)&x;
return y;
}
long double_to_long(double x)
{
union
{
double x;
long y;
}
u;
u.x = x;
return u.y;
}
Please note that the second option is not recommended, because it breaks strict-aliasing rule.
There are four basic transformations from floating-point to integer types:
floor - Rounds towards negative infinity, i.e. next lowest integer.
ceil[ing] - Rounds towards positive infinity, i.e. next highest integer.
trunc[ate] - Rounds towards zero, i.e. strips the floating-point portion and leaves the integer.
round - Rounds towards the nearest integer.
None of these transformations will give the behaviour you specify, but floor will permit the slightly weaker condition (a < b) implies (c <= d).
If a double value uses more space to represent than a long, then there is no mapping that can meet your initial constraint, thanks to the pigeonhole principle. Basically, since the double type can represent many more distinct values than a long type, there is no way to preserve the strict partial order of the < relationship, as multiple double values would be forced to map to the same long value.
See also:
Difference between Math.Floor() and Math.Truncate() (Stack Overflow)
Pigeonhole principle (Wikipedia)
Use frexp() to get you mostly there. It splits the number into exponent and significand (fraction).
Assume long is at least the same size as double, other-wise this is pointless. Pigeonhole principle.
#include <math.h>
long f(double x) {
assert(sizeof(long) >= sizeof(double));
#define EXPOWIDTH 11
#define FRACWIDTH 52
int ipart;
double fraction = frexp(fabs(x), &ipart);
long lg = ipart;
lg += (1L << EXPOWIDTH)/2;
if (lg < 0) ipart = 0;
if (lg >= (1L << EXPOWIDTH)) lg = (1L << EXPOWIDTH) - 1;
lg <<= FRACWIDTH;
lg += (long) (fraction * (1L << FRACWIDTH));
if (x < 0) {
lg = -lg;
}
return lg;
}
-
Notes:
The proper value for EXPO depends on DBL_MAX_EXP and DBL_MIN_EXP and particulars of the double type.
This solution maps the same double values near the extremes of double. I will look and test more later.
Otherwise as commented above: overlay the two types.
As long is often 2's complement and double is laid out in a sign-magnitude fashion, extra work is need when the double is negative. Also watch out for -0.0.
long f(double x) {
assert(sizeof x == sizeof (long));
union {
double d;
long lg;
} u = { x*1.0 }; // *1.0 gets rid of -0.0
// If 2's complement - which is the common situation
if (u.lg < 0) {
u.lg = LONG_MAX - u.lg;
}
return u.lg;
}
Related
I´m looking for an alternative for the ceil() and floor() functions in C, due to I am not allowed to use these in a project.
What I have build so far is a tricky back and forth way by the use of the cast operator and with that the conversion from a floating-point value (in my case a double) into an int and later as I need the closest integers, above and below the given floating-point value, to be also double values, back to double:
#include <stdio.h>
int main(void) {
double original = 124.576;
double floorint;
double ceilint;
int f;
int c;
f = (int)original; //Truncation to closest floor integer value
c = f + 1;
floorint = (double)f;
ceilint = (double)c;
printf("Original Value: %lf, Floor Int: %lf , Ceil Int: %lf", original, floorint, ceilint);
}
Output:
Original Value: 124.576000, Floor Int: 124.000000 , Ceil Int: 125.000000
For this example normally I would not need the ceil and floor integer values of c and f to be converted back to double but I need them in double in my real program. Consider that as a requirement for the task.
Although the output is giving the desired values and seems right so far, I´m still in concern if this method is really that right and appropriate or, to say it more clearly, if this method does bring any bad behavior or issue into the program or gives me a performance-loss in comparison to other alternatives, if there are any other possible alternatives.
Do you know a better alternative? And if so, why this one should be better?
Thank you very much.
Do you know a better alternative? And if so, why this one should be better?
OP'code fails:
original is already a whole number.
original is a negative like -1.5. Truncation is not floor there.
original is just outside int range.
original is not-a-number.
Alternative construction
double my_ceil(double x)
Using the cast to some integer type trick is a problem when x is outsize the integer range. So check first if x is inside range of a wide enough integer (one whose precision exceeds double). x values outside that are already whole numbers. Recommend to go for the widest integer (u)intmax_t.
Remember that a cast to an integer is a round toward 0 and not a floor. Different handling needed if x is negative/positive when code is ceil() or floor(). OP's code missed this.
I'd avoid if (x >= INTMAX_MAX) { as that involves (double) INTMAX_MAX whose rounding and then precise value is "chosen in an implementation-defined manner". Instead, I'd compare against INTMAX_MAX_P1. some_integer_MAX is a Mersenne Number and with 2's complement, ...MIN is a negated "power of 2".
#include <inttypes.h>
#define INTMAX_MAX_P1 ((INTMAX_MAX/2 + 1)*2.0)
double my_ceil(double x) {
if (x >= INTMAX_MAX_P1) {
return x;
}
if (x < INTMAX_MIN) {
return x;
}
intmax_t i = (intmax_t) x; // this rounds towards 0
if (i < 0 || x == i) return i; // negative x is already rounded up.
return i + 1.0;
}
As x may be a not-a-number, it is more useful to reverse the compare as relational compare of a NaN is false.
double my_ceil(double x) {
if (x >= INTMAX_MIN && x < INTMAX_MAX_P1) {
intmax_t i = (intmax_t) x; // this rounds towards 0
if (i < 0 || x == i) return i; // negative x is already rounded up.
return i + 1.0;
}
return x;
}
double my_floor(double x) {
if (x >= INTMAX_MIN && x < INTMAX_MAX_P1) {
intmax_t i = (intmax_t) x; // this rounds towards 0
if (i > 0 || x == i) return i; // positive x is already rounded down.
return i - 1.0;
}
return x;
}
You're missing an important step: you need to check if the number is already integral, so for ceil assuming non-negative numbers (generalisation is trivial), use something like
double ceil(double f){
if (f >= LLONG_MAX){
// f will be integral unless you have a really funky platform
return f;
} else {
long long i = f;
return 0.0 + i + (f != i); // to obviate potential long long overflow
}
}
Another missing piece in the puzzle, which is covered off by my enclosing if, is to check if f is within the bounds of a long long. On common platforms if f was outside the bounds of a long long then it would be integral anyway.
Note that floor is trivial due to the fact that truncation to long long is always towards zero.
I'd like to check if a long long variable can be safely cast into a double. DBL_MAX doesn't help, because there are integers smaller than that which are not representable by double, while some of integers larger than 2^53 can still fit.
Is there a reliable way to do this?
Can a compiler optimise out a statement like the one below?
(long long)((double)a) == a (where a is a long long)
This does not ask for a largest integer that can be represented as double, I ask for a general function that can check if I can exactly convert any long long value to double without errors.
OP's method is a good start.
(long long)((double)a) == a
Yet has a problem. E.g. long long a = LLONG_MAX; ((double)a) results is a rounded value exceeding LLONG_MAX.
The following will certainly not overflow double.
(Pathological exception: LLONG_MIN exceeds -DBL_MAX).
volatile double b = (double) a;
Converting back to long long and testing against a is sufficient to meet OP's goal. Only need to insure b is in long long range. #gnasher729 Let us assume 2's complement and double uses FLT_RADIX != 10. In that case, the lowest long long is a power-of-2 and the highest is a power-of-2 minus 1 and conversion to double can be made exact with careful calculation of the long long limits, as follows.
bool check_ll(long long a) {
constant double d_longLong_min = LLONG_MIN;
constant double d_longLong_max_plus_1 = (LLONG_MAX/2 + 1)*2.0;
volatile double b = (double) a;
if (b < d_longLong_min || b >= d_longLong_max_plus_1) {
return false;
}
return (long long) b == a;
}
[edit simplify - more general]
A test of b near LLONG_MIN is only needed when long long does not use 2's complement
bool check_ll2(long long a) {
volatile double b = (double) a;
constant double d_longLong_max_plus_1 = (LLONG_MAX/2 + 1)*2.0;
#if LLONG_MIN == -LLONG_MAX
constant double d_longLong_min_minus_1 = (LLONG_MIN/2 - 1)*2.0;;
if (b <= d_longLong_min_minus_1 || b >= d_longLong_max_plus_1) {
return false;
}
#else
if (b >= d_longLong_max_plus_1) {
return false;
}
#endif
return (long long) b == a;
}
I would not expect a compile to be able to optimize out (long long)((double)a) == a. IAC, by using an intermediate volatile double, code prevents that.
I'm not sure you can check this conversion before you cast, but fenv.h seems like it can help you for after-cast checking. FE_INEXACT can allow you to check if the operation you just performed could not be exactly stored.
http://www.cplusplus.com/reference/cfenv/FE_INEXACT/
Can two floating point values (IEEE 754 binary64) be compared as integers? Eg.
long long a = * (long long *) ptr_to_double1,
b = * (long long *) ptr_to_double2;
if (a < b) {...}
assuming the size of long long and double is the same.
YES - Comparing the bit-patterns for two floats as if they were integers (aka "type-punning") produces meaningful results under some restricted scenarios...
Identical to floating-point comparison when:
Both numbers are positive, positive-zero, or positive-infinity.
One positive and one negative number, and you are using a signed integer comparison.
Inverse of floating-point comparison when:
Both numbers are negative, negative-zero, or negative-infinity.
One positive and one negative number, and you are using a unsigned integer comparison.
Not comparable to floating-point comparison when:
Either number is one of the NaN values - Floating point comparisons with a NaN always returns false, and this simply can't be modeled in integer operations where exactly one of the following is always true: (A < B), (A == B), (B < A).
Negative floating-point numbers are a bit funky b/c they are handled very differently than in the 2's complement arithmetic used for integers. Doing an integer +1 on the representation for a negative float will make it a bigger negative number.
With a little bit manipulation, you can make both positive and negative floats comparable with integer operations (this can come in handy for some optimizations):
int32 float_to_comparable_integer(float f) {
uint32 bits = std::bit_cast<uint32>(f);
const uint32 sign_bit = bits & 0x80000000ul;
// Modern compilers turn this IF-statement into a conditional move (CMOV) on x86,
// which is much faster than a branch that the cpu might mis-predict.
if (sign_bit) {
bits = 0x7FFFFFF - bits;
}
return static_cast<int32>(bits);
}
Again, this does not work for NaN values, which always return false from comparisons, and have multiple valid bit representations:
Signaling NaNs (w/ sign bit): Anything between 0xFF800001, and 0xFFBFFFFF.
Signaling NaNs (w/o sign bit): Anything between 0x7F800001, and 0x7FBFFFFF.
Quiet NaNs (w/ sign bit): Anything between 0xFFC00000, and 0xFFFFFFFF.
Quiet NaNs (w/o sign bit): Anything between 0x7FC00000, and 0x7FFFFFFF.
IEEE-754 bit format: http://www.puntoflotante.net/FLOATING-POINT-FORMAT-IEEE-754.htm
More on Type-Punning: https://randomascii.wordpress.com/2012/01/23/stupid-float-tricks-2/
No. Two floating point values (IEEE 754 binary64) cannot compare simply as integers with if (a < b).
IEEE 754 binary64
The order of the values of double is not the same order as integers (unless you are are on a rare sign-magnitude machine). Think positive vs. negative numbers.
double has values like 0.0 and -0.0 which have the same value but different bit patterns.
double has "Not-a-number"s that do not compare like their binary equivalent integer representation.
If both the double values were x > 0 and not "Not-a-number", endian, aliasing, and alignment, etc. were not an issue, OP's idea would work.
Alternatively, a more complex if() ... condition would work - see below
[non-IEEE 754 binary64]
Some double use an encoding where there are multiple representations of the same value. This would differ from an "integer" compare.
Tested code: needs 2's complement, same endian for double and the integers, does not account for NaN.
int compare(double a, double b) {
union {
double d;
int64_t i64;
uint64_t u64;
} ua, ub;
ua.d = a;
ub.d = b;
// Cope with -0.0 right away
if (ua.u64 == 0x8000000000000000) ua.u64 = 0;
if (ub.u64 == 0x8000000000000000) ub.u64 = 0;
// Signs differ?
if ((ua.i64 < 0) != (ub.i64 < 0)) {
return ua.i64 >= 0 ? 1 : -1;
}
// If numbers are negative
if (ua.i64 < 0) {
ua.u64 = -ua.u64;
ub.u64 = -ub.u64;
}
return (ua.u64 > ub.u64) - (ua.u64 < ub.u64);
}
Thanks to #David C. Rankin for a correction.
Test code
void testcmp(double a, double b) {
int t1 = (a > b) - (a < b);
int t2 = compare(a, b);
if (t1 != t2) {
printf("%le %le %d %d\n", a, b, t1, t2);
}
}
#include <float.h>
void testcmps() {
// Various interesting `double`
static const double a[] = {
-1.0 / 0.0, -DBL_MAX, -1.0, -DBL_MIN, -0.0,
+0.0, DBL_MIN, 1.0, DBL_MAX, +1.0 / 0.0 };
int n = sizeof a / sizeof a[0];
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
testcmp(a[i], a[j]);
}
}
puts("!");
}
If you strictly cast the bit value of a floating point number to its correspondingly-sized signed integer (as you've done), then signed integer comparison of the results will be identical to the comparison of the original floating-point values, excluding NaN values. Put another way, this comparison is legitimate for all representable finite and infinite numeric values.
In other words, for double-precision (64-bits), this comparison will be valid if the following tests pass:
long long exponentMask = 0x7ff0000000000000;
long long mantissaMask = 0x000fffffffffffff;
bool isNumber = ((x & exponentMask) != exponentMask) // Not exp 0x7ff
|| ((x & mantissaMask) == 0); // Infinities
for each operand x.
Of course, if you can pre-qualify your floating-point values, then a quick isNaN() test would be much more clear. You'd have to profile to understand performance implications.
There are two parts to your question:
Can two floating point numbers be compared? The answer to this is yes. it is perfectly valid to compare size of floating point numbers. Generally you want to avoid equals comparisons due to truncation issues see here, but
if (a < b)
will work just fine.
Can two floating point numbers be compared as integers? This answer is also yes, but this will require casting. This question should help with that answer: convert from long long to int and the other way back in c++
What's going on here:
#include <stdio.h>
#include <math.h>
int main(void) {
printf("17^12 = %lf\n", pow(17, 12));
printf("17^13 = %lf\n", pow(17, 13));
printf("17^14 = %lf\n", pow(17, 14));
}
I get this output:
17^12 = 582622237229761.000000
17^13 = 9904578032905936.000000
17^14 = 168377826559400928.000000
13 and 14 do not match with wolfram alpa cf:
12: 582622237229761.000000
582622237229761
13: 9904578032905936.000000
9904578032905937
14: 168377826559400928.000000
168377826559400929
Moreover, it's not wrong by some strange fraction - it's wrong by exactly one!
If this is down to me reaching the limits of what pow() can do for me, is there an alternative that can calculate this? I need a function that can calculate x^y, where x^y is always less than ULLONG_MAX.
pow works with double numbers. These represent numbers of the form s * 2^e where s is a 53 bit integer. Therefore double can store all integers below 2^53, but only some integers above 2^53. In particular, it can only represent even numbers > 2^53, since for e > 0 the value is always a multiple of 2.
17^13 needs 54 bits to represent exactly, so e is set to 1 and hence the calculated value becomes even number. The correct value is odd, so it's not surprising it's off by one. Likewise, 17^14 takes 58 bits to represent. That it too is off by one is a lucky coincidence (as long as you don't apply too much number theory), it just happens to be one off from a multiple of 32, which is the granularity at which double numbers of that magnitude are rounded.
For exact integer exponentiation, you should use integers all the way. Write your own double-free exponentiation routine. Use exponentiation by squaring if y can be large, but I assume it's always less than 64, making this issue moot.
The numbers you get are too big to be represented with a double accurately. A double-precision floating-point number has essentially 53 significant binary digits and can represent all integers up to 2^53 or 9,007,199,254,740,992.
For higher numbers, the last digits get truncated and the result of your calculation is rounded to the next number that can be represented as a double. For 17^13, which is only slightly above the limit, this is the closest even number. For numbers greater than 2^54 this is the closest number that is divisible by four, and so on.
If your input arguments are non-negative integers, then you can implement your own pow.
Recursively:
unsigned long long pow(unsigned long long x,unsigned int y)
{
if (y == 0)
return 1;
if (y == 1)
return x;
return pow(x,y/2)*pow(x,y-y/2);
}
Iteratively:
unsigned long long pow(unsigned long long x,unsigned int y)
{
unsigned long long res = 1;
while (y--)
res *= x;
return res;
}
Efficiently:
unsigned long long pow(unsigned long long x,unsigned int y)
{
unsigned long long res = 1;
while (y > 0)
{
if (y & 1)
res *= x;
y >>= 1;
x *= x;
}
return res;
}
A small addition to other good answers: under x86 architecture there is usually available x87 80-bit extended format, which is supported by most C compilers via the long double type. This format allows to operate with integer numbers up to 2^64 without gaps.
There is analogue of pow() in <math.h> which is intended for operating with long double numbers - powl(). It should also be noticed that the format specifier for the long double values is other than for double ones - %Lf. So the correct program using the long double type looks like this:
#include <stdio.h>
#include <math.h>
int main(void) {
printf("17^12 = %Lf\n", powl(17, 12));
printf("17^13 = %Lf\n", powl(17, 13));
printf("17^14 = %Lf\n", powl(17, 14));
}
As Stephen Canon noted in comments there is no guarantee that this program should give exact result.
I was looking at another question (here) where someone was looking for a way to get the square root of a 64 bit integer in x86 assembly.
This turns out to be very simple. The solution is to convert to a floating point number, calculate the sqrt and then convert back.
I need to do something very similar in C however when I look into equivalents I'm getting a little stuck. I can only find a sqrt function which takes in doubles. Doubles do not have the precision to store large 64bit integers without introducing significant rounding error.
Is there a common math library that I can use which has a long double sqrt function?
There is no need for long double; the square root can be calculated with double (if it is IEEE-754 64-bit binary). The rounding error in converting a 64-bit integer to double is nearly irrelevant in this problem.
The rounding error is at most one part in 253. This causes an error in the square root of at most one part in 254. The sqrt itself has a rounding error of less than one part in 253, due to rounding the mathematical result to the double format. The sum of these errors is tiny; the largest possible square root of a 64-bit integer (rounded to 53 bits) is 232, so an error of three parts in 254 is less than .00000072.
For a uint64_t x, consider sqrt(x). We know this value is within .00000072 of the exact square root of x, but we do not know its direction. If we adjust it to sqrt(x) - 0x1p-20, then we know we have a value that is less than, but very close to, the square root of x.
Then this code calculates the square root of x, truncated to an integer, provided the operations conform to IEEE 754:
uint64_t y = sqrt(x) - 0x1p-20;
if (2*y < x - y*y)
++y;
(2*y < x - y*y is equivalent to (y+1)*(y+1) <= x except that it avoids wrapping the 64-bit integer if y+1 is 232.)
Function sqrtl(), taking a long double, is part of C99.
Note that your compilation platform does not have to implement long double as 80-bit extended-precision. It is only required to be as wide as double, and Visual Studio implements is as a plain double. GCC and Clang do compile long double to 80-bit extended-precision on Intel processors.
Yes, the standard library has sqrtl() (since C99).
If you only want to calculate sqrt for integers, using divide and conquer should find the result in max 32 iterations:
uint64_t mysqrt (uint64_t a)
{
uint64_t min=0;
//uint64_t max=1<<32;
uint64_t max=((uint64_t) 1) << 32; //chux' bugfix
while(1)
{
if (max <= 1 + min)
return min;
uint64_t sqt = min + (max - min)/2;
uint64_t sq = sqt*sqt;
if (sq == a)
return sqt;
if (sq > a)
max = sqt;
else
min = sqt;
}
Debugging is left as exercise for the reader.
Here we collect several observations in order to arrive to a solution:
In standard C >= 1999, it is garanted that non-netative integers have a representation in bits as one would expected for any base-2 number.
----> Hence, we can trust in bit manipulation of this type of numbers.
If x is a unsigned integer type, tnen x >> 1 == x / 2 and x << 1 == x * 2.
(!) But: It is very probable that bit operations shall be done faster than their arithmetical counterparts.
sqrt(x) is mathematically equivalent to exp(log(x)/2.0).
If we consider truncated logarithms and base-2 exponential for integers, we could obtain a fair estimate: IntExp2( IntLog2(x) / 2) "==" IntSqrtDn(x), where "=" is informal notation meaning almost equatl to (in the sense of a good approximation).
If we write IntExp2( IntLog2(x) / 2 + 1) "==" IntSqrtUp(x), we obtain an "above" approximation for the integer square root.
The approximations obtained in (4.) and (5.) are a little rough (they enclose the true value of sqrt(x) between two consecutive powers of 2), but they could be a very well starting point for any algorithm that searchs for the square roor of x.
The Newton algorithm for square root could be work well for integers, if we have a good first approximation to the real solution.
http://en.wikipedia.org/wiki/Integer_square_root
The final algorithm needs some mathematical comprobations to be plenty sure that always work properly, but I will not do it right now... I will show you the final program, instead:
#include <stdio.h> /* For printf()... */
#include <stdint.h> /* For uintmax_t... */
#include <math.h> /* For sqrt() .... */
int IntLog2(uintmax_t n) {
if (n == 0) return -1; /* Error */
int L;
for (L = 0; n >>= 1; L++)
;
return L; /* It takes < 64 steps for long long */
}
uintmax_t IntExp2(int n) {
if (n < 0)
return 0; /* Error */
uintmax_t E;
for (E = 1; n-- > 0; E <<= 1)
;
return E; /* It takes < 64 steps for long long */
}
uintmax_t IntSqrtDn(uintmax_t n) { return IntExp2(IntLog2(n) / 2); }
uintmax_t IntSqrtUp(uintmax_t n) { return IntExp2(IntLog2(n) / 2 + 1); }
int main(void) {
uintmax_t N = 947612934; /* Try here your number! */
uintmax_t sqrtn = IntSqrtDn(N), /* 1st approx. to sqrt(N) by below */
sqrtn0 = IntSqrtUp(N); /* 1st approx. to sqrt(N) by above */
/* The following means while( abs(sqrt-sqrt0) > 1) { stuff... } */
/* However, we take care of subtractions on unsigned arithmetic, just in case... */
while ( (sqrtn > sqrtn0 + 1) || (sqrtn0 > sqrtn+1) )
sqrtn0 = sqrtn, sqrtn = (sqrtn0 + N/sqrtn0) / 2; /* Newton iteration */
printf("N==%llu, sqrt(N)==%g, IntSqrtDn(N)==%llu, IntSqrtUp(N)==%llu, sqrtn==%llu, sqrtn*sqrtn==%llu\n\n",
N, sqrt(N), IntSqrtDn(N), IntSqrtUp(N), sqrtn, sqrtn*sqrtn);
return 0;
}
The last value stored in sqrtn is the integer square root of N.
The last line of the program just shows all the values, with comprobation purposes.
So, you can try different values of Nand see what happens.
If we add a counter inside the while-loop, we'll see that no more than a few iterations happen.
Remark: It is necessary to verify that the condition abs(sqrtn-sqrtn0)<=1 is always achieved when working in the integer-number setting. If not, we shall have to fix the algorithm.
Remark2: In the initialization sentences, observe that sqrtn0 == sqrtn * 2 == sqrtn << 1. This avoids us some calculations.
// sqrt_i64 returns the integer square root of v.
int64_t sqrt_i64(int64_t v) {
uint64_t q = 0, b = 1, r = v;
for( b <<= 62; b > 0 && b > r; b >>= 2);
while( b > 0 ) {
uint64_t t = q + b;
q >>= 1;
if( r >= t ) {
r -= t;
q += b;
}
b >>= 2;
}
return q;
}
The for loop may be optimized by using the clz machine code instruction.