Implement ceil() in C - c

I want to implement my own ceil() in C. Searched through the libraries for source code & found here, but it seems pretty difficult to understand. I want clean & elegant code.
I also searched on SO, found some answer here. None of the answer seems to be correct. One of the answer is:
#define CEILING_POS(X) ((X-(int)(X)) > 0 ? (int)(X+1) : (int)(X))
#define CEILING_NEG(X) ((X-(int)(X)) < 0 ? (int)(X-1) : (int)(X))
#define CEILING(X) ( ((X) > 0) ? CEILING_POS(X) : CEILING_NEG(X) )
AFAIK, the return type of the ceil() is not int. Will macro be type-safe here?
Further, will the above implementation work for negative numbers?
What will be the best way to implement it?
Can you provide the clean code?

The macro you quoted definitely won't work correctly for numbers that are greater than INT_MAX but which can still be represented exactly as a double.
The only way to implement ceil() correctly (assuming you can't implement it using an equivalent assembly instruction) is to do bit-twiddling on the binary representation of the floating point number, as is done in the s_ceil.c source file behind your first link. Understanding how the code works requires an understanding of the floating point representation of the underlying platform -- the representation is most probably going to be IEEE 754 -- but there's no way around this.
Edit:
Some of the complexities in s_ceil.c stem from the special cases it handles (NaNs, infinities) and the fact that it needs to do its work without being able to assume that a 64-bit integral type exists.
The basic idea of all the bit-twiddling is to mask off the fractional bits of the mantissa and add 1 to it if the number is greater than zero... but there's a bit of additional logic involved as well to make sure you do the right thing in all cases.
Here's a illustrative version of ceil() for floats that I cobbled together. Beware: This does not handle the special cases correctly and it is not tested extensively -- so don't actually use it. It does however serve to illustrate the principles involved in the bit-twiddling. I've tried to comment the routine extensively, but the comments do assume that you understand how floating point numbers are represented in IEEE 754 format.
union float_int
{
float f;
int i;
};
float myceil(float x)
{
float_int val;
val.f=x;
// Extract sign, exponent and mantissa
// Bias is removed from exponent
int sign=val.i >> 31;
int exponent=((val.i & 0x7fffffff) >> 23) - 127;
int mantissa=val.i & 0x7fffff;
// Is the exponent less than zero?
if(exponent<0)
{
// In this case, x is in the open interval (-1, 1)
if(x<=0.0f)
return 0.0f;
else
return 1.0f;
}
else
{
// Construct a bit mask that will mask off the
// fractional part of the mantissa
int mask=0x7fffff >> exponent;
// Is x already an integer (i.e. are all the
// fractional bits zero?)
if((mantissa & mask) == 0)
return x;
else
{
// If x is positive, we need to add 1 to it
// before clearing the fractional bits
if(!sign)
{
mantissa+=1 << (23-exponent);
// Did the mantissa overflow?
if(mantissa & 0x800000)
{
// The mantissa can only overflow if all the
// integer bits were previously 1 -- so we can
// just clear out the mantissa and increment
// the exponent
mantissa=0;
exponent++;
}
}
// Clear the fractional bits
mantissa&=~mask;
}
}
// Put sign, exponent and mantissa together again
val.i=(sign << 31) | ((exponent+127) << 23) | mantissa;
return val.f;
}

Nothing you will write is more elegant than using the standard library implementation. No code at all is always more elegant than elegant code.
That aside, this approach has two major flaws:
If X is greater than INT_MAX + 1 or less than INT_MIN - 1, the behavior of your macro is undefined. This means that your implementation may give incorrect results for nearly half of all floating-point numbers. You will also raise the invalid flag, contrary to IEEE-754.
It gets the edge cases for -0, +/-infinity, and nan wrong. In fact, the only edge case it gets right is +0.
You can implement ceil in manner similar to what you tried, like so (this implementation assumes IEEE-754 double precision):
#include <math.h>
double ceil(double x) {
// All floating-point numbers larger than 2^52 are exact integers, so we
// simply return x for those inputs. We also handle ceil(nan) = nan here.
if (isnan(x) || fabs(x) >= 0x1.0p52) return x;
// Now we know that |x| < 2^52, and therefore we can use conversion to
// long long to force truncation of x without risking undefined behavior.
const double truncation = (long long)x;
// If the truncation of x is smaller than x, then it is one less than the
// desired result. If it is greater than or equal to x, it is the result.
// Adding one cannot produce a rounding error because `truncation` is an
// integer smaller than 2^52.
const double ceiling = truncation + (truncation < x);
// Finally, we need to patch up one more thing; the standard specifies that
// ceil(-small) be -0.0, whereas we will have 0.0 right now. To handle this
// correctly, we apply the sign of x to the result.
return copysign(ceiling, x);
}
Something like that is about as elegant as you can get and still be correct.
I flagged a number of concerns with the (generally good!) implementation that Martin put in his answer. Here's how I would implement his approach:
#include <stdint.h>
#include <string.h>
static inline uint64_t toRep(double x) {
uint64_t r;
memcpy(&r, &x, sizeof x);
return r;
}
static inline double fromRep(uint64_t r) {
double x;
memcpy(&x, &r, sizeof x);
return x;
}
double ceil(double x) {
const uint64_t signbitMask = UINT64_C(0x8000000000000000);
const uint64_t significandMask = UINT64_C(0x000fffffffffffff);
const uint64_t xrep = toRep(x);
const uint64_t xabs = xrep & signbitMask;
// If |x| is larger than 2^52 or x is NaN, the result is just x.
if (xabs >= toRep(0x1.0p52)) return x;
if (xabs < toRep(1.0)) {
// If x is in (1.0, 0.0], the result is copysign(0.0, x).
// We can generate this value by clearing everything except the signbit.
if (x <= 0.0) return fromRep(xrep & signbitMask);
// Otherwise x is in (0.0, 1.0), and the result is 1.0.
else return 1.0;
}
// Now we know that the exponent of x is strictly in the range [0, 51],
// which means that x contains both integral and fractional bits. We
// generate a mask covering the fractional bits.
const int exponent = xabs >> 52;
const uint64_t fractionalBits = significandMask >> exponent;
// If x is negative, we want to truncate, so we simply mask off the
// fractional bits.
if (xrep & signbitMask) return fromRep(xrep & ~fractionalBits);
// x is positive; to force rounding to go away from zero, we first *add*
// the fractionalBits to x, then truncate the result. The add may
// overflow the significand into the exponent, but this produces the
// desired result (zero significand, incremented exponent), so we just
// let it happen.
return fromRep(xrep + fractionalBits & ~fractionalBits);
}
One thing to note about this approach is that it does not raise the inexact floating-point flag for non-integral inputs. That may or may not be a concern for your usage. The first implementation that I listed does raise the flag.

I don't think a macrofunction is a good solution: it isn't type safe and there is a multi-evaluation of the arguments (side-effects). You should rather write a clean and elegant function.

As I would have expected more jokes in answers, I will try a couple
#define CEILING(X) ceil(X)
Bonus: a macro with not so many side effects
If you don't care too much of negative zeroes
#define CEILING(X) (-floor(-(X)))
If you care of negative zero, then
#define CEILING(X) (NEGATIVE_ZERO - floor(-(X)))
Portable definition of NEGATIVE_ZERO left as an exercize....
Bonus, it will also set FP flags (OVERFLOW INVALID INEXACT)

Related

Alternative to ceil() and floor() to get the closest integer values, above and below of a floating point value?

I´m looking for an alternative for the ceil() and floor() functions in C, due to I am not allowed to use these in a project.
What I have build so far is a tricky back and forth way by the use of the cast operator and with that the conversion from a floating-point value (in my case a double) into an int and later as I need the closest integers, above and below the given floating-point value, to be also double values, back to double:
#include <stdio.h>
int main(void) {
double original = 124.576;
double floorint;
double ceilint;
int f;
int c;
f = (int)original; //Truncation to closest floor integer value
c = f + 1;
floorint = (double)f;
ceilint = (double)c;
printf("Original Value: %lf, Floor Int: %lf , Ceil Int: %lf", original, floorint, ceilint);
}
Output:
Original Value: 124.576000, Floor Int: 124.000000 , Ceil Int: 125.000000
For this example normally I would not need the ceil and floor integer values of c and f to be converted back to double but I need them in double in my real program. Consider that as a requirement for the task.
Although the output is giving the desired values and seems right so far, I´m still in concern if this method is really that right and appropriate or, to say it more clearly, if this method does bring any bad behavior or issue into the program or gives me a performance-loss in comparison to other alternatives, if there are any other possible alternatives.
Do you know a better alternative? And if so, why this one should be better?
Thank you very much.
Do you know a better alternative? And if so, why this one should be better?
OP'code fails:
original is already a whole number.
original is a negative like -1.5. Truncation is not floor there.
original is just outside int range.
original is not-a-number.
Alternative construction
double my_ceil(double x)
Using the cast to some integer type trick is a problem when x is outsize the integer range. So check first if x is inside range of a wide enough integer (one whose precision exceeds double). x values outside that are already whole numbers. Recommend to go for the widest integer (u)intmax_t.
Remember that a cast to an integer is a round toward 0 and not a floor. Different handling needed if x is negative/positive when code is ceil() or floor(). OP's code missed this.
I'd avoid if (x >= INTMAX_MAX) { as that involves (double) INTMAX_MAX whose rounding and then precise value is "chosen in an implementation-defined manner". Instead, I'd compare against INTMAX_MAX_P1. some_integer_MAX is a Mersenne Number and with 2's complement, ...MIN is a negated "power of 2".
#include <inttypes.h>
#define INTMAX_MAX_P1 ((INTMAX_MAX/2 + 1)*2.0)
double my_ceil(double x) {
if (x >= INTMAX_MAX_P1) {
return x;
}
if (x < INTMAX_MIN) {
return x;
}
intmax_t i = (intmax_t) x; // this rounds towards 0
if (i < 0 || x == i) return i; // negative x is already rounded up.
return i + 1.0;
}
As x may be a not-a-number, it is more useful to reverse the compare as relational compare of a NaN is false.
double my_ceil(double x) {
if (x >= INTMAX_MIN && x < INTMAX_MAX_P1) {
intmax_t i = (intmax_t) x; // this rounds towards 0
if (i < 0 || x == i) return i; // negative x is already rounded up.
return i + 1.0;
}
return x;
}
double my_floor(double x) {
if (x >= INTMAX_MIN && x < INTMAX_MAX_P1) {
intmax_t i = (intmax_t) x; // this rounds towards 0
if (i > 0 || x == i) return i; // positive x is already rounded down.
return i - 1.0;
}
return x;
}
You're missing an important step: you need to check if the number is already integral, so for ceil assuming non-negative numbers (generalisation is trivial), use something like
double ceil(double f){
if (f >= LLONG_MAX){
// f will be integral unless you have a really funky platform
return f;
} else {
long long i = f;
return 0.0 + i + (f != i); // to obviate potential long long overflow
}
}
Another missing piece in the puzzle, which is covered off by my enclosing if, is to check if f is within the bounds of a long long. On common platforms if f was outside the bounds of a long long then it would be integral anyway.
Note that floor is trivial due to the fact that truncation to long long is always towards zero.

Converting SIGNED fractions to UNSIGNED fixed point for addition and multiplication

How can we convert floating point numbers to their "fixed-point representations", and use their "fixed-point representations" in fixed-point operations such as addition and multiplication? The result in the fixed-point operation must yield to the correct answer when converted back to floating point.
Say:
(double)(xb_double) + (double)(xb_double) = ?
Then we convert both addends to a fixed point representation (integer),
(int)(xa_fixed) + (int)(xb_fixed) = (int) (xsum_fixed)
To get (double)(xsum_double), we convert (int)(sum_fixed) back to floating point and yield same answer,
FixedToDouble(xsum_fixed) => xsum_double
Specifically, if the range of the values of xa_double and xb_double is between -1.65 and 1.65, I want to convert xa_double and xb_double in their respective 10-bit fixed point representations (0x0000 to 0x03FF)
WHAT I HAVE TRIED
int fixed_MAX = 1023;
int fixed_MIN = 0;
double Value_MAX = 1.65;
double Value_MIN = -1.65;
double slope = ((fixed_MAX) - (fixed_MIN))/((Value_MAX) - (Value_MIN));
int DoubleToFixed(double x)
{
return round(((x) - Value_MIN)*slope + fixed_MIN); //via interpolation method
}
double FixedToDouble(int x)
{
return (double)((((x) + fixed_MIN)/slope) + Value_MIN);
}
int sum_fixed(int x, int y)
{
return (x + y - (1.65*slope)); //analysis, just basic math
}
int subtract_fixed(int x, int y)
{
return (x - y + (1.65*slope));
}
int product_fixed(int x, int y)
{
return (((x * y) - (slope*slope*((1.65*FixedToDouble(x)) + (1.65*FixedToDouble(y)) + (1.65*1.65))) + (slope*slope*1.65)) / slope);
}
And if I want to add (double)(1.00) + (double)(2.00) = which should yield to (double)(3.00),
With my code,
xsum_fixed = DoubleToFixed(1.00) + DoubleToFixed(2.00);
xsum_double = FixedToDouble(xsum_fixed);
I get the answer:
xsum_double = 3.001613
Which is very close to the correct answer (double)(3.00)
Also, if I perform multiplication and subtraction I get 2.004839 and -1.001613, respectively.
HERE'S THE CATCH:
So I know my code is working, but how can I perform addition, multiplication and subtraction on these fixed-point representations without having INTERNAL FLOATING POINT OPERATIONS AND NUMBERS.
So in the code above, the functions sum_fixed, product_fixed, and subtract_fixed have internal floating point numbers (slope and 1.65, 1.65 being the MAX float input). I derived my code by basic math, really.
So I want to implement add, subtract, and product functions without any internal floating point operations or numbers.
UPDATE:
I also found a simpler code in converting fractional numbers to fixed-point:
//const int scale = 16; //1/2^16 in 32 bits
#define DoubleToFixed(x) (int)((x) * (double)(1<<scale))
#define FixedToDouble(x) ((double)(x) / (double)(1<<scale))
#define FractionPart(x) ((x) & FractionMask)
#define MUL(x,y) (((long long)(x)*(long long)(y)) >> scale)
#define DIV(x, y) (((long long)(x)<<16)/(y))
However, this converts only UNSIGNED fractions to UNSIGNED fixed-point. And I want to convert SIGNED fractions (-1.65 to 1.65) to UNSIGNED fixed-point (0x0000 to 0x03FF). How can I do this with the use of this code above? Is the range or number of bits have something to do with the conversion process? Is this code only for positive fractions?
credits to #chux
You can have the mantissa of the floating point representation of your number be equal to its fixed point representation. Since FP addition shifts the smaller operand's mantissa until both operands have the same exponent, you can add a certain 'magic number' to force it. For double, it's 1<<(52-precision) (52 is double's mantissa size, 'precision' is the required number of binary precision digits). So the conversion would look like this:
union { double f; long long i; } u = { xfloat+(1ll<<52-precision) }; // shift x's mantissa
long long xfixed = u.i & (1ll<<52)-1; // extract the mantissa
After that you can use xfixed in integer math (for multiplication, you'd have to shift the result right by 'precision'). To convert it back to double, simply multiply it by 1.0/(1 << precision);
Note that it doesn't handle negatives. If you need them, you'd have to convert them to the complementary representation manually (first fabs the double, then negate the int result if the input was negative).

Compare floating point numbers as integers

Can two floating point values (IEEE 754 binary64) be compared as integers? Eg.
long long a = * (long long *) ptr_to_double1,
b = * (long long *) ptr_to_double2;
if (a < b) {...}
assuming the size of long long and double is the same.
YES - Comparing the bit-patterns for two floats as if they were integers (aka "type-punning") produces meaningful results under some restricted scenarios...
Identical to floating-point comparison when:
Both numbers are positive, positive-zero, or positive-infinity.
One positive and one negative number, and you are using a signed integer comparison.
Inverse of floating-point comparison when:
Both numbers are negative, negative-zero, or negative-infinity.
One positive and one negative number, and you are using a unsigned integer comparison.
Not comparable to floating-point comparison when:
Either number is one of the NaN values - Floating point comparisons with a NaN always returns false, and this simply can't be modeled in integer operations where exactly one of the following is always true: (A < B), (A == B), (B < A).
Negative floating-point numbers are a bit funky b/c they are handled very differently than in the 2's complement arithmetic used for integers. Doing an integer +1 on the representation for a negative float will make it a bigger negative number.
With a little bit manipulation, you can make both positive and negative floats comparable with integer operations (this can come in handy for some optimizations):
int32 float_to_comparable_integer(float f) {
uint32 bits = std::bit_cast<uint32>(f);
const uint32 sign_bit = bits & 0x80000000ul;
// Modern compilers turn this IF-statement into a conditional move (CMOV) on x86,
// which is much faster than a branch that the cpu might mis-predict.
if (sign_bit) {
bits = 0x7FFFFFF - bits;
}
return static_cast<int32>(bits);
}
Again, this does not work for NaN values, which always return false from comparisons, and have multiple valid bit representations:
Signaling NaNs (w/ sign bit): Anything between 0xFF800001, and 0xFFBFFFFF.
Signaling NaNs (w/o sign bit): Anything between 0x7F800001, and 0x7FBFFFFF.
Quiet NaNs (w/ sign bit): Anything between 0xFFC00000, and 0xFFFFFFFF.
Quiet NaNs (w/o sign bit): Anything between 0x7FC00000, and 0x7FFFFFFF.
IEEE-754 bit format: http://www.puntoflotante.net/FLOATING-POINT-FORMAT-IEEE-754.htm
More on Type-Punning: https://randomascii.wordpress.com/2012/01/23/stupid-float-tricks-2/
No. Two floating point values (IEEE 754 binary64) cannot compare simply as integers with if (a < b).
IEEE 754 binary64
The order of the values of double is not the same order as integers (unless you are are on a rare sign-magnitude machine). Think positive vs. negative numbers.
double has values like 0.0 and -0.0 which have the same value but different bit patterns.
double has "Not-a-number"s that do not compare like their binary equivalent integer representation.
If both the double values were x > 0 and not "Not-a-number", endian, aliasing, and alignment, etc. were not an issue, OP's idea would work.
Alternatively, a more complex if() ... condition would work - see below
[non-IEEE 754 binary64]
Some double use an encoding where there are multiple representations of the same value. This would differ from an "integer" compare.
Tested code: needs 2's complement, same endian for double and the integers, does not account for NaN.
int compare(double a, double b) {
union {
double d;
int64_t i64;
uint64_t u64;
} ua, ub;
ua.d = a;
ub.d = b;
// Cope with -0.0 right away
if (ua.u64 == 0x8000000000000000) ua.u64 = 0;
if (ub.u64 == 0x8000000000000000) ub.u64 = 0;
// Signs differ?
if ((ua.i64 < 0) != (ub.i64 < 0)) {
return ua.i64 >= 0 ? 1 : -1;
}
// If numbers are negative
if (ua.i64 < 0) {
ua.u64 = -ua.u64;
ub.u64 = -ub.u64;
}
return (ua.u64 > ub.u64) - (ua.u64 < ub.u64);
}
Thanks to #David C. Rankin for a correction.
Test code
void testcmp(double a, double b) {
int t1 = (a > b) - (a < b);
int t2 = compare(a, b);
if (t1 != t2) {
printf("%le %le %d %d\n", a, b, t1, t2);
}
}
#include <float.h>
void testcmps() {
// Various interesting `double`
static const double a[] = {
-1.0 / 0.0, -DBL_MAX, -1.0, -DBL_MIN, -0.0,
+0.0, DBL_MIN, 1.0, DBL_MAX, +1.0 / 0.0 };
int n = sizeof a / sizeof a[0];
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
testcmp(a[i], a[j]);
}
}
puts("!");
}
If you strictly cast the bit value of a floating point number to its correspondingly-sized signed integer (as you've done), then signed integer comparison of the results will be identical to the comparison of the original floating-point values, excluding NaN values. Put another way, this comparison is legitimate for all representable finite and infinite numeric values.
In other words, for double-precision (64-bits), this comparison will be valid if the following tests pass:
long long exponentMask = 0x7ff0000000000000;
long long mantissaMask = 0x000fffffffffffff;
bool isNumber = ((x & exponentMask) != exponentMask) // Not exp 0x7ff
|| ((x & mantissaMask) == 0); // Infinities
for each operand x.
Of course, if you can pre-qualify your floating-point values, then a quick isNaN() test would be much more clear. You'd have to profile to understand performance implications.
There are two parts to your question:
Can two floating point numbers be compared? The answer to this is yes. it is perfectly valid to compare size of floating point numbers. Generally you want to avoid equals comparisons due to truncation issues see here, but
if (a < b)
will work just fine.
Can two floating point numbers be compared as integers? This answer is also yes, but this will require casting. This question should help with that answer: convert from long long to int and the other way back in c++

Rounding up integer without using float, double, or division

Its an embedded platform thats why such restrictions.
original equation: 0.02035*c*c - 2.4038*c
Did this:
int32_t val = 112; // this value is arbitrary
int32_t result = (val*((val * 0x535A8) - 0x2675F70));
result = result>>24;
The precision is still poor. When we multiply val*0x535A8 Is there a way we can further improve the precision by rounding up, but without using any float, double, or division.
The problem is not precision. You're using plenty of bits.
I suspect the problem is that you're comparing two different methods of converting to int. The first is a cast of a double, the second is a truncation by right-shifting.
Converting floating point to integer simply drops the fractional part, leading to a round towards zero; right-shifting does a round down or floor. For positive numbers there's no difference, but for negative numbers the two methods will be 1 off from each other. See an example at http://ideone.com/rkckuy and some background reading at Wikipedia.
Your original code is easy to fix:
int32_t result = (val*((val * 0x535A8) - 0x2675F70));
if (result < 0)
result += 0xffffff;
result = result>>24;
See the results at http://ideone.com/D0pNPF
You might also just decide that the right shift result is OK as is. The conversion error isn't greater than it is for the other method, just different.
Edit: If you want to do rounding instead of truncation the answer is even easier.
int32_t result = (val*((val * 0x535A8) - 0x2675F70));
result = (result + (1L << 23)) >> 24;
I'm going to join in with some of the others in suggesting that you use a constant expression to replace those magic constants with something that documents how they were derived.
static const int32_t a = (int32_t)(0.02035 * (1L << 24) + 0.5);
static const int32_t b = (int32_t)(2.4038 * (1L << 24) + 0.5);
int32_t result = (val*((val * a) - b));
How about just scaling your constants by 10000. The maximum number you then get is 2035*120*120 - 24038*120 = 26419440, which is far below the 2^31 limit. So maybe there is no need to do real bit-tweaking here.
As noted by Joe Hass, your problem is that you shift your precision bits into the dustbin.
Whether shifting your decimals by 2 or by 10 to the left does actually not matter. Just pretend your decimal point is not behind the last bit but at the shifted position. If you keep computing with the result, shifting by 2 is likely easier to handle. If you just want to output the result, shift by powers of ten as proposed above, convert the digits and insert the decimal point 5 characters from the right.
Givens:
Lets assume 1 <= c <= 120,
original equation: 0.02035*c*c - 2.4038*c
then -70.98586 < f(c) < 4.585
--> -71 <= result <= 5
rounding f(c) to nearest int32_t.
Arguments A = 0.02035 and B = 2.4038
A & B may change a bit with subsequent compiles, but not at run-time.
Allow coder to input values like 0.02035 & 2.4038. The key components shown here and by others it to scale the factors like 0.02035 to by some power-of-2, do the equation (simplified into the form (A*c - B)*c) and the scale the result back.
Important features:
1 When determining A and B, insure the compile time floating point multiplication and final conversion occurs via a round and not a truncation. With positive values, the + 0.5 achieves that. Without a rounded answer UD_A*UD_Scaling could end up just under a whole number and truncate away 0.999999 when converting to the int32_t
2 Instead of doing expensive division at run-time, we do >> (right shift). By adding half the divisor (as suggested by #Joe Hass), before the division, we get a nicely rounded answer. It is important not to code in / here as some_signed_int / 4 and some_signed_int >> 2 do not round the same way. With 2's complement, >> truncates toward INT_MIN whereas / truncates toward 0.
#define UD_A (0.02035)
#define UD_B (2.4038)
#define UD_Shift (24)
#define UD_Scaling ((int32_t) 1 << UD_Shift)
#define UD_ScA ((int32_t) (UD_A*UD_Scaling + 0.5))
#define UD_ScB ((int32_t) (UD_B*UD_Scaling + 0.5))
for (int32_t val = 1; val <= 120; val++) {
int32_t result = ((UD_A*val - UD_B)*val + UD_Scaling/2) >> UD_Shift;
printf("%" PRId32 "%" PRId32 "\n", val, result);
}
Example differences:
val, OP equation, OP code, This code
1, -2.38345, -3, -2
54, -70.46460, -71, -70
120, 4.58400, 4, 5
This is a new answer. My old +1 answer deleted.
If you r input uses max 7 bits and you have 32 bit available then your best bet is to shift everything by as many bits as possible and work with that:
int32_t result;
result = (val * (int32_t)(0.02035 * 0x1000000)) - (int32_t)(2.4038 * 0x1000000);
result >>= 8; // make room for another 7 bit multiplication
result *= val;
result >>= 16;
Constant conversion will be done by an optimising compiler at compile time.

Find which power of 2 range a number falls within? (In C)

As in whether it falls within 2^3 - 2^4, 2^4 - 2^5, etc. The number returned would be the EXPONENT itself (minus an offset).
How could this be done extremely quickly and efficiently as possible? This function will be called a lot in a program that is EXTREMELY dependent on speed. This is my current code but it is far too inefficient as it uses a for loop.
static inline size_t getIndex(size_t numOfBytes)
{
int i = 3;
for (; i < 32; i++)
{
if (numOfBytes < (1 << i))
return i - OFFSET;
}
return (NUM_OF_BUCKETS - 1);
}
Thank you very much!
What you're after is simply log2(n), as far as I can tell.
It might be worth cheating and using some inline assembly if your target architecture(s) have instructions that can do this. See the Wikipedia entry on "find first set" for lots of discussion and information about hardware support.
One way to do it would be to find the highest order bit that is set to 1. I'm trying to think if this is efficient though, since you'll still have to do n checks in worst case.
Maybe you could do a binary search style where you check if it's greater than 2^16, if so, check if it's greater than 2^24 (assuming 32 bits here), and if not, then check if it's greater than 2^20, etc... That would be log(n) checks, but I'm not sure of the efficiency of a bit check vs a full int comparison.
Could get some perf data on either.
There is a particularly efficient algorithm using de Bruijn sequences described on Sean Eron Anderson's excellent Bit Twiddling Hacks page:
uint32_t v; // find the log base 2 of 32-bit v
int r; // result goes here
static const int MultiplyDeBruijnBitPosition[32] =
{
0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
};
v |= v >> 1; // first round down to one less than a power of 2
v |= v >> 2;
v |= v >> 4;
v |= v >> 8;
v |= v >> 16;
r = MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27];
It works in 13 operations without branching!
You are basically trying to compute: floor(log2(x))
Take the logarithm to the base 2, then take the floor.
The most portable way to do this in C is to use the logf() function, which finds the log to the base e, then adjust: log2(x) == logf(x) / logf(2.0)
See the answer here: How to write log base(2) in c/c++
If you just cast the resulting float value to int, you compute floor() at the same time.
But, if it is available to you and you can use it, there is an extremely fast way to compute log2() of a floating point number: logbf()
From the man page:
The inte-
ger constant FLT_RADIX, defined in <float.h>, indicates the radix used
for the system's floating-point representation. If FLT_RADIX is 2,
logb(x) is equal to floor(log2(x)), except that it is probably faster.
http://linux.die.net/man/3/logb
If you think about how floating-point numbers are stored, you realize that the value floor(log2(x)) is part of the number, and if you just extract that value you are done. A little bit of shifting and bit-masking, and subtract the bias from the exponent (or technically the "significand") and there you have it. The fastest way possible to compute floor(log2(x)) for any float value x.
http://en.wikipedia.org/wiki/Single_precision
But actually logbf() converts the result to a float before giving it to you, and handles errors. If you write your own function to extract the exponent as an integer, it will be slightly faster and an integer is what you want anyway. If you wanted to write your own function you need to use a C union to gain access to the bits inside the float; trying to play with pointers will get you warnings or errors related to "type-punning", at least on GCC. I will give details on how to do this, if you ask. I have written this code before, as an inline function.
If you only have a small range of numbers to test, you could possibly cast your numbers to integer and then use a lookup table.
You can make use of floating number representation:
double n_bytes = numOfBytes
Taking the exponent bits should give you the result as floating numbers are represented as:
(-1)^S X (1. + M) X 2^E
Where:
S - Sign
M - Mantissa
E - Exponent
To construct the mask and shift you would have to read about the exact bit pattern of the floating point type you are using.
The CPU floating point support does most of the work for you.
An even better way would be to use the built-in function:
double frexp (double x, int * exp );
Floating point representation
#include <Limits.h> // For CHAR_BIT.
#include <math.h> // For frexp.
#include <stdio.h> // For printing results, as a demonstration.
// These routines assume 0 < x.
/* This requires GCC (or any other compiler that supplies __builtin_clz). It
should perform well on any machine with a count-leading-zeroes instruction
or something similar.
*/
static int log2A(unsigned int x)
{
return sizeof x * CHAR_BIT - 1 - __builtin_clz(x);
}
/* This requires that a double be able to exactly represent any unsigned int.
(This is true for 32-bit integers and 64-bit IEEE 754 floating-point.) It
might perform well on some machines and poorly on others.
*/
static int log2B(unsigned int x)
{
int exponent;
frexp(x, &exponent);
return exponent - 1;
}
int main(void)
{
// Demonstrate the routines.
for (unsigned int x = 1; x; x <<= 1)
printf("0x%08x: log2A -> %2d, log2B -> %2d.\n", x, log2A(x), log2B(x));
return 0;
}
This is generally fast on any machine with hardware floating point unit:
((union { float val; uint32_t repr; }){ x }.repr >> 23) - 0x7f
The only assumptions it makes are that floating point is IEEE and integer and floating point endianness match, both of which are true on basically all real-world systems (certainly all modern ones).
Edit: When I've used this in the past, I didn't need it for large numbers. Eric points out that it will give the wrong result for ints that don't fit in float. Here is a revised (albeit possibly slower) version that fixes that and supports values up to 52 bits (in particular, all 32-bit positive integer inputs):
((union { double val; uint64_t repr; }){ x }.repr >> 52) - 0x3ff
Also note that I'm assuming x is a positive (not just non-negative, also nonzero) number. If x is negative you'll get a bogus result, and if x is 0, you'll get a large negative result (approximating negative infinity as the logarithm).

Resources