I'm stuck there trying to figure out how to convert the last two "if" statements of the following code to a branchless state.
int u, x, y;
x = rand() % 100 - 50;
y = rand() % 100 - 50;
u = rand() % 4;
if ( y > x) u = 5;
if (-y > x) u = 4;
Or, in case the above turns out to be too difficult, you can consider them as:
if (x > 0) u = 5;
if (y > 0) u = 4;
I think that what gets me is the fact that those don't have an else catcher. If it was the case I could have probably adapted a variation of a branchless abs (or max/min) function.
The rand() functions you see aren't part of the real code. I added them like this just to hint at the expected ranges that the variables x, y and u can possibly have at the time the two branches happen.
Assembly machine code is allowed for the purpose.
EDIT:
After a bit of braingrinding I managed to put together a working branchless version:
int u, x, y;
x = rand() % 100 - 50;
y = rand() % 100 - 50;
u = rand() % 4;
u += (4-u)*((unsigned int)(x+y) >> 31);
u += (5-u)*((unsigned int)(x-y) >> 31);
Unfortunately, due to the integer arithmetic involved, the original version with if statements turns out to be faster by a 30% range.
Compiler knows where the party is at.
[All: this answer was written with the assumption that the calls on rand() were part of the problem. I offer improvement below under that assumption.
OP belatedly clarifies he only used rand to tell us ranges (and presumably distribution) of the values of x and y. Unclear if he meant for the value for u, too. Anyway, enjoy my improved answer to the problem he didn't really pose].
I think you'd be better off recoding this as:
int u, x, y;
x = rand() % 100 - 50;
y = rand() % 100 - 50;
if ( y > x) u = 5;
else if (-y > x) u = 4;
else u = rand() % 4;
This calls the last rand only 1/4 as often as OP's original code.
Since I assume rand (and the divides) are much more expensive
than compare-and-branch, this would be a significant savings.
If your rand generator produces a lot of truly random bits (e.g. 16) on each call as it should, you can call it just once (I've assumed rand is more expensive than divide, YMMV):
int u, x, y, t;
t = rand() ;
u = t % 4;
t = t >> 2;
x = t % 100 - 50;
y = ( t / 100 ) %100 - 50;
if ( y > x) u = 5;
else if (-y > x) u = 4;
I think that the rand function in the MS C library is not good enough for this if you want really random values. I had to code my own; turned out faster anyway.
You might also get rid of the divide, by using multiplication by a reciprocal (untested):
int u, x, y;
unsigned int t;
unsigned long t2;
t = rand() ;
u = t % 4;
{ // Compute value of x * 2^32 in a long by multiplying.
// The (unsigned int) term below should be folded into a single constant at compile time.
// The remaining multiply can be done by one machine instruction
// (typically 32bits * 32bits --> 64bits) widely found in processors.
// The "4" has the same effect as the t = t >> 2 in the previous version
t2 = ( t * ((unsigned int)1./(4.*100.)*(1<<32));
}
x = (t2>>32)-50; // take the upper word (if compiler won't, do this in assembler)
{ // compute y from the fractional remainder of the above multiply,
// which is sitting in the lower 32 bits of the t2 product
y = ( t2 mod (1<<32) ) * (unsigned int)(100.*(1<<32));
}
if ( y > x) u = 5;
else if (-y > x) u = 4;
If your compiler won't produce the "right" instructions, it should be straightforward to write assembly code to do this.
Some tricks using arrays indices, they may be quite fast if the compiler/CPU has one-step instructions to convert comparison results to 0-1 values (e.g. x86's "sete" and similar).
int ycpx[3];
/* ... */
ycpx[0] = 4;
ycpx[1] = u;
ycpx[2] = 5;
u = ycpx[1 - (-y <= x) + (y > x)];
Alternate form
int v1[2];
int v2[2];
/* ... */
v1[0] = u;
v1[1] = 5;
v2[1] = 4;
v2[0] = v1[y > x];
u = v2[-y > x];
Almost unreadable...
NOTE: In both cases the initialization of array elements containing 4 and 5 may be included in declaration and arrays may be made static if reentrancy is not a problem for you.
Related
I need to calculate the entropy and due to the limitations of my system I need to use restricted C features (no loops, no floating point support) and I need as much precision as possible. From here I figure out how to estimate the floor log2 of an integer using bitwise operations. Nevertheless, I need to increase the precision of the results. Since no floating point operations are allowed, is there any way to calculate log2(x/y) with x < y so that the result would be something like log2(x/y)*10000, aiming at getting the precision I need through arithmetic integer?
You will base an algorithm on the formula
log2(x/y) = K*(-log(x/y));
where
K = -1.0/log(2.0); // you can precompute this constant before run-time
a = (y-x)/y;
-log(x/y) = a + a^2/2 + a^3/3 + a^4/4 + a^5/5 + ...
If you write the loop correctly—or, if you prefer, unroll the loop to code the same sequence of operations looplessly—then you can handle everything in integer operations:
(y^N*(1*2*3*4*5*...*N)) * (-log(x/y))
= y^(N-1)*(2*3*4*5*...*N)*(y-x) + y^(N-2)*(1*3*4*5*...*N)*(y-x)^2 + ...
Of course, ^, the power operator, binding tighter than *, is not a C operator, but you can implement that efficiently in the context of your (perhaps unrolled) loop as a running product.
The N is an integer large enough to afford desired precision but not so large that it overruns the number of bits you have available. If unsure, then try N = 6 for instance. Regarding K, you might object that that is a floating-point number, but this is not a problem for you because you are going to precompute K, storing it as a ratio of integers.
SAMPLE CODE
This is a toy code but it works for small values of x and y such as 5 and 7, thus sufficing to prove the concept. In the toy code, larger values can silently overflow the default 64-bit registers. More work would be needed to make the code robust.
#include <stddef.h>
#include <stdlib.h>
// Your program will not need the below headers, which are here
// included only for comparison and demonstration.
#include <math.h>
#include <stdio.h>
const size_t N = 6;
const long long Ky = 1 << 10; // denominator of K
// Your code should define a precomputed value for Kx here.
int main(const int argc, const char *const *const argv)
{
// Your program won't include the following library calls but this
// does not matter. You can instead precompute the value of Kx and
// hard-code its value above with Ky.
const long long Kx = lrintl((-1.0/log(2.0))*Ky); // numerator of K
printf("K == %lld/%lld\n", Kx, Ky);
if (argc != 3) exit(1);
// Read x and y from the command line.
const long long x0 = atoll(argv[1]);
const long long y = atoll(argv[2]);
printf("x/y == %lld/%lld\n", x0, y);
if (x0 <= 0 || y <= 0 || x0 > y) exit(1);
// If 2*x <= y, then, to improve accuracy, double x repeatedly
// until 2*x > y. Each doubling offsets the log2 by 1. The offset
// is to be recovered later.
long long x = x0;
int integral_part_of_log2 = 0;
while (1) {
const long long trial_x = x << 1;
if (trial_x > y) break;
x = trial_x;
--integral_part_of_log2;
}
printf("integral_part_of_log2 == %d\n", integral_part_of_log2);
// Calculate the denominator of -log(x/y).
long long yy = 1;
for (size_t j = N; j; --j) yy *= j*y;
// Calculate the numerator of -log(x/y).
long long xx = 0;
{
const long long y_minus_x = y - x;
for (size_t i = N; i; --i) {
long long term = 1;
size_t j = N;
for (; j > i; --j) {
term *= j*y;
}
term *= y_minus_x;
--j;
for (; j; --j) {
term *= j*y_minus_x;
}
xx += term;
}
}
// Convert log to log2.
xx *= Kx;
yy *= Ky;
// Restore the aforementioned offset.
for (; integral_part_of_log2; ++integral_part_of_log2) xx -= yy;
printf("log2(%lld/%lld) == %lld/%lld\n", x0, y, xx, yy);
printf("in floating point, this ratio of integers works out to %g\n",
(1.0*xx)/(1.0*yy));
printf("the CPU's floating-point unit computes the log2 to be %g\n",
log2((1.0*x0)/(1.0*y)));
return 0;
}
Running this on my machine with command-line arguments of 5 7, it outputs:
K == -1477/1024
x/y == 5/7
integral_part_of_log2 == 0
log2(5/7) == -42093223872/86740254720
in floating point, this ratio of integers works out to -0.485279
the CPU's floating-point unit computes the log2 to be -0.485427
Accuracy would be substantially improved by N = 12 and Ky = 1 << 20, but for that you need either thriftier code or more than 64 bits.
THRIFTIER CODE
Thriftier code, wanting more effort to write, might represent numerator and denominator in prime factors. For example, it might represent 500 as [2 0 3], meaning (22)(30)(53).
Yet further improvements might occur to your imagination.
AN ALTERNATE APPROACH
For an alternate approach, though it might not meet your requirements precisely as you have stated them, #phuclv has given the suggestion I would be inclined to follow if your program were mine: work the problem in reverse, guessing a value c/d for the logarithm and then computing 2^(c/d), presumably via a Newton-Raphson iteration. Personally, I like the Newton-Raphson approach better. See sect. 4.8 here (my original).
MATHEMATICAL BACKGROUND
Several sources including mine already linked explain the Taylor series underlying the first approach and the Newton-Raphson iteration of the second approach. The mathematics unfortunately is nontrivial, but there you have it. Good luck.
I want to create a modulo-like function which can work with double-precision floats rather than ints. Another important factor is that the function must round towards negative infinity, rather than zero.
I have a couple of methods which work, but I believe them to be slow for a function which will be called many times in loops:
// A suggested method
double reduce_range(double x, const double max) {
x /= max; // Normalize to [0,1)
x -= (int) x;
x += 1.0;
x -= (int) x;
return x * max; // Denormalize
}
// My own simple implementation
double reduce_range(const double x, const double max) {
return x - floor(x / max) * max;
}
Both seem to work, but the second uses floor (which seems to be a bit of a bottleneck for these sorts of things) and the first repeatedly casts to int and subtracts. Is there not some faster way to do this (or to allow the compiler to take care of it)?
Alternatively, how about this:
double reduce_range(double x, const double max) {
x = fmod(x, max);
if(x < 0) x += max;
return x;
}
Is it going to be greatly slowed down by the branching if?
Edit: some example inputs and outputs:
(5.0, 7.0) >> 5.0
(8.5, 7.0) >> 1.5
(-2.3, 7.0) >> 4.7
If you are worried about the branch, then possibly this might be better, if it's cheaper to load an integer into the fpu:
x += max * (x < 0);
I have written this code in C where each of a,b,cc,ma,mb,mcc,N,k are int . But as per specification of the problem , N and k could be as big as 10^9 . 10^9 can be stored within a int variable in my machine. But internal and final value of of a,b,cc,ma,mb,mcc will be much bigger for bigger values of N and k which can not be stored even in a unsigned long long int variable.
Now, I want to print value of mcc % 1000000007 as you can see in the code. I know, some clever modulo arithmetic tricks in the operations of the body of the for loop can create correct output without any overflow and also can make the program time efficient. Being new in modulo arithmetic, I failed to solve this. Can someone point me out those steps?
ma=1;mb=0;mcc=0;
for(i=1; i<=N; ++i){
a=ma;b=mb;cc=mcc;
ma = k*a + 1;
mb = k*b + k*(k-1)*a*a;
mcc = k*cc + k*(k-1)*a*(3*b+(k-2)*a*a);
}
printf("%d\n",mcc%1000000007);
My attempt:
I used a,b,cc,ma,mb,mcc as long long and done this. Could it be optimized more ??
ma=1;mb=0;cc=0;
ok = k*(k-1);
for(i=1; i<=N; ++i){
a=ma;b=mb;
as = (a*a)%MOD;
ma = (k*a + 1)%MOD;
temp1 = (k*b)%MOD;
temp2 = (as*ok)%MOD;
mb = (temp1+temp2)%MOD;
temp1 = (k*cc)%MOD;
temp2 = (as*(k-2))%MOD;
temp3 = (3*b)%MOD;
temp2 = (temp2+temp3)%MOD;
temp2 = (temp2*a)%MOD;
temp2 = (ok*temp2)%MOD;
cc = (temp1 + temp2)%MOD;
}
printf("%lld\n",cc);
Let's look at a small example:
mb = (k*b + k*(k-1)*a*a)%MOD;
Here, k*b, k*(k-1)*a*a can overflow, so can the sum, taking into account
(x + y) mod m = (x mod m + y mod m) mod m
we can rewrite this (x= k*b, y=k*(k-1)*a*a and m=MOD)
mb = ((k*b) % MOD + (k*(k-1)*a*a) %MOD) % MOD
now, we could go one step futher. Since
x * y mod m = (x mod m * y mod m) mod m
we can also rewrite the multiplication k*(k-1)*a*a % MOD with with x=k*(k-1) and y=a*a to
((k*(k-1)) %MOD) * ((a*a) %MOD)) % MOD
I'm sure you can do the rest. While you can sprinkle % MOD all over the place, you should careful consider whether you need it or not, taking John's hint into account:
Adding two n-digit numbers produces a number of up to n+1 digits, and
multiplying an n-digit number by an m-digit number produces a result
with up to n + m digits.
As such, there are places where you will need use modulus properties, and there are some, where you surely don't need it, but this is your part of the work ;).
That's a good exercise to build a template class along these lines:
template <int N>
class modulo_int_t
{
public:
modulo_int_t(int value) : value_(value % N) {}
modulo_int_t<N> operator+(const modulo_int_t<N> &rhs)
{
return modulo_int_t<N>(value_ + rhs.value) ;
}
// fill in the other operations
private:
int value_ ;
} ;
Then write the operations using modulo_int_t<1000000007> objects instead of int.
Disclaimer: make use of long long where appropriate and take care of negative differencies...
I found many posts about bitwise division and I completely understand most bitwise usage but I can't think of a specific division. I want to divide a given number (lets say 100) with all the multiples of 2 possible (ATTENTION: I don't want to divide with powers of 2 bit multiples!)
For example: 100/2, 100/4, 100/6, 100/8, 100/10...100/100
Also I know that because of using unsigned int the answers will be rounded for example 100/52=0 but it doesn't really matter, because I can both skip those answers or print them, no problem. My concern is mostly how I can divide with 6 or 10, etc. (multiples of 2). There is need for it to be done in C, because I can manage to transform any code you give me from Java to C.
Following the math shown for the accepted solution to the division by 3 question, you can derive a recurrence for the division algorithm:
To compute (int)(X / Y)
Let k be such that 2k ≥ Y and 2k-1 < Y
(note, 2k = (1 << k))
Let d = 2k - Y
Then, if A = (int)(X / 2k) and B = X % 2k,
X = (1 << k) * A + B
= (1 << k) * A - Y * A + Y * A + B
= d * A + Y * A + B
= Y * A + (d * A + B)
Thus,
X/Y = A + (d * A + B)/Y
In otherwords,
If S(X, Y) := X/Y, then S(X, Y) := A + S(d * A + B, Y).
This recurrence can be implemented with a simple loop. The stopping condition for the loop is when the numerator falls below 2k. The function divu implements the recurrence, using only bitwise operators and using unsigned types. Helper functions for the math operations are left unimplemented, but shouldn't be too hard (the linked answer provides a full add implementation already). The rs() function is for "right-shift", which does sign extension on the unsigned input. The function div is the actual API for int, and checks for divide by zero and negative y before delegating to divu. negate does 2's complement negation.
static unsigned divu (unsigned x, unsigned y) {
unsigned k = 0;
unsigned pow2 = 0;
unsigned mask = 0;
unsigned diff = 0;
unsigned sum = 0;
while ((1 << k) < y) k = add(k, 1);
pow2 = (1 << k);
mask = sub(pow2, 1);
diff = sub(pow2, y);
while (x >= pow2) {
sum = add(sum, rs(x, k));
x = add(mul(diff, rs(x, k)), (x & mask));
}
if (x >= y) sum = add(sum, 1);
return sum;
}
int div (int x, int y) {
assert(y);
if (y > 0) return divu(x, y);
return negate(divu(x, negate(y)));
}
This implementation depends on signed int using 2's complement. For maximal portability, div should convert negative arguments to 2's complement before calling divu. Then, it should convert the result from divu back from 2's complement to the native signed representation.
The following code works for positive numbers. When the dividend or the divisor or both are negative, have flags to change the sign of the answer appropriately.
int divi(long long m, long long n)
{
if(m==0 || n==0 || m<n)
return 0;
long long a,b;
int f=0;
a=n;b=1;
while(a<=m)
{
b = b<<1;
a = a<<1;
f=1;
}
if(f)
{
b = b>>1;
a = a>>1;
}
b = b + divi(m-a,n);
return b;
}
Use the operator / for integer division as much as you can.
For instance, when you want to divide 100 by 6 or 10 you should write 100/6 or 100/10.
When you mention bit wise division do you (1) mean an implementation of operator / or (2) you are referring to the division by a power of two number.
For (1) a processor should have an integer division unit. If not the compiler should provide a good implementation.
For (2) you can use 100>>2 instead of 100/4. If the numerator is known at compile time then a good compiler should automatically use the shift instruction.
In the below code, is there a way to avoid the if statement?
s = 13; /*Total size*/
b = 5; /*Block size*/
x = 0;
b1 = b;
while(x < s)
{
if(x + b > s)
b1 = s-x;
SendData(x, b1); /*SendData(offset,length);*/
x += b1;
}
Thanks much!
I don't know maybe you'll think:
s = 13; /*Total size*/
b = 5; /*Block size*/
x = 0;
while(x + b < s)
{
SendData(x, b); /*SendData(offset,length);*/
x += b;
}
SendData(x, s%b);
is better?
Don't waste your time on pointless micro-optimizations your compiler probably does for you anyway.
Program for the programmer; not the computer. Compilers get better and better, but programmers don't.
If it makes your program more readable (#PaulPRO's answer), then do it. Otherwise, don't.
You can use a conditional move or branchless integer select to assign b1 without an if-statement:
// if a >= 0, return x, else y
// assumes 32-bit processors
inline int isel( int a, int x, int y ) // inlining is important here
{
int mask = a >> 31; // arithmetic shift right, splat out the sign bit
// mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
return x + ((y - x) & mask);
};
// ...
while(x < s)
{
b1 = isel( x + b - s, s-x, b1 );
SendData(x, b1); /*SendData(offset,length);*/
x += b1;
}
This is only a useful optimization on in-order processors, though. It won't make any difference on a modern PC's x86, which has a fast branch and a reorder unit. It might be useful on some embedded systems (like a Playstation), where pipeline latency matters more for performance than instruction count. I've used it to shave a few microseconds in tight loops.
In theory a compiler "should" be able to turn a ternary expression (b = (a > 0 ? x : y)) into a conditional move, but I've never met one that did.
Of course, in a larger sense everyone who says that this is a pointless optimization compared to the cost of SendData() is correct. The difference between a cmov and a branch is about 4 nanoseconds, which is negligible compared to the cost of a network call. Spending your time fixing this branch which happens once per network call is like driving across town to save 1¢ on gasoline.
If you try to remove if(), it might change your logic and you have to spend lot of time for testing. I see only one potential change:
s = 13;
b = 5;
x = 0;
b1 = b;
while(x < s)
{
const unsigned int total = x + b; // <--- introduce 'total'
if(total > s)
b1 = s-x;
SendData(x, b1);
x = total; // <--- reusing it
}