A define sentence is:
#define _INTSIZEOF(n) ( (sizeof(n) + sizeof(int) - 1) & ~(sizeof(int) - 1) )
I have been told the purpose is bit alignment.
I wonder how it works, thx in advance.
The above macro simply aligns the size of n to the nearest greater-or-equal sizeof(int) boundary.
The basic algorithm for aligning value a to the nearest greater-or-equal arbitrary boundary b is to
Divide a by b rounding up, and then
Multiply the quotient by b again.
In the domain of unsigned (or just positive) values the first step is achieved by the following popular trick
q = (a + b - 1) / b
// where `/` is ordinary C-style integer division (rounding down)
// Now `q` is `a` divided by `b` rounded up
Combining this with the second step we get the following
aligned_a = (a + b - 1) / b * b
In aligned_a you get the desired aligned value.
Applying this algorithm to the problem at hand one would arrive at the following implementation of _INTSIZEOF macro
#define _INTSIZEOF(n)\
( (sizeof(n) + sizeof(int) - 1) / sizeof(int) * sizeof(int) )
This is already good enough.
However, if you know in advance that the alignment boundary is a power of 2, you can "optimize" the calculations by replacing the divide+multiply sequence with a simple bitwise operation
aligned_a = (a + b - 1) & ~(b - 1)
That is exactly what's done in the above original implementation of _INTSIZEOF macro.
This "optimization" might probably make sense with some compilers (although I would expect a modern compiler to be able to figure it out by itself). However, considering that the above _INTSIZEOF(n) macro is apparently intended to serve as a compile-time expression (it does not depend on any run-time values, barring VLA objects/types passed as n), there's not much point in optimizing it that way.
Here's a hint:
A common method to do ceil(a/b) is:
(a + (b-1)) / b
b * ( (a + b - 1) / b ) = (a + b - 1) & ~(b - 1)
To see why the above holds, consider this:
Part I (why q = (a + b - 1) / b produces the number we are looking for):
... note that we want q to be the number of b's that are in a, but rounded up (i.e., if after integer division, there is a remainder, then that remainder should be rounded up to b and hence q incremented by 1).
there exists Q and R such that a = Qb + R, and hence a + b - 1 = Qb + b - 1 + R. If we perform integer division on a + b - 1 by b we would get Q + (b-1+R)/b. The 2nd part of this will be zero if R is zero and 1 if R is not zero (note R is guaranteed to be less than b).
Part II (the macro):
now if b is a power of two, then integer division of a + b - 1 by b is simply a right shift of the exponent of b (i.e., b = 2^n, then shift right by n places).
in addition, multiplication by b is a left shift (shift left by n places)
hence combined, all we are doing is clearing the rightmost n bits to zero, and this is accomplished by masking: ~(b-1) gives us 1111...111000...0 where the number of 1s is equal to n (b = 2^n)
Related
I am trying to immplement big integer addition in CUDA using the following code
__global__ void add(unsigned *A, unsigned *B, unsigned *C/*output*/, int radix){
int id = blockIdx.x * blockDim.x + threadIdx.x;
A[id ] = A[id] + B[id];
C[id ] = A[id]/radix;
__syncthreads();
A[id] = A[id]%radix + ((id>0)?C[id -1]:0);
__syncthreads();
C[id] = A[id];
}
but it does not work properly and also i don't now how to handle the extra carry bit. Thanks
TL;DR build a carry-lookahead adder where each individual additionner adds modulo radix, instead of modulo 2
Additions need incoming carries
The problem in your model is that you have a rippling carry. See Rippling carry adders.
If you were in an FPGA that wouldn't be a problem because they have dedicated logic to do that fast (carry chains, they're cool). But alas, you're on a GPU !
That is, for a given id, you only know the input carry (thus whether you are going to sum A[id]+B[id] or A[id]+B[id]+1) when all the sums with smaller id values have been computed. As a matter of fact, initially, you only know the first carry.
A[3]+B[3] + ? A[2]+B[2] + ? A[1]+B[1] + ? A[0]+B[0] + 0
| | | |
v v v v
C[3] C[2] C[1] C[0]
Characterize the carry output
And each sum also has a carry output, which isn't on the drawing. So you have to think of the addition in this larger scheme as a function with 3 inputs and 2 outputs : (C, c_out) = add(A, B, c_in)
In order to not wait O(n) for the sum to complete (where n is the number of items your sum is cut into), you can precompute all the possible results at each id. That isn't such a huge load of work, since A and B don't change, only the carries. So you have 2 possible outputs : (c_out0, C) = add(A, B, 0) and (c_out1, C') = add(A, B, 1).
Now with all these results, we need to basically implement a carry lookahead unit.
For that, we need to figure out to functions of each sum's carry output P and G :
P a.k.a. all of the following definitions
Propagate
"if a carry comes in, then a carry will go out of this sum"
c_out1 && !c_out0
A + B == radix-1
G a.k.a. all of the following definitions
Generate
"whatever carry comes in, a carry will go out of this sum"
c_out1 && c_out0
c_out0
A + B >= radix
So in other terms, c_out = G or (P and c_in). So now we have a start of an algorithm that can tell us easily for each id the carry output as a function of its carry input directly :
At each id, compute C[id] = A[id]+B[id]+0
Get G[id] = C[id] > radix -1
Get P[id] = C[id] == radix-1
Logarithmic tree
Now we can finish in O(log(n)), even though treeish things are nasty on GPUs, but still shorter than waiting. Indeed, from 2 additions next to each other, we can get a group G and a group P :
For id and id+1 :
step = 2
if id % step == 0, do steps 6 through 10, otherwise, do nothing
group_P = P[id] and P[id+step/2]
group_G = (P[id+step/2] and G[id]) or G[id+step/2]
c_in[id+step/2] = G[id] or (P[id] and c_in[id])
step = step * 2
if step < n, go to 5
At the end (after repeating steps 5-10 for every level of your tree with less ids every time), everything will be expressed in terms of Ps and Gs which you computed, and c_in[0] which is 0. On the wikipedia page there are formulas for the grouping by 4 instead of 2, which will get you an answer in O(log_4(n)) instead of O(log_2(n)).
Hence the end of the algorithm :
At each id, get c_in[id]
return (C[id]+c_in[id]) % radix
Take advantage of hardware
What we really did in this last part, was mimic the circuitry of a carry-lookahead adder with logic. However, we already have additionners in the hardware that do similar things (by definition).
Let us replace our definitions of P and G based on radix by those based on 2 like the logic inside our hardware, mimicking a sum of 2 bits a and b at each stage : if P = a ^ b (xor), and G = a & b (logical and). In other words, a = P or G and b = G. So if we create a intP integer and a intG integer, where each bit is respectively the P and G we computed from each ids sum (limiting us to 64 sums), then the addition (intP | intG) + intG has the exact same carry propagation as our elaborate logical scheme.
The reduction to form these integers will still be a logarithmic operation I guess, but that was to be expected.
The interesting part, is that each bit of the sum is function of its carry input. Indeed, every bit of the sum is eventually function of 3 bits a+b+c_in % 2.
If at that bit P == 1, then a + b == 1, thus a+b+c_in % 2 == !c_in
Otherwise, a+b is either 0 or 2, and a+b+c_in % 2 == c_in
Thus we can trivially form the integer (or rather bit-array) int_cin = ((P|G)+G) ^ P with ^ being xor.
Thus we have an alternate ending to our algorithm, replacing steps 4 and later :
at each id, shift P and G by id : P = P << id and G = G << id
do an OR-reduction to get intG and intP which are the OR of all the P and G for id 0..63
Compute (once) int_cin = ((P|G)+G) ^ P
at each id, get `c_in = int_cin & (1 << id) ? 1 : 0;
return (C[id]+c_in) % radix
PS : Also, watch out for integer overflow in your arrays, if radix is big. If it isn't then the whole thing doesn't really make sense I guess...
PPS : in the alternate ending, if you have more than 64 items, characterize them by their P and G as if radix was 2^64, and re-run the same steps at a higher level (reduction, get c_in) and then get back to the lower level apply 7 with P+G+carry in from higher level
Let's suppose we have noramlly distributed random int values from function:
unsigned int myrand();
The commonest way to shrink its range to [0, A] (int A) is to do as follows:
(double)rand() / UINT_MAX * A
Now I need to do the same for values in range of __int64:
unsigned __int64 max64;
unsigned __int64 r64 = myrand();
r64 <<= 32;
r64 |= myrand();
r64 = normalize(r64, max64);
The problem is to normalize return range by some __int64 because it could not be placed in double. I wouldn't like to use various libraries for big numbers due to performance reasons. Is there a way to shrink return range quickly and easily while saving normal distribution of values?
The method that you give
(double)myrand() / UINT_MAX * A
is already broken. For example, if A = 1 and you want integers in the range [0, 1] you will only ever get a value of 1 if myrand () returned UINT_MAX. If you meant the range [0, A), that is only the value 0, then it is still broken because it will in that case return a value outside the range. No matter what, you are introducing a bias.
If you want A+1 different values from 0 to A inclusive, and 2^32 ≤ A < 2^64, you proceed as follows:
Step 1: Calculate a 64 bit random number R as you did. If A is one less than a power of two, you return R shifted by the right amount.
Step 2: Find how many different random values would be mapped to the same output value. Mathematically, that number is floor (2^64 / (A + 1)). 2^64 is too large, but that is no problem because it is equal to 1 + floor ((2^64 - (A + 1)) / (A + 1)), calculated in C or C++ as D = 1 + (- (A + 1)) / (A + 1) if A has type uint64_t.
Step 3: Find how many different random values should be mapped by calculating N = D * (A + 1). If R >= N then go back to Step 1.
Step 4: Return R / D.
No floating point arithmetic needed. The result is totally unbiased. If A < 2^32 you fall back to the 32 bit version (or you use the 64 bit version as well, but it calls myrandom () twice as often as needed).
Of course you calculate D and N only once unless A changes.
Maybe you can use "long double" if it is available in your platform.
Given three integers, a, band c with a,b <= c < INT_MAX I need to compute (a * b) % c but a * b can overflow if the values are too large, which gives the wrong result.
Is there a way to compute this directly through bithacks, i.e. without using a type that won't overflow for the values in question?
Karatsuba's algorithm is not really needed here. It is enough to split your operands just once.
Let's say, for simplicity's sake, that your numbers are 64-bit unsigned integers. Let k=2^32. Then
a=a1+k*a2
b=b1+k*b2
(a1+k*a2)*(b1+k*b2) % c =
a1*b1 % c + k*a1*b2 % c + k*a2*b1 % c + k*k*a2*b2 % c
Now a1*b1 % c can be computed immediately, the rest could be computed by alternately performing x <<= 1 and x %= c 32 or 64 times (since (u*v)%c=((u%c)*v)%c). This could ostensibly overflow if c >= 2^63. However, the nice thing is that this pair of operations need not be performed literally. Either x < c/2 and then you only need a shift (and there's no overflow), or x >= c/2 and
2*x % c = 2*x - c = x - (c-x).
(and there's no overflow again).
Several of the major compilers offer a 128-bit integer type, with which you can do this computation without overflow.
I have came across this problem many time but I am unable to solve it. There would occur some cases or the other which will wrong answer or otherwise the program I write will be too slow. Formally I am talking about calculating
nCk mod p where p is a prime n is a large number, and 1<=k<=n.
What have I tried:
I know the recursive formulation of factorial and then modelling it as a dynamic programming problem, but I feel that it is slow. The recursive formulation is (nCk) + (nCk-1) = (n+1Ck). I took care of the modulus while storing values in array to avoid overflows but I am not sure that just doing a mod p on the result will avoid all overflows as it may happen that one needs to remove.
To compute nCr, there's a simple algorithm based on the rule nCr = (n - 1)C(r - 1) * n / r:
def nCr(n,r):
if r == 0:
return 1
return n * nCr(n - 1, r - 1) // r
Now in modulo arithmetic we don't quite have division, but we have modulo inverses which (when modding by a prime) are just as good
def nCrModP(n, r, p):
if r == 0:
return 1
return n * nCrModP(n - 1, r - 1) * modinv(r, p) % p
Here's one implementation of modinv on rosettacode
Not sure what you mean by "storing values in array", but I assume they array serves as a lookup table while running to avoid redundant calculations to speed things up. This should take care of the speed problem. Regarding the overflows - you can perform the modulo operation at any stage of computation and repeat it as much as you want - the result will be correct.
First, let's work with the case where p is relatively small.
Take the base-p expansions of n and k: write n = n_0 + n_1 p + n_2 p^2 + ... + n_m p^m and k = k_0 + k_1 p + ... + k_m p^m where each n_i and each k_i is at least 0 but less than p. A theorem (which I think is due to Edouard Lucas) states that C(n,k) = C(n_0, k_0) * C(n_1, k_1) * ... * C(n_m, k_m). This reduces to taking a mod-p product of numbers in the "n is relatively small" case below.
Second, if n is relatively small, you can just compute binomial coefficients using dynamic programming on the formula C(n,k) = C(n-1,k-1) + C(n-1,k), reducing mod p at each step. Or do something more clever.
Third, if k is relatively small (and less than p), you should be able to compute n!/(k!(n-k)!) mod p by computing n!/(n-k)! as n * (n-1) * ... * (n-k+1), reducing modulo p after each product, then multiplying by the modular inverses of each number between 1 and k.
Let us say we have x and y and both are signed integers in C, how do we find the most accurate mean value between the two?
I would prefer a solution that does not take advantage of any machine/compiler/toolchain specific workings.
The best I have come up with is:(a / 2) + (b / 2) + !!(a % 2) * !!(b %2) Is there a solution that is more accurate? Faster? Simpler?
What if we know if one is larger than the other a priori?
Thanks.
D
Editor's Note: Please note that the OP expects answers that are not subject to integer overflow when input values are close to the maximum absolute bounds of the C int type. This was not stated in the original question, but is important when giving an answer.
After accept answer (4 yr)
I would expect the function int average_int(int a, int b) to:
1. Work over the entire range of [INT_MIN..INT_MAX] for all combinations of a and b.
2. Have the same result as (a+b)/2, as if using wider math.
When int2x exists, #Santiago Alessandri approach works well.
int avgSS(int a, int b) {
return (int) ( ((int2x) a + b) / 2);
}
Otherwise a variation on #AProgrammer:
Note: wider math is not needed.
int avgC(int a, int b) {
if ((a < 0) == (b < 0)) { // a,b same sign
return a/2 + b/2 + (a%2 + b%2)/2;
}
return (a+b)/2;
}
A solution with more tests, but without %
All below solutions "worked" to within 1 of (a+b)/2 when overflow did not occur, but I was hoping to find one that matched (a+b)/2 for all int.
#Santiago Alessandri Solution works as long as the range of int is narrower than the range of long long - which is usually the case.
((long long)a + (long long)b) / 2
#AProgrammer, the accepted answer, fails about 1/4 of the time to match (a+b)/2. Example inputs like a == 1, b == -2
a/2 + b/2 + (a%2 + b%2)/2
#Guy Sirton, Solution fails about 1/8 of the time to match (a+b)/2. Example inputs like a == 1, b == 0
int sgeq = ((a<0)==(b<0));
int avg = ((!sgeq)*(a+b)+sgeq*(b-a))/2 + sgeq*a;
#R.., Solution fails about 1/4 of the time to match (a+b)/2. Example inputs like a == 1, b == 1
return (a-(a|b)+b)/2+(a|b)/2;
#MatthewD, now deleted solution fails about 5/6 of the time to match (a+b)/2. Example inputs like a == 1, b == -2
unsigned diff;
signed mean;
if (a > b) {
diff = a - b;
mean = b + (diff >> 1);
} else {
diff = b - a;
mean = a + (diff >> 1);
}
If (a^b)<=0 you can just use (a+b)/2 without fear of overflow.
Otherwise, try (a-(a|b)+b)/2+(a|b)/2. -(a|b) is at least as large in magnitude as both a and b and has the opposite sign, so this avoids the overflow.
I did this quickly off the top of my head so there might be some stupid errors. Note that there are no machine-specific hacks here. All behavior is completely determined by the C standard and the fact that it requires twos-complement, ones-complement, or sign-magnitude representation of signed values and specifies that the bitwise operators work on the bit-by-bit representation. Nope, the relative magnitude of a|b depends on the representation...
Edit: You could also use a+(b-a)/2 when they have the same sign. Note that this will give a bias towards a. You can reverse it and get a bias towards b. My solution above, on the other hand, gives bias towards zero if I'm not mistaken.
Another try: One standard approach is (a&b)+(a^b)/2. In twos complement it works regardless of the signs, but I believe it also works in ones complement or sign-magnitude if a and b have the same sign. Care to check it?
Edit: version fixed by #chux - Reinstate Monica:
if ((a < 0) == (b < 0)) { // a,b same sign
return a/2 + b/2 + (a%2 + b%2)/2;
} else {
return (a+b)/2;
}
Original answer (I'd have deleted it if it hadn't been accepted).
a/2 + b/2 + (a%2 + b%2)/2
Seems the simplest one fitting the bill of no assumption on implementation characteristics (it has a dependency on C99 which specifying the result of / as "truncated toward 0" while it was implementation dependent for C90).
It has the advantage of having no test (and thus no costly jumps) and all divisions/remainder are by 2 so the use of bit twiddling techniques by the compiler is possible.
For unsigned integers the average is the floor of (x+y)/2. But the same fails for signed integers. This formula fails for integers whose sum is an odd -ve number as their floor is one less than their average.
You can read up more at Hacker's Delight in section 2.5
The code to calculate average of 2 signed integers without overflow is
int t = (a & b) + ((a ^ b) >> 1)
unsigned t_u = (unsigned)t
int avg = t + ( (t_u >> 31 ) & (a ^ b) )
I have checked it's correctness using Z3 SMT solver
Just a few observations that may help:
"Most accurate" isn't necessarily unique with integers. E.g. for 1 and 4, 2 and 3 are an equally "most accurate" answer. Mathematically (not C integers):
(a+b)/2 = a+(b-a)/2 = b+(a-b)/2
Let's try breaking this down:
If sign(a)!=sign(b) then a+b will will not overflow. This case can be determined by comparing the most significant bit in a two's complement representation.
If sign(a)==sign(b) then if a is greater than b, (a-b) will not overflow. Otherwise (b-a) will not overflow. EDIT: Actually neither will overflow.
What are you trying to optimize exactly? Different processor architectures may have different optimal solutions. For example, in your code replacing the multiplication with an AND may improve performance. Also in a two's complement architecture you can simply (a & b & 1).
I'm just going to throw some code out, not looking too fast but perhaps someone can use and improve:
int sgeq = ((a<0)==(b<0));
int avg = ((!sgeq)*(a+b)+sgeq*(b-a))/2 + sgeq*a
I would do this, convert both to long long(64 bit signed integers) add them up, this won't overflow and then divide the result by 2:
((long long)a + (long long)b) / 2
If you want the decimal part, store it as a double.
It is important to note that the result will fit in a 32 bit integer.
If you are using the highest-rank integer, then you can use:
((double)a + (double)b) / 2
This answer fits to any number of integers:
int[] array = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };
decimal avg = 0;
for (int i = 0; i < array.Length; i++){
avg = (array[i] - avg) / (i+1) + avg;
}
expects avg == 5.0 for this test