Algorithm to determine the maximum sum array by complementing bit mask - arrays

This question posed by my co-worker bamboozled me. I cannot even come up with a clean brute-force solution. To state the problem:
Given an array of size n containing non-negative integers, k = [10, 40, 1, 200, 5000, ..., n], a bit mask of size n, i.e. mask = 1001....b_n where |mask| = n. and an integer representing contiguous bits that can be complemented S = 3, find a configuration of mask that yields maximum sum array.
The complement size S is used to pick S contiguous bit from the bit mask and replacing it by its complement.
For example, if mask = 100001 with S = 2 you could
Change mask to 010001 by applying the mask at MSB
You can iteratively keep on complementing at any bit in the mask till you find array of maximum size.
Here is what I've come up:
Find all the 2^n bit mask configurations then apply them to find the maximum sum array
Given the initial mask configuration see if there exists a path to the maximum sum array configuration found in step 1.
Again mine is an exponential solution. Any efficient ones are appreciated.

Start off with the trivial observation that you would never apply your given bitmask G, which simply consists of S 1s, more than once on the same stretch of your original mask, M - this is because bitwise xor is commutative and associative allowing you to reorder as you please, and xor'ing a bitmask to itself gives you all 0s.
Given a bitmask B of length S, and an integral index ind in [0,n), let BestSum(ind, B) be the best possible sum that can be obtained on [ind:n) slice of your input array k when M'[ind, ind + S) = B, where M' is the final state of your mask after performing all the operations. Let us write B = b.B', where b is the MSB and consider the two possibilities for b:
b = M[ind] : In this case, you will not apply G at M[ind] and hence BestSum(ind, B) = b*k[ind] + max(BestSum(ind + 1, B'.0), BestSum(ind + 1, B'.1)).
b != M[ind] : In this case, you will apply G at M[ind] and hence BestSum(ind, B) = b*k[ind] + max(BestSum(ind + 1, (~B').0), BestSum(ind + 1, (~B').1)).
This, along with the boundary conditions, gives you a DP with runtime O(n*2^S). The best solution would be max over all BestSum(0, B).
Note that we have brushed all reachability issues under the carpet of "boundary conditions". Let us address that now - if, for a given ind and B, there is no final configuration M' such that M'[ind, ind + S) = B, define BestSum(ind, B) = -inf. That will ensure that the only cases where you need to answer unreachability is indeed the boundary - i.e., ind = n - S. The only values of (n-S, B) that are reachable at (n-S, M[n-S:n)) and (n-S, M[n-S:n) ^ G), thus handling the boundary with ease.

Would the following work?
Use DFS to expand a tree with all the possibilities (do one flip in each depth), the recursion's ending condition is:
Reached a state of all masks are 1
Keep coming back to the same position, means we can never reach the state will all masks are 1. (I am not sure how exactly we can detect this though.)

Related

Find Minimum Operand to Maximize sum of bitwise AND operator

Given an array of integers Arr and an integer K, bitwise AND is to be performed on each element A[i] with an integer X
Let Final sum be defined as follows:
Sum of ( A[i] AND X ) for all values of i ( 0 to length of array-1 )
Return the integer X subject to following constraints:
Final sum should be maximum
X should contain exactly K bits as 1 in its binary representation
If multiple values of X satisfy the above conditions, return the minimum possible X
Input:
Arr : [8,4,2]
K = 2
Output: X=12
12 Contains exactly 2 bits in its binary and is the smallest number that gives maximum possible answer for summation of all (A[i] AND X)
Approach Tried :
Took bitwise OR for all numbers in the array in binary and retained the first K bits of the binary that had 1 , made remaining bits 0, convert back to int
Passed 7/12 Test Cases
Can someone help me out with what mistake am I making with regards to the approach or suggest a better approach ? Thanks in advance.
Consider an input like [ 8, 4, 4, 4 ], K = 1. Your algorithm will give 8 but the correct answer is 4. Just because a given bit is more significant doesn't mean that it will automatically contribute more to the sum, as there might be more than twice as many elements of the array that use a smaller bit.
My suggestion would be to compute a weight for each bit of your potential X -- the number of elements of the array that have that bit set times the value of that bit (2i for bit i). Then find the K bits with the largest weight.
To do this, you need to know how big your integers are -- if they are 32 bits, you need to compute just 32 weights. If they might be bigger you need more. Depending on your programming language you may also need to worry about overflow with your weight calculations (or with the sum calculation -- is this a true sum, or a sum mod 2n for some n?). If some elements of the array might be negative, how are negatives represented (2s complement?) and how does that interact with AND?
Let dp[k][i] represent the maximum sum(a & X), a ∈ array, where i is the highest bit index in X and k is the number of bits in X. Then:
dp[1][i]:
sum(a & 2^i)
dp[k][i]:
sum(a & 2^i) + max(dp[k-1][j])
for j < i
sum(a & 2^i) can be precalculated for all values of i in O(n * m), where m is the word size. max(dp[k-1][j]) is monotonically increasing over j and we want to store the earliest instance of each max to minimise the resulting X.
For each k, therefore, we iterate over m is. Overall time complexity O(k * m + n * m), where m is the word size.

Big integer addition code

I am trying to immplement big integer addition in CUDA using the following code
__global__ void add(unsigned *A, unsigned *B, unsigned *C/*output*/, int radix){
int id = blockIdx.x * blockDim.x + threadIdx.x;
A[id ] = A[id] + B[id];
C[id ] = A[id]/radix;
__syncthreads();
A[id] = A[id]%radix + ((id>0)?C[id -1]:0);
__syncthreads();
C[id] = A[id];
}
but it does not work properly and also i don't now how to handle the extra carry bit. Thanks
TL;DR build a carry-lookahead adder where each individual additionner adds modulo radix, instead of modulo 2
Additions need incoming carries
The problem in your model is that you have a rippling carry. See Rippling carry adders.
If you were in an FPGA that wouldn't be a problem because they have dedicated logic to do that fast (carry chains, they're cool). But alas, you're on a GPU !
That is, for a given id, you only know the input carry (thus whether you are going to sum A[id]+B[id] or A[id]+B[id]+1) when all the sums with smaller id values have been computed. As a matter of fact, initially, you only know the first carry.
A[3]+B[3] + ? A[2]+B[2] + ? A[1]+B[1] + ? A[0]+B[0] + 0
| | | |
v v v v
C[3] C[2] C[1] C[0]
Characterize the carry output
And each sum also has a carry output, which isn't on the drawing. So you have to think of the addition in this larger scheme as a function with 3 inputs and 2 outputs : (C, c_out) = add(A, B, c_in)
In order to not wait O(n) for the sum to complete (where n is the number of items your sum is cut into), you can precompute all the possible results at each id. That isn't such a huge load of work, since A and B don't change, only the carries. So you have 2 possible outputs : (c_out0, C) = add(A, B, 0) and (c_out1, C') = add(A, B, 1).
Now with all these results, we need to basically implement a carry lookahead unit.
For that, we need to figure out to functions of each sum's carry output P and G :
P a.k.a. all of the following definitions
Propagate
"if a carry comes in, then a carry will go out of this sum"
c_out1 && !c_out0
A + B == radix-1
G a.k.a. all of the following definitions
Generate
"whatever carry comes in, a carry will go out of this sum"
c_out1 && c_out0
c_out0
A + B >= radix
So in other terms, c_out = G or (P and c_in). So now we have a start of an algorithm that can tell us easily for each id the carry output as a function of its carry input directly :
At each id, compute C[id] = A[id]+B[id]+0
Get G[id] = C[id] > radix -1
Get P[id] = C[id] == radix-1
Logarithmic tree
Now we can finish in O(log(n)), even though treeish things are nasty on GPUs, but still shorter than waiting. Indeed, from 2 additions next to each other, we can get a group G and a group P :
For id and id+1 :
step = 2
if id % step == 0, do steps 6 through 10, otherwise, do nothing
group_P = P[id] and P[id+step/2]
group_G = (P[id+step/2] and G[id]) or G[id+step/2]
c_in[id+step/2] = G[id] or (P[id] and c_in[id])
step = step * 2
if step < n, go to 5
At the end (after repeating steps 5-10 for every level of your tree with less ids every time), everything will be expressed in terms of Ps and Gs which you computed, and c_in[0] which is 0. On the wikipedia page there are formulas for the grouping by 4 instead of 2, which will get you an answer in O(log_4(n)) instead of O(log_2(n)).
Hence the end of the algorithm :
At each id, get c_in[id]
return (C[id]+c_in[id]) % radix
Take advantage of hardware
What we really did in this last part, was mimic the circuitry of a carry-lookahead adder with logic. However, we already have additionners in the hardware that do similar things (by definition).
Let us replace our definitions of P and G based on radix by those based on 2 like the logic inside our hardware, mimicking a sum of 2 bits a and b at each stage : if P = a ^ b (xor), and G = a & b (logical and). In other words, a = P or G and b = G. So if we create a intP integer and a intG integer, where each bit is respectively the P and G we computed from each ids sum (limiting us to 64 sums), then the addition (intP | intG) + intG has the exact same carry propagation as our elaborate logical scheme.
The reduction to form these integers will still be a logarithmic operation I guess, but that was to be expected.
The interesting part, is that each bit of the sum is function of its carry input. Indeed, every bit of the sum is eventually function of 3 bits a+b+c_in % 2.
If at that bit P == 1, then a + b == 1, thus a+b+c_in % 2 == !c_in
Otherwise, a+b is either 0 or 2, and a+b+c_in % 2 == c_in
Thus we can trivially form the integer (or rather bit-array) int_cin = ((P|G)+G) ^ P with ^ being xor.
Thus we have an alternate ending to our algorithm, replacing steps 4 and later :
at each id, shift P and G by id : P = P << id and G = G << id
do an OR-reduction to get intG and intP which are the OR of all the P and G for id 0..63
Compute (once) int_cin = ((P|G)+G) ^ P
at each id, get `c_in = int_cin & (1 << id) ? 1 : 0;
return (C[id]+c_in) % radix
PS : Also, watch out for integer overflow in your arrays, if radix is big. If it isn't then the whole thing doesn't really make sense I guess...
PPS : in the alternate ending, if you have more than 64 items, characterize them by their P and G as if radix was 2^64, and re-run the same steps at a higher level (reduction, get c_in) and then get back to the lower level apply 7 with P+G+carry in from higher level

Normalize random value of _int64

Let's suppose we have noramlly distributed random int values from function:
unsigned int myrand();
The commonest way to shrink its range to [0, A] (int A) is to do as follows:
(double)rand() / UINT_MAX * A
Now I need to do the same for values in range of __int64:
unsigned __int64 max64;
unsigned __int64 r64 = myrand();
r64 <<= 32;
r64 |= myrand();
r64 = normalize(r64, max64);
The problem is to normalize return range by some __int64 because it could not be placed in double. I wouldn't like to use various libraries for big numbers due to performance reasons. Is there a way to shrink return range quickly and easily while saving normal distribution of values?
The method that you give
(double)myrand() / UINT_MAX * A
is already broken. For example, if A = 1 and you want integers in the range [0, 1] you will only ever get a value of 1 if myrand () returned UINT_MAX. If you meant the range [0, A), that is only the value 0, then it is still broken because it will in that case return a value outside the range. No matter what, you are introducing a bias.
If you want A+1 different values from 0 to A inclusive, and 2^32 ≤ A < 2^64, you proceed as follows:
Step 1: Calculate a 64 bit random number R as you did. If A is one less than a power of two, you return R shifted by the right amount.
Step 2: Find how many different random values would be mapped to the same output value. Mathematically, that number is floor (2^64 / (A + 1)). 2^64 is too large, but that is no problem because it is equal to 1 + floor ((2^64 - (A + 1)) / (A + 1)), calculated in C or C++ as D = 1 + (- (A + 1)) / (A + 1) if A has type uint64_t.
Step 3: Find how many different random values should be mapped by calculating N = D * (A + 1). If R >= N then go back to Step 1.
Step 4: Return R / D.
No floating point arithmetic needed. The result is totally unbiased. If A < 2^32 you fall back to the 32 bit version (or you use the 64 bit version as well, but it calls myrandom () twice as often as needed).
Of course you calculate D and N only once unless A changes.
Maybe you can use "long double" if it is available in your platform.

Find missing element

I saw the following question:
Given an array A of integers, find an integer k that is not present in A. Assume that the integers are 32-bit signed integers.
How to solve it?
Thank you
//// update -- This is the provided solution - but I don't think it is correct ////
Consider a very simple hash function F(x)=x mod (n+1). we can build a bit-vector of length
n+1 that is initialized to 0 and for every element in A, we set bit F(A[i]) to 1.Since there are only n elements in the array, we can find the missing element easily.
I think the above solution is wrong.
For example,
A [ 2, 100, 4 ], then both 4 and 100 will match to the same place.
If I'm interpreting the question correctly, (find any integer not in A) and n is the number of elements in A, then the answer is correct as given.
The fact that the hash function can have collisions doesn't really matter; By the pigeonhole principle in reverse, there will be some bit in the bit vector that is not set - because it has more bits than there are elements in A. The corresponding value is an integer not in A. In the case where the hash function has collisions, there will actually be more than one bit not set, which works in favor of the algorithm's correctness rather than against it.
To illustrate, here's how your example would work out:
A = [2, 100, 4]
n = length(A) = 3
f(x) = x mod 4
map(f,A) = [2, 0, 0]
so the final bit vector will be:
[1,0,1,0]
From there, we can arbitrarily choose any integer corresponding to any 0 bit, which in this case is any odd number.
max (items) + 1
springs to mind. Of course it would fail if one of the elements was 2^31-1 and another was -(2^32).
Failing that, sort them and look for an interval.
Since there appear to be no upperbounds or other constraints:
for (int i = int.MinValue; i <= int.MaxValue; i++)
{
if (! A.Contains(i))
{
// found it !
break;
}
}

How to map a long integer number to a N-dimensional vector of smaller integers (and fast inverse)?

Given a N-dimensional vector of small integers is there any simple way to map it with one-to-one correspondence to a large integer number?
Say, we have N=3 vector space. Can we represent a vector X=[(int16)x1,(int16)x2,(int16)x3] using an integer (int48)y? The obvious answer is "Yes, we can". But the question is: "What is the fastest way to do this and its inverse operation?"
Will this new 1-dimensional space possess some very special useful properties?
For the above example you have 3 * 32 = 96 bits of information, so without any a priori knowledge you need 96 bits for the equivalent long integer.
However, if you know that your x1, x2, x3, values will always fit within, say, 16 bits each, then you can pack them all into a 48 bit integer.
In either case the technique is very simple you just use shift, mask and bitwise or operations to pack/unpack the values.
Just to make this concrete, if you have a 3-dimensional vector of 8-bit numbers, like this:
uint8_t vector[3] = { 1, 2, 3 };
then you can join them into a single (24-bit number) like so:
uint32_t all = (vector[0] << 16) | (vector[1] << 8) | vector[2];
This number would, if printed using this statement:
printf("the vector was packed into %06x", (unsigned int) all);
produce the output
the vector was packed into 010203
The reverse operation would look like this:
uint8_t v2[3];
v2[0] = (all >> 16) & 0xff;
v2[1] = (all >> 8) & 0xff;
v2[2] = all & 0xff;
Of course this all depends on the size of the individual numbers in the vector and the length of the vector together not exceeding the size of an available integer type, otherwise you can't represent the "packed" vector as a single number.
If you have sets Si, i=1..n of size Ci = |Si|, then the cartesian product set S = S1 x S2 x ... x Sn has size C = C1 * C2 * ... * Cn.
This motivates an obvious way to do the packing one-to-one. If you have elements e1,...,en from each set, each in the range 0 to Ci-1, then you give the element e=(e1,...,en) the value e1+C1*(e2 + C2*(e3 + C3*(...Cn*en...))).
You can do any permutation of this packing if you feel like it, but unless the values are perfectly correlated, the size of the full set must be the product of the sizes of the component sets.
In the particular case of three 32 bit integers, if they can take on any value, you should treat them as one 96 bit integer.
If you particularly want to, you can map small values to small values through any number of means (e.g. filling out spheres with the L1 norm), but you have to specify what properties you want to have.
(For example, one can map (n,m) to (max(n,m)-1)^2 + k where k=n if n<=m and k=n+m if n>m--you can draw this as a picture of filling in a square like so:
1 2 5 | draw along the edge of the square this way
4 3 6 v
8 7
if you start counting from 1 and only worry about positive values; for integers, you can spiral around the origin.)
I'm writing this without having time to check details, but I suspect the best way is to represent your long integer via modular arithmetic, using k different integers which are mutually prime. The original integer can then be reconstructed using the Chinese remainder theorem. Sorry this is a bit sketchy, but hope it helps.
To expand on Rex Kerr's generalised form, in C you can pack the numbers like so:
X = e[n];
X *= MAX_E[n-1] + 1;
X += e[n-1];
/* ... */
X *= MAX_E[0] + 1;
X += e[0];
And unpack them with:
e[0] = X % (MAX_E[0] + 1);
X /= (MAX_E[0] + 1);
e[1] = X % (MAX_E[1] + 1);
X /= (MAX_E[1] + 1);
/* ... */
e[n] = X;
(Where MAX_E[n] is the greatest value that e[n] can have). Note that these maximum values are likely to be constants, and may be the same for every e, which will simplify things a little.
The shifting / masking implementations given in the other answers are a generalisation of this, for cases where the MAX_E + 1 values are powers of 2 (and thus the multiplication and division can be done with a shift, the addition with a bitwise-or and the modulus with a bitwise-and).
There is some totally non portable ways to make this real fast using packed unions and direct accesses to memory. That you really need this kind of speed is suspicious. Methods using shifts and masks should be fast enough for most purposes. If not, consider using specialized processors like GPU for wich vector support is optimized (parallel).
This naive storage does not possess any usefull property than I can foresee, except you can perform some computations (add, sub, logical bitwise operators) on the three coordinates at once as long as you use positive integers only and you don't overflow for add and sub.
You'd better be quite sure you won't overflow (or won't go negative for sub) or the vector will become garbage.
#include <stdint.h> // for uint8_t
long x;
uint8_t * p = &x;
or
union X {
long L;
uint8_t A[sizeof(long)/sizeof(uint8_t)];
};
works if you don't care about the endian. In my experience compilers generate better code with the union because it doesn't set of their "you took the address of this, so I must keep it in RAM" rules as quick. These rules will get set off if you try to index the array with stuff that the compiler can't optimize away.
If you do care about the endian then you need to mask and shift.
I think what you want can be solved using multi-dimensional space filling curves. The link gives a lot of references on this, which in turn give different methods and insights. Here's a specific example of an invertible mapping. It works for any dimension N.
As for useful properties, these mappings are related to Gray codes.
Hard to say whether this was what you were looking for, or whether the "pack 3 16-bit ints into a 48-bit int" does the trick for you.

Resources