Related
This question posed by my co-worker bamboozled me. I cannot even come up with a clean brute-force solution. To state the problem:
Given an array of size n containing non-negative integers, k = [10, 40, 1, 200, 5000, ..., n], a bit mask of size n, i.e. mask = 1001....b_n where |mask| = n. and an integer representing contiguous bits that can be complemented S = 3, find a configuration of mask that yields maximum sum array.
The complement size S is used to pick S contiguous bit from the bit mask and replacing it by its complement.
For example, if mask = 100001 with S = 2 you could
Change mask to 010001 by applying the mask at MSB
You can iteratively keep on complementing at any bit in the mask till you find array of maximum size.
Here is what I've come up:
Find all the 2^n bit mask configurations then apply them to find the maximum sum array
Given the initial mask configuration see if there exists a path to the maximum sum array configuration found in step 1.
Again mine is an exponential solution. Any efficient ones are appreciated.
Start off with the trivial observation that you would never apply your given bitmask G, which simply consists of S 1s, more than once on the same stretch of your original mask, M - this is because bitwise xor is commutative and associative allowing you to reorder as you please, and xor'ing a bitmask to itself gives you all 0s.
Given a bitmask B of length S, and an integral index ind in [0,n), let BestSum(ind, B) be the best possible sum that can be obtained on [ind:n) slice of your input array k when M'[ind, ind + S) = B, where M' is the final state of your mask after performing all the operations. Let us write B = b.B', where b is the MSB and consider the two possibilities for b:
b = M[ind] : In this case, you will not apply G at M[ind] and hence BestSum(ind, B) = b*k[ind] + max(BestSum(ind + 1, B'.0), BestSum(ind + 1, B'.1)).
b != M[ind] : In this case, you will apply G at M[ind] and hence BestSum(ind, B) = b*k[ind] + max(BestSum(ind + 1, (~B').0), BestSum(ind + 1, (~B').1)).
This, along with the boundary conditions, gives you a DP with runtime O(n*2^S). The best solution would be max over all BestSum(0, B).
Note that we have brushed all reachability issues under the carpet of "boundary conditions". Let us address that now - if, for a given ind and B, there is no final configuration M' such that M'[ind, ind + S) = B, define BestSum(ind, B) = -inf. That will ensure that the only cases where you need to answer unreachability is indeed the boundary - i.e., ind = n - S. The only values of (n-S, B) that are reachable at (n-S, M[n-S:n)) and (n-S, M[n-S:n) ^ G), thus handling the boundary with ease.
Would the following work?
Use DFS to expand a tree with all the possibilities (do one flip in each depth), the recursion's ending condition is:
Reached a state of all masks are 1
Keep coming back to the same position, means we can never reach the state will all masks are 1. (I am not sure how exactly we can detect this though.)
Suppose to have a __m128 variable holding 4 SP values, and you want the minimum one, is there any intrinsic function available, or anything other than the naive linear comparison among the values?
Right know my solution is the following (suppose the input __m128 variable is x):
x = _mm_min_ps(x, (__m128)_mm_srli_si128((__m128i)x, 4));
min = _mm_min_ss(x, (__m128)_mm_srli_si128((__m128i)x, 8))[0];
Which is quite horrible but it's working (btw, is there anything like _mm_srli_si128 but for the __m128 type?)
There is no single instruction/intrinsic but you can do it with two shuffles and two mins:
__m128 _mm_hmin_ps(__m128 v)
{
v = _mm_min_ps(v, _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 1, 0, 3)));
v = _mm_min_ps(v, _mm_shuffle_ps(v, v, _MM_SHUFFLE(1, 0, 3, 2)));
return v;
}
The output vector will contain the min of all the elements in the input vector, replicated throughout the output vector.
Paul R's answer is great! (#Paul R - if you read this thank you!) I just wanted to try to explain how it actually works for anyone new to SSE stuff like me. Of course
I might be wrong somewhere, so any corrections are welcome!
How does _mm_shuffle_ps work?
First of all, SSE registers have indexes that go in reverse to what you might expect, like this:
[6, 9, 8, 5] // values
3 2 1 0 // indexes
This order of indexing makes vector left-shifts move data from low to high indices, just like left-shifting the bits in an integer. The most-significant element is at the left.
_mm_shuffle_ps can mix the contents of two registers:
// __m128 a : (a3, a2, a1, a0)
// __m128 b : (b3, b2, b1, b0)
__m128 two_from_a_and_two_from_b = _mm_shuffle_ps(b, a, _MM_SHUFFLE(3, 2, 1, 0));
// ^ ^ ^ ^
// indexes into second operand indexes into first operand
// two_from_a_and_two_from_b : (a3, a2, b1, b0)
Here, we only want to shuffle the values of one register, not two. We can do that by passing v as both parameters, like this (you can see this in Paul R's function):
// __m128 v : (v3, v2, v1, v0)
__m128 v_rotated_left_by_1 = _mm_shuffle_ps(v, v, _MM_SHUFFLE(2, 1, 0, 3));
// v_rotated_left_by_1 : (v2, v1, v0, v3) // i.e. move all elements left by 1 with wraparound
I'm going to wrap it in a macro for readability though:
#define mm_shuffle_one(v, pattern) _mm_shuffle_ps(v, v, pattern)
(It can't be a function because the pattern argument to _mm_shuffle_ps must be constant at compile time.)
Here's a slightly modified version of the actual function – I added intermediate names for readability, as the compiler optimizes them out anyway:
inline __m128 _mm_hmin_ps(__m128 v){
__m128 v_rotated_left_by_1 = mm_shuffle_one(v, _MM_SHUFFLE(2, 1, 0, 3));
__m128 v2 = _mm_min_ps(v, v_rotated_left_by_1);
__m128 v2_rotated_left_by_2 = mm_shuffle_one(v2, _MM_SHUFFLE(1, 0, 3, 2));
__m128 v3 = _mm_min_ps(v2, v2_rotated_left_by_2);
return v3;
}
Why are shuffling the elements the way we are? And how do we find the smallest of four elements with just two min operations?
I had some trouble following how you can min 4 floats with just two vectorized min operations, but I understood it when I manually followed which values are min'd together, step by step. (Though it's likely more fun to do it on your own than read it)
Say we've got v:
[7,6,9,5] v
First, we min the values of v and v_rotated_left_by_1:
[7,6,9,5] v
3 2 1 0 // (just the indices of the elements)
[6,9,5,7] v_rotated_left_by_1
2 1 0 3 // (the indexes refer to v, and we rotated it left by 1, so the indices are shifted)
--------- min
[6,6,5,5] v2
3 2 1 0 // (explained
2 1 0 3 // below )
Each column under an element of v2 tracks which indexes of v were min'd together to get that element.
So, going column-wise left to right:
v2[3] == 6 == min(v[3], v[2])
v2[2] == 6 == min(v[2], v[1])
v2[1] == 5 == min(v[1], v[0])
v2[0] == 5 == min(v[0], v[3])
Now the second min:
[6,6,5,5] v2
3 2 1 0
2 1 0 3
[5,5,6,6] v2_rotated_left_by_2
1 0 3 2
0 3 2 1
--------- min
[5,5,5,5] v3
3 2 1 0
2 1 0 3
1 0 3 2
0 3 2 1
Voila! Each column under v3 contains (3,2,1,0) - each element of v3 has been mind with all the elements of v - so each element contains the minimum of the whole vector v.
After using the function, you can extract the minimum value with float _mm_cvtss_f32(__m128):
__m128 min_vector = _mm_hmin_ps(my_vector);
float minval = _mm_cvtss_f32(min_vector);
***
This is just a tangential thought, but what I found interesting is that this approach could be extended to sequences of arbitrary length, rotating the result of the previous step by 1, 2, 4, 8, ... 2**ceil(log2(len(v))) (i think) at each step.
That's cool from a theoretical perspective - if you can compare two sequences element-wise simultaneously, you can find the minimum/maximum1 of a sequences in logarithmic time!
1 This extends to all horizontal folds/reductions, like sum. Same shuffles, different vertical operation.
However, AVX (256-bit vectors) makes 128-bit boundaries special, and harder to shuffle across. If you only want a scalar result, extract the high half so every step narrows the vector width in half. (Like in Fastest way to do horizontal float vector sum on x86, which has more efficient shuffles than 2x shufps for 128-bit vectors, avoiding some movaps instructions when compiling without AVX.)
But if you want the result broadcast to every element like #PaulR's answer, you'd want to do in-lane shuffles (i.e. rotate within the 4 elements in every lane), then swap halves, or rotate 128-bit lanes.
I am looking for a fast algorithm:
I have a int array of size n, the goal is to find all patterns in the array that
x1, x2, x3 are different elements in the array, such that x1+x2 = x3
For example I know there's a int array of size 3 is [1, 2, 3] then there's only one possibility: 1+2 = 3 (consider 1+2 = 2+1)
I am thinking about implementing Pairs and Hashmaps to make the algorithm fast. (the fastest one I got now is still O(n^2))
Please share your idea for this problem, thank you
Edit: The answer below applies to a version of this problem in which you only want one triplet that adds up like that. When you want all of them, since there are potentially at least O(n^2) possible outputs (as pointed out by ex0du5), and even O(n^3) in pathological cases of repeated elements, you're not going to beat the simple O(n^2) algorithm based on hashing (mapping from a value to the list of indices with that value).
This is basically the 3SUM problem. Without potentially unboundedly large elements, the best known algorithms are approximately O(n^2), but we've only proved that it can't be faster than O(n lg n) for most models of computation.
If the integer elements lie in the range [u, v], you can do a slightly different version of this in O(n + (v-u) lg (v-u)) with an FFT. I'm going to describe a process to transform this problem into that one, solve it there, and then figure out the answer to your problem based on this transformation.
The problem that I know how to solve with FFT is to find a length-3 arithmetic sequence in an array: that is, a sequence a, b, c with c - b = b - a, or equivalently, a + c = 2b.
Unfortunately, the last step of the transformation back isn't as fast as I'd like, but I'll talk about that when we get there.
Let's call your original array X, which contains integers x_1, ..., x_n. We want to find indices i, j, k such that x_i + x_j = x_k.
Find the minimum u and maximum v of X in O(n) time. Let u' be min(u, u*2) and v' be max(v, v*2).
Construct a binary array (bitstring) Z of length v' - u' + 1; Z[i] will be true if either X or its double [x_1*2, ..., x_n*2] contains u' + i. This is O(n) to initialize; just walk over each element of X and set the two corresponding elements of Z.
As we're building this array, we can save the indices of any duplicates we find into an auxiliary list Y. Once Z is complete, we just check for 2 * x_i for each x_i in Y. If any are present, we're done; otherwise the duplicates are irrelevant, and we can forget about Y. (The only situation slightly more complicated is if 0 is repeated; then we need three distinct copies of it to get a solution.)
Now, a solution to your problem, i.e. x_i + x_j = x_k, will appear in Z as three evenly-spaced ones, since some simple algebraic manipulations give us 2*x_j - x_k = x_k - 2*x_i. Note that the elements on the ends are our special doubled entries (from 2X) and the one in the middle is a regular entry (from X).
Consider Z as a representation of a polynomial p, where the coefficient for the term of degree i is Z[i]. If X is [1, 2, 3, 5], then Z is 1111110001 (because we have 1, 2, 3, 4, 5, 6, and 10); p is then 1 + x + x2 + x3 + x4 + x5 + x9.
Now, remember from high school algebra that the coefficient of xc in the product of two polynomials is the sum over all a, b with a + b = c of the first polynomial's coefficient for xa times the second's coefficient for xb. So, if we consider q = p2, the coefficient of x2j (for a j with Z[j] = 1) will be the sum over all i of Z[i] * Z[2*j - i]. But since Z is binary, that's exactly the number of triplets i,j,k which are evenly-spaced ones in Z. Note that (j, j, j) is always such a triplet, so we only care about ones with values > 1.
We can then use a Fast Fourier Transform to find p2 in O(|Z| log |Z|) time, where |Z| is v' - u' + 1. We get out another array of coefficients; call it W.
Loop over each x_k in X. (Recall that our desired evenly-spaced ones are all centered on an element of X, not 2*X.) If the corresponding W for twice this element, i.e. W[2*(x_k - u')], is 1, we know it's not the center of any nontrivial progressions and we can skip it. (As argued before, it should only be a positive integer.)
Otherwise, it might be the center of a progression that we want (so we need to find i and j). But, unfortunately, it might also be the center of a progression that doesn't have our desired form. So we need to check. Loop over the other elements x_i of X, and check if there's a triple with 2*x_i, x_k, 2*x_j for some j (by checking Z[2*(x_k - x_j) - u']). If so, we have an answer; if we make it through all of X without a hit, then the FFT found only spurious answers, and we have to check another element of W.
This last step is therefore O(n * 1 + (number of x_k with W[2*(x_k - u')] > 1 that aren't actually solutions)), which is maybe possibly O(n^2), which is obviously not okay. There should be a way to avoid generating these spurious answers in the output W; if we knew that any appropriate W coefficient definitely had an answer, this last step would be O(n) and all would be well.
I think it's possible to use a somewhat different polynomial to do this, but I haven't gotten it to actually work. I'll think about it some more....
Partially based on this answer.
It has to be at least O(n^2) as there are n(n-1)/2 different sums possible to check for other members. You have to compute all those, because any pair summed may be any other member (start with one example and permute all the elements to convince yourself that all must be checked). Or look at fibonacci for something concrete.
So calculating that and looking up members in a hash table gives amortised O(n^2). Or use an ordered tree if you need best worst-case.
You essentially need to find all the different sums of value pairs so I don't think you're going to do any better than O(n2). But you can optimize by sorting the list and reducing duplicate values, then only pairing a value with anything equal or greater, and stopping when the sum exceeds the maximum value in the list.
I am not even sure if this can be done in polynomial time.
Problem:
Given two arrays of real numbers,
A = (a[1], a[2], ..., a[n]),
B = (b[1], b[2], ..., b[n]), (b[j] > 0, j = 1, 2, ..., n)
and a number k, find a subset A' of A (A' = (a[i(1)],
a[i(2)], ..., a[i(k)])), which contains exactly k elements, such that, (sum a[i(j)])/(sum b[i(j)]) is maximized, wherej = 1, 2, ..., k.
For example, if k == 3, and {a[1], a[5], a[7]} is the result, then
(a[1] + a[5] + a[7])/(b[1] + b[5] + b[7])
should be larger than any other combination. Any clue?
Assuming that the entries of B are positive (it sounds as though this special case might be useful to you), there is an O(n^2 log n) algorithm.
Let's first solve the problem of deciding, for a particular t, whether there exists a solution such that
(sum a[i(j)])/(sum b[i(j)]) >= t.
Clearing the denominator, this condition is equivalent to
sum (a[i(j)] - t*b[i(j)]) >= 0.
All we have to do is choose the k largest values of a[i(j)] - t*b[i(j)].
Now, in order to solve the problem when t is unknown, we use a kinetic algorithm. Think of t as being a time variable; we are interested in the evolution of a one-dimensional physical system with n particles having initial positions A and velocities -B. Each particle crosses each other particle at most one time, so the number of events is O(n^2). In between crossings, the optimum of sum (a[i(j)] - t*b[i(j)]) changes linearly, because the same subset of k is optimal.
If B can contain negative numbers, then this is NP-Hard.
Because of the NP-Hardness of this problem:
Given k and array B, is there a subset of size k of B which sums to zero.
The A becomes immaterial in that case.
Of course, from your comment it seems like B must contain positive numbers.
I'm having problem to understand some piece of code in MTD driver
#define ROUNDUP(x, y) ((((x)+((y)-1))/(y))*(y))
...
static struct mtd_partition my_parts[] =
{
{
.name = "boot",
.size = 0,
.offset = 0,
.mask_flags = MTD_WRITEABLE
},
{
.name = "linux",
.size = 0,
.offset = 0
},
{
.name = "rootfs",
.size = 0,
.offset = 0,
.mask_flags = MTD_WRITEABLE
},
{
.name = "nvram",
.size = 0,
.offset = 0
},
{
.name = 0,
.size = 0,
.offset = 0
}
}
...
i = (sizeof(bcm947xx_parts)/sizeof(struct mtd_partition)) - 2;
bcm947xx_parts[i].size = ROUNDUP(NVRAM_SPACE, mtd->erasesize);
bcm947xx_parts[i].offset = size - bcm947xx_parts[i].size;
So here are my questoins:
1) why is it necessary to round up the size of the partition?
2) could you help to understand how the rounding works?
3) flash driver in boot loader on the same platform doesn't do the rounding for this specific partition, so the flash layout has different offsets in the kernel side and in the bootloader. What is the reason for this?
Thanks in advance for any valuable comments !
(1) Flash memory comes in multiples of its erase size. (Apparently. At least, this is what the quoted code tells me.) This means there is a gap between the end of the NVRAM and whatever comes next. This gap is less than the size of one erase size. In flash, it's convenient to not put two objects with different rewrite schedules in a single erase block -- changing either object requires the flash storage controller to copy the block to a temporary store, apply a partial update to the store, erase the block (slow-ish), and write the updated block to the main store. (It can reuse a different previously erased block and thread it back in the place of the original block. But this is considered a high tech optimization.)
(2) How to parse macros:
((((x)+((y)-1))/(y))*(y))
Step 1, remove the parens around the arguments that make sure that complicated expressions passed as arguments don't suddenly rebind in unexpected ways due to operator precedence.
(((x+(y-1))/y)*y)
Step 2, remove paranoid parens for operations that clearly have the indicated precedence.
(x+y-1)/y*y
Step 3, use your C parsing rules, not your algebra rules. If x and y are integral types (not enough information in your code to be certain of this), then the division is integer division, so translate from C to math.
floor((x+y-1)/y)*y
Step 4, read. If x is a multiple of y, then since y-1 is too small to be a multiple of y, the operation just gives back x. If x is 1 more than a multiple of y, then the +y-1 pushes the numerator over the next multiple of y and the result is the smallest multiple of y that happens to be larger than x. In fact, if x is between 1 more and y-1 more than a multiple of y, the "+y-1" bumps the numerator up over the next multiple of y and the result of rounding up is the smallest multiple of y larger than x.
What we find, therefore is that ROUNDUP(x,y) rounds x up to the smallest multiple of y that happens to be greater than or equal to x. Additionally, this macro evaluates its second argument more than once: don't put expressions with side effects in the second slot unless you want those side effects to happen three times per call. (Consider int i = 3; ROUNDUP(6,i++) and wonder which subexpressions are evaluated before and which after each of the three increments of i.)
(3) No idea. No one told the bootloader writer that NVRAMs only come in multiples of erasesize?