SUBSET-SUM, upper bound on number of solutions - theory

As you probably know, the SUBSET-SUM problem is defined as determining if a subset of a set of whole numbers sum to a specified whole number. (there is another definition of subset-sum, where a group of integers sum to zero, but let's use this definition for now)
For example ((1,2,4,5),6) is true because (2,4) sums to 6. We say that (2,4) is a "solution"
Furthermore, ((1,5,10),7) is false because nothing in the arguments sum to 7
My question is, given a set of argument numbers for SUBSET-SUM is there a polynomial upper bound on the number of possible solutions. In the first example there was (2,4) and (1,5).
We know that since SUBSET-SUM is NP-complete deciding in polynomail time probably is impossible. However my question is not related to the decision time, I'm asking strictly about the size of the list of solutions.
Obviously the size of the power set of the argument numbers can be an upper bound on solution list size, however this has exponential growth. My intuition is that there should be a polynomial bound, but I cannot prove this.
nb I know this sounds like a homework question, but please trust me it isn't. I am trying to teach myself certain aspects of CS theory and this is where my thoughts have taken me.

No; take numbers:
(1, 2, 1+2, 4, 8, 4+8, 16, 32, 16+32, ..., 22n, 22n+1, 22n+22n+1)
and ask about forming the sum 1 + 2 + 4 + ... + 22n + 22n+1. (For example: with n = 3 take the set (1,2,3,4,8,12,16,32,48) and ask about the subsets summing to 63.)
You can form 1+2 either using 1 and 2 or using 1+2.
You can form 4+8 either using 4 and 8 or using 4+8.
....
You can form 22n + 22n+1 either using 22n and 22n+1 or 22n+22n+1.
The choices are independent, so there are at least 3n=3m/3, where m is the size of your set. I bet this can be sharply strengthened, but this proves there's no polynomial bound.

Sperner's Theorem provides a nice (albeit non-polynomial) upper bound, at least in the case when the numbers in the set are strictly greater than zero (as seems to be the case in your problem).
The family of all subsets with a given sum form a Sperner family, which is a collection of subsets where no subset in the family is itself a subset of any of the other subsets in the family. This is where the assumption that the elements are strictly greater than zero is used. Sperner's theorem states that the maximum size of such a Sperner family is given by the binomial coefficient n Choose floor(n/2).
If you drop the assumption that the n numbers are distinct, it is easy to see that this upper bound can't be improved (just take all numbers = 1 and let the target sum be floor(n/2)). I don't know if it can be improved under the assumption that the numbers are distinct.

Related

Algorithm so that i can index 2^n combinations in a way so i can backtrack from any index value of 1 to 2^n without using an array

I am trying to do something but it is outside my field. To explain lets set n=3 to simplify things where n is the total number of the parameters in this example: A, B, C. These parameters can have a state of ON and OFF (aka 0 or 1).
The total number of combinations of these parameters is 2^n = 8 in this case which can be visualized as:
ABC
1: 000
2: 111
3: 100
4: 010
5: 001
6: 110
7: 011
8: 101
Of course the above list can be sorted in (2^n)! = 40320 ways.
I want an algorithm so that i can calculate the state of any of my parameters (0 or 1) given a number from 1 to 2^n. For example if i have the number of 3 using the table above i know state of A is 1 and B and C is 0. Of course you can have a table/array to look it up given a specific sorting, but even for relatively small values of n you need to have a huge table.
I'm not familiar with this and the methods you can do indexing that's why i need help.
Kind regards
Just realised you can actually look at it another way. What you want is a function encrypting N bits to another set of N bits. In practice this is the same as format preserving encryption. The question is, do you care whether:
all 2^n cases are covered, or just a large enough number close to 2^n (you have to choose the right encryption/hash method)
you want to do this one way or both ways (that is, do you ever want to ask - I have this number corresponding to that number, which permutation am I using)
If the answer is no to both, you can just find an FPE algorithm that doesn't require you to generate the whole table (some do).
I have seen another problem of finding all subsets of a given set using bitmask. You can use the same concept in your case. This link contains a good tutorial.

Algorithm - How to select one number from each column in an array so that their sum is as close as possible to a particular value

I have an m x n matrix of real numbers. I want to choose a single value from each column such that the sum of my selected values is as close as possible to a pre-specified total.
I am not an experienced programmer (although I have an experienced friend who will help). I would like to achieve this using Matlab, Mathematica or c++ (MySQL if necessary).
The code only needs to run a few times, once every few days - it does not necessarily need to be optimised. I will have 16 columns and about 12 rows.
Normally I would suggest dynamic programming, but there are a few features of this situation suggesting an alternative approach. First, the performance demands are light; this program will be run only a couple times, and it doesn't sound as though a running time on the order of hours would be a problem. Second, the matrix is fairly small. Third, the matrix contains real numbers, so it would be necessary to round and then do a somewhat sophisticated search to ensure that the optimal possibility was not missed.
Instead, I'm going to suggest the following semi-brute-force approach. 12**16 ~ 1.8e17, the total number of possible choices, is too many, but 12**9 ~ 5.2e9 is doable with brute force, and 12**7 ~ 3.6e7 fits comfortably in memory. Compute all possible choices for the first seven columns. Sort these possibilities by total. For each possible choice for the last nine columns, use an efficient search algorithm to find the best mate among the first seven. (If you have a lot of memory, you could try eight and eight.)
I would attempt a first implementation in C++, using std::sort and std::lower_bound from the <algorithm> standard header. Measure it; if it's too slow, then try an in-memory B+-tree (does Boost have one?).
I spent some more time thinking about how to implement what I wrote above in the simplest way possible. Here's an approach that will work well for a 12 by 16 matrix on a 64-bit machine with roughly 4 GB of memory.
The number of choices for the first eight columns is 12**8. Each choice is represented by a 4-byte integer between 0 and 12**8 - 1. To decode a choice index i, the row for the first column is given by i % 12. Update i /= 12;. The row for the second column now is given by i % 12, et cetera.
A vector holding all choices requires roughly 12**8 * 4 bytes, or about 1.6 GB. Two such vectors require 3.2 GB. Prepare one for the first eight columns and one for the last eight. Sort them by sum of the entries that they indicate. Use saddleback search to find the best combination. (Initialize an iterator into the first vector and a reverse iterator into the second. While neither iterator is at its end, compare the current combination against the current best and update the current best if necessary. If the current combination sums to than the target, increment the first iterator. If the sum is greater than the target, increment the second iterator.)
I would estimate that this requires less than 50 lines of C++.
Without knowing the range of values that might fill the arrays, how about something generic like this:
divide the target by the number of remaining columns.
Pick the number from that column closest to that value.
Repeat from 1. Until each column picked.

How does the HyperLogLog algorithm work?

I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list.
This was particularly interesting to me because it brought me back to my MySQL days when I saw that "Cardinality" value (which I always assumed until recently that it was calculated not estimated).
So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in JavaScript:
function countUniqueAlgo1(arr) {
var Table = {};
var numUnique = 0;
var numDataPoints = arr.length;
for (var j = 0; j < numDataPoints; j++) {
var val = arr[j];
if (Table[val] != null) {
continue;
}
Table[val] = 1;
numUnique++;
}
return numUnique;
}
But the problem is that my algorithm, while O(n), uses a lot of memory (storing values in Table).
I've been reading this paper about how to count duplicates in a list in O(n) time and using minimal memory.
It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.
I've read the paper, but I can't seem to understand it. Can someone give a more layperson's explanation? I know what hashes are, but I don't understand how they are used in this HyperLogLog algorithm.
The main trick behind this algorithm is that if you, observing a stream of random integers, see an integer which binary representation starts with some known prefix, there is a higher chance that the cardinality of the stream is 2^(size of the prefix).
That is, in a random stream of integers, ~50% of the numbers (in binary) starts with "1", 25% starts with "01", 12,5% starts with "001". This means that if you observe a random stream and see a "001", there is a higher chance that this stream has a cardinality of 8.
(The prefix "00..1" has no special meaning. It's there just because it's easy to find the most significant bit in a binary number in most processors)
Of course, if you observe just one integer, the chance this value is wrong is high. That's why the algorithm divides the stream in "m" independent substreams and keep the maximum length of a seen "00...1" prefix of each substream. Then, estimates the final value by taking the mean value of each substream.
That's the main idea of this algorithm. There are some missing details (the correction for low estimate values, for example), but it's all well written in the paper. Sorry for the terrible english.
A HyperLogLog is a probabilistic data structure. It counts the number of distinct elements in a list. But in comparison to a straightforward way of doing it (having a set and adding elements to the set) it does this in an approximate way.
Before looking how the HyperLogLog algorithm does this, one has to understand why you need it. The problem with a straightforward way is that it consumes O(distinct elements) of space. Why there is a big O notation here instead of just distinct elements? This is because elements can be of different sizes. One element can be 1 another element "is this big string". So if you have a huge list (or a huge stream of elements) it will take a lot memory.
Probabilistic Counting
How can one get a reasonable estimate of a number of unique elements? Assume that you have a string of length m which consists of {0, 1} with equal probability. What is the probability that it will start with 0, with 2 zeros, with k zeros? It is 1/2, 1/4 and 1/2^k. This means that if you have encountered a string starting with k zeros, you have approximately looked through 2^k elements. So this is a good starting point. Having a list of elements that are evenly distributed between 0 and 2^k - 1 you can count the maximum number of the biggest prefix of zeros in binary representation and this will give you a reasonable estimate.
The problem is that the assumption of having evenly distributed numbers from 0 t 2^k-1 is too hard to achieve (the data we encountered is mostly not numbers, almost never evenly distributed, and can be between any values. But using a good hashing function you can assume that the output bits would be evenly distributed and most hashing function have outputs between 0 and 2^k - 1 (SHA1 give you values between 0 and 2^160). So what we have achieved so far is that we can estimate the number of unique elements with the maximum cardinality of k bits by storing only one number of size log(k) bits. The downside is that we have a huge variance in our estimate. A cool thing that we almost created 1984's probabilistic counting paper (it is a little bit smarter with the estimate, but still we are close).
LogLog
Before moving further, we have to understand why our first estimate is not that great. The reason behind it is that one random occurrence of high frequency 0-prefix element can spoil everything. One way to improve it is to use many hash functions, count max for each of the hash functions and in the end average them out. This is an excellent idea, which will improve the estimate, but LogLog paper used a slightly different approach (probably because hashing is kind of expensive).
They used one hash but divided it into two parts. One is called a bucket (total number of buckets is 2^x) and another - is basically the same as our hash. It was hard for me to get what was going on, so I will give an example. Assume you have two elements and your hash function which gives values form 0 to 2^10 produced 2 values: 344 and 387. You decided to have 16 buckets. So you have:
0101 011000 bucket 5 will store 1
0110 000011 bucket 6 will store 4
By having more buckets you decrease the variance (you use slightly more space, but it is still tiny). Using math skills they were able to quantify the error (which is 1.3/sqrt(number of buckets)).
HyperLogLog
HyperLogLog does not introduce any new ideas, but mostly uses a lot of math to improve the previous estimate. Researchers have found that if you remove 30% of the biggest numbers from the buckets you significantly improve the estimate. They also used another algorithm for averaging numbers. The paper is math-heavy.
And I want to finish with a recent paper, which shows an improved version of hyperLogLog algorithm (up until now I didn't have time to fully understand it, but maybe later I will improve this answer).
The intuition is if your input is a large set of random number (e.g. hashed values), they should distribute evenly over a range. Let's say the range is up to 10 bit to represent value up to 1024. Then observed the minimum value. Let's say it is 10. Then the cardinality will estimated to be about 100 (10 × 100 ≈ 1024).
Read the paper for the real logic of course.
Another good explanation with sample code can be found here:
Damn Cool Algorithms: Cardinality Estimation - Nick's Blog

Finding a repeating sequence at the end of a sequence of numbers

My problem is this: I have a large sequence of numbers. I know that, after some point, it becomes periodic - that is, there are k numbers at the beginning of the sequence, and then there are m more numbers that repeat for the rest of the sequence. As an example to make this more clear, the sequence might look like this: [1, 2, 5, 3, 4, 2, 1, 1, 3, 2, 1, 1, 3, 2, 1, 1, 3, ...], where k is 5 and m is 4, and the repeating block is then [2, 1, 1, 3]. As is clear from this example, I can have repeating bits inside of the larger block, so it doesn't help to just look for the first instances of repetition.
However, I do not know what k or m are - my goal is to take the sequence [a_1, a_2, ... , a_n] as an input and output the sequence [a_1, ... , a_k, [a_(k+1), ... , a_(k+m)]] - basically truncating the longer sequence by listing the majority of it as a repeating block.
Is there an efficient way to do this problem? Also, likely harder but more ideal computationally - is it possible to do this as I generate the sequence in question, so that I have to generate a minimal amount? I've looked at other, similar questions on this site, but they all seem to deal with sequences without the beginning non-repeating bit, and often without having to worry about internal repetition.
If it helps/would be useful, I can also get into why I am looking at this and what I will use it for.
Thanks!
EDITS: First, I should have mentioned that I do not know if the input sequence ends at exactly the end of a repeated block.
The real-world problem that I am attempting to work on is writing a nice, closed-form expression for continued fraction expansions (CFEs) of quadratic irrationals (actually, the negative CFE). It is very simple to generate partial quotients* for these CFEs to any degree of accuracy - however, at some point the tail of the CFE for a quadratic irrational becomes a repeating block. I need to work with the partial quotients in this repeating block.
My current thoughts are this: perhaps I can adapt some of the algorithms suggested that work from the right to work with one of these sequences. Alternatively, perhaps there is something in the proof of why quadratic irrationals are periodic that will help me see why they begin to repeat, which will help me come up with some easy criteria to check.
*If I am writing a continued fraction expansion as [a_0, a_1, ...], I refer to the a_i's as partial quotients.
Some background info can be found here for those interested: http://en.wikipedia.org/wiki/Periodic_continued_fraction
You can use a rolling hash to achieve linear time complexity and O(1) space complexity (I think this is the case, since I don't believe you can have an infinite repeating sequence with two frequencies which are not multiples of each other).
Algorithm: You just keep two rolling hashes which expand like this:
_______ _______ _______
/ \/ \/ \
...2038975623895769874883301010883301010883301010
. . . ||
. . . [][]
. . . [ ][ ]
. . .[ ][ ]
. . [. ][ ]
. . [ . ][ ]
. . [ .][ ]
. . [ ][ ]
. [ ][ ]
Keep on doing this for the entire sequence. The first pass will only detect repetitions repeated 2*n times for some value of n. However that's not our goal: our goal in the first pass is to detect all possible periods, which this does. As we go along the sequence performing this process, we also keep track of all relatively prime periods we will need to later check:
periods = Set(int)
periodsToFurthestReach = Map(int -> int)
for hash1,hash2 in expandedPairOfRollingHashes(sequence):
L = hash.length
if hash1==hash2:
if L is not a multiple of any period:
periods.add(L)
periodsToFurthestReach[L] = 2*L
else L is a multiple of some periods:
for all periods P for which L is a multiple:
periodsToFurthestReach[P] = 2*L
After this process, we have a list of all periods and how far they've reached. Our answer is probably the one with the furthest reach, but we check all other periods for repetition (fast because we know the periods we're checking for). If this is computationally difficult, we can optimize by pruning away periods (which stop repeating) as we're going through the list, very much like the sieve of Eratosthenes, by keeping a priority queue of when we next expect a period to repeat.
At the end, we double-check the result to make sure there was no hash collision (in unlikely even there is, blacklist and repeat).
Here I assumed your goal was to minimize non-repeating-length, and not give a repeating element which can be further factored; you can modify this algorithm to find all other compressions, if they exist.
So, ninjagecko provided a good working answer to the question I posed. Thanks very much! However, I ended up finding a more efficient, mathematically based way to do the specific case that I am looking at - that is, writing out a closed form expression for the continued fraction expansion of a quadratic irrational. Obviously this solution will only work for this specific case, rather than the general case that I asked about, but I thought it might be useful to put it here in case others have a similar question.
Basically, I remembered that a quadratic irrational is reduced if and only if its continued fraction expansion is purely periodic - as in, it repeats right from the beginning, without any leading terms.
When you work out the continued fraction expansion of a number x, you basically set x_0 to be x, and then you form your sequence [a_0; a_1, a_2, a_3, ... ] by defining a_n = floor(x_n) and x_(n+1) = 1/(x_n - a_n). Normally you would just continue this until you reach a desired precision. However, for our purposes, we just run this method until x_k is a reduced quadratic irrational (which occurs if it is bigger than 1 and its conjugate is between -1 and 0). Once this happens, we know that a_k is the first term of our repeating block. Then, when we find x_(k+m+1) equal to x_k, we know that a_(k+m) is the last term in our repeating block.
Search from the right:
does a_n == a_n-1
does (a_n,a_n-1) == (a_n-2,a_n-3)
...
This is clearly O(m^2). The only available bound appears to be that m<n/2, so it's O(n^2)
Is this acceptable for your application? (Are we doing your homework for you, or is there an actual real-world problem here?)
This page lists several good cycle-detection algorithms and gives an implementation of an algorithm in C.
Consider the sequence once it has repeated a number of times. It will end e.g. ...12341234123412341234. If you take the repeating part of the string up to just before the last cycle of repeats, and then slide it along by the length of that cycle, you will find that you have a long match between a substring at the end of the sequence and the same substring slid to the left a distance which is small compared with its length.
Conversely, if you have a string where a[x] = a[x + k] for a large number of x, then you also have a[x] = a[x + k] = a[x + 2k] = a[x + 3k]... so a string that matches itself when slid a short distance compared to its length must contain repeats.
If you look at http://en.wikipedia.org/wiki/Suffix_array, you will see that you can build the list of all suffixes of a string, in sorted order, in linear time, and also an array which tells you how many characters each suffix has in common with the previous suffix in sorted order. If you look for the entry with the largest value of this, this would be my candidate for a string going ..1234123412341234, and the distance between the starting points of the two suffixes would tell you the length at which the sequence repeats. (but in practice some sort of rolling hash search like http://en.wikipedia.org/wiki/Rabin-Karp might be quicker and easier, although there are quite codeable linear time Suffix Array algorithms, like "Simple Linear Work Suffix Array Construction" by Karkkainen and Sanders).
Suppose that you apply this algorithm when the number of characters available is 8, 16, 32, 64, .... 2^n, and you finally find a repeat at 2^p. How much time have you wasted in earlier stages? 2^(p-1) + 2^(p-2) + ..., which sums to about 2^p, so the repeated searches are only a constant overhead.

Most mutually distant k elements (clustering?)

I have a simple machine learning question:
I have n (~110) elements, and a matrix of all the pairwise distances. I would like to choose the 10 elements that are most far apart. That is, I want to
Maximize:
Choose 10 different elements.
Return min distance over (all pairings within the 10).
My distance metric is symmetric and respects the triangle inequality.
What kind of algorithm can I use? My first instinct is to do the following:
Cluster the n elements into 20
clusters.
Replace each cluster with just the
element of that cluster that is
furthest from the mean element of
the original n.
Use brute force to solve the
problem on the remaining 20
candidates. Luckily, 20 choose 10 is
only 184,756.
Edit: thanks to etarion's insightful comment, changed "Return sum of (distances)" to "Return min distance" in the optimization problem statement.
Here's how you might approach this combinatorial optimization problem by taking the convex relaxation.
Let D be an upper triangular matrix with your distances on the upper triangle. I.e. where i < j, D_i,j is the distance between elements i and j. (Presumably, you'll have zeros on the diagonal, as well.)
Then your objective is to maximize x'*D*x, where x is binary valued with 10 elements set to 1 and the rest to 0. (Setting the ith entry in x to 1 is analogous to selecting the ith element as one of your 10 elements.)
The "standard" convex optimization thing to do with a combinatorial problem like this is to relax the constraints such that x need not be discrete valued. Doing so gives us the following problem:
maximize y'*D*y
subject to: 0 <= y_i <= 1 for all i, 1'*y = 10
This is (morally) a quadratic program. (If we replace D with D + D', it'll become a bona fide quadratic program and the y you get out should be no different.) You can use an off-the-shelf QP solver, or just plug it in to the convex optimization solver of your choice (e.g. cvx).
The y you get out need not be (and probably won't be) a binary vector, but you can convert the scalar values to discrete ones in a bunch of ways. (The simplest is probably to let x be 1 in the 10 entries where y_i is highest, but you might need to do something a little more complicated.) In any case, y'*D*y with the y you get out does give you an upper bound for the optimal value of x'*D*x, so if the x you construct from y has x'*D*x very close to y'*D*y, you can be pretty happy with your approximation.
Let me know if any of this is unclear, notation or otherwise.
Nice question.
I'm not sure if it can be solved exactly in an efficient manner, and your clustering based solution seems reasonable. Another direction to look at would be local search method such as simulated annealing and hill climbing.
Here's an obvious baseline I would compare any other solution against:
Repeat 100 times:
Greedily select the datapoint that whose removal decreases the objective function the least and remove it.

Resources