Finding a repeating sequence at the end of a sequence of numbers - arrays

My problem is this: I have a large sequence of numbers. I know that, after some point, it becomes periodic - that is, there are k numbers at the beginning of the sequence, and then there are m more numbers that repeat for the rest of the sequence. As an example to make this more clear, the sequence might look like this: [1, 2, 5, 3, 4, 2, 1, 1, 3, 2, 1, 1, 3, 2, 1, 1, 3, ...], where k is 5 and m is 4, and the repeating block is then [2, 1, 1, 3]. As is clear from this example, I can have repeating bits inside of the larger block, so it doesn't help to just look for the first instances of repetition.
However, I do not know what k or m are - my goal is to take the sequence [a_1, a_2, ... , a_n] as an input and output the sequence [a_1, ... , a_k, [a_(k+1), ... , a_(k+m)]] - basically truncating the longer sequence by listing the majority of it as a repeating block.
Is there an efficient way to do this problem? Also, likely harder but more ideal computationally - is it possible to do this as I generate the sequence in question, so that I have to generate a minimal amount? I've looked at other, similar questions on this site, but they all seem to deal with sequences without the beginning non-repeating bit, and often without having to worry about internal repetition.
If it helps/would be useful, I can also get into why I am looking at this and what I will use it for.
Thanks!
EDITS: First, I should have mentioned that I do not know if the input sequence ends at exactly the end of a repeated block.
The real-world problem that I am attempting to work on is writing a nice, closed-form expression for continued fraction expansions (CFEs) of quadratic irrationals (actually, the negative CFE). It is very simple to generate partial quotients* for these CFEs to any degree of accuracy - however, at some point the tail of the CFE for a quadratic irrational becomes a repeating block. I need to work with the partial quotients in this repeating block.
My current thoughts are this: perhaps I can adapt some of the algorithms suggested that work from the right to work with one of these sequences. Alternatively, perhaps there is something in the proof of why quadratic irrationals are periodic that will help me see why they begin to repeat, which will help me come up with some easy criteria to check.
*If I am writing a continued fraction expansion as [a_0, a_1, ...], I refer to the a_i's as partial quotients.
Some background info can be found here for those interested: http://en.wikipedia.org/wiki/Periodic_continued_fraction

You can use a rolling hash to achieve linear time complexity and O(1) space complexity (I think this is the case, since I don't believe you can have an infinite repeating sequence with two frequencies which are not multiples of each other).
Algorithm: You just keep two rolling hashes which expand like this:
_______ _______ _______
/ \/ \/ \
...2038975623895769874883301010883301010883301010
. . . ||
. . . [][]
. . . [ ][ ]
. . .[ ][ ]
. . [. ][ ]
. . [ . ][ ]
. . [ .][ ]
. . [ ][ ]
. [ ][ ]
Keep on doing this for the entire sequence. The first pass will only detect repetitions repeated 2*n times for some value of n. However that's not our goal: our goal in the first pass is to detect all possible periods, which this does. As we go along the sequence performing this process, we also keep track of all relatively prime periods we will need to later check:
periods = Set(int)
periodsToFurthestReach = Map(int -> int)
for hash1,hash2 in expandedPairOfRollingHashes(sequence):
L = hash.length
if hash1==hash2:
if L is not a multiple of any period:
periods.add(L)
periodsToFurthestReach[L] = 2*L
else L is a multiple of some periods:
for all periods P for which L is a multiple:
periodsToFurthestReach[P] = 2*L
After this process, we have a list of all periods and how far they've reached. Our answer is probably the one with the furthest reach, but we check all other periods for repetition (fast because we know the periods we're checking for). If this is computationally difficult, we can optimize by pruning away periods (which stop repeating) as we're going through the list, very much like the sieve of Eratosthenes, by keeping a priority queue of when we next expect a period to repeat.
At the end, we double-check the result to make sure there was no hash collision (in unlikely even there is, blacklist and repeat).
Here I assumed your goal was to minimize non-repeating-length, and not give a repeating element which can be further factored; you can modify this algorithm to find all other compressions, if they exist.

So, ninjagecko provided a good working answer to the question I posed. Thanks very much! However, I ended up finding a more efficient, mathematically based way to do the specific case that I am looking at - that is, writing out a closed form expression for the continued fraction expansion of a quadratic irrational. Obviously this solution will only work for this specific case, rather than the general case that I asked about, but I thought it might be useful to put it here in case others have a similar question.
Basically, I remembered that a quadratic irrational is reduced if and only if its continued fraction expansion is purely periodic - as in, it repeats right from the beginning, without any leading terms.
When you work out the continued fraction expansion of a number x, you basically set x_0 to be x, and then you form your sequence [a_0; a_1, a_2, a_3, ... ] by defining a_n = floor(x_n) and x_(n+1) = 1/(x_n - a_n). Normally you would just continue this until you reach a desired precision. However, for our purposes, we just run this method until x_k is a reduced quadratic irrational (which occurs if it is bigger than 1 and its conjugate is between -1 and 0). Once this happens, we know that a_k is the first term of our repeating block. Then, when we find x_(k+m+1) equal to x_k, we know that a_(k+m) is the last term in our repeating block.

Search from the right:
does a_n == a_n-1
does (a_n,a_n-1) == (a_n-2,a_n-3)
...
This is clearly O(m^2). The only available bound appears to be that m<n/2, so it's O(n^2)
Is this acceptable for your application? (Are we doing your homework for you, or is there an actual real-world problem here?)

This page lists several good cycle-detection algorithms and gives an implementation of an algorithm in C.

Consider the sequence once it has repeated a number of times. It will end e.g. ...12341234123412341234. If you take the repeating part of the string up to just before the last cycle of repeats, and then slide it along by the length of that cycle, you will find that you have a long match between a substring at the end of the sequence and the same substring slid to the left a distance which is small compared with its length.
Conversely, if you have a string where a[x] = a[x + k] for a large number of x, then you also have a[x] = a[x + k] = a[x + 2k] = a[x + 3k]... so a string that matches itself when slid a short distance compared to its length must contain repeats.
If you look at http://en.wikipedia.org/wiki/Suffix_array, you will see that you can build the list of all suffixes of a string, in sorted order, in linear time, and also an array which tells you how many characters each suffix has in common with the previous suffix in sorted order. If you look for the entry with the largest value of this, this would be my candidate for a string going ..1234123412341234, and the distance between the starting points of the two suffixes would tell you the length at which the sequence repeats. (but in practice some sort of rolling hash search like http://en.wikipedia.org/wiki/Rabin-Karp might be quicker and easier, although there are quite codeable linear time Suffix Array algorithms, like "Simple Linear Work Suffix Array Construction" by Karkkainen and Sanders).
Suppose that you apply this algorithm when the number of characters available is 8, 16, 32, 64, .... 2^n, and you finally find a repeat at 2^p. How much time have you wasted in earlier stages? 2^(p-1) + 2^(p-2) + ..., which sums to about 2^p, so the repeated searches are only a constant overhead.

Related

sequences with the same order in an array - Identify sequences

I'm looking for a hint towards a solution of the problem:
Suppose there's an array with some numbers in ascending order and some in descending, for example [1,2,5,9,6,3,2,4,7,8] has sequences asc [1,2,5,9], desc [(9),6,3,2], asc [(2),4,7,8].
Now this isn't a problem, I could simply loop through an array and add them to some data structure, and when the direction changes - I store this structure somwhere and start filling next one.
What I've found tricky is if I want to have threshold of some sort. For example: [0,50,100,99,98,97,105,160]
So the sequence in descending order [(100), 99, 98, 97] could be neglected, because overall change is -3, whereas the sequence was increasing much more dramatically (+100) and as a result, the algorithm identifies only one sequence in ascending order.
I have tried the same method as above, simply adding all sequences in a data structure and then comparing the change in values of two consequtive items: (100 vs -3 means -3 can be neglected). But then the problem is if I have say this situation:
(example only in change of values from start to end of sequense)
[+100, -3, +1, -50]
in this situation I cannot neglect descending movement, because the numbers start to descend, then slightly ascend and again go down pretty significantly.
and it gets really confusing with stuff like that:
[+100, -3, +3, -3, +3, -50]
this is quick sketch of representation of what I am trying to achieve:
black lines represent initial data in an array, red thin lines are desired resulting output
Could somebody point me out in right direction? How would I approach this situation? Compare multiple sequences at a time slowly combining sequences together? Maybe I would need to go through sequences multiple times?
I'm not sure If I've come across problem like that and don't know working algorithm. This is a problem I've faced myself trying to analyse some data.
If I understand correctly, you expect your curve to be a succession of alternatively increasing and decreasing sequences, with a bit of added noise.
The usual way to get rid of noise is to filter data. There are millions of ways to do that, most of them requiring frequency analysis, but in your case you could probably get good enough results with something simple.
The main point is that the relevant variable is not the values in the array, but their variations.
Given N values, consider the array of N-1 elements holding the differences between two consecutive values.
[0,50,100,99,98,97,105,160] -> 50,100,-1,-1,-1,6,45
Now eliminate all values whose absolute value is below a given threshold (say 10 for instance)
-> 50,100,0,0,0,0,45
you can then detect a rising sequence by looking at streaks of all positive or null values (and the same for decreasing sequences, considering zero or negative values).
As for all filtering processes, you will have to find a sweet spot for your threshold. Too low and it will fail to eliminate insignificant variations, too high and it will wipe out significant slope inversions.
I don't know if I understand your problem correctly, but I had to do this kind of dimensionality reduction many times before, so I wrote a small javascript library to do so. It uses the Perceptually Important Points algorithm.
In the algorithm you can define a custom metric of the distance between three consecutive points (to measure how much a single point adds in entropy).
Here is a demonstration (in JS). It works kind like a heap, where you remove points that do not contribute so much to the overall entropy:
for(var i=0; i<data.length; i++)
heap.add(data[i]);
while(heap.minValue() < threshold)
heap.removeMin();
And here is the library.

How does the HyperLogLog algorithm work?

I've been learning about different algorithms in my spare time recently, and one that I came across which appears to be very interesting is called the HyperLogLog algorithm - which estimates how many unique items are in a list.
This was particularly interesting to me because it brought me back to my MySQL days when I saw that "Cardinality" value (which I always assumed until recently that it was calculated not estimated).
So I know how to write an algorithm in O(n) that will calculate how many unique items are in an array. I wrote this in JavaScript:
function countUniqueAlgo1(arr) {
var Table = {};
var numUnique = 0;
var numDataPoints = arr.length;
for (var j = 0; j < numDataPoints; j++) {
var val = arr[j];
if (Table[val] != null) {
continue;
}
Table[val] = 1;
numUnique++;
}
return numUnique;
}
But the problem is that my algorithm, while O(n), uses a lot of memory (storing values in Table).
I've been reading this paper about how to count duplicates in a list in O(n) time and using minimal memory.
It explains that by hashing and counting bits or something one can estimate within a certain probability (assuming the list is evenly distributed) the number of unique items in a list.
I've read the paper, but I can't seem to understand it. Can someone give a more layperson's explanation? I know what hashes are, but I don't understand how they are used in this HyperLogLog algorithm.
The main trick behind this algorithm is that if you, observing a stream of random integers, see an integer which binary representation starts with some known prefix, there is a higher chance that the cardinality of the stream is 2^(size of the prefix).
That is, in a random stream of integers, ~50% of the numbers (in binary) starts with "1", 25% starts with "01", 12,5% starts with "001". This means that if you observe a random stream and see a "001", there is a higher chance that this stream has a cardinality of 8.
(The prefix "00..1" has no special meaning. It's there just because it's easy to find the most significant bit in a binary number in most processors)
Of course, if you observe just one integer, the chance this value is wrong is high. That's why the algorithm divides the stream in "m" independent substreams and keep the maximum length of a seen "00...1" prefix of each substream. Then, estimates the final value by taking the mean value of each substream.
That's the main idea of this algorithm. There are some missing details (the correction for low estimate values, for example), but it's all well written in the paper. Sorry for the terrible english.
A HyperLogLog is a probabilistic data structure. It counts the number of distinct elements in a list. But in comparison to a straightforward way of doing it (having a set and adding elements to the set) it does this in an approximate way.
Before looking how the HyperLogLog algorithm does this, one has to understand why you need it. The problem with a straightforward way is that it consumes O(distinct elements) of space. Why there is a big O notation here instead of just distinct elements? This is because elements can be of different sizes. One element can be 1 another element "is this big string". So if you have a huge list (or a huge stream of elements) it will take a lot memory.
Probabilistic Counting
How can one get a reasonable estimate of a number of unique elements? Assume that you have a string of length m which consists of {0, 1} with equal probability. What is the probability that it will start with 0, with 2 zeros, with k zeros? It is 1/2, 1/4 and 1/2^k. This means that if you have encountered a string starting with k zeros, you have approximately looked through 2^k elements. So this is a good starting point. Having a list of elements that are evenly distributed between 0 and 2^k - 1 you can count the maximum number of the biggest prefix of zeros in binary representation and this will give you a reasonable estimate.
The problem is that the assumption of having evenly distributed numbers from 0 t 2^k-1 is too hard to achieve (the data we encountered is mostly not numbers, almost never evenly distributed, and can be between any values. But using a good hashing function you can assume that the output bits would be evenly distributed and most hashing function have outputs between 0 and 2^k - 1 (SHA1 give you values between 0 and 2^160). So what we have achieved so far is that we can estimate the number of unique elements with the maximum cardinality of k bits by storing only one number of size log(k) bits. The downside is that we have a huge variance in our estimate. A cool thing that we almost created 1984's probabilistic counting paper (it is a little bit smarter with the estimate, but still we are close).
LogLog
Before moving further, we have to understand why our first estimate is not that great. The reason behind it is that one random occurrence of high frequency 0-prefix element can spoil everything. One way to improve it is to use many hash functions, count max for each of the hash functions and in the end average them out. This is an excellent idea, which will improve the estimate, but LogLog paper used a slightly different approach (probably because hashing is kind of expensive).
They used one hash but divided it into two parts. One is called a bucket (total number of buckets is 2^x) and another - is basically the same as our hash. It was hard for me to get what was going on, so I will give an example. Assume you have two elements and your hash function which gives values form 0 to 2^10 produced 2 values: 344 and 387. You decided to have 16 buckets. So you have:
0101 011000 bucket 5 will store 1
0110 000011 bucket 6 will store 4
By having more buckets you decrease the variance (you use slightly more space, but it is still tiny). Using math skills they were able to quantify the error (which is 1.3/sqrt(number of buckets)).
HyperLogLog
HyperLogLog does not introduce any new ideas, but mostly uses a lot of math to improve the previous estimate. Researchers have found that if you remove 30% of the biggest numbers from the buckets you significantly improve the estimate. They also used another algorithm for averaging numbers. The paper is math-heavy.
And I want to finish with a recent paper, which shows an improved version of hyperLogLog algorithm (up until now I didn't have time to fully understand it, but maybe later I will improve this answer).
The intuition is if your input is a large set of random number (e.g. hashed values), they should distribute evenly over a range. Let's say the range is up to 10 bit to represent value up to 1024. Then observed the minimum value. Let's say it is 10. Then the cardinality will estimated to be about 100 (10 × 100 ≈ 1024).
Read the paper for the real logic of course.
Another good explanation with sample code can be found here:
Damn Cool Algorithms: Cardinality Estimation - Nick's Blog

Most mutually distant k elements (clustering?)

I have a simple machine learning question:
I have n (~110) elements, and a matrix of all the pairwise distances. I would like to choose the 10 elements that are most far apart. That is, I want to
Maximize:
Choose 10 different elements.
Return min distance over (all pairings within the 10).
My distance metric is symmetric and respects the triangle inequality.
What kind of algorithm can I use? My first instinct is to do the following:
Cluster the n elements into 20
clusters.
Replace each cluster with just the
element of that cluster that is
furthest from the mean element of
the original n.
Use brute force to solve the
problem on the remaining 20
candidates. Luckily, 20 choose 10 is
only 184,756.
Edit: thanks to etarion's insightful comment, changed "Return sum of (distances)" to "Return min distance" in the optimization problem statement.
Here's how you might approach this combinatorial optimization problem by taking the convex relaxation.
Let D be an upper triangular matrix with your distances on the upper triangle. I.e. where i < j, D_i,j is the distance between elements i and j. (Presumably, you'll have zeros on the diagonal, as well.)
Then your objective is to maximize x'*D*x, where x is binary valued with 10 elements set to 1 and the rest to 0. (Setting the ith entry in x to 1 is analogous to selecting the ith element as one of your 10 elements.)
The "standard" convex optimization thing to do with a combinatorial problem like this is to relax the constraints such that x need not be discrete valued. Doing so gives us the following problem:
maximize y'*D*y
subject to: 0 <= y_i <= 1 for all i, 1'*y = 10
This is (morally) a quadratic program. (If we replace D with D + D', it'll become a bona fide quadratic program and the y you get out should be no different.) You can use an off-the-shelf QP solver, or just plug it in to the convex optimization solver of your choice (e.g. cvx).
The y you get out need not be (and probably won't be) a binary vector, but you can convert the scalar values to discrete ones in a bunch of ways. (The simplest is probably to let x be 1 in the 10 entries where y_i is highest, but you might need to do something a little more complicated.) In any case, y'*D*y with the y you get out does give you an upper bound for the optimal value of x'*D*x, so if the x you construct from y has x'*D*x very close to y'*D*y, you can be pretty happy with your approximation.
Let me know if any of this is unclear, notation or otherwise.
Nice question.
I'm not sure if it can be solved exactly in an efficient manner, and your clustering based solution seems reasonable. Another direction to look at would be local search method such as simulated annealing and hill climbing.
Here's an obvious baseline I would compare any other solution against:
Repeat 100 times:
Greedily select the datapoint that whose removal decreases the objective function the least and remove it.

SUBSET-SUM, upper bound on number of solutions

As you probably know, the SUBSET-SUM problem is defined as determining if a subset of a set of whole numbers sum to a specified whole number. (there is another definition of subset-sum, where a group of integers sum to zero, but let's use this definition for now)
For example ((1,2,4,5),6) is true because (2,4) sums to 6. We say that (2,4) is a "solution"
Furthermore, ((1,5,10),7) is false because nothing in the arguments sum to 7
My question is, given a set of argument numbers for SUBSET-SUM is there a polynomial upper bound on the number of possible solutions. In the first example there was (2,4) and (1,5).
We know that since SUBSET-SUM is NP-complete deciding in polynomail time probably is impossible. However my question is not related to the decision time, I'm asking strictly about the size of the list of solutions.
Obviously the size of the power set of the argument numbers can be an upper bound on solution list size, however this has exponential growth. My intuition is that there should be a polynomial bound, but I cannot prove this.
nb I know this sounds like a homework question, but please trust me it isn't. I am trying to teach myself certain aspects of CS theory and this is where my thoughts have taken me.
No; take numbers:
(1, 2, 1+2, 4, 8, 4+8, 16, 32, 16+32, ..., 22n, 22n+1, 22n+22n+1)
and ask about forming the sum 1 + 2 + 4 + ... + 22n + 22n+1. (For example: with n = 3 take the set (1,2,3,4,8,12,16,32,48) and ask about the subsets summing to 63.)
You can form 1+2 either using 1 and 2 or using 1+2.
You can form 4+8 either using 4 and 8 or using 4+8.
....
You can form 22n + 22n+1 either using 22n and 22n+1 or 22n+22n+1.
The choices are independent, so there are at least 3n=3m/3, where m is the size of your set. I bet this can be sharply strengthened, but this proves there's no polynomial bound.
Sperner's Theorem provides a nice (albeit non-polynomial) upper bound, at least in the case when the numbers in the set are strictly greater than zero (as seems to be the case in your problem).
The family of all subsets with a given sum form a Sperner family, which is a collection of subsets where no subset in the family is itself a subset of any of the other subsets in the family. This is where the assumption that the elements are strictly greater than zero is used. Sperner's theorem states that the maximum size of such a Sperner family is given by the binomial coefficient n Choose floor(n/2).
If you drop the assumption that the n numbers are distinct, it is easy to see that this upper bound can't be improved (just take all numbers = 1 and let the target sum be floor(n/2)). I don't know if it can be improved under the assumption that the numbers are distinct.

Minimize function in adjacent items of an array

I have an array (arr) of elements, and a function (f) that takes 2 elements and returns a number.
I need a permutation of the array, such that f(arr[i], arr[i+1]) is as little as possible for each i in arr. (and it should loop, ie. it should also minimize f(arr[arr.length - 1], arr[0]))
Also, f works sort of like a distance, so f(a,b) == f(b,a)
I don't need the optimum solution if it's too inefficient, but one that works reasonable well and is fast since I need to calculate them pretty much in realtime (I don't know what to length of arr is, but I think it could be something around 30)
What does "such that f(arr[i], arr[i+1]) is as little as possible for each i in arr" mean? Do you want minimize the sum? Do you want to minimize the largest of those? Do you want to minimize f(arr[0],arr[1]) first, then among all solutions that minimize this, pick the one that minimizes f(arr[1],arr[2]), etc., and so on?
If you want to minimize the sum, this is exactly the Traveling Salesman Problem in its full generality (well, "metric TSP", maybe, if your f's indeed form a metric). There are clever optimizations to the naive solution that will give you the exact optimum and run in reasonable time for about n=30; you could use one of those, or one of the heuristics that give you approximations.
If you want to minimize the maximum, it is a simpler problem although still NP-hard: you can do binary search on the answer; for a particular value d, draw edges for pairs which have f(x,y)
If you want to minimize it lexiocographically, it's trivial: pick the pair with the shortest distance and put it as arr[0],arr[1], then pick arr[2] that is closest to arr[1], and so on.
Depending on where your f(,)s are coming from, this might be a much easier problem than TSP; it would be useful for you to mention that as well.
You're not entirely clear what you're optimizing - the sum of the f(a[i],a[i+1]) values, the max of them, or something else?
In any event, with your speed limitations, greedy is probably your best bet - pick an element to make a[0] (it doesn't matter which due to the wraparound), then choose each successive element a[i+1] to be the one that minimizes f(a[i],a[i+1]).
That's going to be O(n^2), but with 30 items, unless this is in an inner loop or something that will be fine. If your f() really is associative and commutative, then you might be able to do it in O(n log n). Clearly no faster by reduction to sorting.
I don't think the problem is well-defined in this form:
Let's instead define n fcns g_i : Perms -> Reals
g_i(p) = f(a^p[i], a^p[i+1]), and wrap around when i+1 > n
To say you want to minimize f over all permutations really implies you can pick a value of i and minimize g_i over all permutations, but for any p which minimizes g_i, a related but different permatation minimizes g_j (just conjugate the permutation). So therefore it makes no sense to speak minimizing f over permutations for each i.
Unless we know something more about the structure of f(x,y) this is an NP-hard problem. Given a graph G and any vertices x,y let f(x,y) be 1 if there is no edge and 0 if there is an edge. What the problem asks is an ordering of the vertices so that the maximum f(arr[i],arr[i+1]) value is minimized. Since for this function it can only be 0 or 1, returning a 0 is equivalent to finding a Hamiltonian path in G and 1 is saying that no such path exists.
The function would have to have some sort of structure that disallows this example for it to be tractable.

Resources