What would be an appropriate algorithm or strategy to cluster the patterns in a multidimensional array of numbers in which the elements have different lengths.
An example would be an array with these elements:
0: [4,2,8,5,3,2,8]
1: [1,3,6,2]
2: [8,3,8]
3: [3,2,5,2,1,8]
The goal is to find and cluster the patterns inside those lists of numbers. For instance in element "3" there is the pattern: "2,5,2,8" (not contiguous) which can also be found in element "0".
The numbers of the pattern found are not contiguous either in element "0" nor in element "3", but they have the same order.
Note: the example uses integers for more clarity but the real data will use floats, and instead of being exactly the same they will be taken as a "match" when both are separate within a given threshold.
.
Edit 2:
Although Abhishek Bansai's way is helpful if we chose only the longest common subsequence, we may miss other important patterns. For instance the these two sequences:
0: [4,5,2,1,3,6,8,9]
1: [2,1,3,4,5,6,7,8]
The longest common subsequence would be [2,1,3,6,8] but there is another important subsequence [4,5,6,8] that we would be missing.
.
Edit 1:
The answer from Abhishek Bansai seems a very good way to go about this.
It's the Longest Common Subsequence algorithm:
Comparing each element with each of the other elements using this algorithm will return all the patterns, and the next step would be to generate clusters out of those patterns.
Since you seem to be more interested in finding a "likeness" between sequences by looking for all the matches (per edits 1,2), you'll find that there is a vast body of research in the field of Sequence Alignment. From the wiki:
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns.
Related
Given an array of sets find the one that does not belong:
example: [[a,b,c,d], [a,b,f,g], [a,b,h,i], [j,k,l,m]]
output: [j,k,l,m]
We can see above that the first three sets have a common subset [a,b] and the last one does not. Note: There may be a case where the outlier set does have elements contained in the input group. In this case we have to find the set that has the least in common with the other sets.
I have tried iterating over the input list and keeping a count for each character (in a hash).
In a second pass, find which set has the smallest total.
In the example above, the last set would have a sum of counts of 4:
j*1 + k*1 + l*1 + m*1.
I'd like to know if there are better ways to do this.
Your description:
find the set that has the least in common with the other sets
Doing this as a general application would require computing similarity with each individual pair of sets; this does not seem to be what you describe algorithmically. Also, it's an annoying O(n^2) algorithm.
I suggest the follow amendment and clarification
find the set that least conforms to the mean of the entire list of sets.
This matches your description much better, and can be done in two simple passes, O(n*m) where you have n sets of size m.
The approach you outlined does the job quite nicely: count the occurrences of each element in all of the sets, O(nm). Then score each set according to the elements it contains, also O(nm). Keep track of which element has the lowest score.
For additional "accuracy", you could sort the scores and look for gaps in the scoring -- this would point out multiple outliers.
If you do this in Python, use the Counter class for your tally.
You shouldn't be looking for the smallest sum of count of elements. It is dependent on the size of the set. But if you substract the size of the set from the sum, it's 0 only if the set is disjoint from all the others. Another option, is to look at the maximum of the count of its elements. If the maximum is one on a set, then they only belong to the set.
There are many functions you can use. As the note states:
Note: There may be a case where the outlier set does have elements contained in the input group. In this case we have to find the set that has the least in common with the other sets.
The previous functions are not optimal. A better function would count the number of shared elements. Set the value of an element to 1 if it's in multiple sets and 0 if it appears only once.
I'm not sure whether to post this is Mathematics or here, but it's Algorithms, so I'm going to try here.
Essentially, I have an algorithm to find the median value in an array. It is, in essence, the quick select algorithm.
What I want to do is find arrays of numbers that satisfy the average case. I.E, when the array length is 5, the average number of basic operations is 6. I want to find the relationship between the arrays that output 6, so that I can programatically build an array of length x, and get the number of operations
I've been generating a tonne of arrays to try and find a pattern by hand, and I can't see it. I have been using the permutations of {1,2,3,4,5}, and going higher than that and it becomes too unwieldy to look at (minimum 720 arrays), and lower there is not enough variation to find a pattern.
The way I found 6 basic operations was to run the algorithm over ever permutation of {1,2,3,4,5}, and output the result into a list, which I then piped into python and ran sum(list)/len(list). I then manually went through and found the arrays that output 6, and looked at them. First alone, and then in the group, to see if I could find any characteristics.
The first pivot is always zero.
I'm either looking for some kind of formula to generate these arrays, or a way to analyse the data to obtain the pattern.
Edit:
I should clarify that I am looking for a way to programmatically generate arrays that meets the criteria of 'average case' for the quick select.
Say, I have a sorted array of n elements. I want to find 2 different keys k1 and k2 in this array using Binary search.
A basic solution would be to apply Binary search on them separately, like two calls for 2 keys which would maintain the time complexity to 2(logn).
Can we solve this problem using any other approach(es) for different k keys, k < n ?
Each search you complete can be used to subdivide the input to make it more efficient. For example suppose the element corresponding to k1 is at index i1. If k2 > k1 you can restrict the second search to i1..n, otherwise restrict it to 0..i1.
Best case is when your search keys are sorted also, so every new search can begin where the last one was found.
You can reduce the real complexity (although it will still be the same big O) by walking the shared search path once. That is, start the binary search until the element you're at is between the two items you are looking for. At that point, spawn a thread to continue the binary search for one element in the range past the pivot element you're at and spawn a thread to continue the binary search for the other element in the range before the pivot element you're at. Return both results. :-)
EDIT:
As Oli Charlesworth had mentioned in his comment, you did ask for an arbitrary amount of elements. This same logic can be extended to an arbitrary amount of search keys though. Here is an example:
You have an array of search keys like so:
searchKeys = ['findme1', 'findme2', ...]
You have key-value datastructure that maps a search key to the value found:
keyToValue = {'findme1': 'foundme1', 'findme2': 'foundme2', 'findme3': 'NOT_FOUND_VALUE'}
Now, following the same logic as before this EDIT, you can pass a "pruned" searchKeys array on each thread spawn where the keys diverge at the pivot. Each time you find a value for the given key, you update the keyToValue map. When there are no more ranges to search but still values in the searchKeys array, you can assume those keys are not to be found and you can update the mapping to signify that in some way (some null-like value perhaps?). When all threads have been joined (or by use of a counter), you return the mapping. The big win here is that you did not have to repeat the initial search logic that any two keys may share.
Second EDIT:
As Mark has added in his answer, sorting the search keys allows you to only have to look at the first item in the key range.
You can find academic articles calculating the complexity of different schemes for the general case, which is merging two sorted sequences of possibly very different lengths using the minimum number of comparisons. The paper at http://www.math.cmu.edu/~af1p/Texfiles/HL.pdf analyses one of the best known schemes, by Hwang and Lin, and has references to other schemes, and to the original paper by Hwang and Lin.
It looks a lot like a merge which steps through each item of the smaller list, skipping along the larger list with a stepsize that is the ratio of the sizes of the two lists. If it finds out that it has stepped too far along the large list it can use binary search to find a match amongst the values it has stepped over. If it has not stepped far enough, it takes another step.
Let us say I have some large collection of rows of data, where each element in the row is a (key, value) pair:
1) [(bird, "eagle"), (fish, "cod"), ... , (soda, "coke")]
2) [(bird, "lark"), (fish, "bass"), ..., (soda, "pepsi")]
n) ....
n+1) [(bird, "robin"), (fish, "flounder"), ..., (soda, "fanta")]
I would like the ability to run some computation that would allow me to determine for a new row, what is the row that is "most similar" to this row?
The most direct way I could think of finding the "most similar" row for any particular row is to directly compare said row against all other rows. This is obviously computationally very expensive.
I am looking for a solution of the following form.
A function that can take a row, and generate some derivative integer for that row. This returned integer would be a sort of "signature" of the row. The important property of this signature is that if two rows are very "similar" they would generate very close integers, if rows are very "different", they would generate distant integers. Obviously, if they are identical rows they would generate the same signature.
I could then takes these generated signatures, with the index of the row they point to, and sort them all by their signatures. This data structure I would keep so that I can do fast lookups. Call it database B.
When I have a new row, I wish to know which existent row in database B is most similar, I would:
Generate a signature for the new row
Binary search through the sorted list of (signature,index) in database B for the closet match
Return the closest matching (could be a perfect match) row in database B.
I know their is a lot of hand waving in this question. My problem is that I do not actually know what the function would be that would generate this signature. I see Levenshtein distances, but those represent the transformation cost, not so much the signature. I see that I could try lossy compressions, two things might be "bucketable" as they compress to the same thing. I am looking for other ideas on how to do this.
Thank you.
EDIT: This is my original answer, which we will call Case 1, where there is no precedence to the keys
You cannot do it as a sorted integer because that is one dimensional and your data is multi-dimensional. So "nearness" in that sense cannot be established on a line.
Your example shows bird, fish and soda for all 3 lines. Are the keys fixed and known? If they are not, then your first step is to hash the keys of a row to establish rows that have the same keys.
For the values, consider this as a poor man's Saturday Night similarity trick. Hash the values, any two rows that match on that hash are an exact match and represent the same "spot", zero distance.
If N is the number of key/value pairs:
The closest non-exact "nearness" would mean matching N-1 out of N values. So you generate N more hashes, each one dropping out one of the values. Any two rows that match on those hashes have N-1 out of N values in common.
The next closest non-exact "nearness" would mean matching N-2 out of N values. So you generate more than N more hashes (I can't figure the binary this late), this time each hash leaves out a combination of two values. Any two rows that match on those hashes have N-2 out of N values in common.
So you can see where this is going. At the logical extreme you end up with 2^N hashes, not very savory, but I'm assuming you would not go that far because you reach a point where too few matching values would be considered to "far" to be worth considering.
EDIT: To see how we cannot escape dimensionality, consider just two keys, with values 1-9. Plot all possible values on a graph. We see see that {1,1} is close to {2,2}, but also that {5,6} is close to {6,7}. So we get a brainstorm, we say, Aha! I'll calculate each point's distance from the origin using Pythagorean theorem! This will make both {1,1} and {2,2} easy to detect. But then the two points {1,10} and {10,1} will get the same number, even though they are as far apart as they can be on the graph. So we say, ok, I need to add the angle for each. Two points at the same distance are distinguished by their angle, two points at the same angle are distinguished by their distance. But of course now we've plotted them on two dimensions.
EDIT: Case 2 would be when there is precedence to the keys, when key 1 is more significant than key 2, which is more significant than key 3, etc. In this case, if the allowed values were A-Z, you would string the values together as if they were digits to get a sortable value. ABC is very close to ABD, but very far from BBD.
If you had a lot of data, and wanted to do this hardcore, I would suggest a statistical method like PLSA or PSVM, which can extract identifying topics from text and identify documents with similar topic probabilities.
A simpler, but less accurate way of doing it is using Soundex, which is available for many languages. You can store the soundex (which will be a short string, not an integer I'm afraid), and look for exact matches to the soundex, which should point to similar rows.
I think it's unrealistic to expect a function to turn a series of strings into an integer such that integers near each other map to similar strings. The closest you might come is doing a checksum on each individual tuple, and comparing the checksums for the new row to the checksums of existing rows, but I'm guessing you're trying to come up with a single number you can index on.
I want to find the longest possible sequence of words that match the following rules:
Each word can be used at most once
All words are Strings
Two strings sa and sb can be concatenated if the LAST two characters of sa matches the first two characters of sb.
In the case of concatenation, it is performed by overlapping those characters. For example:
sa = "torino"
sb = "novara"
sa concat sb = "torinovara"
For example, I have the following input file, "input.txt":
novara
torino
vercelli
ravenna
napoli
liverno
messania
noviligure
roma
And, the output of the above file according to the above rules should be:
torino
novara
ravenna
napoli
livorno
noviligure
since the longest possible concatenation is:
torinovaravennapolivornovilligure
Can anyone please help me out with this? What would be the best data structure for this?
This can be presented as a directed graph problem -- the nodes are words, and they are connected by an edge if they overlap (and the smallest overlap is chosen to get the longest length), and then find the highest weight non-intersecting path.
(Well, actually, you want to expand the graph a bit to handle beginning and ending at a word. Adjoin a "starting node" with with an edge to each word of weight length word / 2.
Between words, -overlap + length start + length finish / 2, and between each word and an "ending node" "length word / 2". Might be easier to double it.)
https://cstheory.stackexchange.com/questions/3684/max-non-overlapping-path-in-weighted-graph
I would start really simple. Make 2 vectors of strings, one sorted normally, one sorted by the last 2 letters. Create an index (vector of ints) for the second vector that points out it's position in the first.
To find the longest.. first remove the orphans. words that don't match at either end to anything. Then you want to build a neighbor joining tree, this is where you determine which words can possibly reach each other. If you have 2 or more trees you should try the largest tree first.
Now with a tree your job is to find ends that are rare, and bind them to other ends, and repeat. This should get you a pretty nice solution, if it uses all the words your golden, skip the other trees. If it doesn't then your into a whole slew of algorithms to make this efficient.
Some items to consider:
If you have 3+ unique ends you are guaranteed to drop 1+ words.
This can be used to prune your tries down while hunting for an answer.
recalculate unique ends often.
Odd numbers of a given end ensure that one must be dropped(you get 2 freebies at the ends).
Segregate words that can self match , you can always throw them in last, and they muck up the math otherwise.
You may be able to create small self matching rings, you can treat these like self matching words, as long as you don't orphan them when you create them. This can make the performance fantastic, but no guarantees on a perfect solution.
The search space is order(N!) a list of millions of elements may be hard to prove an exact answer. Of course I could be overlooking something.