Is there any known algorithm how to effectively generate any random multiset permutations with additional restrictions.
Example:
I have a multiset of items, for example: {1,1,1,2,2,3,3,3}, and a restricting set of sets, for example {{3},{1,2},{1,2,3},{1,2,3},{1,2,3},{1,2,3},{2,3},{2,3}}. I am looking for permutations of items, but the first element must be 3, and the second must be 1 or 2, etc.
One such permutation that fits restrictions is: {3,1,1,1,2,2,3,3}
Yes, there is. I asked in this German forum and got the following answer: The problem can be reduced to finding a maximum matching on a bipartite graph.
In order to do this, introduce vertices for all elements in the multiset. These vertices form the one side of the bipartite graph. Then, introduce vertices fo each restriction set. These vertices form the other side of the bipartite graph. Now introduce edges from each restriction set to those vertices on the first side, such that the vertex on the first side is "hit" if and only if it represents an element that is contained in the connected set.
The bipartite graph for your example would look like this:
Now the matching chooses edges in a way that no two adjacent edges are chosen. E.g. the first "1" is chosen for the second restriction "{1,2}", then it can not be used for any other restriction any more, since the use of another edge from this vertex would not result in a matching any more.
Feel free to ask in case you got another question on this.
Related
Given an array of sets find the one that does not belong:
example: [[a,b,c,d], [a,b,f,g], [a,b,h,i], [j,k,l,m]]
output: [j,k,l,m]
We can see above that the first three sets have a common subset [a,b] and the last one does not. Note: There may be a case where the outlier set does have elements contained in the input group. In this case we have to find the set that has the least in common with the other sets.
I have tried iterating over the input list and keeping a count for each character (in a hash).
In a second pass, find which set has the smallest total.
In the example above, the last set would have a sum of counts of 4:
j*1 + k*1 + l*1 + m*1.
I'd like to know if there are better ways to do this.
Your description:
find the set that has the least in common with the other sets
Doing this as a general application would require computing similarity with each individual pair of sets; this does not seem to be what you describe algorithmically. Also, it's an annoying O(n^2) algorithm.
I suggest the follow amendment and clarification
find the set that least conforms to the mean of the entire list of sets.
This matches your description much better, and can be done in two simple passes, O(n*m) where you have n sets of size m.
The approach you outlined does the job quite nicely: count the occurrences of each element in all of the sets, O(nm). Then score each set according to the elements it contains, also O(nm). Keep track of which element has the lowest score.
For additional "accuracy", you could sort the scores and look for gaps in the scoring -- this would point out multiple outliers.
If you do this in Python, use the Counter class for your tally.
You shouldn't be looking for the smallest sum of count of elements. It is dependent on the size of the set. But if you substract the size of the set from the sum, it's 0 only if the set is disjoint from all the others. Another option, is to look at the maximum of the count of its elements. If the maximum is one on a set, then they only belong to the set.
There are many functions you can use. As the note states:
Note: There may be a case where the outlier set does have elements contained in the input group. In this case we have to find the set that has the least in common with the other sets.
The previous functions are not optimal. A better function would count the number of shared elements. Set the value of an element to 1 if it's in multiple sets and 0 if it appears only once.
What would be an appropriate algorithm or strategy to cluster the patterns in a multidimensional array of numbers in which the elements have different lengths.
An example would be an array with these elements:
0: [4,2,8,5,3,2,8]
1: [1,3,6,2]
2: [8,3,8]
3: [3,2,5,2,1,8]
The goal is to find and cluster the patterns inside those lists of numbers. For instance in element "3" there is the pattern: "2,5,2,8" (not contiguous) which can also be found in element "0".
The numbers of the pattern found are not contiguous either in element "0" nor in element "3", but they have the same order.
Note: the example uses integers for more clarity but the real data will use floats, and instead of being exactly the same they will be taken as a "match" when both are separate within a given threshold.
.
Edit 2:
Although Abhishek Bansai's way is helpful if we chose only the longest common subsequence, we may miss other important patterns. For instance the these two sequences:
0: [4,5,2,1,3,6,8,9]
1: [2,1,3,4,5,6,7,8]
The longest common subsequence would be [2,1,3,6,8] but there is another important subsequence [4,5,6,8] that we would be missing.
.
Edit 1:
The answer from Abhishek Bansai seems a very good way to go about this.
It's the Longest Common Subsequence algorithm:
Comparing each element with each of the other elements using this algorithm will return all the patterns, and the next step would be to generate clusters out of those patterns.
Since you seem to be more interested in finding a "likeness" between sequences by looking for all the matches (per edits 1,2), you'll find that there is a vast body of research in the field of Sequence Alignment. From the wiki:
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns.
I have a 2-dimensional array of bool like this
The shape won't have any holes -- even if it has -- I'll ignore them. Now I want to find the Polygon embracing my shape:
Is there any algorithm ready to use for this case? I couldn't find any, but I'm not sure whether I know the correct search-term for this task.
You can use a delaunay triangulation and then remove the longest edges. I use the average of all edges multiply with a constant.
After thinking about more a little I found it out and there is a O(n)-way to do it: Search row-wise for the first coordinate that contains at least one adjacent field set true. From there you can definitly take the first step to the right. From now on just walk around the field deciding what direction to walk next based on the four adjacent fields.
Let us say I have some large collection of rows of data, where each element in the row is a (key, value) pair:
1) [(bird, "eagle"), (fish, "cod"), ... , (soda, "coke")]
2) [(bird, "lark"), (fish, "bass"), ..., (soda, "pepsi")]
n) ....
n+1) [(bird, "robin"), (fish, "flounder"), ..., (soda, "fanta")]
I would like the ability to run some computation that would allow me to determine for a new row, what is the row that is "most similar" to this row?
The most direct way I could think of finding the "most similar" row for any particular row is to directly compare said row against all other rows. This is obviously computationally very expensive.
I am looking for a solution of the following form.
A function that can take a row, and generate some derivative integer for that row. This returned integer would be a sort of "signature" of the row. The important property of this signature is that if two rows are very "similar" they would generate very close integers, if rows are very "different", they would generate distant integers. Obviously, if they are identical rows they would generate the same signature.
I could then takes these generated signatures, with the index of the row they point to, and sort them all by their signatures. This data structure I would keep so that I can do fast lookups. Call it database B.
When I have a new row, I wish to know which existent row in database B is most similar, I would:
Generate a signature for the new row
Binary search through the sorted list of (signature,index) in database B for the closet match
Return the closest matching (could be a perfect match) row in database B.
I know their is a lot of hand waving in this question. My problem is that I do not actually know what the function would be that would generate this signature. I see Levenshtein distances, but those represent the transformation cost, not so much the signature. I see that I could try lossy compressions, two things might be "bucketable" as they compress to the same thing. I am looking for other ideas on how to do this.
Thank you.
EDIT: This is my original answer, which we will call Case 1, where there is no precedence to the keys
You cannot do it as a sorted integer because that is one dimensional and your data is multi-dimensional. So "nearness" in that sense cannot be established on a line.
Your example shows bird, fish and soda for all 3 lines. Are the keys fixed and known? If they are not, then your first step is to hash the keys of a row to establish rows that have the same keys.
For the values, consider this as a poor man's Saturday Night similarity trick. Hash the values, any two rows that match on that hash are an exact match and represent the same "spot", zero distance.
If N is the number of key/value pairs:
The closest non-exact "nearness" would mean matching N-1 out of N values. So you generate N more hashes, each one dropping out one of the values. Any two rows that match on those hashes have N-1 out of N values in common.
The next closest non-exact "nearness" would mean matching N-2 out of N values. So you generate more than N more hashes (I can't figure the binary this late), this time each hash leaves out a combination of two values. Any two rows that match on those hashes have N-2 out of N values in common.
So you can see where this is going. At the logical extreme you end up with 2^N hashes, not very savory, but I'm assuming you would not go that far because you reach a point where too few matching values would be considered to "far" to be worth considering.
EDIT: To see how we cannot escape dimensionality, consider just two keys, with values 1-9. Plot all possible values on a graph. We see see that {1,1} is close to {2,2}, but also that {5,6} is close to {6,7}. So we get a brainstorm, we say, Aha! I'll calculate each point's distance from the origin using Pythagorean theorem! This will make both {1,1} and {2,2} easy to detect. But then the two points {1,10} and {10,1} will get the same number, even though they are as far apart as they can be on the graph. So we say, ok, I need to add the angle for each. Two points at the same distance are distinguished by their angle, two points at the same angle are distinguished by their distance. But of course now we've plotted them on two dimensions.
EDIT: Case 2 would be when there is precedence to the keys, when key 1 is more significant than key 2, which is more significant than key 3, etc. In this case, if the allowed values were A-Z, you would string the values together as if they were digits to get a sortable value. ABC is very close to ABD, but very far from BBD.
If you had a lot of data, and wanted to do this hardcore, I would suggest a statistical method like PLSA or PSVM, which can extract identifying topics from text and identify documents with similar topic probabilities.
A simpler, but less accurate way of doing it is using Soundex, which is available for many languages. You can store the soundex (which will be a short string, not an integer I'm afraid), and look for exact matches to the soundex, which should point to similar rows.
I think it's unrealistic to expect a function to turn a series of strings into an integer such that integers near each other map to similar strings. The closest you might come is doing a checksum on each individual tuple, and comparing the checksums for the new row to the checksums of existing rows, but I'm guessing you're trying to come up with a single number you can index on.
I have my graph implemented with linked lists, for both vertices and edges and that is becoming an issue for the Dijkstra algorithm. As I said on a previous question, I'm converting this code that uses an adjacency matrix to work with my graph implementation.
The problem is that when I find the minimum value I get an array index. This index would have match the vertex index if the graph vertexes were stored in an array instead. And the access to the vertex would be constant.
I don't have time to change my graph implementation, but I do have an hash table, indexed by a unique number (but one that does not start at 0, it's like 100090000) which is the problem I'm having. Whenever I need, I use the modulo operator to get a number between 0 and the total number of vertices.
This works fine for when I need an array index from the number, but when I need the number from the array index (to access the calculated minimum distance vertex in constant time), not so much.
I tried to search for how to inverse the modulo operation, like, 100090000 mod 18000 = 10000 and, 10000 invmod 18000 = 100090000 but couldn't find a way to do it.
My next alternative is to build some sort of reference array where, in the example above, arr[10000] = 100090000. That would fix the problem, but would require to loop the whole graph one more time.
Do I have any better/easier solution with my current graph implementation?
In your array, instead of just storing the count (or whatever you're storing there), store a structure which contains the count as well as the vertex index number.