How to find pattern groups in boolean array?

How to find pattern groups in boolean array? - arrays

Given a 2D array of Boolean values I want to find all patterns that consist of at least 2 columns and at least 2 rows. The problem is somewhat close to finding cliques in a graph.
In the example below green cells represent "true" bits, greys are "false". Pattern 1 contains cols 1,3,4 and 5 and rows 1 and 2. Pattern 2 contains only columns 2 and 4, and rows 2,3,4.
Business idea behind this is finding similarity patterns among various groups of social network users. In real world number of rows can go up to 3E7, and the number of columns up to 300.
Can't really figure out a solution other than brute force matching.
Please advice the proper name of the problem, so I could read more, or advice an elegant solution.

This is (equivalent to) asking for all bicliques (complete bipartite subgraphs) larger than a certain size in a bipartite graph. Here the rows are the vertices of one part A of the graph, and the columns are the vertices of the other part B, and there is an edge between u \in A and v \in B whenever the cell at row u, column v is green.
Although you say that you want to find all patterns, you probably only want to find only maximal ones -- that is, patterns that cannot be extended to become larger patterns by adding more rows or columns. (Otherwise, for any pattern with c >= 2 columns and r >= 3 rows, you will also get back the more than 2^(c-2)*2^(r-3) non-maximal patterns that can be formed by deleting some of the rows or columns.)
But even listing just the maximal patterns can take time exponential in the number of rows and columns, assuming that P != NP. That's because the problem of finding a maximum (i.e. largest-possible) pattern, in terms of the total number of green cells, has been proven to be NP-complete: if it were possible to list all maximal patterns in polynomial time, then we could simply do so, and pick the largest, thereby solving this NP-complete problem in polynomial time.

Related

What is right-invariance in the context of ranking?

I'm reading a paper called Unsupervised Rank Aggregation with Domain-Specific Experience, and under section 2.1 they talk about the the distance between two permutations of a list. Some examples of this distance metric are the Kendall Tau Distance, and the Spearman Footrule Distance. A property this distance metric could have is right-invariance. In the paper, if a metric has this property it means that it does not depend on how the object is indexed.
This part confuses me, as I don't really understand the difference between an object's rank, and an object's index. If an object is in a ranked list, wouldn't its index directly correlate to its rank? Additionally, they mention that the Kendall Tau Distance is right-invariant, yet it's formula shows that it is dependent on the index of objects i and j. So, what exactly is right-invariance in the context of rank aggregation.

The objects you're ranking arrive to the algorithm in a list, and the rankings you're aggregating arrive to the algorithm as permutations that act on the list. The order of the list/the indices of the objects in the input list is not supposed to matter: the algorithm is supposed to rank the objects in the same way (assign the same new indices) no matter the original ordering (ignoring the original indices). The new indices correspond to the rank and are important. The old indices (both in the list of objects and in the input rankings) are an artifact of the input representation and care must be taken to make sure they're ignored. Saying that the indices of the objects in the input list doesn't matter is the same as saying that shuffling the input list doesn't change the output of the algorithm. Since the rankings you're aggregating are represented by permutations of the input list, shuffling the input list by some permutation requires you right-multiply all the ranking permutations by the inverse of the shuffling permutation in order to get the same actual rankings of the objects. Since all of these new shuffled ranking permutations still represent the same rankings, the distance metric used to compare the rankings had better be insensitive to the change: such insensitivity is called right-invariance.
As to the right-invariance of the Kendall tau distance: consider the formula 2(x - 3) + 6 - 2x. This looks like it depends on the number you choose for x, but actually it's always zero, so it doesn't actually. Similarly for the Kendall tau: it might not be immediately obvious that it is right invariant (certainly not obvious to me); it's something that you'd probably have to sit down and prove to yourself mathematically. (If they don't even reference a proof, I'd assume it's actually quite trivial if you think about it, but this isn't my field and I'm not going to get it without some pencil and paper.)

Maximal Number of Functional Dependencies (including trivial)

In a relationship R of N attributes how many functional dependencies are there (including trivial)?
I know that trivial dependencies are those where the right hand side is a subset of the left hand side but I'm not sure how to calculate the upper bound on the dependencies.
Any information regarding the answer and the approach to it would be greatly appreciated.
-

The maximum possible number of functional dependencies is
the number of possible left-hand sides * the number of possible right-hand sides
We're including trivial functional dependencies, so the number of possible left-hand sides equals the number of possible right-hand sides. So this simplifies to
(the number of possible left-hand sides)2
Let's say you have R{∅AB}. There are three attributes.1 The number of possible left-hand sides is
combinations of 3 attributes taken 1 at a time, plus
combinations of 3 attributes taken 2 at a time, plus
combinations of 3 attributes taken 3 at a time
which equals 3+3+1, or 7. So there are at most 72 possible functional dependencies for any R having three attributes: 49. The order of attributes doesn't matter, so we use formulas for combinations, not for permutations.
If you start with R{∅ABC}, you have
combinations of 4 attributes taken 1 at a time, plus
combinations of 4 attributes taken 2 at a time, plus
combinations of 4 attributes taken 3 at a time, plus
combinations of 4 attributes taken 4 at a time
which equals 4+6+4+1, or 15. So there are at most 152 possible functional dependencies for any R having four attributes: 225.
Once you know this formula, these calculations are simple using a spreadsheet. It's also pretty easy to write a program to generate every possible functional dependency using a scripting language like Ruby or Python.
The Wikipedia article on combinations has examples of how to count the combinations, with and without using factorials.
All possible combinations from R{∅AB}:
A->A A->B A->∅ A->AB A->A∅ A->B∅ A->AB∅
B->A B->B B->∅ B->AB B->A∅ B->B∅ B->AB∅
∅->A ∅->B ∅->∅ ∅->AB ∅->A∅ ∅->B∅ ∅->AB∅
AB->A AB->B AB->∅ AB->AB AB->A∅ AB->B∅ AB->AB∅
A∅->A A∅->B A∅->∅ A∅->AB A∅->A∅ A∅->B∅ A∅->AB∅
B∅->A B∅->B B∅->∅ B∅->AB B∅->A∅ B∅->B∅ B∅->AB∅
AB∅->A AB∅->B AB∅->∅ AB∅->AB AB∅->A∅ AB∅->B∅ AB∅->AB∅
Most people ignore the empty set. They'd say R{∅AB} has only two attributes, A and B, and they'd write it as R{AB}.

Algorithm - How to select one number from each column in an array so that their sum is as close as possible to a particular value

I have an m x n matrix of real numbers. I want to choose a single value from each column such that the sum of my selected values is as close as possible to a pre-specified total.
I am not an experienced programmer (although I have an experienced friend who will help). I would like to achieve this using Matlab, Mathematica or c++ (MySQL if necessary).
The code only needs to run a few times, once every few days - it does not necessarily need to be optimised. I will have 16 columns and about 12 rows.

Normally I would suggest dynamic programming, but there are a few features of this situation suggesting an alternative approach. First, the performance demands are light; this program will be run only a couple times, and it doesn't sound as though a running time on the order of hours would be a problem. Second, the matrix is fairly small. Third, the matrix contains real numbers, so it would be necessary to round and then do a somewhat sophisticated search to ensure that the optimal possibility was not missed.
Instead, I'm going to suggest the following semi-brute-force approach. 12**16 ~ 1.8e17, the total number of possible choices, is too many, but 12**9 ~ 5.2e9 is doable with brute force, and 12**7 ~ 3.6e7 fits comfortably in memory. Compute all possible choices for the first seven columns. Sort these possibilities by total. For each possible choice for the last nine columns, use an efficient search algorithm to find the best mate among the first seven. (If you have a lot of memory, you could try eight and eight.)
I would attempt a first implementation in C++, using std::sort and std::lower_bound from the <algorithm> standard header. Measure it; if it's too slow, then try an in-memory B+-tree (does Boost have one?).
I spent some more time thinking about how to implement what I wrote above in the simplest way possible. Here's an approach that will work well for a 12 by 16 matrix on a 64-bit machine with roughly 4 GB of memory.
The number of choices for the first eight columns is 12**8. Each choice is represented by a 4-byte integer between 0 and 12**8 - 1. To decode a choice index i, the row for the first column is given by i % 12. Update i /= 12;. The row for the second column now is given by i % 12, et cetera.
A vector holding all choices requires roughly 12**8 * 4 bytes, or about 1.6 GB. Two such vectors require 3.2 GB. Prepare one for the first eight columns and one for the last eight. Sort them by sum of the entries that they indicate. Use saddleback search to find the best combination. (Initialize an iterator into the first vector and a reverse iterator into the second. While neither iterator is at its end, compare the current combination against the current best and update the current best if necessary. If the current combination sums to than the target, increment the first iterator. If the sum is greater than the target, increment the second iterator.)
I would estimate that this requires less than 50 lines of C++.

Without knowing the range of values that might fill the arrays, how about something generic like this:
divide the target by the number of remaining columns.
Pick the number from that column closest to that value.
Repeat from 1. Until each column picked.

Most mutually distant k elements (clustering?)

I have a simple machine learning question:
I have n (~110) elements, and a matrix of all the pairwise distances. I would like to choose the 10 elements that are most far apart. That is, I want to
Maximize:
Choose 10 different elements.
Return min distance over (all pairings within the 10).
My distance metric is symmetric and respects the triangle inequality.
What kind of algorithm can I use? My first instinct is to do the following:
Cluster the n elements into 20
clusters.
Replace each cluster with just the
element of that cluster that is
furthest from the mean element of
the original n.
Use brute force to solve the
problem on the remaining 20
candidates. Luckily, 20 choose 10 is
only 184,756.
Edit: thanks to etarion's insightful comment, changed "Return sum of (distances)" to "Return min distance" in the optimization problem statement.

Here's how you might approach this combinatorial optimization problem by taking the convex relaxation.
Let D be an upper triangular matrix with your distances on the upper triangle. I.e. where i < j, D_i,j is the distance between elements i and j. (Presumably, you'll have zeros on the diagonal, as well.)
Then your objective is to maximize x'*D*x, where x is binary valued with 10 elements set to 1 and the rest to 0. (Setting the ith entry in x to 1 is analogous to selecting the ith element as one of your 10 elements.)
The "standard" convex optimization thing to do with a combinatorial problem like this is to relax the constraints such that x need not be discrete valued. Doing so gives us the following problem:
maximize y'*D*y
subject to: 0 <= y_i <= 1 for all i, 1'*y = 10
This is (morally) a quadratic program. (If we replace D with D + D', it'll become a bona fide quadratic program and the y you get out should be no different.) You can use an off-the-shelf QP solver, or just plug it in to the convex optimization solver of your choice (e.g. cvx).
The y you get out need not be (and probably won't be) a binary vector, but you can convert the scalar values to discrete ones in a bunch of ways. (The simplest is probably to let x be 1 in the 10 entries where y_i is highest, but you might need to do something a little more complicated.) In any case, y'*D*y with the y you get out does give you an upper bound for the optimal value of x'*D*x, so if the x you construct from y has x'*D*x very close to y'*D*y, you can be pretty happy with your approximation.
Let me know if any of this is unclear, notation or otherwise.

Nice question.
I'm not sure if it can be solved exactly in an efficient manner, and your clustering based solution seems reasonable. Another direction to look at would be local search method such as simulated annealing and hill climbing.
Here's an obvious baseline I would compare any other solution against:
Repeat 100 times:
Greedily select the datapoint that whose removal decreases the objective function the least and remove it.

Solving stochastic maximum bipartite matching problem

I have faced the following problem:
there are two disjoint sets, A and B
for each pair of elements (a, b) (a belongs to set A, where b belongs to set B) there a probability pij is known in advance. It represents the probability (certainty level) that a matches b, or in other words, how closely a matches b (and vice-versa, because pij == pji).
I have to find a matching with the highest probability/certainty and find out pairs (a, b) which describe the matching
every element must be matched / paired with another from the other set exactly once (like in the standard bipartite matching problem)
if possible, I would like to compute a number which approximately expresses the uncertainty level for the obtained matching (let's say that 0 represents random guess and 1 represents certainty)
A simple practical example in which such algorithm is required is described below (this is not actually the problem I am solving!):
two people are asked to write letters
a - z on a piece of paper
for each pair of letters (a, b) we run a pattern matcher to determine the probability that letter a written by person A represents letter b wrote by person B. This gives us the
probability matrix which expresses some kind of similarity correlation
for each pair of letters (a, b)
for each letter that person A wrote,
we need to find the corresponding
letter written by person B
Current approach:
I am wondering if I could just assign weights which are proportional to the logarithm of certainty level / probability that element a from set A matches element b from set B and then run maximum weighted bipartite matching to find the maximum sum. The logarithm is because I want to maximize the total probability of multiple matching, and since single matches (represented as pairs of matched elements a - b) form a chain of events, which is a product of probabilities, by taking the logarithm we converts this to a sum of probabilities, which is then easily maximized using an algorithm for weighted bipartite matching, such as Hungarian algorithm. But I somehow doubt this approach would ensure the best matching in terms of statistical expected maximum.
After searching a bit, the closest problem I found was a two-stage stochastic maximum weighted matching problem, which is NP-hard, but I actually need some kind of "one-stage" stochastic maximum weighted matching problem.

I wonder if you can use MaxFlow/MinCut. I can't prove it's optimal at the moment, but your problem may be NP-hard anyway. You can use MF/MC to find a perfect matching when you have a bipartite graph with V=(A,B) by creating a source connected to all nodes in A with a weight of 1 and a sink connected to all nodes in B with weight 1. I'm proposing you make the weights of edges that cross from A to B be the probabilities you mentioned above. What do you think?