Solving stochastic maximum bipartite matching problem - matching

I have faced the following problem:
there are two disjoint sets, A and B
for each pair of elements (a, b) (a belongs to set A, where b belongs to set B) there a probability pij is known in advance. It represents the probability (certainty level) that a matches b, or in other words, how closely a matches b (and vice-versa, because pij == pji).
I have to find a matching with the highest probability/certainty and find out pairs (a, b) which describe the matching
every element must be matched / paired with another from the other set exactly once (like in the standard bipartite matching problem)
if possible, I would like to compute a number which approximately expresses the uncertainty level for the obtained matching (let's say that 0 represents random guess and 1 represents certainty)
A simple practical example in which such algorithm is required is described below (this is not actually the problem I am solving!):
two people are asked to write letters
a - z on a piece of paper
for each pair of letters (a, b) we run a pattern matcher to determine the probability that letter a written by person A represents letter b wrote by person B. This gives us the
probability matrix which expresses some kind of similarity correlation
for each pair of letters (a, b)
for each letter that person A wrote,
we need to find the corresponding
letter written by person B
Current approach:
I am wondering if I could just assign weights which are proportional to the logarithm of certainty level / probability that element a from set A matches element b from set B and then run maximum weighted bipartite matching to find the maximum sum. The logarithm is because I want to maximize the total probability of multiple matching, and since single matches (represented as pairs of matched elements a - b) form a chain of events, which is a product of probabilities, by taking the logarithm we converts this to a sum of probabilities, which is then easily maximized using an algorithm for weighted bipartite matching, such as Hungarian algorithm. But I somehow doubt this approach would ensure the best matching in terms of statistical expected maximum.
After searching a bit, the closest problem I found was a two-stage stochastic maximum weighted matching problem, which is NP-hard, but I actually need some kind of "one-stage" stochastic maximum weighted matching problem.

I wonder if you can use MaxFlow/MinCut. I can't prove it's optimal at the moment, but your problem may be NP-hard anyway. You can use MF/MC to find a perfect matching when you have a bipartite graph with V=(A,B) by creating a source connected to all nodes in A with a weight of 1 and a sink connected to all nodes in B with weight 1. I'm proposing you make the weights of edges that cross from A to B be the probabilities you mentioned above. What do you think?

Related

What is right-invariance in the context of ranking?

I'm reading a paper called Unsupervised Rank Aggregation with Domain-Specific Experience, and under section 2.1 they talk about the the distance between two permutations of a list. Some examples of this distance metric are the Kendall Tau Distance, and the Spearman Footrule Distance. A property this distance metric could have is right-invariance. In the paper, if a metric has this property it means that it does not depend on how the object is indexed.
This part confuses me, as I don't really understand the difference between an object's rank, and an object's index. If an object is in a ranked list, wouldn't its index directly correlate to its rank? Additionally, they mention that the Kendall Tau Distance is right-invariant, yet it's formula shows that it is dependent on the index of objects i and j. So, what exactly is right-invariance in the context of rank aggregation.
The objects you're ranking arrive to the algorithm in a list, and the rankings you're aggregating arrive to the algorithm as permutations that act on the list. The order of the list/the indices of the objects in the input list is not supposed to matter: the algorithm is supposed to rank the objects in the same way (assign the same new indices) no matter the original ordering (ignoring the original indices). The new indices correspond to the rank and are important. The old indices (both in the list of objects and in the input rankings) are an artifact of the input representation and care must be taken to make sure they're ignored. Saying that the indices of the objects in the input list doesn't matter is the same as saying that shuffling the input list doesn't change the output of the algorithm. Since the rankings you're aggregating are represented by permutations of the input list, shuffling the input list by some permutation requires you right-multiply all the ranking permutations by the inverse of the shuffling permutation in order to get the same actual rankings of the objects. Since all of these new shuffled ranking permutations still represent the same rankings, the distance metric used to compare the rankings had better be insensitive to the change: such insensitivity is called right-invariance.
As to the right-invariance of the Kendall tau distance: consider the formula 2(x - 3) + 6 - 2x. This looks like it depends on the number you choose for x, but actually it's always zero, so it doesn't actually. Similarly for the Kendall tau: it might not be immediately obvious that it is right invariant (certainly not obvious to me); it's something that you'd probably have to sit down and prove to yourself mathematically. (If they don't even reference a proof, I'd assume it's actually quite trivial if you think about it, but this isn't my field and I'm not going to get it without some pencil and paper.)

How to find pattern groups in boolean array?

Given a 2D array of Boolean values I want to find all patterns that consist of at least 2 columns and at least 2 rows. The problem is somewhat close to finding cliques in a graph.
In the example below green cells represent "true" bits, greys are "false". Pattern 1 contains cols 1,3,4 and 5 and rows 1 and 2. Pattern 2 contains only columns 2 and 4, and rows 2,3,4.
Business idea behind this is finding similarity patterns among various groups of social network users. In real world number of rows can go up to 3E7, and the number of columns up to 300.
Can't really figure out a solution other than brute force matching.
Please advice the proper name of the problem, so I could read more, or advice an elegant solution.
This is (equivalent to) asking for all bicliques (complete bipartite subgraphs) larger than a certain size in a bipartite graph. Here the rows are the vertices of one part A of the graph, and the columns are the vertices of the other part B, and there is an edge between u \in A and v \in B whenever the cell at row u, column v is green.
Although you say that you want to find all patterns, you probably only want to find only maximal ones -- that is, patterns that cannot be extended to become larger patterns by adding more rows or columns. (Otherwise, for any pattern with c >= 2 columns and r >= 3 rows, you will also get back the more than 2^(c-2)*2^(r-3) non-maximal patterns that can be formed by deleting some of the rows or columns.)
But even listing just the maximal patterns can take time exponential in the number of rows and columns, assuming that P != NP. That's because the problem of finding a maximum (i.e. largest-possible) pattern, in terms of the total number of green cells, has been proven to be NP-complete: if it were possible to list all maximal patterns in polynomial time, then we could simply do so, and pick the largest, thereby solving this NP-complete problem in polynomial time.

Certainty of data distribution profile when performing a sort operation

Lets assume we have some data structure like an array of n entries, and for arguments sake lets assume that the data has bounded numerical values .
Is there a way to determine the profile of the data , say monotonically ascending ,descending etc to a reasonable degree, perhaps with a certainty value of z having checked k entries within the data structure?
Assuming we have an array of size N, this means that we have N-1 comparisons between each adjacent elements in the array. Let M=N-1. M represents the number of relations. The probability of the array not being in the correct order is
1/M
If you select a subset of K relations to determine monoticallly ascending or descending, the theoretical probability of certainty is
K / M
Since these are two linear equations, it is easy to see that if you want to be .9 sure, then you will need to check about 90% of the entries.
This only takes into account the assumptions in your question. If you can are aware of the probability distribution, then using statistics, you could randomly check a small percentage of the array.
If you only care about the array being in relative order(for example, on an interval from [0,10], most 1s would be close to the beginning.), this is another question altogether. An algorithm that does this as opposed to just sorting, would have to have a high cost for swapping elements and a cheap cost for comparisons. Otherwise, there would be no performance pay offs from writing a complex algorithm to handle the check.
It is important to note that this is theoretically speaking. I am assuming no distribution in the array.
The easier problem is to check the probability of encountering such orderly behavior from random data.
Eg. If numbers are arranged randomly there is p=0.5 that the first number is lower than the second number (we will come to the case of repetitions later). Now, if you sample k pairs and in every case first number is lower than second number, the probability of observing it is 2^(-k).
Coming back to repetitions, keep track of observed repetitions and factor it in. Eg. If the probability of repetition is q, probability of not observing repetitions is (1-q), probability of observing increasing or equal is q + (1-q)/2, so exponentiatewith k to get the probaility.

Graph Theory - Fill nodes with a limited number of routes

First I must say this is not a Homework or something related, this is a problem of a game named (freeciv).
Ok, in the game we have 'n' number of cities usually (8-12), each city can have a max number of trade-routes of 'k' usually (4), and those trade-routes need to be 'd' distance or further (8 Manhattan tiles).
The problem consist in to find the k*n trade-routes with (max distances or min distances), obviously this problem can be solved with a brute-force algorithm but it is really slow when you the player have more than 10 cities because the program has to make several iterations; I tried to solve it using graph theory but I am not an not an expert in it, I even asked some of my teachers and none could explain me an smart-algorithm, so I didn't come here to find the exact solution but I did to get the idea or the steps to analyze this.
The problem has two parts:
Calculate pair-wise distances between the cities
Select which pairs should become trade-route
I don't think the first part can be calculated faster than O(n·t) where t is number of tiles, as each run of Dijkstra's algorithm will give you distances from one city to all other cities. However if I understand correctly, distance between two cities never changes and is symmetrical. So whenever a new city is built, you just need to run Dijkstra's algorithm from it and cache the distances.
For the second part I would expect greedy algorithm to work. Order all pairs of cities by suitability and in each step pick the first one that does not violate the constraint of k routes per city. I am not sure whether it can be proven (the proof should be similar to the one for Kruskal's minimal spanning-tree algorithm if it exists. But I suspect it will work fine in practice even if you find that it does not work in theory (I haven't tried to either prove or disprove it; it's up to you)
continue #Jan Hudec way:
Init Stage:
lets say you have N cities (c1, c2,... cN). you should build a list of connections when each entity in the list will have a format of (cX, cY, Distance) (while X < Y, this is n^2/2 time) and order it by distance (descending for max distance or ascending for min distance), and you should also have an array/list which will hold the number of connection per City (cZ = W) which initialized for each city at N-1 because they all connected at the beginning.
Iterations:
iterate over the lists of connections
for each (cX, cY, D) if the number of connection (in the connection number array) of cX > k and cY > k then delete (cX, cY, D) from the connection list and also decrees by one the value of cX and cY in the connection array.
in the end, you'll have the connection list with the value you wish for.

Most mutually distant k elements (clustering?)

I have a simple machine learning question:
I have n (~110) elements, and a matrix of all the pairwise distances. I would like to choose the 10 elements that are most far apart. That is, I want to
Maximize:
Choose 10 different elements.
Return min distance over (all pairings within the 10).
My distance metric is symmetric and respects the triangle inequality.
What kind of algorithm can I use? My first instinct is to do the following:
Cluster the n elements into 20
clusters.
Replace each cluster with just the
element of that cluster that is
furthest from the mean element of
the original n.
Use brute force to solve the
problem on the remaining 20
candidates. Luckily, 20 choose 10 is
only 184,756.
Edit: thanks to etarion's insightful comment, changed "Return sum of (distances)" to "Return min distance" in the optimization problem statement.
Here's how you might approach this combinatorial optimization problem by taking the convex relaxation.
Let D be an upper triangular matrix with your distances on the upper triangle. I.e. where i < j, D_i,j is the distance between elements i and j. (Presumably, you'll have zeros on the diagonal, as well.)
Then your objective is to maximize x'*D*x, where x is binary valued with 10 elements set to 1 and the rest to 0. (Setting the ith entry in x to 1 is analogous to selecting the ith element as one of your 10 elements.)
The "standard" convex optimization thing to do with a combinatorial problem like this is to relax the constraints such that x need not be discrete valued. Doing so gives us the following problem:
maximize y'*D*y
subject to: 0 <= y_i <= 1 for all i, 1'*y = 10
This is (morally) a quadratic program. (If we replace D with D + D', it'll become a bona fide quadratic program and the y you get out should be no different.) You can use an off-the-shelf QP solver, or just plug it in to the convex optimization solver of your choice (e.g. cvx).
The y you get out need not be (and probably won't be) a binary vector, but you can convert the scalar values to discrete ones in a bunch of ways. (The simplest is probably to let x be 1 in the 10 entries where y_i is highest, but you might need to do something a little more complicated.) In any case, y'*D*y with the y you get out does give you an upper bound for the optimal value of x'*D*x, so if the x you construct from y has x'*D*x very close to y'*D*y, you can be pretty happy with your approximation.
Let me know if any of this is unclear, notation or otherwise.
Nice question.
I'm not sure if it can be solved exactly in an efficient manner, and your clustering based solution seems reasonable. Another direction to look at would be local search method such as simulated annealing and hill climbing.
Here's an obvious baseline I would compare any other solution against:
Repeat 100 times:
Greedily select the datapoint that whose removal decreases the objective function the least and remove it.

Resources