Searching for efficient clustering algorithm - c

In a 2D NxN matrix each point represents a area of a map. There are M numbers of customers in random areas whose service need to be served by K numbers of customer service centers in random areas. Each customer service center can serve up to X number of jobs. Number of all customers must be less than or equals to total capability of customer service centres. All customers must to be assigned in any of the service centre and hamiltonian distance is the cost (customer can move up,left,down and right only towards service centre). How to assign customers to minimise the total cost? I was looking for a direction if its a well known problem or at least pseudocode.

I think you can handle this problem using MinCost/MaxFlow algorithm. Create the graph as follow:
Create M + K + 2 nodes; M customer-nodes, K customer-service-center-nodes (csc-nodes), a source and a sink.
Create K edges from the source to the K csc-nodes with cost 0 and capacity equal to the number of customers that each CSC can serve.
Create M edges from the M customer-nodes to the sink, each edge will have capacity 1 and cost 0.
Create K * M edges from the K csc-nodes to the M customer-nodes each one with a capacity equal to 1 and cost equal to the distance between the CSC and the customer.
Run MinCost/MaxFlow algorithm on the network (V = M + K + 2, E = M + K + M*K). If the max-flow value is equal to M, then you can serve all the customers with the resulting (minimum) cost.
The solution for this case is 23.

The way the problem is formulated, you have a constrained optimization problem, and not a clustering problem. It likely is convex, integer and linear.
Clustering algorithms won't satisfy the capacity constraint.
There is plenty of research on such optimization. There are various highly optimized solvers available.

Related

Probability: Estimating NoSQL Query Size / COUNT Using Random Samples

I have a very large NoSQL database. Each item in the database is assigned a uniformly distributed random value between 0 and 1. This database is so large that performing a COUNT on queries does not yield acceptable performance, but I'd like to use the random values to estimate COUNT.
The idea is this:
Run a query and order the query by the random value. Random values are indexed, so it's fast.
Grab the lowest N values, and see how big the largest value is, say R.
Estimate COUNT as N / R
The question is two-fold:
Is N / R the best way to estimate COUNT? Maybe it should be (N+1)/R? Maybe we could look at the other values (average, variance, etc), and not just the largest value to get a better estimate?
What is the error margin on this estimated value of COUNT?
Note: I thought about posting this in the math stack exchange, but given this is for databases, I thought it would be more appropriate here.
This actually would be better on math or statistics stack exchange.
The reasonable estimate is that if R is large and x is your order statistic, then R is approximately n / x - 1. About 95% of the time the error will be within 2 R / sqrt(n) of this. So looking at the 100th element will estimate the right answer to within about 20%. Looking at the 10,000th element will estimate it to within about 2%. And the millionth element will get you the right answer to within about 0.2%.
To see this, start with the fact that the n'th order statistic has a Beta distribution with parameters 𝛼 = n and β = R + 1 - n. Which means that the mean value of the n'th smallest value out of R values is n/(R+1). And its variance is 𝛼β / ((𝛼 + β)^2 (𝛼 + β + 1)). If we assume that R is much larger than n, then this is approximately n R / R^3 = n / R^2. Which means that our standard deviation is sqrt(n) / R.
If x is our order statistic, this means that (n / x) - 1 is a reasonable estimate of R. And how much is it off by? Well, we can use the tangent line approximation. The function (n / x) - 1 has a derivative of - n / x^2 Its derivative at x = n/(R+1) is therefore (R + 1)^2 / n. Which for large R is roughly R^2 / n. Stick in our standard deviation of sqrt(n) / R and we come up with an error proportional to R / sqrt(n). Since a 95% confidence interval would be 2 standard deviations, you probably will have an error of around 2 R / sqrt(n).

Finding max of difference + element for an array

We have an array consisting of each entry as a tuple of two integers. Let the array be A = [(a1, b1), (a2, b2), .... , (an, bn)]. Now we have multiple queries where we are given an integer x, we need to find the maximum value of ai + |x - bi| for 1 <= i <= n.
I understand this can be easily achieved in O(n) time complexity for each query but I am looking for something faster than that, probably O(log n) for each query. I can preprocess the array in O(n) time, but the queries should be done faster than O(n).
Any kind of help would be appreciated.
It seems to be way too easy to over-think this.
For n = 1, the function is v-shaped with a minimum of a1 at b1, with slopes of -1 and 1, respectively - let's call these values ac and bc (for combined).
For an additional pair (ai, bi), one of the pairs may dominate the other (|bc - bi| ≤ |ac - ai), which may then be ignored.
Otherwise, the falling slope of the combination will be from the pair with the larger b, the rising slope from the other.
The minimum will be between the individual b, closer to the b of the pair with the larger a, the distance being half the difference between the (absolute value of the) "coordinate" differences, the minimum value that amount higher.
The main catch is that neither needs to be an integer - the only alternative being exactly in the middle between two integers.
(Ending up with the falling slope from max ai + bi, and the rising slope of max ai - bi.)

Given two paid roads from point x to point y, find the cheapest road

Given: Two roads, L and R, they both start from point x and end in point y.
They're both paid roads, and they have n paid points in the road l1,l2,..,ln on the L road and r1,r2,...rn on the R road, which you have to pay when you cross that point.
If you want to keep driving to point y from R road instead of L (or from road L instead of R), you have to pay the changing road fee(but not the paid point).Meaning, If you leave in point i, you don't have to pay for the point li or ri, but just the changing road fee, and then you continue to the i+1 point in the new road you're in (R or L) - and you have to paid that l(i+1) or r(i+1). points of changing road fee are defined by s1,s2,...sn.
EXAMPLE: If I'm in point 3 in road R and I choose to keep driving to point y from road L, I just pay s3 (and not r3), and then I pay l4 too.
Question: Solve this in DP that finds the cheapest way from x to y with the best time complexity.
I come up with a solution which I believe won't work all the time, and I hope you can help me construct a better working one.
Define: Graph G, 2n vertices, edges are (li, li+1), (li, ri), and even 3n vertices because of (ri,ri+1). Array of size n called x. so x[i] will hold the value of the lowest cost route on up to index i. We will fill the array from the n to 1. We'll init the values of the array with zero. And the formula be as followed. A[i] = min { A[i+1] + xi, A[i+1] + si + x_(i+1)] }. In the end, we return A[1].
I believe this fails sometimes, If it's true, I would love to get some help on changing it to work properly or finding another algorithm that will get this job done.
Even though there are some unclear points about your formulation, I'll post my solution anyways and add clarification if necessary.
Graph theoretic approach
The transformation you do to construct you 2n vertex graph is definitely sensible, since your problem becomes a shortest path problem.
If you assign the cost road fee at ri to edges (ri, r(i+1)) or road fee at li to edges (li, l(i+1)) and road switch fee si to edges (ri,l(i+1)) or (li,r(i+1)), a shortest path from x to y is also a cheapest path in your model.
Dynamic programming formulation
Because of the structure of the graph, you don't need a shortest path algorithm like Dijkstra, a simpler solution similar to the one you proposed should suffice. However when storing your array x, you have no way of determining if the optimal solution to get to a certain position i should end on road R or L.
A simple way to resolve this issue would be to store the shortest way for every paid point ri and li, i.e. store two arrays AR and AL where AR[i] is the optimal cost to get to ri (and AL[i] for li)
A simple DP formulation would be
AR[1] = AL[1] = 0 // We don't need to pay at the start
// either stay on the road or switch the road and pay changing fee:
AR[i + 1] = min(AR[i] + fee at ri, AL[i] + changing fee si)
AL[i + 1] = min(AL[i] + fee at li, AR[i] + changing fee si)
and compute the minimal cost by taking the minimum of AR[n + 1] and AL[n + 1]. To get the actual road, simply backtrack which choices were taken at every min computation.

Is it possible to select a number from every given intervals without repetition in selections. Solution in LINEAR TIME

I have been trying this question on hackerearth practice which requires below work done.
PROBLEM
Given an integer n which signifies a sequence of n numbers from {0,1,2,3,4,5.......n-2,n-1}
We are provided m ranges in form of (L,R) such that (0<=L<=n-1)(0<=R<=n-1)
if(L <= R) (L,R) signifies numbers {L,L+1,L+2,L+3.......R-1,R} from above sequence
else (L,R) signifies numbers {R,R+1,R+2,.......n-2,n-1} & {0,1,2,3....L-1,L} ie numbers wrap around
example
n = 5 ie {0,1,2,3,4}
(0,3) signifies {0,1,2,3}
(3,0) signifies {3,4,0}
(3,2) signifies {3,4,0,1,2}
Now we have to select ONE (only one) number from each range without repeating any selection. We have to tell is it possible to select one number from each(and every) range without repetition.
Example test case
n = 5// numbers {0,1,2,3,4}
// ranges m in number //
0 0 ie {0}
1 2 ie {1,2}
2 3 ie {2,3}
4 4 ie {4}
4 0 ie {4,0}
Answer is "NO" it's not possible.
Because we cannot select any number from range 4 0 because if we select 4 from it we could not be able to select from range 4 4 and if select 0 from it we would not be able to select from 0 0
My approaches -
1) it can be done in O(N*M) using recurrsion checking all possibilitie of selection from each range and side by side using hash map to record our selections.
2) I was trying it in order n or m ie linear order .Problem lack editorial explanation .Only a code is mentioned in the editorial without comments and explanation . I m not able to get the codelinear solution code by someone which passes all test cases and got accepted.
I am not able to understand the logic/algo used in the code and why is it working?
Please suggest ANY linear method and logic behind it because problem has these constraints
1 <= N<= 10^9
1 <= M <= 10^5
0 <= L, R < N
which demands a linear or nlogn solution as i guess??
The code in the editorial can also be seen here http://ideone.com/5Xb6xw
Warning --After looking The code I found the code is using n and m interchangebly So i would like to mention the input format for the problem.
INPUT FORMAT
The first line contains test cases, tc, followed by two integers N,M- the first one depicting the number of countries on the globe, the second one depicting the number of ranges his girlfriend has given him. After which, the next M lines will have two integers describing the range, X and Y. If (X <= Y), then range covers countries [X,X+1... Y] else range covers [X,X+1,.... N-1,0,1..., Y].
Output Format
Print "YES" if it is possible to do so, print "NO", if it is not.
There are two components to the editorial solution.
Linear-time reduction to a problem on ordinary intervals
Assume to avoid trivial cases that the number of input intervals is less than n.
The first is to reduce the problem to one where the intervals don't wrap around as follows. Given an interval [L, R], if L ≤ R, then emit two intervals [L, R] and [L + n, R + n]; if L > R, emit [L, R + n]. The easy direction of the reduction is showing that, if the original problem has a solution, then the reduced problem has a solution. For [L, R] with L ≤ R assigned a number k, assign k to [L, R] and k + n to [L + n, R + n]. For [L, R] with L > R, assign whichever of k, k + n belongs to [L, R + n]. Except for the dual assignment of k and k + n for intervals [L, R] and [L + n, R + n] respectively, each interval gets its own residue class mod n, so the assignments do not conflict.
Conversely, the hard direction of the reduction (if the original problem has no solution, then the reduced problem has no solution) is proved using Hall's marriage theorem. By Hall's criterion, an unsolvable original problem has, for some k, a set of k input intervals whose union has size less than k. We argue first that there exists such a set of input intervals whose union is a (circular) interval (which by assumption isn't all of 0..n-1). Decompose the union into the set of maximal (circular) intervals that comprise it. Each input interval is contained in exactly one of these intervals. By an averaging argument, some maximal (circular) interval contains more input intervals than its size. We finish by "lifting" this counterexample to the reduced problem. Given the maximal (circular) interval [L*, R*], we lift it to the ordinary interval [L*, R*] if L* ≤ R*, or [L*, R* + n] if L* > R*. Do likewise with the circular intervals contained in this interval. It is tedious but straightforward to show that this lifted counterexample satisfies Hall's criterion, which implies that the reduced problem has no solution.
O(m log m)-time solution for ordinary intervals
This is a sweep-line algorithm. Sort the intervals by lower endpoint and scan them in that order. We imagine that the sweep line moves from lower endpoint to lower endpoint. Maintain the set of intervals that intersect the sweep line and have not been assigned a number, sorted by upper endpoint. When the sweep line is about to move, assign the numbers between the old and new positions to the intervals in the set, preferentially to the ones whose upper endpoint is the lowest. The correctness of this strategy should be clear: the intervals that could be assigned a number but are passed over have at least as many options (in the sense of being a superset) as the intervals that are assigned, so we never make a choice that we have cause to regret.

Compact storage coefficients of a multivariate polynomial

The setup
I am writing a code for dealing with polynomials of degree n over d-dimensional variable x and ran into a problem that others have likely faced in the past. Such polynomial can be characterized by coefficients c(alpha) corresponding to x^alpha, where alpha is a length d multi-index specifying the powers the d variables must be raised to.
The dimension and order are completely general, but known at compile time, and could be easily as high as n = 30 and d = 10, though probably not at the same time. The coefficients are dense, in the sense that most coefficients are non-zero.
The number of coefficients required to specify such a polynomial is n + d choose n, which in high dimensions is much less than n^d coefficients that could fill a cube of side length n. As a result, in my situation I have to store the coefficients rather compactly. This is at a price, because retrieving a coefficient for a given multi-index alpha requires knowing its location.
The question
Is there a (straightforward) function mapping a d-dimensional multi-index alpha to a position in an array of length (n + d) choose n?
Ordering combinations
A well-known way to order combinations can be found on this wikipedia page. Very briefly you order the combinations lexically so you can easily count the number of lower combinations. An explanation can be found in the sections Ordering combinations and Place of a combination in the ordering.
Precomputing the binomial coefficients will speed up the index calculation.
Associating monomials with combinations
If we can now associate each monomial with a combination we can effectively order them with the method above. Since each coefficient corresponds with such a monomial this would provide the answer you're looking for. Luckily if
alpha = (a[1], a[2], ..., a[d])
then the combination you're looking for is
combination = (a[1] + 0, a[1] + a[2] + 1, ..., a[1] + a[2] + ... + a[d] + d - 1)
The index can then readily be calculated with the formula from the wikipedia page.
A better, more object oriented solution, would be to create Monomial and Polynomial classes. The Polynomial class would encapsulate a collection of Monomials. That way you can easily model a pathological case like
y(x) = 1.0 + x^50
using just two terms rather than 51.
Another solution would be a map/dictionary where the key was the exponent and the value is the coefficient. That would only require two entries for my pathological case. You're in business if you have a C/C++ hash map.
Personally, I don't think doing it the naive way with arrays is so terrible, even with a polynomial containing 1000 terms. RAM is cheap; that array won't make or break you.

Resources