Allocate places according to distance - database

As an over-simplified example I have a list of events that have a maximum attendance:
event | places
===================
event_A | 1
event_B | 2
event_C | 1
And a list of attendees with the distance to the events:
attendee | event_A dist | event_B dist | event_C dist
==========================================================
attendee_1 | 12 | 15 | 12
attendee_2 | 11 | 15 | 11
attendee_3 | 10 | 11 | 12
Can anyone suggest a simple method to produce a set of options providing the best case allocations based on shortest total distance and on shortest mean distance?
I currently have the data held in Oracle Spatial database, but I'm open to suggestions.

I currently understand your problem as follows:
Each atendee should be assigned to exactly one event
Each event has a limit as to how many atendees are assigned to it
Underfull or even empty events are no problem
Each assignment between an event and an atendee corresponds to a given distance
You want to minimize the overall distance for all assignments
You might want to print the result not using sums but using means
Based on this interpretation, I suggest the following algorithm:
Create a complete bipartite graph, with nodes in partition A for atendees and nodes in partition B for places in events. So every atendee corresponds to one node, and every event corresponds to as many nodes as it has places. All atendees are connected to all event nodes with the distance as the edge cost.
At this point your problem corresponds to a general assignment problem, with “agents” corresponding to your event places, and “tasks” corresponding to your atendees. Every atendee must be covered, but not every event place must be used.
Add dummy attendees to allow a perfect matching. Simply sum up all the places, subtract from that the number of actual atendees. For the difference, create as many atendee nodes, with distance zero to all the event nodes.
By making both partitions equal in size, you are now in the domain of the more common linear assignment problem.
Use the hungarian algorithm to compute a minimal cost assignment. Perhaps you can think of some simplifications which make use of the fact that you have many equivalent nodes, i.e. places for the same event and all those dummy atendees.
All of this should probably be done in application code, not in the data base. So I'd rather tag this algorithm. You'll need to pull the full cost matrix from the database to provide costs for your edges.

Related

Tree evaluation in Flink

I have a usecase where I want to build a realtime a decision tree evaluator using Flink. I have a decision tree something like below:
Decision tree example
Root Node(Product A)---- Check if price of Product A increased by $10 in last 10mins
----------------------------
If Yes --> Left Child of A(Product B) ---> check if price of Product B increased by $20 in last 10mins ---> If not output Product B
----------------------------
If No ---> Right Child of A(Product C) ---> Check if price of Product C increased by $20 in last 10mins ---> If not output Product C
Note: This is just example of one decision tree, I have multiple such decision trees with different product type/number of nodes and different conditions. Want to write a common Flink app to evaluate all these.
Now in input I am getting an input data stream with prices of all product types(A, B and c) every 1min. To achieve my usecase one approach that I can think of is as follows:
Filter input stream by product type
For each product type, use Sliding Window over last X mins based on product type triggered every min
Process window function to check difference of prices for a particular product type and output price difference for each product type in output stream.
Now that we have price difference of each product type/nodes of the tree, we can then evaluate the decision tree logic. Now to do this, we have to make sure the processing of price diff calculation of all product types in a decision tree (Product A, B and C in above example) has to be completed before determining the output. One way is to store the outputs of all these products from output stream to a datastore and keep checking from an ec2 instance every 5s or so if all these price computations are completed. Once done, execute the decision tree logic to determine the output product.
Wanted to understand if there is any other way where this entire computation can be done in Flink itself without needing any other components(datastore/ec2). I am fairly new to Flink so any leads would be highly appreciated!

Compare rows without nested for-loops

I have a table in mssql, since I am kinda new to sql I have run in to a problem. The table consists of the following data.
ID | Long | Lat | TimeStamp
-----------+--------------+--------------+------------------
123 | 54 | 18 | 2012-12-02...
143 | 31 | 35 | 2011-09-14...
322 | 53 | 19 | 2012-11-29...
And so on...
I have written a boolean function which checks a condition for a pair of long and lats. I have also written a function which gives the distance between a pair of long and lats. What I want to do is to add a column with the distance to the row which is closest to the the current row and also passes the boolean function and are sufficiently close to each other in time. The database table consists of several million rows, I therefore refrain from using nested for-loops, how would you guys tackle this large dataset? Do mssql have some smart way of doing this?
All help is welcome <3
You basically need to comapare 1 million records to 1 million other records in your function, giving you 1 trillion values to work with. In my opinion, SQL is not really going to help you manage that many records, so you'll have to do some work yourself and also only do bits of work at a time. There may be a better way, but I would do the following:
Add the following columns to the table: calculated destination, distance
Run an algorithm periodically or if a new record is added. This algorithm only needs to run for records where calculated is false.
The algorithm can work something like below:
With the current record, loop through each record that has calculated as true and calculate the distance to that location. if the location is the shortest thus far, set a variable to temporarily store the destination and distance.
After the loop has finished, you have the shortest destination and distance for the current location. Update that in the database and also set the calculated flag to true.
Now compare this distance to the distance in the record for the destination. If it is shorter, also update that record.

R Tree 50,000 foot overview?

I'm working on a school project that involves taking a lat/long point and finding the top five closest points in a known list of places. The list is to be stored in memory, with the caveat that we must choose an "appropriate data structure" -- that is, we cannot simply store all the places in an array and compare distances one-by-one in a linear fashion. The teacher suggested grouping the place data by US State to prevent calculating the distance for places that are obviously too far away. I think I can do better.
From my research online it seems like an R-Tree or one of its variants might be a neat solution. Unfortunately, that sentence is as far as I've gotten with understanding the actual technique, as the literature is simply too dense for my non-academic head.
Can somebody give me a really high overview of what the process is for populating an R-Tree with lat/long data, and then traversing the tree to find those 5 nearest neighbors of a given point?
Additionally the project is in C, and I don't have to reinvent the wheel on this, so if you've used an existing open source C implementation of an R Tree I'd be interested in your experiences.
UPDATE: This blog post describes a straightforward search algorithm for a regionally partitioned space (like a PR quadtree). Hope that helps a future reader.
Have you considered alternative data structures?
I believe, instead of R-tree a Point Quadtree would be more effective for your need.Spatial Index Demos provides some demos for a list of possible data structures including R-tree and Point Quadtree. Hope it gives an insight.
Quad Trees
A quad tree takes a square of space and divides it into four children with half the dimensions along the X and Y axis.
+---+---+
| | | Each square is a child
| | | of the parent; when you
+---+---+ get to leaves a node has
| | | a single point or a list
| | | of points.
+---+---+
This data structure is recursive and you search for points by checking which child holds the point until you get to the leaf. A leaf either has a single member (point with X,Y coords) or a list of members, depending on the implementation. If you fill up a node you split it into 4 and distribute the children. Essentially, the data structure is a generalisation of a binary tree, so it is not necessarily balanced.
Balancing a quad tree may not be necessary for your purposes and is left as an exercise for the reader - try searching on the web for 'balanced quad tree'
Note that this data structure cannot index items that can overlap, but if you're only storing points this won't be a problem.
Finding nearest neighbours in a quad tree
Off the top of my head, here's a quick and dirty algorithm for finding the 'n' nearest neighbours to your point. It's not necessarily optimially efficient, but it will be fairly simple to implement. If someone has a link to a better one, feel free to post it in a comment or answer.
Locate the quad tree node containing
your point, keeping a list of its
parents.
Push all of the points in the
node into a priority queue based on
their distance from your base point
(i.e. by the length of the hypotenuse
per Pythagoras' theorem). Depending
on the implementation there may be
one or more per node. For a simple
implementation of a priority queue
data structure, look up 'binary
heap'.
If any of the 'n' points are further away then the edges of the bounding box, add the contents of its neighbours. i.e. If your base point is close to the edge of the bounding box, it is possible that neighbouring tree nodes might contain points that are closer than the points found within your bounding box. You will need to back up the tree to do this, which is why you need to keep track of your parent nodes.
When all of the 'n' closest points are closer than the edges of your bounding box you know that there could not possibly be neighbours that you have missed. Therefore, the 'n' closest points within this box must be your 'n' closest neighbours.

Model for a Scheduling System

We have built a scheduling system to control our client's appointments. This
system is similar to the one used by Prometric to schedule an exam. The main
concerns are: guarantee that there is no overscheduling, support at least one
hundred thousand appointments per month and allow to increase/decrease the
testing center capacity easily.
We devised the following design based on Capacity Vectors. We assumed that
each appointment requires at least five minutes. A vector is composed by 288
slots (24 hours * 12 slots per hour), each one represented by one byte. The
vector allows the system to represent up to 255 appointments every five
minutes. The information used is composed by two vectors: one to store the
testing center capacity (NOMINAL CAPACITY) and another to store the used capacity
(USED CAPACITY). To recover the current capacity (CURRENT CAPACITY), the system
takes the testing NOMINAL CAPACITY and subtracts the USED CAPACITY.
The database has the following structure:
NOMINAL CAPACITY
The nominal capacity represents the capacity for work days (Mon-Fri).
| TEST_CENTER_ID | NOMINAL_CAPACITY
| 1 | 0000001212121212....0000
| 2 | 0000005555555555....0000
...
| N | 0000000000010101....0000
USED CAPACITY
This table stores the used capacity for each day / testing center.
| TEST_CENTER_ID | DATE | USED_CAPACITY
| 1 | 01/01/2010 | 0000001010101010...0000
| 1 | 01/02/2010 | 0000000202020202...0000
| 1 | 01/03/2010 | 0000001010101010...0000
...
| 2 | 01/01/2010 | 0000001010101010...0000
...
| N | 01/01/2010 | 0000000000000000...0000
After the client chose the testing center and a date, the system presents the
available slots doing the following calculation. For example:
TEST_CENTER_ID 1
DATE 01/01/2010
NOMINAL_CAPACITY 0000001212121212...0000
USED_CAPACITY 0000001010101010...0000
AVAILABLE_CAPAC 0000000202020202...0000
If the client decides to schedule an appointment, the system locks the chosen
day (a row in # USED CAPACITY table) and increases the corresponding byte.
The system is working well now, but we foresee contention problems if the
number of appointments increases too much.
Does anyone has a better/another model for this problem or suggestions to
improve it?
We have thought in avoiding the contention by sub-diving the representation of a
vector in hour and changing to an optimistic lock strategy. For example:
| TEST_CENTER_ID | DATE | USED_CAPACITY_0 | USED_CAPACITY_1 | ... | USED_CAPACITY_23
| 1 | 01/01/2010 | 000000101010 | 1010... | ... | ...0000
This way will don't need to lock a row and reduce collision events.
Here's one idea:
You can use optimistic locking with your current design. Instead of using a separate version number or timestamp to check whether the entire row has been modified, save a memento of the used_capacity array when the row is read. You only lock on update, at which time you compare just the modified slot byte to see if it's been updated. If not, you can embed the new value into that one element without modifying the others, thus retaining modifications to other slots performed by other processes since your initial read.
This should work as well on an set of adjacent bytes for appointments that are longer than 5 minutes.
If you know the slot(s) in question when you initially read, then you can just save the beginning array index and salient values instead of the entire array.

Pre RTree step: Divide a set of points into rectangular regions each containing one point

given my current position (lat,long) I want to quickly find the nearest neighbor in a points of interest problem. Thus I intend to use an R-Tree database, which allows for quick lookup. However, first the database must be populated - of course. Therefore, I need to determine the rectangular regions that covers the area, where each region contains one point of interest.
My question is how do I preprocess the data, i.e. how do I subdivide the area into these rectangular sub-regions? (I want rectangular regions because they are easily added to the RTree - in contrast to more general Voronoi regions).
/John
Edit: The below approach works, but ignores the critical feature of R-trees -- that The splitting behavior of R-tree nodes is well defined, and maintains a balanced tree (through B-tree-like properties). So in fact, all you have to do is:
Pick the maximum number of rectangles per page
Create seed rectangles (use points furthest away from each other, centroids, whatever).
For each point, choose a rectangle to put it into
If it already falls into a single rectangle, put it in there
If it does not, extend the rectangle that needs to be extended least (different ways to measure "least" exits -- area works)
If multiple rectangles apply -- choose one based on how full it is, or some other heuristic
If overflow occurs -- use the quadratic split to move things around...
And so on, using R-tree algorithms straight out of a text book.
I think the method below is ok for finding your initial seed rectangles; but you don't want to populate your whole R-tree that way. Doing the splits and rebalancing all the time can be a bit expensive, so you will probably want to do a decent chunk of the work with the KD approach below; just not all of the work.
The KD approach: enclose everything in a rectangle.
If the number of points in the rectangle is > threshold, sweep in direction D until you cover half the points.
Divide into rectangles left and right (or above and below) the splitting point).
Call the same procedure recursively on the new rectangles, with the next direction (if you were going left to right, you will now go top to bottom, and vice versa).
The advantage this has over the divide-into-squares approach offered by another poster is that it accommodates skewed point distributions better.
Oracle Spatial Cartridge documentation describes tessellation algorithm that can do what you want. In short:
enclose all your points in square
if square contains 1 point - index square
if square does not contain points - ignore it
if square contains more then 1 point
split square into 4 equal squares and repeat analysis for each new square
Result should be something like this:
alt text http://download-uk.oracle.com/docs/cd/A64702_01/doc/cartridg.805/a53264/sdo_ina5.gif
I think you a missing something in the problem statement. Assume you have N points (x, y) such that every point has a unique x- and y-coordinate. You can divide your area into N rectangles then by just dividing it into N vertical columns. But that does not help you to solve the nearest POI problem easily, does it? So I think you are thinking about something about the rectangle structure which you haven't articulated yet.
Illustration:
| | | | |5| | |
|1| | | | |6| |
| | |3| | | | |
| | | | | | | |
| |2| | | | | |
| | | | | | |7|
| | | |4| | | |
The numbers are POIs and the vertical lines show a subdivision into 7 rectangular areas. But clearly there isn't much "interesting" information in the subdivision. Is there some additional criterion on the subdivision which you haven't mentioned?

Resources