Optimal Solution - Programming Theory - theory

I would like someone to explain different approaches to a simple problem and then I am going to try and implement it in PHP for a wider application.
I have five people who are choosing who gets what room in there are five rooms Grand, Large, Medium, Medium and Small.
Person 1 orders the rooms Grand, Large
Person 2 orders the rooms Large, Medium
Person 3 orders the rooms Large, Small
Person 4 orders the room Medium
Person 5 orders the rooms Large, Medium
Where a missing room is one they are not interested in.
What is the fairest way to choose who gets each room?

Use a heuristic to compute the matching value of every situation. E.g. if a person remains without a room, the value would be low or negative. If every person remains with the biggest room they ordered, the value would be the highest.
Compute this value for every situation and then take the situation with the highest value.

Fairness is not always well defined.
However, in this case it seems that a person can either get a room he requests, or not. Thus, one can make a strong argument that all solutions where the same number of people get a room they want are equally fair, and solutions where more people get a room they want are more fair than those where fewer get what they want (we are thus not giving any one person preference).
In your example there appears to be only one solution where everyone gets a room he wants. Thus, that is the 'fairest' solution.
The algorithm to find this isjust a depth-first-search (or, branch-and-bound if you need the speedup) that considers all possible allocations and finds a maximal one.

Related

Database Design and normalization in chess

I was wondering what was the better approach for my database/table design. As show in the picture, i have players who play a match. One player plays multiple matches and one match is played by multiple players, so it is a n:m relation. This could result in thress tables player(id, firstname), player_to_match(playerid, matchid), match(id).
In my case, the number of players never changes, it is always two (n=2). Which of the following designs is better?
(1)
player_to_match(matchid, playerid)
Having two rows for each map and one cell redundancy (matchid)
(2)
match(matchid, playerid1, playerid2)
As i said, the number of players per match can never change
Thank you
Lucas
[ERM-Diagram with two Entities: Player(ID, Firstname), Match(ID), n:m Assosiation from Player to Match titled "plays"]
http://fs1.directupload.net/images/141210/rmeuutpg.png
I'd stick with option (1). It will make it easier to answer such simple questions as "how many matches has player X played?" With option (2), you'd have to query two columns for the value X to answer that question and that starts to get ugly.
Use the 2-table design. For something like this, you don't need the extra complexity because there is no chance chess is ever going to need 3 players. Unless you watch Big Bang Theory...
I prefer to start with the simpler form, and then modify it later if needed. As developers, we tend to try to come up with a solution that will handle any future possibility, but most of the time it never happens, and we've wasted a lot of time building an elegant solution for a problem that doesn't exist. Go simple first.
If you do need the 3-table option, you have some extra work to do to make sure there are always 2 related records to a match, no more, no less. Make sure you can't delete a user that is attached to an existing match, or you will have a match with only one player. A few things like that you'll have to watch out for.
I would do this:
matchid
black (references PLAYER)
white (references PLAYER)
The number of players in the game is finite (two), which eliminates the rationale for a 1-to-n child table; each player moreover has a defined "role" (white vs black) and you'd want to be able to distinguish them in that manner.

stadium problem: Provide algorithm to solve the problem

In a small stadium there are several thousand people in the stands. Devise a distributed algorithm
enabling the audience to count itself. Do not assume any particular geometry of the stadium except, if you want, that it is bowl shaped. Explicitly state your assumptions, then present your algorithm and analysis
I was assuming the members to be a linked-list and appending the counter and free(ptr)..I may be wrong...Kindly provide some useful insights
Thanks in adv...
Assuming everyone can talk to his/her neighbor (possibly over many empty seats) and that fans of team A are willing to speak to fans of team B, the following could work:
Everyone grabs his/her nearest neighbor, who is not already grabbed by someone else, to form groups of at most two people. Now everyone remembers the size of the group they are in (can be 1 or 2). Now a leader of each group is chosen in a way that he is able to communicate to a member of another group. The leaders of each group try to join their group with one other and each member of the two (now joined) groups remembers the sum of the members of each group (this can be done by broadcasting the new value to be added to the group). This process continues until there is only one group left. Upon termination of this process everyone knows the number of people in the stadium.
Hope this helps.
In a small stadium there are several
thousand people in the stands. Devise
a distributed algorithm enabling the
audience to count itself.
Feynman answer (see round manhole question): Have everyone shout "Several thousand!"
Here is another algorithm:
Let everyone count the others and himself.
Then, at the sound of a horn, each counter shout his count.
Keep the most shouted.
With this algorithm you can cope with errors.
For every column, a leader is chosen with the rule "the person in the row closest to the field is the leader" (these seats are usually filled). The leader initiates a count of the people in that column in the following manner:
1. Shake hand with the person directly sitting behind, and ask, "you?"
2. If no one is behind the person, the response should be 1, or else do step 1 with person behind, and the answer is one more than the answer from the person behind.
3. The leader immediately writes this number on a board, and holds it up
4. Among these leaders, the youngest person should start collecting these boards, and add them. If she meets a person younger than her collecting boards, the count till then is handed over to the other person. If the same age, the taller person would take over.
Everyone drops their dacks, and the "output" is available in the local paper the next day ala "N people arrested at world's first mass-streaker incident". (i.e. the application asks the system to do the work).
Each person picks one other person to fight, and first asks how many people they've knocked out. Winner finds another opponent, adds the defeated opponent's count. Last man (or woman) standing has the answer.
Everyone stands up, plucks a hair, then hands it to someone nearby with at least as many hairs before sitting down again. Standing people continue to seek others to give hairs too. The last person counts the hairs.
You invite people to grab a bag of minties from the canteen, then hand them around until they can't find anyone who hasn't had one, then drop the bag at a central point. Multiply bags * number-of-minties-per-bag - number-of-minties-left-in-bags.

Atomicity of field for part numbers

In our internal inventory application, we store three values (in separate fields) that become the printed "part number" in this format: PPP-NNNNN-VVVV (P = Prefix, N = Number, V = version).
So for example, if you have a part 010-00001-01 you know it's version 1 of a part of type "010" (which let's say is a printed circuit board).
So, in the process of creating parts engineering wants to group parts together by keeping the "number" component (the middle 5 digits) the same across multiple prefixes like so:
001-00040-0001 - Overall assembly
010-00040-0001 - PCB
015-00040-0001 - Schematics
This seems problematic and frustrating as it sometimes adds extra meaning to the "number" field (but not consistently since not all parts with the same "number" component are necessarily linked).
Am I being a purist or is this fine? 1NF is awfully vague with regards to atomicity. I think I'm mostly frustrated because of the extra logic to ensure that the next "number" part of the overall part number is valid and available for all prefixes.
There have been a number of enterprises that have foundered, or nearly foundered, on the "part number syndrome". You might be able to find some case studies. DEC part numbers were somewhat mixed up.
The customer is not always right, but the customer is always the customer.
In this case, it sounds to me like engineering is trying to use as single number to model a relationship. I mean the relationship between Overall assembly, PCB, and Scematics. It's better to model relationships as relations. It allows you more flexibility down the road. You may have a hard time selling engineering on this point.
In my experience, regardless of database normative rules, when the client/customer/user wants something done a certain way, there is most likely a reason for it, and that reason will save them money (in some fashion). Sometimes it will save money by reducing steps, by reducing training costs, or simply because That's The Way It's Always Been. Whatever the reason, eventually you'll end up doing it because they're paying to have it done (unless it violates accounting rules).
In this instance, it sounds like an extra sorting criteria on some queries for reports, and a new 'allocated number' table with an auto-incrementing key. That doesn't sound too bad to me. Ask me sometime about the database report a client VP commissioned strictly to cast data in such a fashion as to make a different VP look bad in meetings (not that he told me that up front).

ai: Determining what tests to run to get most useful data

This is for http://cssfingerprint.com
I have a system (see about page on site for details) where:
I need to output a ranked list, with confidences, of categories that match a particular feature vector
the binary feature vectors are a list of site IDs & whether this session detected a hit
feature vectors are, for a given categorization, somewhat noisy (sites will decay out of history, and people will visit sites they don't normally visit)
categories are a large, non-closed set (user IDs)
my total feature space is approximately 50 million items (URLs)
for any given test, I can only query approx. 0.2% of that space
I can only make the decision of what to query, based on results so far, ~10-30 times, and must do so in <~100ms (though I can take much longer to do post-processing, relevant aggregation, etc)
getting the AI's probability ranking of categories based on results so far is mildly expensive; ideally the decision will depend mostly on a few cheap sql queries
I have training data that can say authoritatively that any two feature vectors are the same category but not that they are different (people sometimes forget their codes and use new ones, thereby making a new user id)
I need an algorithm to determine what features (sites) are most likely to have a high ROI to query (i.e. to better discriminate between plausible-so-far categories [users], and to increase certainty that it's any given one).
This needs to take into balance exploitation (test based on prior test data) and exploration (test stuff that's not been tested enough to find out how it performs).
There's another question that deals with a priori ranking; this one is specifically about a posteriori ranking based on results gathered so far.
Right now, I have little enough data that I can just always test everything that anyone else has ever gotten a hit for, but eventually that won't be the case, at which point this problem will need to be solved.
I imagine that this is a fairly standard problem in AI - having a cheap heuristic for what expensive queries to make - but it wasn't covered in my AI class, so I don't actually know whether there's a standard answer. So, relevant reading that's not too math-heavy would be helpful, as well as suggestions for particular algorithms.
What's a good way to approach this problem?
If you know nothing about the features you have not sampled, then you have little to go on when deciding whether to explore or exploit your data. If you can express your ROI as a single number following every query, then there is an optimal way of making this choice by keeping track of the upper confidence bounds. See the paper Finite-time Analysis of Multiarmed Bandit Problem.

Generating 'neighbours' for users based on rating

I'm looking for techniques to generate 'neighbours' (people with similar taste) for users on a site I am working on; something similar to the way last.fm works.
Currently, I have a compatibilty function for users which could come into play. It ranks users on having 1) rated similar items 2) rated the item similarly. The function weighs point 2 heigher and this would be the most important if I had to use only one of these factors when generating 'neighbours'.
One idea I had would be to just calculate the compatibilty of every combination of users and selecting the highest rated users to be the neighbours for the user. The downside of this is that as the number of users go up then this process couls take a very long time. For just a 1000 users, it needs 1000C2 (0.5 * 1000 * 999 = = 499 500) calls to the compatibility function which could be very heavy on the server also.
So I am looking for any advice, links to articles etc on how best to achieve a system like this.
In the book Programming Collective Intelligence
http://oreilly.com/catalog/9780596529321
Chapter 2 "Making Recommendations" does a really good job of outlining methods of recommending items to people based on similarities between users. You could use the similarity algorithms to find the 'neighbours' you are looking for. The chapter is available on google book search here:
http://books.google.com/books?id=fEsZ3Ey-Hq4C&printsec=frontcover
Be sure to look at Collaborative Filtering. Many recommendation systems use collaborative filtering to suggest items to users. They do it by finding 'neighbors' and then suggesting items your neighbors rated highly but you haven't rated. You could go as far as finding neighbors, and who knows, maybe you'll want recommendations in the future.
GroupLens is a research lab at the University of Minnesota that studies collaborative filtering techniques. They have a ton of published research as well as a few sample datasets.
The Netflix Prize is a competition to determine who can most effectively solve this sort of problem. Follow the links off their LeaderBoard. A few of the competitors share their solutions.
As far as a computationally inexpensive solution, you could try this:
Create categories for your items. If we're talking about music, they might be classical, rock, jazz, hip-hop... or go further: Grindcore, Math Rock, Riot Grrrl...
Now, every time a user rates an item, roll up their ratings at the category level. So you know 'User A' likes Honky Tonk and Acid House because they give those items high ratings frequently. Frequency and strength is probably important for your category aggregate score.
When it's time to find neighbors, instead of cruising through all ratings, just look for similar scores in the categories.
This method wouldn't be as accurate but it's fast.
Cheers.
What you need is a clustering algorithm, which would automatically group similar users together. The first difficulty that you are facing is that most clustering algorithms expect the items they cluster to be represented as points in a Euclidean space. In your case, you don't have the coordinates of the points. Instead, you can compute the value of the "similarity" function between pairs of them.
One good possibility here is to use spectral clustering, which needs precisely what you have: a similarity matrix. The downside is that you still need to compute your compatibility function for every pair of points, i. e. the algorithm is O(n^2).
If you absolutely need an algorithm faster than O(n^2), then you can try an approach called dissimilarity spaces. The idea is very simple. You invert your compatibility function (e. g. by taking its reciprocal) to turn it into a measure of dissimilarity or distance. Then you compare every item (user, in your case) to a set of prototype items, and treat the resulting distances as coordinates in a space. For instance, if you have 100 prototypes, then each user would be represented by a vector of 100 elements, i. e. by a point in 100-dimensional space. Then you can use any standard clustering algorithm, such as K-means.
The question now is how do you choose the prototypes, and how many do you need. Various heuristics have been tried, however, here is a dissertation which argues that choosing prototypes randomly may be sufficient. It shows experiments in which using 100 or 200 randomly selected prototypes produced good results. In your case if you have 1000 users, and you choose 200 of them to be prototypes, then you would need to evaluate your compatibility function 200,000 times, which is an improvement of a factor of 2.5 over comparing every pair. The real advantage, though, is that for 1,000,000 users 200 prototypes would still be sufficient, and you would need to make 200,000,000 comparisons, rather than 500,000,000,000 an improvement of a factor of 2500. What you get is O(n) algorithm, which is better than O(n^2), despite a potentially large constant factor.
The problem seems like to be 'classification problems'. Yes there are so many solutions and approaches.
To start exploration check this:
http://en.wikipedia.org/wiki/Statistical_classification
Have you heard of kohonen networks?
Its a self organing learning algorithm that clusters similar variables into similar slots. Although most sites like the one I link you to displays the net as bidimensional there is little involved in extending the algorithm into a multiple dimension hypercube.
With such a data structure finding and storing neighbours with similar tastes is trivial as similar users should be stores into similar locations (almost like a reverse hash code).
This reduces your problem into one of finding the variables that will define similarity and establishing distances between possible enumerate values ,like for example classical and acoustic are close toghether while death metal and reggae are quite distant (at least in my oppinion)
By the way in order to find good dividing variables the best algorithm is a decision tree. The nodes closer to the root will be the most important variables to establish 'closeness'.
It looks like you need to read about clustering algorithms. The general idea is that instead of comparing every point with every other point each time you divide them in clusters of similar points. Then the neighborhood may be all the points in the same cluster. The number/size of the clusters is usually a parameter of the clustering algorithm.
Yo can find a video about clustering in Google's series about cluster computing and mapreduce.
Concerns over performance can be greatly mitigated if you consider this as a build/batch problem rather than a realtime query.
The graph can be statically computed then latently updated e.g. hourly, daily etc. to then generate edges and storage optimized for runtime query e.g. top 10 similar users for each user.
+1 for Programming Collective Intelligence too - it is very informative - wish it wasn't (or I was!) as Python-oriented, but still good.

Resources