Algo to find least amount of items that satisfy a given requirement - sql-server

I want to implement an algorithm for the following problem. It later on needs to be implemented in T-SQL:
I have a set of providers - lets say shops. Each shop has it set of items it offers. Some items overlap between shops, some are only present in one shop.
I have a list of items - lets say a shopping list with a set of items I want.
I now have to find the combination of shops which offer ALL the items while requiring the least amount of shops.
I am pretty sure this problem is frequently solved and the algorithm has its own name but I was not able to find it via search.

Sorry, I think I misinterpret the question the first time. Your problem is essentially a Set Cover Problem which is NP-Complete. There are heuristic approaches however no optimal solution.
(This is similar but not quite your problem, worth looking at though) Knapsack Problem

Not sure about any particular algorithm. But given your requirements, I would:
Start looking for the shop with the largest number of matching items from your list.
Iterate until you fill your list.
If you want to really optimize this, you can then look for any redundant shops. A redundant shop would have items provided by one or more of the other shops in your list.
On a second thought, this could be solved using binary linear programming. Where each shop is variable, and the constraints are such that each item must be served by at least one shop. You then try to minimize the number of shops. Not sure how you would solve this inside T-SQL.

Related

Common ways for pair matching of users in a database?

hope you are safe and well!
I have a question about regular or common ways of pair-matching if there is a database of users: say there are a few properties of each user, and when matching, each user could change the filtering options to only match those who fit their own requirement(so there is mutual selection between users), and we want to efficiently match 1000 users as precisely as possible.
For example, let's say there are 3 properties of every user: gender(female/male/other), study level(elementary/mediate/advanced), and grade(freshman/sophomore/junior/senior), and when matching, each user could choose to only match with people with their selected gender, study level and grade.
When focusing on 1 user, I could guess, on the perspective of database, we could use the filtering options in commands and get a list of those who satisfy both "my requirement" and "I fit their requirement"? However, I think this would be slow and asynchronous problems when there are 1000+ users in the matching phase at the same time?
I saw another post here discussed the blossom algorithm or greedy algorithm, which seem cool since if looking in a graph. Are they doable in this case? I guess if two users mutually fit both requirements, they would have an edge between the two nodes, and the value of edge could be comprehensive matching scores of 3 properties all together?
Anyway, I'm wondering is there a common way to do the pair matching precisely with at least 1000+ users at the same time?
Thank you so much!
If the requirement is that each match has to have the exact same properties, then the solution is fairly simple; just do a multiple criteria sort (ex. first sort by gender, then within each gender category sort by study level, etc.) and pair the identical users.
However, in a random dataset you're very unlikely to have perfect matches for all users. In that case you would want score pairs by how closely each category matches and use a more complex algorithm to maximize your overall matches. What you would do depends heavily on your use case and userbase size. Honestly, 1000 users is a very small number for modern computers; pretty much any polynomial time method (including blossom as you mentioned) would work fine.

How to find similarity for large number of features

I'm not sure if I am asking the question at right place as I'm new to stackoverflow, please move if required.
I'm trying to solve a link prediction problem for Flickr Dataset. My dataset has 5K nodes and each node has around 27K features, it is sparse.
I want to find similarity between the nodes so that I can predict a link between them if the similarity value is greater than some threshold that I decide. The problem is with the number of features. I cannot load the file in Weka (To try to reduce features by some info gain or something and then try clustering or check if cosine similarity measure)
One more problem is, how to define this as a classification problem ? I wanted to find overlapping tags for two nodes, so the table contains the nodes and some features of them (will be in thousands) and all of them will be positive class only as I know that there is a link between them.
I want to create a test data set with some of the nodes and and create similar table and label them as positive class or negative class. But my problem is all data I have is positive, so I think it would never be able to label as negative. How to change it to a classification problem correctly ?
Any pointers or help is very much appreciated.
Weka can deal with 27K features, it shoudn't be a problem... However, I would approach this problem as a classification problem, but a link-discovery one, which, in this case can be seen as a matching problem.
My approach would be:
1. new node appears
2. search for the most similar elements
3. assume they are related (there is a link) if the similarity is greater than your threshold.
The main problem would be to tune the threshold based on some quality measure.
For this approach Lucene would be probably the best option.
I hope this helps.

What are the most efficient algorithms to recommend items to groups of users?

Using collaborative filtering usually applies to giving ratings to an individual user, but how would these algorithms change when needing to recommend an item(s) to multiple people (for example: friends wanting to watch a movie or wanting to choose a holiday together)?
Since this question is at a very general level, I will answer it at that level.
The key change is that a loss function that is typically minimized (or an objective function is maximized) for an individual would be minimized for a set. Unless you have training data for sets, this tends to be very difficult. What's more, the set could change depending on the recommendation.
Nonetheless, a naive approach would be to suggest a least common denominator item: one that, on average, maximizes the objective function.

Examples of good UI for selecting multiple records

I'm currently revisiting an area of my Windows-based software and looking at changing the relationship from 1->M to M->M. As a result, I need to adjust the UI to accommodate selecting multiple related records.
There are a lot of ways to handle this that are common, but usually pretty clunky. Examples include the two-pane list of all items, and list of selected items, or a list of all records and a checkbox beside each one that applies.
In my case, there may be an awful lot (in the tens of thousands) of records that could be associated, so I'll probably need to include some kind of search mechanism.
I'm not looking for a hard and fast answer -- I can implement something pretty easily that's functional, I'm looking to see if anyone here has come up with (or seen) any great UIs for doing this kind of thing, whether it's web based, Windows, Mac, Unix, whatever.
Images or links to them would be appreciated!
Edit: here's an example of what I'm considering:
I like the way StackOverflow relates many tags with many questions:
Items are displayed as user types
You start obviously with the record you want to associate multiple items with.
As you type the search displays the matches ( no need to press on "Search" )
The user select the desired record ( Sorting would be nice. SO uses "tag relevance". For instance typing 'a' brings Java rather than asp because Java has more questions than asp, in your case relevance may be the user name )
The system creates the relationship ( in memory )
If a number of records ( 5+ ) are filling the input field, they are moved into a semi-regid area ( not a SO problem because it only has 5 tag withing a single question, but in your case something like the "interesting tags" feature would be needed )
Associated items are moved to a "rigid" area
Of course in an ordered manner ( using a table )
Finally when the user end with the association it clicks SAVE or CANCEL buttons.
This approach has more efficiency by not needing to have the user press on "search" or "add other" which distracts them from what they're doing, it is being said it interrupts its train of thought.
Also, if you make the user grab the mouse to click on something while they are typing the UI is less efficient ( I think there is something called the Hick's law about that, but quite frankly I may be wrong )
As you see this approach is pretty much already what you have in mind, but adding some facilities to make the user happier ( The danger would be if the user loves this approach and wants it in other parts of the system )
It's an interesting and fairly common UI problem, how to efficiently select items. I'm assuming that you are intending on having the user first select a single item and that the mechanism you are interested is how to choose other items that get related to this first single item.
There ares various select methods. From a usability standpoint, it would be preferable to just have ONE method used for each scenario. Then when the user sees it, they will know what to do.
various selection techniques:
dropdown list - obvious for single selects.
open list multi select - eg: a multiline textbox that shows 10 or 20 lines and has a scroll bar
dropdown list where you select then hit and 'add' link or button to add multiple selects
list moving - where you have two open lists, with all the choices available in the left list, you select a few then click a button to move your selection to the right list.
Check boxes - good for just a few choices of multiple selection possibilities.
List of items, each with an 'add' button next to them - good for short lists
You've said that you'll have thousands of possible choices, so that eliminates 1 and 5. Really, thousands will eliminate all of them, as the usability doesn't scale well with more than a few hundred in the list.
If you can count on the user to filter the list, like in your example, then 6 may be suitable. If you think of how Facebook picture tagging works, I think that it fairly efficient for long lists: background: Facebook picture tagging is a mechanism that allows you to assign one or more people to portions of an image - ie 'tag' them.
When you select an image to tag (ie the 'single item') and wish to relate other items(people) to it, A dialog box pops up. It contains the top 6 or so names that you've used in the past, and a textbox where you can start to type the person's name you wish to use. As you type, the list dynamically changes to reduce the number of people to only those who contain the letter sequence you've typed. This works very well for large lists, but it does rely on the user typing to filter. It also will rely on use of scripting to intelligently reduce the list based on the user's input.
For your application it would rely on the user performing this step once for each association, as I'm assuming that the other items won't all have similar names!
Here's an image of the Facebook tagging application: http://screencast.com/t/9MPlpJQJzBQ
A search feature that filters records in real time as you type would probably be a good idea to include. Another would be the possibility to sort the records.
Since there may be a lot of records, the best choice in this case is probably to have a separate area which displays what you have already chosen, so that the user won't have to scroll around the selection areas to find what they already have.
self-explanatory GUI http://img25.imageshack.us/img25/8568/28666917.png
Link to the original image
Another thing is, that in my opinion your problem is not about selecting multiple records, but filtering those tens of thousands of records. M->M association can be implemented in variety of way, but the tricky part is to provide a convenient and logical way to browse/search the huge amount of data.
I'd suggest not having to click add more to be able to search. The warning at the right is nice, but IMHO it should only say the search displays results as the user types.
Sorting a column (maybe along with the search) would also be a nice functionality. I'd suggesting it being done by clicking on the header of the table, with some icon indicating whether the sort is ascending or descending.
I'd suggest also the search to do an approximate string matching in case there are no or few results. It is so annoying not being able to find something you don't remember exactly.
Finally, for testing the first impression (though not the functionality itself), I'd suggest uploading it to the 5 second test and see what you get.
I think that what you have mocked up is a pretty good way to do it. When you think about the tags-to-posts relationship on a blog (or on SO even), that is many-to-many and it is usually implemented very similarly: for one post, you search for (or, since they are simple strings, directly enter) as many tags as you want to associate with it. I can't really think of any many-to-many relationships I encounter often, although I know there are probably many...
There are a number of important questions to consider - how many records will typically be used (as opposed to available for association)? Will there be a large number of records on one side of the association (given the switch from 1->M, this seems likely)?
If one of the quantities of records is usually very small (<10, I'd say), call this the LHS (because it usually is), then the best way to associate may be to allow searches for LHS and RHS items, then drag-and-drop them onto a list - LHS items onto the list proper; RHS items into the existing LHS items. That way, it's intuitive to specify a relation between to items. You could also add other options like "associate with all", or a grouping pen so you can assign several records to several other records - nothing is tedious like having to do 15 drags-and-drops of the same record.
In fact, I think that's the most crucial bit of any M->M UI design - minimize repetition. Doing the same thing over for 100s of records (remember that if "nobody will ever...", they will) is unfun, especially if it's complex. I know this seems to contradict my earlier advice, but I don't think it does - design for the typical use case, but make sure that the atypical ones do not make the program unusable.

Generating 'neighbours' for users based on rating

I'm looking for techniques to generate 'neighbours' (people with similar taste) for users on a site I am working on; something similar to the way last.fm works.
Currently, I have a compatibilty function for users which could come into play. It ranks users on having 1) rated similar items 2) rated the item similarly. The function weighs point 2 heigher and this would be the most important if I had to use only one of these factors when generating 'neighbours'.
One idea I had would be to just calculate the compatibilty of every combination of users and selecting the highest rated users to be the neighbours for the user. The downside of this is that as the number of users go up then this process couls take a very long time. For just a 1000 users, it needs 1000C2 (0.5 * 1000 * 999 = = 499 500) calls to the compatibility function which could be very heavy on the server also.
So I am looking for any advice, links to articles etc on how best to achieve a system like this.
In the book Programming Collective Intelligence
http://oreilly.com/catalog/9780596529321
Chapter 2 "Making Recommendations" does a really good job of outlining methods of recommending items to people based on similarities between users. You could use the similarity algorithms to find the 'neighbours' you are looking for. The chapter is available on google book search here:
http://books.google.com/books?id=fEsZ3Ey-Hq4C&printsec=frontcover
Be sure to look at Collaborative Filtering. Many recommendation systems use collaborative filtering to suggest items to users. They do it by finding 'neighbors' and then suggesting items your neighbors rated highly but you haven't rated. You could go as far as finding neighbors, and who knows, maybe you'll want recommendations in the future.
GroupLens is a research lab at the University of Minnesota that studies collaborative filtering techniques. They have a ton of published research as well as a few sample datasets.
The Netflix Prize is a competition to determine who can most effectively solve this sort of problem. Follow the links off their LeaderBoard. A few of the competitors share their solutions.
As far as a computationally inexpensive solution, you could try this:
Create categories for your items. If we're talking about music, they might be classical, rock, jazz, hip-hop... or go further: Grindcore, Math Rock, Riot Grrrl...
Now, every time a user rates an item, roll up their ratings at the category level. So you know 'User A' likes Honky Tonk and Acid House because they give those items high ratings frequently. Frequency and strength is probably important for your category aggregate score.
When it's time to find neighbors, instead of cruising through all ratings, just look for similar scores in the categories.
This method wouldn't be as accurate but it's fast.
Cheers.
What you need is a clustering algorithm, which would automatically group similar users together. The first difficulty that you are facing is that most clustering algorithms expect the items they cluster to be represented as points in a Euclidean space. In your case, you don't have the coordinates of the points. Instead, you can compute the value of the "similarity" function between pairs of them.
One good possibility here is to use spectral clustering, which needs precisely what you have: a similarity matrix. The downside is that you still need to compute your compatibility function for every pair of points, i. e. the algorithm is O(n^2).
If you absolutely need an algorithm faster than O(n^2), then you can try an approach called dissimilarity spaces. The idea is very simple. You invert your compatibility function (e. g. by taking its reciprocal) to turn it into a measure of dissimilarity or distance. Then you compare every item (user, in your case) to a set of prototype items, and treat the resulting distances as coordinates in a space. For instance, if you have 100 prototypes, then each user would be represented by a vector of 100 elements, i. e. by a point in 100-dimensional space. Then you can use any standard clustering algorithm, such as K-means.
The question now is how do you choose the prototypes, and how many do you need. Various heuristics have been tried, however, here is a dissertation which argues that choosing prototypes randomly may be sufficient. It shows experiments in which using 100 or 200 randomly selected prototypes produced good results. In your case if you have 1000 users, and you choose 200 of them to be prototypes, then you would need to evaluate your compatibility function 200,000 times, which is an improvement of a factor of 2.5 over comparing every pair. The real advantage, though, is that for 1,000,000 users 200 prototypes would still be sufficient, and you would need to make 200,000,000 comparisons, rather than 500,000,000,000 an improvement of a factor of 2500. What you get is O(n) algorithm, which is better than O(n^2), despite a potentially large constant factor.
The problem seems like to be 'classification problems'. Yes there are so many solutions and approaches.
To start exploration check this:
http://en.wikipedia.org/wiki/Statistical_classification
Have you heard of kohonen networks?
Its a self organing learning algorithm that clusters similar variables into similar slots. Although most sites like the one I link you to displays the net as bidimensional there is little involved in extending the algorithm into a multiple dimension hypercube.
With such a data structure finding and storing neighbours with similar tastes is trivial as similar users should be stores into similar locations (almost like a reverse hash code).
This reduces your problem into one of finding the variables that will define similarity and establishing distances between possible enumerate values ,like for example classical and acoustic are close toghether while death metal and reggae are quite distant (at least in my oppinion)
By the way in order to find good dividing variables the best algorithm is a decision tree. The nodes closer to the root will be the most important variables to establish 'closeness'.
It looks like you need to read about clustering algorithms. The general idea is that instead of comparing every point with every other point each time you divide them in clusters of similar points. Then the neighborhood may be all the points in the same cluster. The number/size of the clusters is usually a parameter of the clustering algorithm.
Yo can find a video about clustering in Google's series about cluster computing and mapreduce.
Concerns over performance can be greatly mitigated if you consider this as a build/batch problem rather than a realtime query.
The graph can be statically computed then latently updated e.g. hourly, daily etc. to then generate edges and storage optimized for runtime query e.g. top 10 similar users for each user.
+1 for Programming Collective Intelligence too - it is very informative - wish it wasn't (or I was!) as Python-oriented, but still good.

Resources