How much data do I need for recommender system? - dataset

I have to develop a personality/job suitability online test for an HR department. Basically, users will answer questions, on a scale of 0-10 for example, and after say 50 questions, I want to translate that to a rating in 5 different personality/ job suitability characteristics.
I don't have any real data to start with, so first, is it even worth it to use a recommendation engine like MyMediaLite (github). How many samples will I need to train it to a decent performance?
I previously built a training course recommender, by simply doing and hand-weighted sum where each question increased the weight of several courses that were related to that question. It was an expert system, built like a feed-forward neural network, where I personally tuned all the weights based on my knowledge of the questions and the courses' content.
I would like to this time around use a recommender system, but I'm wondering how many times I would have to take the 50 question test, and then assign the results manually. would 100 examples do? that could be possible. 1000 would be too long. How can I know ahead of time?

Though useless, I want to say this is not possible to give a definite number. You should focus on learning curve when adding new samples.
You can process the samples by hand and engine on parallel, and compare the result given by both. Once the measurement e.g. recall and precision of the result given by engine achieve your expectation, then you get enough samples.
Hope this helpful!

Related

Data set for Doc2Vec general sentiment analysis

I am trying to build doc2vec model, using gensim + sklearn to perform sentiment analysis on short sentences, like comments, tweets, reviews etc.
I downloaded amazon product review data set, twitter sentiment analysis data set and imbd movie review data set.
Then combined these in 3 categories, positive, negative and neutral.
Next I trinaed gensim doc2vec model on the above data so I can obtain the input vectors for the classifying neural net.
And used sklearn LinearReggression model to predict on my test data, which is about 10% from each of the above three data sets.
Unfortunately the results were not good as I expected. Most of the tutorials out there seem to focus only on one specific task, 'classify amazon reviews only' or 'twitter sentiments only', I couldn't manage to find anything that is more general purpose.
Can some one share his/her thought on this?
How good did you expect, and how good did you achieve?
Combining the three datasets may not improve overall sentiment-detection ability, if the signifiers of sentiment vary in those different domains. (Maybe, 'positive' tweets are very different in wording than product-reviews or movie-reviews. Tweets of just a few to a few dozen words are often quite different than reviews of hundreds of words.) Have you tried each separately to ensure the combination is helping?
Is your performance in line with other online reports of using roughly the same pipeline (Doc2Vec + LinearRegression) on roughly the same dataset(s), or wildly different? That will be a clue as to whether you're doing something wrong, or just have too-high expectations.
For example, the doc2vec-IMDB.ipynb notebook bundled with gensim tries to replicate an experiment from the original 'Paragraph Vector' paper, doing sentiment-detection on an IMDB dataset. (I'm not sure if that's the same dataset as you're using.) Are your results in the same general range as that notebook achieves?
Without seeing your code, and details of your corpus-handling & parameter choices, there could be all sorts of things wrong. Many online examples have nonsense choices. But maybe your expectations are just off.

Is it better to use pattern recognition tool to predict traffic condition using previous data?

I wish to use Artificial neural network pattern recognition tool to predict traffic flow of the urban area with the use of previous traffic count data.
I want to know whether it is a good technique to predict traffic condition.
Probably should be posted on CrossValidated.
The exact effectiveness is based on what features you are looking at in predicting traffic conditions. The question "whether it's a good technique" is too vague. Neural networks might work pretty well under certain circumstances, while it might also work really badly on other situations. Without a specific context it's hard to tell.
Typically neural networks work pretty well on predicting patterns. If you can form your problem into specific pattern recognition tasks then it's possible that neural networks will work pretty well.
-- Update --
Based on the following comment
What I need to predict is vehicle count of a given road, according to the given time and given day with the use of previous data set. As a example when I enter the road name that I need to travel, the time that I wish to travel and the day, I need to get the vehicle count of that road at that time and day.
I would say be very cautious with using neural networks, because depending on your data source, your data may get really sparse. Lets say you have 10000 roads, then for a month period, you are dividing your data set by 30 days, then 24 hours, then 10000 roads.
If you want your neural network to work you need to at least have enough data for each partition of your data set. If you divide your data set in the way described above, you have 7200000 partitions already. Just think about how much data you need in total. The result of having a small dataset means most of your 7 million partitions will have no data available in it, which then implies that your neural network prediction will not work most of the time, since you don't have data to start with.
This is part of the reason why big companies are sort of crazy about big data, because you just never get enough of it.
But anyway, do ask on CrossValidated since people there are more statistician-y and can provide better explanations.
And please note, there might be other ways to split your data (or not splitting at all) to make it work. The above is just an example of pitfalls you might encounter.

Developing an AI system to pick a fantasy football team

I'm looking to build an AI system to "pick" a fantasy football team. I have only basic knowledge of AI techniques (especially when it comes to game theory), so I am looking for advice on what techniques could be used to accomplish this and pointers to some reading materials.
I am aware that this may be a very difficult or maybe even impossible task for AI to accurately complete: however I am not too concerned on the accuracy, rather I am interested in learning some AI and this seems like a fun way to apply it.
Some basic facts about the game:
A team of 14 players must be picked
There is a limit on the total cost of players picked
The players picked must adhere to a certain configuration (there must always be one goalkeeper, at least two defenders, one midfielder and one forward)
The team may be altered on a weekly basis but removing/adding more than one player a week will inccur a penalty
P.S. I have stats on every match played in last season, could this be used to train the AI system?
This is interesting.
So if you didn't really care about accuracy at all, you could just come up with some heuristic for the quality of a team. For instance, assign a point value to each player and then try to maximize it using dynamic programming. Something like: http://www.cse.unl.edu/~goddard/Courses/CSCE310J/Lectures/Lecture8-DynamicProgramming.pdf
This would be similar to the knapsack problem.
Technically this is AI since a computer is deciding something but maybe not what you had in mind.
You sound like you want a learning AI (http://en.wikipedia.org/wiki/Machine_learning) which is an interesting field. Here's how you can approach the problem.
Define your inputs. Right now you have last years data. You'll probably want data on many years. Also, you might be able to include the ranking of pundits, maybe a bunch of magazines rank players or something, that seems useful as well.
Take your inputs and feed them into some machine learning algorithm for each season. Wikipedia will help you out there.
Essentially, for each season you'll want to feed in your data, have your AI pick a team, and then rate the performance of the team based on the seasons results.
Keep doing this and maybe your bot will get better at picking teams, and you can apply to this year's data.
(If you only have last year's data, it's okay to train the algorithm with just that but your AI will probably be over trained on that one set and won't be as accurate.)
This was just a sketch of how it might look. For a romp into AI, this problem is probably pretty hard so don't feel disheartened if it seems overwhelming at first.

Generating 'neighbours' for users based on rating

I'm looking for techniques to generate 'neighbours' (people with similar taste) for users on a site I am working on; something similar to the way last.fm works.
Currently, I have a compatibilty function for users which could come into play. It ranks users on having 1) rated similar items 2) rated the item similarly. The function weighs point 2 heigher and this would be the most important if I had to use only one of these factors when generating 'neighbours'.
One idea I had would be to just calculate the compatibilty of every combination of users and selecting the highest rated users to be the neighbours for the user. The downside of this is that as the number of users go up then this process couls take a very long time. For just a 1000 users, it needs 1000C2 (0.5 * 1000 * 999 = = 499 500) calls to the compatibility function which could be very heavy on the server also.
So I am looking for any advice, links to articles etc on how best to achieve a system like this.
In the book Programming Collective Intelligence
http://oreilly.com/catalog/9780596529321
Chapter 2 "Making Recommendations" does a really good job of outlining methods of recommending items to people based on similarities between users. You could use the similarity algorithms to find the 'neighbours' you are looking for. The chapter is available on google book search here:
http://books.google.com/books?id=fEsZ3Ey-Hq4C&printsec=frontcover
Be sure to look at Collaborative Filtering. Many recommendation systems use collaborative filtering to suggest items to users. They do it by finding 'neighbors' and then suggesting items your neighbors rated highly but you haven't rated. You could go as far as finding neighbors, and who knows, maybe you'll want recommendations in the future.
GroupLens is a research lab at the University of Minnesota that studies collaborative filtering techniques. They have a ton of published research as well as a few sample datasets.
The Netflix Prize is a competition to determine who can most effectively solve this sort of problem. Follow the links off their LeaderBoard. A few of the competitors share their solutions.
As far as a computationally inexpensive solution, you could try this:
Create categories for your items. If we're talking about music, they might be classical, rock, jazz, hip-hop... or go further: Grindcore, Math Rock, Riot Grrrl...
Now, every time a user rates an item, roll up their ratings at the category level. So you know 'User A' likes Honky Tonk and Acid House because they give those items high ratings frequently. Frequency and strength is probably important for your category aggregate score.
When it's time to find neighbors, instead of cruising through all ratings, just look for similar scores in the categories.
This method wouldn't be as accurate but it's fast.
Cheers.
What you need is a clustering algorithm, which would automatically group similar users together. The first difficulty that you are facing is that most clustering algorithms expect the items they cluster to be represented as points in a Euclidean space. In your case, you don't have the coordinates of the points. Instead, you can compute the value of the "similarity" function between pairs of them.
One good possibility here is to use spectral clustering, which needs precisely what you have: a similarity matrix. The downside is that you still need to compute your compatibility function for every pair of points, i. e. the algorithm is O(n^2).
If you absolutely need an algorithm faster than O(n^2), then you can try an approach called dissimilarity spaces. The idea is very simple. You invert your compatibility function (e. g. by taking its reciprocal) to turn it into a measure of dissimilarity or distance. Then you compare every item (user, in your case) to a set of prototype items, and treat the resulting distances as coordinates in a space. For instance, if you have 100 prototypes, then each user would be represented by a vector of 100 elements, i. e. by a point in 100-dimensional space. Then you can use any standard clustering algorithm, such as K-means.
The question now is how do you choose the prototypes, and how many do you need. Various heuristics have been tried, however, here is a dissertation which argues that choosing prototypes randomly may be sufficient. It shows experiments in which using 100 or 200 randomly selected prototypes produced good results. In your case if you have 1000 users, and you choose 200 of them to be prototypes, then you would need to evaluate your compatibility function 200,000 times, which is an improvement of a factor of 2.5 over comparing every pair. The real advantage, though, is that for 1,000,000 users 200 prototypes would still be sufficient, and you would need to make 200,000,000 comparisons, rather than 500,000,000,000 an improvement of a factor of 2500. What you get is O(n) algorithm, which is better than O(n^2), despite a potentially large constant factor.
The problem seems like to be 'classification problems'. Yes there are so many solutions and approaches.
To start exploration check this:
http://en.wikipedia.org/wiki/Statistical_classification
Have you heard of kohonen networks?
Its a self organing learning algorithm that clusters similar variables into similar slots. Although most sites like the one I link you to displays the net as bidimensional there is little involved in extending the algorithm into a multiple dimension hypercube.
With such a data structure finding and storing neighbours with similar tastes is trivial as similar users should be stores into similar locations (almost like a reverse hash code).
This reduces your problem into one of finding the variables that will define similarity and establishing distances between possible enumerate values ,like for example classical and acoustic are close toghether while death metal and reggae are quite distant (at least in my oppinion)
By the way in order to find good dividing variables the best algorithm is a decision tree. The nodes closer to the root will be the most important variables to establish 'closeness'.
It looks like you need to read about clustering algorithms. The general idea is that instead of comparing every point with every other point each time you divide them in clusters of similar points. Then the neighborhood may be all the points in the same cluster. The number/size of the clusters is usually a parameter of the clustering algorithm.
Yo can find a video about clustering in Google's series about cluster computing and mapreduce.
Concerns over performance can be greatly mitigated if you consider this as a build/batch problem rather than a realtime query.
The graph can be statically computed then latently updated e.g. hourly, daily etc. to then generate edges and storage optimized for runtime query e.g. top 10 similar users for each user.
+1 for Programming Collective Intelligence too - it is very informative - wish it wasn't (or I was!) as Python-oriented, but still good.

Measuring the performance of classification algorithm

I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent on the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classificator, with taking data overfitting problem into account.
That is: given N[1..100] training samples, if I run the training algorithm on every one of the samples, and use this very same samples to measure fitness, it might stuck into a data overfitting problem -the classifier will know the exact answers for the training instances, without having much predictive power, rendering the fitness results useless.
An obvious solution would be seperating the hand-tagged samples into training, and test samples; and I'd like to learn about methods selecting the statistically significant samples for training.
White papers, book pointers, and PDFs much appreciated!
You could use 10-fold Cross-validation for this. I believe it's pretty standard approach for classification algorithm performance evaluation.
The basic idea is to divide your learning samples into 10 subsets. Then use one subset for test data and others for train data. Repeat this for each subset and calculate average performance at the end.
As Mr. Brownstone said 10-fold Cross-Validation is probably the best way to go. I recently had to evaluate the performance of a number of different classifiers for this I used Weka. Which has an API and a load of tools that allow you to easily test the performance of lots of different classifiers.

Resources