How to come up with questions on supervised and unsupervised learning? - database

I am new to data mining concepts and trying to learn the differences between supervised and unsupervised learning. So far what i know is that supervised means getting the information from labeled datasets and unsupervised means clustering the data without any labels given.
I kinda understand what they are but can't really apply them in real life (can't really apply the concepts to ask real time questions). I found the following example question on one of the machine-learning web forums and was wondering if someone could help me with it so i can use it as an example to understand the concept a little better. The question is:
Given the following dataset on different cars, make up 2 questions based on supervised and unsupervised learning.
Any kind of help is appreciated.
Thanks :)

Supervised Learning:
So the above dataset has 11 attributes. From the 11 attributes we can see that am column is something that classifies the car with manual transmission if it is 0 and car with automatic transmission if it is 1.
So usually in supervised learning we are given a training data with a response variable , so in this case if you treat this is a data given for supervised learning, you can train your model using appropriate algorithm and then for any test data predict what will be the corresponding am(Manual or Automatic).
Unsupervised Learning :
Assume that the transmission column is not given , and try to group the cars into clusters based on any Unsupervised learning Algorithm, see if you can come with two clusters which would be one cluster consisting of manual transmission cars and one cluster consisting of automatic transmission cars.
You may check the below links , these are two videos from Andrew Ng lecture
They are very short videos and will help you to get an better understanding.
https://www.youtube.com/watch?v=ls7Ke48jCt8&index=3&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW
https://www.youtube.com/watch?v=qHfUlFHGG08&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=4

Related

How much data do I need for recommender system?

I have to develop a personality/job suitability online test for an HR department. Basically, users will answer questions, on a scale of 0-10 for example, and after say 50 questions, I want to translate that to a rating in 5 different personality/ job suitability characteristics.
I don't have any real data to start with, so first, is it even worth it to use a recommendation engine like MyMediaLite (github). How many samples will I need to train it to a decent performance?
I previously built a training course recommender, by simply doing and hand-weighted sum where each question increased the weight of several courses that were related to that question. It was an expert system, built like a feed-forward neural network, where I personally tuned all the weights based on my knowledge of the questions and the courses' content.
I would like to this time around use a recommender system, but I'm wondering how many times I would have to take the 50 question test, and then assign the results manually. would 100 examples do? that could be possible. 1000 would be too long. How can I know ahead of time?
Though useless, I want to say this is not possible to give a definite number. You should focus on learning curve when adding new samples.
You can process the samples by hand and engine on parallel, and compare the result given by both. Once the measurement e.g. recall and precision of the result given by engine achieve your expectation, then you get enough samples.
Hope this helpful!

Classification of review from customers into Good,Bad and Neutral

I have a typical AI Problem to solve. Customers gonna submit comments about a product. I have to be able to create a program that classify these comment as either good,bad or neutral.
Surely, Neural Network gonna play a great role in it.
Also, I think fuzzy logic can play some role in it. Such as how far a comment is good,bad or neutral!!
Some more ideas about how to solve it??
This problem is usually referred to as Sentiment Analysis. You can check out the wikipedia entry about Sentiment Analysis for a brief review, or Liu Bing's page on sentiment analysis for more detailed resources and tutorials.
You can use some form of supervised learning.
The most important thing for classification is then choosing the right features. "Features" means you extract some values from the review that still capture the essence with respect to the classification task. Things that come to my mind are
number of words
average number of words per sentence
number of words from some set like {crap, shit, damn, viagra, ...}
Then you can use any available machine learning algorithm (neural networks, SVM) and train a classifier given you have enough reviews that are labeled with good/neutral/bad.
Neural networks would certainly work for it, however I would be supicious about introducing new words, and languages. I would go for a Bayes net approach for determining the probability of being in a "good/neutral/bad" state. You should consider cleaning the data [stemming, etc] before putting it through the bayes net.
Additionally: The meta attributes [what ziggy mentioned] are more of an indicator to boost the performance of the approach you take.
EDIT: Bayes-Nets are a form of supervised learning.

Neural networks by function examples -- how to get feeling of it?

I am looking for recommended books (or other materials, like web pages) which demonstrate such examples -- structure of neural network (artificial) for given function.
I.e. what is the best (in sense, of being minimalistic, yet correct) network structure for function min with N arguments. Or for function abs. And so on.
The reason for my question (what books do you recommend?) is I would like to get proper "feeling" how to shape the network to get the desired effect without overkill having dense network which computes correctly, but very inefficiently.
There is no such thing as "the best NN structure". If you are lucky, you will find a structure that does the job, but that doesn't mean that that structure is the "best".
I highly recommend you reading Programming Collective Intelligence by Toby Segaran. This book has a chapter on neural network. It explains many other artificial intelligence algorithms in a clear and concise way.
You may find additional wide reviews on neural networks here and here
There is a lecture course on iTunes-U called "Informatics for Nursing" that contains several lectures dedicated to ANN's
UPDATE June 2019: the iTunes-U course is no longer available and I couldn't find it elsewhere
Good luck

Developing an AI system to pick a fantasy football team

I'm looking to build an AI system to "pick" a fantasy football team. I have only basic knowledge of AI techniques (especially when it comes to game theory), so I am looking for advice on what techniques could be used to accomplish this and pointers to some reading materials.
I am aware that this may be a very difficult or maybe even impossible task for AI to accurately complete: however I am not too concerned on the accuracy, rather I am interested in learning some AI and this seems like a fun way to apply it.
Some basic facts about the game:
A team of 14 players must be picked
There is a limit on the total cost of players picked
The players picked must adhere to a certain configuration (there must always be one goalkeeper, at least two defenders, one midfielder and one forward)
The team may be altered on a weekly basis but removing/adding more than one player a week will inccur a penalty
P.S. I have stats on every match played in last season, could this be used to train the AI system?
This is interesting.
So if you didn't really care about accuracy at all, you could just come up with some heuristic for the quality of a team. For instance, assign a point value to each player and then try to maximize it using dynamic programming. Something like: http://www.cse.unl.edu/~goddard/Courses/CSCE310J/Lectures/Lecture8-DynamicProgramming.pdf
This would be similar to the knapsack problem.
Technically this is AI since a computer is deciding something but maybe not what you had in mind.
You sound like you want a learning AI (http://en.wikipedia.org/wiki/Machine_learning) which is an interesting field. Here's how you can approach the problem.
Define your inputs. Right now you have last years data. You'll probably want data on many years. Also, you might be able to include the ranking of pundits, maybe a bunch of magazines rank players or something, that seems useful as well.
Take your inputs and feed them into some machine learning algorithm for each season. Wikipedia will help you out there.
Essentially, for each season you'll want to feed in your data, have your AI pick a team, and then rate the performance of the team based on the seasons results.
Keep doing this and maybe your bot will get better at picking teams, and you can apply to this year's data.
(If you only have last year's data, it's okay to train the algorithm with just that but your AI will probably be over trained on that one set and won't be as accurate.)
This was just a sketch of how it might look. For a romp into AI, this problem is probably pretty hard so don't feel disheartened if it seems overwhelming at first.

Measuring the performance of classification algorithm

I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent on the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classificator, with taking data overfitting problem into account.
That is: given N[1..100] training samples, if I run the training algorithm on every one of the samples, and use this very same samples to measure fitness, it might stuck into a data overfitting problem -the classifier will know the exact answers for the training instances, without having much predictive power, rendering the fitness results useless.
An obvious solution would be seperating the hand-tagged samples into training, and test samples; and I'd like to learn about methods selecting the statistically significant samples for training.
White papers, book pointers, and PDFs much appreciated!
You could use 10-fold Cross-validation for this. I believe it's pretty standard approach for classification algorithm performance evaluation.
The basic idea is to divide your learning samples into 10 subsets. Then use one subset for test data and others for train data. Repeat this for each subset and calculate average performance at the end.
As Mr. Brownstone said 10-fold Cross-Validation is probably the best way to go. I recently had to evaluate the performance of a number of different classifiers for this I used Weka. Which has an API and a load of tools that allow you to easily test the performance of lots of different classifiers.

Resources