I have a typical AI Problem to solve. Customers gonna submit comments about a product. I have to be able to create a program that classify these comment as either good,bad or neutral.
Surely, Neural Network gonna play a great role in it.
Also, I think fuzzy logic can play some role in it. Such as how far a comment is good,bad or neutral!!
Some more ideas about how to solve it??
This problem is usually referred to as Sentiment Analysis. You can check out the wikipedia entry about Sentiment Analysis for a brief review, or Liu Bing's page on sentiment analysis for more detailed resources and tutorials.
You can use some form of supervised learning.
The most important thing for classification is then choosing the right features. "Features" means you extract some values from the review that still capture the essence with respect to the classification task. Things that come to my mind are
number of words
average number of words per sentence
number of words from some set like {crap, shit, damn, viagra, ...}
Then you can use any available machine learning algorithm (neural networks, SVM) and train a classifier given you have enough reviews that are labeled with good/neutral/bad.
Neural networks would certainly work for it, however I would be supicious about introducing new words, and languages. I would go for a Bayes net approach for determining the probability of being in a "good/neutral/bad" state. You should consider cleaning the data [stemming, etc] before putting it through the bayes net.
Additionally: The meta attributes [what ziggy mentioned] are more of an indicator to boost the performance of the approach you take.
EDIT: Bayes-Nets are a form of supervised learning.
Related
I am new to data mining concepts and trying to learn the differences between supervised and unsupervised learning. So far what i know is that supervised means getting the information from labeled datasets and unsupervised means clustering the data without any labels given.
I kinda understand what they are but can't really apply them in real life (can't really apply the concepts to ask real time questions). I found the following example question on one of the machine-learning web forums and was wondering if someone could help me with it so i can use it as an example to understand the concept a little better. The question is:
Given the following dataset on different cars, make up 2 questions based on supervised and unsupervised learning.
Any kind of help is appreciated.
Thanks :)
Supervised Learning:
So the above dataset has 11 attributes. From the 11 attributes we can see that am column is something that classifies the car with manual transmission if it is 0 and car with automatic transmission if it is 1.
So usually in supervised learning we are given a training data with a response variable , so in this case if you treat this is a data given for supervised learning, you can train your model using appropriate algorithm and then for any test data predict what will be the corresponding am(Manual or Automatic).
Unsupervised Learning :
Assume that the transmission column is not given , and try to group the cars into clusters based on any Unsupervised learning Algorithm, see if you can come with two clusters which would be one cluster consisting of manual transmission cars and one cluster consisting of automatic transmission cars.
You may check the below links , these are two videos from Andrew Ng lecture
They are very short videos and will help you to get an better understanding.
https://www.youtube.com/watch?v=ls7Ke48jCt8&index=3&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW
https://www.youtube.com/watch?v=qHfUlFHGG08&list=PLZ9qNFMHZ-A4rycgrgOYma6zxF4BZGGPW&index=4
My professor asked my class to make a neural network to try to predict if a breast cancer is benign or malignant. To do this I'm using the Breast Cancer Wisconsin (Diagnostic) Data Set.
As a tip for doing this my professor said not all 30 atributes needs to be used as an input (there are 32, but the first 2 are the ID and Diagnosis), what I want to ask is: How am I supposed to take those 30 inputs (that would create like 100+ weights depending on how many neurons I would use) and get them into a lesser number?
I've already found how to "prune" a neural net, but I don't think that's what I want. I'm not trying to eliminate unnecessary neurons, but to shrink the input itself.
PS: Sorry for any english errors, it's not my native language.
That is a question that is being under research right now. It is called feature selection and there are some techniques already. One is Principal Componetns Analysis (PCA) that reduces the dimensionality of your dataset taking those feature that keeps the most variance. Another thing you can do is to see if there are highly corelated variables. If two inputs are highly correlated may mean that they carry almost the same information so it may be remove without worsen much the performance of your classifier. As a third technique you could use is deep-learning which is a technique that tries to learn the features that will later be used to feed your trainer. More info about deep learning and PCA can be found here http://deeplearning.stanford.edu/wiki/index.php/Main_Page
This problem is called feature selection. It is mostly the same for neural networks as for other classifiers. You could prune your dataset while retaining the most variance using PCA. To go further, you could use a greedy approach and evaluate your features one by one by training and testing your network with each feature excluded in turn.
There is a technique for feature selection using just neural networks
Split your dataset into three groups:
Training data used for supervised training
Validation data used to verify that the neural network is able to generalize
Accuracy testing used to test which of the features are required
The steps:
Train a network on your training and validation set, just like you would normally do.
Test the accuracy of the network with the third dataset.
Locate the varible which yields the smallest drop in the accuracy test above when dropped (dropped meaning always feeding a zero as the input signal )
Retrain your network with the new selection of features
Keep doing this either to the network fails to be trained or there is just one variable left.
Here is a paper on the technique
I tried google and found little that I could understand.
I understand Markov chains to a very basic level: It's a mathematical model that only depends on previous input to change states..so sort of a FSM with weighted random chances instead of different criteria?
I've heard that you can use them to generate semi-intelligent nonsense, given sentences of existing words to use as a dictionary of kinds.
I can't think of search terms to find this, so can anyone link me or explain how I could produce something that gives a semi-intelligent answer? (if you asked it about pie, it would not start going on about the vietnam war it had heard about)
I plan on:
Having this bot idle in IRC channels for a bit
Strip any usernames out of the string and store as sentences or whatever
Over time, use this as the basis for the above.
Yes, a Markov chain is a finite-state machine with probabilistic state transitions. To generate random text with a simple, first-order Markov chain:
Collect bigram (adjacent word pair) statistics from a corpus (collection of text).
Make a markov chain with one state per word. Reserve a special state for end-of-text.
The probability of jumping from state/word x to y is the probability of the words y immediately following x, estimated from relative bigram frequencies in the training corpus.
Start with a random word x (perhaps determined by how often that word occurs as the first word of a sentence in the corpus). Then pick a state/word y to jump to randomly, taking into account the probability of y following x (the state transition probability). Repeat until you hit end-of-text.
If you want to get something semi-intelligent out of this, then your best shot is to train it on lots of carefully collected texts. The "lots" part makes it produce proper sentences (or plausible IRC speak) with high probability; the "carefully collected" part means you control what it talks about. Introducing higher-order Markov chains also helps in both areas, but takes more storage to store the necessary statistics. You may also look into things like statistical smoothing.
However, having your IRC bot actually respond to what is said to it takes a lot more than Markov chains. It may be done by doing text categorization (aka topic spotting) on what is said, then picking a domain-specific Markov chain for text generation. Naïve Bayes is a popular model for topic spotting.
Kernighan and Pike in The Practice of Programming explore various implementation strategies for Markov chain algorithms. These, and natural language generation in general, is covered in great depth by Jurafsky and Martin, Speech and Language Processing.
You want to look for Ian Barber Text Generation ( phpir.com ). Unfortunately the site is down or offline. I have a copy of his text and I want to send it to you.
It seems to me you are trying multiple things at the same time:
extracting words/sentences by idling in IRC
building a knowledge base
listening to some chat, parsing keywords
generate some sentence regarding keywords
Those are basically very different tasks. Markov models are often used for machine learning. I don't see much learning in your tasks though.
larsmans answer shows how you generate sentences from word-based markov-models. You can also train the weights to favor those word-pairs that other IRC users used. But nonetheless this will not generate keyword-related sentences, because building/refining a markov model is not the same as "driving" it.
You might try hidden markov models (HMM) where the visible output is the keywords and the hidden states are made from those word-pairs. You could then favor sentences more appropriate to specific keywords dynamically.
Questions
I want to classify/categorize/cluster/group together a set of several thousand websites. There's data that we can train on, so we can do supervised learning, but it's not data that we've gathered and we're not adamant about using it -- so we're also considering unsupervised learning.
What features can I use in a machine learning algorithm to deal with multilingual data? Note that some of these languages might not have been dealt with in the Natural Language Processing field.
If I were to use an unsupervised learning algorithm, should I just partition the data by language and deal with each language differently? Different languages might have different relevant categories (or not, depending on your psycholinguistic theoretical tendencies), which might affect the decision to partition.
I was thinking of using decision trees, or maybe Support Vector Machines (SVMs) to allow for more features (from my understanding of them). This post suggests random forests instead of SVMs. Any thoughts?
Pragmatical approaches are welcome! (Theoretical ones, too, but those might be saved for later fun.)
Some context
We are trying to classify a corpus of many thousands of websites in 3 to 5 languages (maybe up to 10, but we're not sure).
We have training data in the form of hundreds of websites already classified. However, we may choose to use that data set or not -- if other categories make more sense, we're open to not using the training data that we have, since it is not something we gathered in the first place. We are on the final stages of scraping data/text from websites.
Now we must decide on the issues above. I have done some work with the Brown Corpus and the Brill tagger, but this will not work because of the multiple-languages issue.
We intend to use the Orange machine learning package.
According to the context you have provided, this is a supervised learning problem.
Therefore, you are doing classification, not clustering. If I misunderstood, please update your question to say so.
I would start with the simplest features, namely tokenize the unicode text of the pages, and use a dictionary to translate every new token to a number, and simply consider the existence of a token as a feature.
Next, I would use the simplest algorithm I can - I tend to go with Naive Bayes, but if you have an easy way to run SVM this is also nice.
Compare your results with some baseline - say assigning the most frequent class to all the pages.
Is the simplest approach good enough? If not, start iterating over algorithms and features.
If you go the supervised route, then the fact that the web pages are in multiple languages shouldn't make a difference. If you go with, say lexical features (bag-o'-words style) then each language will end up yielding disjoint sets of features, but that's okay. All of the standard algorithms will likely give comparable results, so just pick one and go with it. I agree with Yuval that Naive Bayes is a good place to start, and only if that doesn't meet your needs that try something like SVMs or random forests.
If you go the unsupervised route, though, the fact that the texts aren't all in the same language might be a big problem. Any reasonable clustering algorithm will first group the texts by language, and then within each language cluster by something like topic (if you're using content words as features). Whether that's a bug or a feature will depend entirely on why you want to classify these texts. If the point is to group documents by topic, irrespective of language, then it's no good. But if you're okay with having different categories for each language, then yeah, you've just got as many separate classification problems as you have languages.
If you do want a unified set of classes, then you'll need some way to link similar documents across languages. Are there any documents in more that one language? If so, you could use them as a kind of statistical Rosetta Stone, to link words in different languages. Then, using something like Latent Semantic Analysis, you could extend that to second-order relations: words in different languages that don't ever occur in the same document, but which tend to co-occur with words which do. Or maybe you could use something like anchor text or properties of the URLs to assign a rough classification to documents in a language-independent manner and use that as a way to get started.
But, honestly, it seems strange to go into a classification problem without a clear idea of what the classes are (or at least what would count as a good classification). Coming up with the classes is the hard part, and it's the part that'll determine whether the project is a success or failure. The actual algorithmic part is fairly rote.
Main answer is: try different approaches. Without actual testing it's very hard to predict what method will give best results. So, I'll just suggest some methods that I would try first and describe their pros and cons.
First of all, I would recommend supervised learning. Even if the data classification is not very accurate, it may still give better results than unsupervised clustering. One of the reasons for it is a number of random factors that are used during clustering. For example, k-means algorithm relies on randomly selected points when starting the process, which can lead to a very different results for different program runnings (though x-means modifications seems to normalize this behavior). Clustering will give good results only if underlying elements produce well separated areas in the feature space.
One of approaches to treating multilingual data is to use multilingual resources as support points. For example, you can index some Wikipedia's articles and create "bridges" between same topics in different languages. Alternatively, you can create multilingual association dictionary like this paper describes.
As for methods, the first thing that comes to mind is instance-based semantic methods like LSI. It uses vector space model to calculate distance between words and/or documents. In contrast to other methods it can efficiently treat synonymy and polysemy. Disadvantage of this method is a computational inefficiency and leak of implementations. One of the phases of LSI makes use of a very big cooccurrence matrix, which for large corpus of documents will require distributed computing and other special treatment. There's modification of LSA called Random Indexing which do not construct full coocurrence matrix, but you'll hardly find appropriate implementation for it. Some time ago I created library in Clojure for this method, but it is pre-alpha now, so I can't recommend using it. Nevertheless, if you decide to give it a try, you can find project 'Clinch' of a user 'faithlessfriend' on github (I'll not post direct link to avoid unnecessary advertisement).
Beyond special semantic methods the rule "simplicity first" must be used. From this point, Naive Bayes is a right point to start from. The only note here is that multinomial version of Naive Bayes is preferable: my experience tells that count of words really does matter.
SVM is a technique for classifying linearly separable data, and text data is almost always not linearly separable (at least several common words appear in any pair of documents). It doesn't mean, that SVM cannot be used for text classification - you still should try it, but results may be much lower than for other machine learning tasks.
I haven't enough experience with decision trees, but using it for efficient text classification seems strange to me. I have seen some examples where they gave excellent results, but when I tried to use C4.5 algorithm for this task, the results were terrible. I believe you should get some software where decision trees are implemented and test them by yourself. It is always better to know then to suggest.
There's much more to say on every topic, so feel free to ask more questions on specific topic.
I am new to machine learning. I am familiar with SVM , Neural networks and GA. I'd like to know the best technique to learn for classifying pictures and audio. SVM does a decent job but takes a lot of time. Anyone know a faster and better one? Also I'd like to know the fastest library for SVM.
Your question is a good one, and has to do with the state of the art of classification algorithms, as you say, the election of the classifier depends on your data, in the case of images, I can tell you that there is one method called Ada-Boost, read this and this to know more about it, in the other hand, you can find lots of people are doing some researh, for example in Gender Classification of Faces Using Adaboost [Rodrigo Verschae,Javier Ruiz-del-Solar and Mauricio Correa] they say:
"Adaboost-mLBP outperforms all other Adaboost-based methods, as well as baseline methods (SVM, PCA and PCA+SVM)"
Take a look at it.
If your main concern is speed, you should probably take a look at VW and generally at stochastic gradient descent based algorithms for training SVMs.
if the number of features is large in comparison to the number of the trainning examples
then you should go for logistic regression or SVM without kernel
if the number of features is small and the number of training examples is intermediate
then you should use SVN with gaussian kernel
is the number of features is small and the number of training examples is large
use logistic regression or SVM without kernels .
that's according to the stanford ML-class .
For such task you may need to extract features first. Only after that classification is feasible.
I think feature extraction and selection is important.
For image classification, there are a lot of features such as raw pixels, SIFT feature, color, texture,etc. It would be better choose some suitable for your task.
I'm not familiar with audio classication, but there may be some specturm features, like the fourier transform of the signal, MFCC.
The methods used to classify is also important. Besides the methods in the question, KNN is a reasonable choice, too.
Actually, using what feature and method is closely related to the task.
The method mostly depends on problem at hand. There is no method that is always the fastest for any problem. Having said that, you should also keep in mind that once you choose an algorithm for speed, you will start compromising on the accuracy.
For example- since your trying to classify images, there might a lot of features compared to the number of training samples at hand. In such cases, if you go for SVM with kernels, you could end up over fitting with the variance being too high.
So you would want to choose a method that has a high bias and low variance. Using logistic regression or linear SVM are some ways to do it.
You could also use different types of regularizations or techniques such as SVD to remove the features that do not contribute much to your output prediction and have only the most important ones. In other words, choose the features that have little or no correlation between them. Once you do this, you would be able to speed yup your SVM algorithms without sacrificing the accuracy.
Hope it helps.
there are some good techniques in learning machines such as, boosting and adaboost.
One method of classification is the boosting method. This method will iteratively manipulate data which will then be classified by a particular base classifier on each iteration, which in turn will build a classification model. Boosting uses weighting of each data in each iteration where its weight value will change according to the difficulty level of the data to be classified.
While the method adaBoost is one ensamble technique by using loss function exponential function to improve the accuracy of the prediction made.
I think your question is very open ended, and "best classifier for images" will largely depend on the type of image you want to classify. But in general, I suggest you study convulutional neural networks ( CNN ) and transfer learning, currently these are the state of the art techniques for the problem.
check out pre-trained models of cnn based neural networks from pytorch or tensorflow
Related to images I suggest you also study pre-processing of images, pre-processing techniques are very important to highlight some feature of the image and improve the generalization of the classifier.