Outlier test in discrete survival analysis - logistic-regression

I am trying to figure out how to do a proper test of outliers in a discrete survival analysis (I use a logistic regression). I find a several suggestions for the continuos survival analysis but nothing for the discrete ones.
Suggestions to methods, litterature or likewise are highly appreciated!

I think you can use dfbetas to do it. I suggest that you read Therneau's book.

Related

Threshold values for viola jones object detection

I am trying to perform Adaboost training stated by Viola and Jones in their paper on rapid object detection. However, I do not understand how to get the threshold values that will classify the faces from non faces for each of the 160k features. Is this a threshold you set manually? or is this based on some kind of maths ?
Can someone please explain the maths to me thanks a lot.
IMO, the best way to describe what happens during threshold assignment of the weak classifiers in every boosting round is a ROC analysis of the weak classifier performance. A great introduction on ROC analysis was written by Tom Fawcett. The full algorithm that does what you want is described in Shappire and Freund`s book, section 3.4.2.

Classification of review from customers into Good,Bad and Neutral

I have a typical AI Problem to solve. Customers gonna submit comments about a product. I have to be able to create a program that classify these comment as either good,bad or neutral.
Surely, Neural Network gonna play a great role in it.
Also, I think fuzzy logic can play some role in it. Such as how far a comment is good,bad or neutral!!
Some more ideas about how to solve it??
This problem is usually referred to as Sentiment Analysis. You can check out the wikipedia entry about Sentiment Analysis for a brief review, or Liu Bing's page on sentiment analysis for more detailed resources and tutorials.
You can use some form of supervised learning.
The most important thing for classification is then choosing the right features. "Features" means you extract some values from the review that still capture the essence with respect to the classification task. Things that come to my mind are
number of words
average number of words per sentence
number of words from some set like {crap, shit, damn, viagra, ...}
Then you can use any available machine learning algorithm (neural networks, SVM) and train a classifier given you have enough reviews that are labeled with good/neutral/bad.
Neural networks would certainly work for it, however I would be supicious about introducing new words, and languages. I would go for a Bayes net approach for determining the probability of being in a "good/neutral/bad" state. You should consider cleaning the data [stemming, etc] before putting it through the bayes net.
Additionally: The meta attributes [what ziggy mentioned] are more of an indicator to boost the performance of the approach you take.
EDIT: Bayes-Nets are a form of supervised learning.

Finding correlated or comoving stocks

I have a table of daily closing stock prices and commodity prices such as Gold, Oil, etc. I want to find what stocks move closely with another stock or a commodity.
Where do I start to do this type of analysis - I know java, SQL, python, perl, and a little bit of R.
Willing to buy and learn new tools like Matlab if necessary.
Any guidance will be highly appreciated.
This is not a homework question.
Thanks..
The technique you are looking for is called cointegration. Language is not important at all when computing cointegration of two time series so use whatever you are comfortable with.
I disagree with other responses that computation is not a problem. It is a huge problem to be able to compute potentially billions of cointegration coefficients between different time series. Using a highly optimized library is critical. However this article on cointegration testing in R should get you started.
Also checkout quant.stackexchange.com for more info on quantitative finance.
Try this:
http://www.sectorspdr.com/correlation/
http://www.etfscreen.com/corr.php
http://correlate.googlelabs.com/faq
https://quant.stackexchange.com/questions/1027/correlation-and-cointegration-similarities-differences-relationships/1038#1038
Where do I start to do this type of analysis
If I were you, I'd start by searching Google Scholar for the word "comovement". Not everything that turns up is directly relevant, but there's quite a lot of stuff that is relevant.
By looking through the papers and googling some more, you should get a clearer picture of what types of statistical methods to learn.
I agree with Ben Bolker that computational tools are not the main issue at this point.

Machine learning, best technique

I am new to machine learning. I am familiar with SVM , Neural networks and GA. I'd like to know the best technique to learn for classifying pictures and audio. SVM does a decent job but takes a lot of time. Anyone know a faster and better one? Also I'd like to know the fastest library for SVM.
Your question is a good one, and has to do with the state of the art of classification algorithms, as you say, the election of the classifier depends on your data, in the case of images, I can tell you that there is one method called Ada-Boost, read this and this to know more about it, in the other hand, you can find lots of people are doing some researh, for example in Gender Classification of Faces Using Adaboost [Rodrigo Verschae,Javier Ruiz-del-Solar and Mauricio Correa] they say:
"Adaboost-mLBP outperforms all other Adaboost-based methods, as well as baseline methods (SVM, PCA and PCA+SVM)"
Take a look at it.
If your main concern is speed, you should probably take a look at VW and generally at stochastic gradient descent based algorithms for training SVMs.
if the number of features is large in comparison to the number of the trainning examples
then you should go for logistic regression or SVM without kernel
if the number of features is small and the number of training examples is intermediate
then you should use SVN with gaussian kernel
is the number of features is small and the number of training examples is large
use logistic regression or SVM without kernels .
that's according to the stanford ML-class .
For such task you may need to extract features first. Only after that classification is feasible.
I think feature extraction and selection is important.
For image classification, there are a lot of features such as raw pixels, SIFT feature, color, texture,etc. It would be better choose some suitable for your task.
I'm not familiar with audio classication, but there may be some specturm features, like the fourier transform of the signal, MFCC.
The methods used to classify is also important. Besides the methods in the question, KNN is a reasonable choice, too.
Actually, using what feature and method is closely related to the task.
The method mostly depends on problem at hand. There is no method that is always the fastest for any problem. Having said that, you should also keep in mind that once you choose an algorithm for speed, you will start compromising on the accuracy.
For example- since your trying to classify images, there might a lot of features compared to the number of training samples at hand. In such cases, if you go for SVM with kernels, you could end up over fitting with the variance being too high.
So you would want to choose a method that has a high bias and low variance. Using logistic regression or linear SVM are some ways to do it.
You could also use different types of regularizations or techniques such as SVD to remove the features that do not contribute much to your output prediction and have only the most important ones. In other words, choose the features that have little or no correlation between them. Once you do this, you would be able to speed yup your SVM algorithms without sacrificing the accuracy.
Hope it helps.
there are some good techniques in learning machines such as, boosting and adaboost.
One method of classification is the boosting method. This method will iteratively manipulate data which will then be classified by a particular base classifier on each iteration, which in turn will build a classification model. Boosting uses weighting of each data in each iteration where its weight value will change according to the difficulty level of the data to be classified.
While the method adaBoost is one ensamble technique by using loss function exponential function to improve the accuracy of the prediction made.
I think your question is very open ended, and "best classifier for images" will largely depend on the type of image you want to classify. But in general, I suggest you study convulutional neural networks ( CNN ) and transfer learning, currently these are the state of the art techniques for the problem.
check out pre-trained models of cnn based neural networks from pytorch or tensorflow
Related to images I suggest you also study pre-processing of images, pre-processing techniques are very important to highlight some feature of the image and improve the generalization of the classifier.

Measuring the performance of classification algorithm

I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent on the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classificator, with taking data overfitting problem into account.
That is: given N[1..100] training samples, if I run the training algorithm on every one of the samples, and use this very same samples to measure fitness, it might stuck into a data overfitting problem -the classifier will know the exact answers for the training instances, without having much predictive power, rendering the fitness results useless.
An obvious solution would be seperating the hand-tagged samples into training, and test samples; and I'd like to learn about methods selecting the statistically significant samples for training.
White papers, book pointers, and PDFs much appreciated!
You could use 10-fold Cross-validation for this. I believe it's pretty standard approach for classification algorithm performance evaluation.
The basic idea is to divide your learning samples into 10 subsets. Then use one subset for test data and others for train data. Repeat this for each subset and calculate average performance at the end.
As Mr. Brownstone said 10-fold Cross-Validation is probably the best way to go. I recently had to evaluate the performance of a number of different classifiers for this I used Weka. Which has an API and a load of tools that allow you to easily test the performance of lots of different classifiers.

Resources