MNIST Dataset, difference of grid search and not, about accuracy - artificial-intelligence

I study AI class and implement Datase MNIST. When I classify the data using grid search and not, SVM and Logistic Regression, it give me almost same accuracy. Is it possible? If it is right results, I want to know reason why it is almost same results. And if I choice one thing, what is correct option to me. Plz explain to me. Thanks.

Which hyperparameters are you optimizing with the grid search? What range are you using? And what do you consider "almost the same" accuracy? Show us some code, please.

Related

multiple exponential and logarithmic regression (PYTHON)

I want to make data analysis. So, i searched and chose a dataset about automotive. Dataset includes 15 columns and 7500 rows. I used linear regression model(multiple) but now, i want to try another regression models like exponential and logarithmic. But i dont know how can i apply for 15 columns.
Can you lead about that ? Anyone has any idea? Firstable, i'll focus on the exponential regression.
Maybe, you can suggest a link or book or article about nonlinear regression models. (searched but i did not find exactly what i wanted)
Thank you for your interest.

How to find similarity for large number of features

I'm not sure if I am asking the question at right place as I'm new to stackoverflow, please move if required.
I'm trying to solve a link prediction problem for Flickr Dataset. My dataset has 5K nodes and each node has around 27K features, it is sparse.
I want to find similarity between the nodes so that I can predict a link between them if the similarity value is greater than some threshold that I decide. The problem is with the number of features. I cannot load the file in Weka (To try to reduce features by some info gain or something and then try clustering or check if cosine similarity measure)
One more problem is, how to define this as a classification problem ? I wanted to find overlapping tags for two nodes, so the table contains the nodes and some features of them (will be in thousands) and all of them will be positive class only as I know that there is a link between them.
I want to create a test data set with some of the nodes and and create similar table and label them as positive class or negative class. But my problem is all data I have is positive, so I think it would never be able to label as negative. How to change it to a classification problem correctly ?
Any pointers or help is very much appreciated.
Weka can deal with 27K features, it shoudn't be a problem... However, I would approach this problem as a classification problem, but a link-discovery one, which, in this case can be seen as a matching problem.
My approach would be:
1. new node appears
2. search for the most similar elements
3. assume they are related (there is a link) if the similarity is greater than your threshold.
The main problem would be to tune the threshold based on some quality measure.
For this approach Lucene would be probably the best option.
I hope this helps.

Machine learning, best technique

I am new to machine learning. I am familiar with SVM , Neural networks and GA. I'd like to know the best technique to learn for classifying pictures and audio. SVM does a decent job but takes a lot of time. Anyone know a faster and better one? Also I'd like to know the fastest library for SVM.
Your question is a good one, and has to do with the state of the art of classification algorithms, as you say, the election of the classifier depends on your data, in the case of images, I can tell you that there is one method called Ada-Boost, read this and this to know more about it, in the other hand, you can find lots of people are doing some researh, for example in Gender Classification of Faces Using Adaboost [Rodrigo Verschae,Javier Ruiz-del-Solar and Mauricio Correa] they say:
"Adaboost-mLBP outperforms all other Adaboost-based methods, as well as baseline methods (SVM, PCA and PCA+SVM)"
Take a look at it.
If your main concern is speed, you should probably take a look at VW and generally at stochastic gradient descent based algorithms for training SVMs.
if the number of features is large in comparison to the number of the trainning examples
then you should go for logistic regression or SVM without kernel
if the number of features is small and the number of training examples is intermediate
then you should use SVN with gaussian kernel
is the number of features is small and the number of training examples is large
use logistic regression or SVM without kernels .
that's according to the stanford ML-class .
For such task you may need to extract features first. Only after that classification is feasible.
I think feature extraction and selection is important.
For image classification, there are a lot of features such as raw pixels, SIFT feature, color, texture,etc. It would be better choose some suitable for your task.
I'm not familiar with audio classication, but there may be some specturm features, like the fourier transform of the signal, MFCC.
The methods used to classify is also important. Besides the methods in the question, KNN is a reasonable choice, too.
Actually, using what feature and method is closely related to the task.
The method mostly depends on problem at hand. There is no method that is always the fastest for any problem. Having said that, you should also keep in mind that once you choose an algorithm for speed, you will start compromising on the accuracy.
For example- since your trying to classify images, there might a lot of features compared to the number of training samples at hand. In such cases, if you go for SVM with kernels, you could end up over fitting with the variance being too high.
So you would want to choose a method that has a high bias and low variance. Using logistic regression or linear SVM are some ways to do it.
You could also use different types of regularizations or techniques such as SVD to remove the features that do not contribute much to your output prediction and have only the most important ones. In other words, choose the features that have little or no correlation between them. Once you do this, you would be able to speed yup your SVM algorithms without sacrificing the accuracy.
Hope it helps.
there are some good techniques in learning machines such as, boosting and adaboost.
One method of classification is the boosting method. This method will iteratively manipulate data which will then be classified by a particular base classifier on each iteration, which in turn will build a classification model. Boosting uses weighting of each data in each iteration where its weight value will change according to the difficulty level of the data to be classified.
While the method adaBoost is one ensamble technique by using loss function exponential function to improve the accuracy of the prediction made.
I think your question is very open ended, and "best classifier for images" will largely depend on the type of image you want to classify. But in general, I suggest you study convulutional neural networks ( CNN ) and transfer learning, currently these are the state of the art techniques for the problem.
check out pre-trained models of cnn based neural networks from pytorch or tensorflow
Related to images I suggest you also study pre-processing of images, pre-processing techniques are very important to highlight some feature of the image and improve the generalization of the classifier.

In what sequence cluster analysis is done?

First find the minimum frequent patterns from the database.
Then divide them into various data types like interval based , binary ,ordinal variables etc and define various distance measures for all the variables.
Finally apply cluster analysis method.
Is this sequence right or am i missing something?
whether you're right or not depends on what you want to do. The general approach that you describe seems to go into the right direction, but you'll never know if your on target until you answer the following questions:
What is your data?
What are trying to find/Which cluster method do you want to use?
From what you describe it seems to me that you want to do 'preprocessing' steps like feature selection and vectorization. Unfortunately, this by itself can be quite challenging. For example, one of the biggest partical problems is the design of a distance function (there's a tremendous amount of research available).
So, please give us more information on your specific target application.

Measuring the performance of classification algorithm

I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent on the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classificator, with taking data overfitting problem into account.
That is: given N[1..100] training samples, if I run the training algorithm on every one of the samples, and use this very same samples to measure fitness, it might stuck into a data overfitting problem -the classifier will know the exact answers for the training instances, without having much predictive power, rendering the fitness results useless.
An obvious solution would be seperating the hand-tagged samples into training, and test samples; and I'd like to learn about methods selecting the statistically significant samples for training.
White papers, book pointers, and PDFs much appreciated!
You could use 10-fold Cross-validation for this. I believe it's pretty standard approach for classification algorithm performance evaluation.
The basic idea is to divide your learning samples into 10 subsets. Then use one subset for test data and others for train data. Repeat this for each subset and calculate average performance at the end.
As Mr. Brownstone said 10-fold Cross-Validation is probably the best way to go. I recently had to evaluate the performance of a number of different classifiers for this I used Weka. Which has an API and a load of tools that allow you to easily test the performance of lots of different classifiers.

Resources