SVM and Neural Network - artificial-intelligence

SVM and Neural Network - artificial-intelligence

What is difference between SVM and Neural Network?
Is it true that linear svm is same NN, and for non-linear separable problems, NN uses adding hidden layers and SVM uses changing space dimensions?

There are two parts to this question. The first part is "what is the form of function learned by these methods?" For NN and SVM this is typically the same. For example, a single hidden layer neural network uses exactly the same form of model as an SVM. That is:
Given an input vector x, the output is:
output(x) = sum_over_all_i weight_i * nonlinear_function_i(x)
Generally the nonlinear functions will also have some parameters. So these methods need to learn how many nonlinear functions should be used, what their parameters are, and what the value of all the weight_i weights should be.
Therefore, the difference between a SVM and a NN is in how they decide what these parameters should be set to. Usually when someone says they are using a neural network they mean they are trying to find the parameters which minimize the mean squared prediction error with respect to a set of training examples. They will also almost always be using the stochastic gradient descent optimization algorithm to do this. SVM's on the other hand try to minimize both training error and some measure of "hypothesis complexity". So they will find a set of parameters that fits the data but also is "simple" in some sense. You can think of it like Occam's razor for machine learning. The most common optimization algorithm used with SVMs is sequential minimal optimization.
Another big difference between the two methods is that stochastic gradient descent isn't guaranteed to find the optimal set of parameters when used the way NN implementations employ it. However, any decent SVM implementation is going to find the optimal set of parameters. People like to say that neural networks get stuck in a local minima while SVMs don't.

NNs are heuristic, while SVMs are theoretically founded. A SVM is guaranteed to converge towards the best solution in the PAC (probably approximately correct) sense. For example, for two linearly separable classes SVM will draw the separating hyperplane directly halfway between the nearest points of the two classes (these become support vectors). A neural network would draw any line which separates the samples, which is correct for the training set, but might not have the best generalization properties.
So no, even for linearly separable problems NNs and SVMs are not same.
In case of linearly non-separable classes, both SVMs and NNs apply non-linear projection into higher-dimensional space. In the case of NNs this is achieved by introducing additional neurons in the hidden layer(s). For SVMs, a kernel function is used to the same effect. A neat property of the kernel function is that the computational complexity doesn't rise with the number of dimensions, while for NNs it obviously rises with the number of neurons.

Running a simple out-of-the-box comparison between support vector machines and neural networks (WITHOUT any parameter-selection) on several popular regression and classification datasets demonstrates the practical differences: an SVM becomes a very slow predictor if many support vectors are being created while a neural network's prediction speed is much higher and model-size much smaller. On the other hand, the training time is much shorter for SVMs. Concerning the accuracy/loss - despite the aforementioned theoretical drawbacks of neural networks - both methods are on par - especially for regression problems, neural networks often outperform support vector machines. Depending on your specific problem, this might help to choose the right model.

Both Support Vector Machines (SVMs) and Artificial Neural Networks (ANNs) are supervised machine learning classifiers. An ANN is a parametric classifier that uses hyper-parameters tuning during the training phase. An SVM is a non-parametric classifier that finds a linear vector (if a linear kernel is used) to separate classes. Actually, in terms of the model performance, SVMs are sometimes equivalent to a shallow neural network architecture. Generally, an ANN will outperform an SVM when there is a large number of training instances, however, neither outperforms the other over the full range of problems.
We can summarize the advantages of the ANN over the SVM as follows:
ANNs can handle multi-class problems by producing probabilities for each class. In contrast, SVMs handle these problems using independent one-versus-all classifiers where each produces a single binary output. For example, a single ANN can be trained to solve the hand-written digits problem while 10 SVMs (one for each digit) are required.
Another advantage of ANNs, from the perspective of model size, is that the model is fixed in terms of its inputs nodes, hidden layers, and output nodes; in an SVM, however, the number of support vector lines could reach the number of instances in the worst case.
The SVM does not perform well when the number of features is greater than the number of samples. More work in feature engineering is required for an SVM than that needed for a multi-layer Neural Network.
On the other hand, SVMs are better than ANNs in certain respects:
In comparison to SVMs, ANNs are more prone to becoming trapped in local minima, meaning that they sometime miss the global picture.
While most machine learning algorithms can overfit if they don’t have enough training samples, ANNs can also overfit if training goes on for too long - a problem that SVMs do not have.
SVM models are easier to understand. There are different kernels that provide a different level of flexibilities beyond the classical linear kernel, such as the Radial Basis Function kernel (RBF). Unlike the linear kernel, the RBF can handle the case when the relation between class labels and attributes is nonlinear.

SVMs and NNs have the same building block as perceptrons, but SVMs also uses a kernel trick to raise dimension from say 2 to 3d by translation such as Y = (x1,2,..^2, y1,2...^2) which can separate linearly inseparable plains using a straight line. Want a demo like it and ask me :)

Actually, they are exactly equivalent to each other. The only difference is in their standard implementations with selections of activation function and regularization etc, which obviously differ from each other. Also, I have yet not seen a dual formulation for neural networks, but SVMs are moving toward the primal anyway.

Practically, most of your assumption are often quite true. I'll elaborate: for linear separable classes Linear SVM works quite good and and it's much faster to train. For non linear classes there is the kernel trick, which is sending your data to a higher dimension space. This trick however has two disadvantages compared to NN. First - your have to search for the right parameters , because the classifier will only work if in the higher dimension the two sets will be linearly separable. Now - testing parameters is often done by grid search which is CPU-time consuming. The other problem is that this whole technique is not as general as NN (for example, for NLP if often results in poor classifier).

Related

What is the purpose of using genetic algorithm in learning an ANN

I studied the basics of learning ANNs with a genetic algorithm. I found out that there are basically 2 things you can do:
Use GA to design the structure of the net (determine whether there should be an edge between two neurons or not). I guess we assume we can only use a certain amount of neuron-to-neuron connections.
Use GA to calculate optimal weights.
I also learned that GA makes sense only in case of irregular networks. If the net consists of layers it's suggested to use back propagation as it's faster.
If back propagation is faster and requires a network made of layers, why would I bother to choose GA for learning or designing the network?

Use GA to design the structure of the net (determine whether there
should be an edge between two neurons or not).
In general you seem to be talking about feed forward networks, presumably MLPs.
With these structure of the network is concerned with the number of neurons and layers as well as the connections between neurons. Usually these are set out fully connected, so that every neuron in layer n is connected to every one in layer n+1. The training method will sort out partial connectivity by training some weights down to zero, or very small numbers.
There are some rules for setting up ANNs bases on the data complexity and what you want them to do. These can give you a good starting point. The training algorithms will sort out neuron to neuron connection but will not affect the number of neurons or layers.
So a GA can be used to experiment with parameters that affect the network size.
Use GA to calculate optimal weights.
GA's are not guaranteed to do this. "optimal weights" don't really exist. A trained network is going to give you a balance between recognition and error. You could say "optimal weights" to obtain a target error.
For a feed forward MLP a GA is going to take more processing time than Back Propagation.
I have also experienced that the GA does not fine tune as well so you may have a network that is less noise tolerant using a GA than an BP.
Neither approach is guaranteed to fine the absolute minimum or even an acceptable minimum. Both can get stuck in local minima. There are techniques for re-starting both GA and BP if this happens. But remember that your network's architecture may not allow it to achieve a an acceptable error on your data. There is a limited amount of memory/space in the weights and a solution my just not be there. So when you think you are in a local minima you could actually be at the absolute minim but above an acceptable error.
I also learned that GA makes sense only in case of irregular networks.
If the net consists of layers it's suggested to use back propagation
as it's faster.
You are right here, and not just with BP. Most network architectures that have a dedicated training algorithm are better suited to this than a GA.
But for irregular networks there may be no dedicated training algorithm. For these a GA allows you to experiment and train. Testing architectures to see if a solution is possible before you try to write a dedicated training algorithm.
Remember before BP was invented there was a ten year hiatus on ANNs because there was no method of training a MLP!
If back propagation is faster and requires a network made of layers,
why would I bother to choose GA for learning or designing the network?
If you are using a FF network BP is usually the best LEARNING choice. However learning only involved manipulating the weights. A GA can be used to design the structure, and modify other things like Bias, squashing function,....
A point to note, which is often overlooked is that Back Propagation trains a single network by adjusting the weights. A GA is a population of many networks with fixed weights, it evolves a solution which is a network that is "born" with fixed weights. There is no actual network training/learning.
While the initial parameters of a single network; number of neurons, layers, bias, initial weights, may require attention and experimentation. The GA parameters of; population size, initial values, mutation rate, crossover,... all affect evolution time and the outcome of a likely or possible solution.

The structure of a network is not necessarily easy to choose (even for a layered one). The accuracy of the network will vary depending on how many neurons are used, how they are organized and connected and many other aspects. Using a GA algorithm to choose the setup might just produce better results than a human guess.
The same goes for the weights. Backpropagation does not necessarily produce a perfect result. It might only come up with a local optimum, which is performing worse than the network could with a different set of weights. A genetic algorithm might find better results here as well.
In the end, it is a different approach to solving the difficult optimization problems that ANNs pose.

Because you have no idea just how to organize your network into layers; in fact, you might use a GA in order to come up with a way to organize it into layers, and then use BP to calculate the weights in said network.

Machine learning, best technique

I am new to machine learning. I am familiar with SVM , Neural networks and GA. I'd like to know the best technique to learn for classifying pictures and audio. SVM does a decent job but takes a lot of time. Anyone know a faster and better one? Also I'd like to know the fastest library for SVM.

Your question is a good one, and has to do with the state of the art of classification algorithms, as you say, the election of the classifier depends on your data, in the case of images, I can tell you that there is one method called Ada-Boost, read this and this to know more about it, in the other hand, you can find lots of people are doing some researh, for example in Gender Classification of Faces Using Adaboost [Rodrigo Verschae,Javier Ruiz-del-Solar and Mauricio Correa] they say:
"Adaboost-mLBP outperforms all other Adaboost-based methods, as well as baseline methods (SVM, PCA and PCA+SVM)"
Take a look at it.

If your main concern is speed, you should probably take a look at VW and generally at stochastic gradient descent based algorithms for training SVMs.

if the number of features is large in comparison to the number of the trainning examples
then you should go for logistic regression or SVM without kernel
if the number of features is small and the number of training examples is intermediate
then you should use SVN with gaussian kernel
is the number of features is small and the number of training examples is large
use logistic regression or SVM without kernels .
that's according to the stanford ML-class .

For such task you may need to extract features first. Only after that classification is feasible.

I think feature extraction and selection is important.
For image classification, there are a lot of features such as raw pixels, SIFT feature, color, texture,etc. It would be better choose some suitable for your task.
I'm not familiar with audio classication, but there may be some specturm features, like the fourier transform of the signal, MFCC.
The methods used to classify is also important. Besides the methods in the question, KNN is a reasonable choice, too.
Actually, using what feature and method is closely related to the task.

The method mostly depends on problem at hand. There is no method that is always the fastest for any problem. Having said that, you should also keep in mind that once you choose an algorithm for speed, you will start compromising on the accuracy.
For example- since your trying to classify images, there might a lot of features compared to the number of training samples at hand. In such cases, if you go for SVM with kernels, you could end up over fitting with the variance being too high.
So you would want to choose a method that has a high bias and low variance. Using logistic regression or linear SVM are some ways to do it.
You could also use different types of regularizations or techniques such as SVD to remove the features that do not contribute much to your output prediction and have only the most important ones. In other words, choose the features that have little or no correlation between them. Once you do this, you would be able to speed yup your SVM algorithms without sacrificing the accuracy.
Hope it helps.

there are some good techniques in learning machines such as, boosting and adaboost.
One method of classification is the boosting method. This method will iteratively manipulate data which will then be classified by a particular base classifier on each iteration, which in turn will build a classification model. Boosting uses weighting of each data in each iteration where its weight value will change according to the difficulty level of the data to be classified.
While the method adaBoost is one ensamble technique by using loss function exponential function to improve the accuracy of the prediction made.

I think your question is very open ended, and "best classifier for images" will largely depend on the type of image you want to classify. But in general, I suggest you study convulutional neural networks ( CNN ) and transfer learning, currently these are the state of the art techniques for the problem.
check out pre-trained models of cnn based neural networks from pytorch or tensorflow
Related to images I suggest you also study pre-processing of images, pre-processing techniques are very important to highlight some feature of the image and improve the generalization of the classifier.

AI Techniques for Face Detection

Can anyone all the different techniques used in face detection? Techniques like neural networks, support vector machines, eigenfaces, etc.
What others are there?

the technique I'm going to talk about is more a machine learning oriented approach; in my opinion is quite fascinating, though not very recent: it was described in the article "Robust Real-Time Face Detection" by Viola and Jones. I used the OpenCV implementation for an university project.
It is based on haar-like features, which consists in additions and subtractions of pixel intensities within rectangular regions of the image. This can be done very fast using a procedure called integral image, for which also GPGPU implementations exist (sometimes are called "prefix scan"). After computing integral image in linear time, any haar-like feature can be evaluated in constant time. A feature is basically a function that takes a 24x24 sub-window of the image S and computes a value feature(S); a triplet (feature, threshold, polarity) is called a weak classifier, because
polarity * feature(S) < polarity * threshold
holds true on certain images and false on others; a weak classifier is expected to perform just a little better than random guess (for instance, it should have an accuracy of at least 51-52%).
Polarity is either -1 or +1.
Feature space is big (~160'000 features), but finite.
Despite threshold could in principle be any number, from simple considerations on the training set it turns out that if there are N examples, only N + 1 threshold for each polarity and for each feature have to be examined in order to find the one that holds the best accuracy. The best weak classifier can thus be found by exhaustively searching the triplets space.
Basically, a strong classifier can be assembled by iteratively choosing the best possible weak classifier, using an algorithm called "adaptive boosting", or AdaBoost; at each iteration, examples which were misclassified in the previous iteration are weighed more. The strong classifier is characterized by its own global threshold, computed by AdaBoost.
Several strong classifiers are combined as stages in an attentional cascade; the idea behind the attentional cascade is that 24x24 sub-windows that are obviously not faces are discarded in the first stages; a strong classifier usually contains only a few weak classifiers (like 30 or 40), hence is very fast to compute. Each stage should have a very high recall, while false positive rate is not very important. if there are 10 stages each with 0.99 recall and 0.3 false positive rate, the final cascade will have 0.9 recall and extremely low false positive rate. For this reason, strong classifier are usually tuned in order to increase recall and false positive rate. Tuning basically involves reducing the global threshold computed by AdaBoost.
A sub-window that makes it way to the end of the cascade is considered a face.
Several sub-window in the initial image, eventually overlapping, eventually after rescaling the image, must be tested.

An emerging but rather effective approach to the broad class of vision problems, including face detection, is the use of Hierarchical Temporal Memory (HTM), a concept/technology developed by Numenta.
Very loosely speaking, this is a neuralnetwork-like approach. This type of network has a tree shape where the number of nodes decreases significantly at each level. HTM models some of the structural and algorithmic properties of the neocortex. In [possible] departure with the neocortex the classification algorithm implemented at the level of each node uses a Bayesian algorithm. HTM model is based on the memory-prediction theory of brain function and relies heavily on the the temporal nature of inputs; this may explain its ability to deal with vision problem, as these are typically temporal (or can be made so) and also require tolerance for noise and "fuzziness".
While Numemta has produced vision kits and demo applications for some time, Vitamin D recently produced -I think- the first commercial application of HTM technology at least in the domain of vision applications.

If you need it not just as theoretical stuff but you really want to do face detection then I recommend you to find already implemented solutions.
There are plenty tested libraries for different languages and they are widely used for this purpose. Look at this SO thread for more information: Face recognition library.

Support Vector Machine or Artificial Neural Network for text processing? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
We need to decide between Support Vector Machines and Fast Artificial Neural Network for some text processing project.
It includes Contextual Spelling Correction and then tagging the text to certain phrases and their synonyms.
Which will be the right approach? Or is there an alternate to both of these... Something more appropriate than FANN as well as SVM?

I think you'll get a competitive results from both of the algorithms, so you should aggregate the results... think about ensemble learning.
Update:
I don't know if this is specific enough: use Bayes Optimal Classifier to combine the prediction from each algorithm. You have to train both of your algorithms, then you have to train the Bayes Optimal Classifier to use your algorithms and make optimal predictions based on the input of the algorithms.
Separate your training data in 3:
1st data set will be used to train the (Artificial) Neural Network and the Support Vector Machines.
2nd data set will be used to train the Bayes Optimal Classifier by taking the raw predictions from the ANN and SVM.
3rd data set will be your qualification data set where you will test your trained Bayes Optimal Classifier.
Update 2.0:
Another way to create an ensemble of the algorithms is to use 10-fold (or more generally, k-fold) cross-validation:
Break data into 10 sets of size n/10.
Train on 9 datasets and test on 1.
Repeat 10 times and take a mean accuracy.
Remember that you can generally combine many the classifiers and validation methods in order to produce better results. It's just a matter of finding what works best for your domain.

You might want to also take a look at maxent classifiers (/log linear models).
They're really popular for NLP problems. Modern implementations, which use quasi-newton methods for optimization rather than the slower iterative scaling algorithms, train more quickly than SVMs. They also seem to be less sensitive to the exact value of the regularization hyperparameter. You should probably only prefer SVMs over maxent, if you'd like to use a kernel to get feature conjunctions for free.
As for SVMs vs. neural networks, using SVMs would probably be better than using ANNs. Like maxent models, training SVMs is a convex optimization problem. This means, given a data set and a particular classifier configuration, SVMs will consistently find the same solution. When training multilayer neural networks, the system can converge to various local minima. So, you'll get better or worse solutions depending on what weights you use to initialize the model. With ANNs, you'll need to perform multiple training runs in order to evaluate how good or bad a given model configuration is.

This question is very old. Lot of developments were happened in NLP area in last 7 years.
Convolutional_neural_network and Recurrent_neural_network evolved during this time.
Word Embeddings: Words appearing within similar context possess similar meaning. Word embeddings are pre-trained on a task where the objective is to predict a word based on its context.
CNN for NLP:
Sentences are first tokenized into words, which are further transformed into a word embedding matrix (i.e., input embedding layer) of d dimension.
Convolutional filters are applied on this input embedding layer to produce a feature map.
A max-pooling operation on each filter obtain a fixed length output and reduce the dimensionality of the output.
Since CNN had a short-coming of not preserving long-distance contextual information, RNNs have been introduced.
RNNs are specialized neural-based approaches that are effective at processing sequential information.
RNN memorizes the result of previous computations and use it in current computation.
There are few variations in RNN - Long Short Term Memory Unit (LSTM) and Gated recurrent units (GRUs)
Have a look at below resources:
deep-learning-for-nlp
Recent trends in deep learning paper

You can use Convolution Neural Network (CNN) or Recurrent Neural Network (RNN) to train NLP. I think CNN has achieved state-of-the-art now.

Measuring the performance of classification algorithm

I've got a classification problem in my hand, which I'd like to address with a machine learning algorithm ( Bayes, or Markovian probably, the question is independent on the classifier to be used). Given a number of training instances, I'm looking for a way to measure the performance of an implemented classificator, with taking data overfitting problem into account.
That is: given N[1..100] training samples, if I run the training algorithm on every one of the samples, and use this very same samples to measure fitness, it might stuck into a data overfitting problem -the classifier will know the exact answers for the training instances, without having much predictive power, rendering the fitness results useless.
An obvious solution would be seperating the hand-tagged samples into training, and test samples; and I'd like to learn about methods selecting the statistically significant samples for training.
White papers, book pointers, and PDFs much appreciated!

You could use 10-fold Cross-validation for this. I believe it's pretty standard approach for classification algorithm performance evaluation.
The basic idea is to divide your learning samples into 10 subsets. Then use one subset for test data and others for train data. Repeat this for each subset and calculate average performance at the end.

As Mr. Brownstone said 10-fold Cross-Validation is probably the best way to go. I recently had to evaluate the performance of a number of different classifiers for this I used Weka. Which has an API and a load of tools that allow you to easily test the performance of lots of different classifiers.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight