Add conditions to SVM - artificial-intelligence

I am working on a project that use multi-class SVM classifier, i check some of the tools that perform multi classification using SVM, in these tools i need to insert training data and RBF parameters. can i add some constraints to the SVM like i want all the members of the class to meet some criteria. e.g. if i want to classify cars i want the price of all cars in class x < 500000. is this possible?
and if you know any place to start with to add conditions to SVM i will appreciate this.

SVM models do not allow any constraints like what you describe. Kernel methods are entirely based on a measure of distance, without rules or input constraints.
If you want such constraints you should consider decision trees/random forests and similar aproaches.

Related

Comparing the two feature sets

I am working on a classification of two feature sets derived from a dataset. We first obtain two feature matrices derived from two feature extraction methods. Now, I need to compare them. However, the recognition accuracy for two feature sets, reaches almost the same recognition accuracy (using 10-fold cross validation by SVM). My question is:
Is there a way to design a meaningful experiment to show the difference between the two methods? What are your suggestions?
Note: I already saw the similar questions in stackoverflow, however, I am looking for another approach.
You can
Perform a dimensionality reduction on the features in their respective spaces. This will allow you to see differences in the distribution of data points. Kudos if you apply a kernel, being the one used by the SVM (linear kernel otherwise).
Do distribution testing on the features to see if they differ much.
Augment the predictions into an output space, and see distance between the vectors.

Fuzzy logic application in recommender system

I was wondering how can I get some kind of advantage with using fuzzy logic in my recommender system?
My system basically calculates similarity between users by:
tanimoto coefficient
cosinus distance
discrete distance
Then all the similarities are combined into one that measures from 0 to 1.
So we can get similar users for user 1 and then recommend him goods that were bought by users who are similar to him.
I understand the basics of fuzzy theory, just can't think of any usage in here, but want to try
Would like to hear any thoughts on this.
I have not seen so many successful applications of fuzzy logic in real life, so I would not expect too much from it.
Why do you want to try it if you cannot think of any usage?
If your similarity value measures from 0 to 1, you can use fuzzy logic to formalize your system. Is like having a system that returns true/false and try to formalize it with bi-valued logic. You just get the formalization.
The only advantage can be defuzzifying the number (using fuzzy words like very similar, not very similar, ...), but you can do that without fuzzy logic too ...

Use case for incremental supervised learning using apache mahout

Business case:
Forecasting fuel consumption at site.
Say fuel consumption C, is dependent on various factors x1,x2,...xn. So mathematically speaking, C = F{x1,x2,...xn}. I do not have any equation to put this.
I do have historical dataset from where I can get a correlation of C to x1,x2 .. etc. C,x1,x2,.. are all quantitative. Finding out the correlation seems tough for a person like me with limited statistical knowledge, for a n variable equation.
So, I was thinking of employing some supervised machine learning techniques for the same. I will train a classifier with the historic data to get a prediction for the next consumption.
Question: Am I thinking in the right way?
Question: If this is correct, my system should be an evolving one. So the more real data I am going to feed to the system, that would evolve my model to make a better prediction the next time. Is this a correct understanding?
If the above the statements are true, does the AdaptiveLogisticRegression algorithm, as present in Mahout, will be of help to me?
Requesting advises from the experts here!
Thanks in advance.
Ok, correlation is not a forecasting model. Correlation simply ascribes some relationship between the datasets based on covariance.
In order to develop a forecasting model, what you need to peform is regression.
The simplest form of regression is linear univariate, where C = F (x1). This can easily be done in Excel. However, you state that C is a function of several variables. For this, you can employ linear multivariate regression. There are standard packages that can perform this (within Excel for example), or you can use Matlab, etc.
Now, we are assuming that there is a "linear" relationship between C and the components of X (the input vector). If the relationship were not linear, then you would need more sophisticated methods (nonlinear regression), which may very well employ machine learning methods.
Finally, some series exhibit auto-correlation. If this is the case, then it may be possible for you to ignore the C = F(x1, x2, x3...xn) relationships, and instead directly model the C function itself using time-series techniques such as ARMA and more complex variants.
I hope this helps,
Srikant Krishna

Datasets to test Nonlinear SVM

I'm implementing a nonlinear SVM and I want to test my implementation on a simple not linearly separable data. Google didn't help me find what I want. Can you please advise me where I can find such data. Or at least, how can I generate such data manually ?
Thanks,
Well, SVMs are two-class classifiers--i.e., these classifiers place data on either side of a single decision boundary.
Therefore, i would suggest a data set comprised of just two classes (that's not strictly necessary because of course an SVM can separate more than two classes by passing the Classifier multiple times (in series) over the data, it's cumbersome to do this during initial testing).
So for instance, you can use the iris data set, linked to in Scott's answer; it's comprised of three classes, Class I is linear separable from Class II and III; Class II and III are not linear separable. If you want to use this data set, for convenience-sake you might prefer to remove Class I (approx. the first 50 data rows), so what remains is a two-class system, in which the two remaining classes are not linearly separable.
The iris data set is quite small (150 x 4, or 50 rows/class x four features)--depending where you are with your SVM prototype testing, this might be exactly what you want, or you might want a larger data set.
An interesting family of data sets that are comprised of just two classes and that are definitely non-linearly separable are the the anonymized data sets supplied by the mega-dating site eHarmony (no affiliation of any kind). In addition to the iris data, I like to use these data sets for SVM prototype evaluation because they are large data sets with quite a few features yet still comprised of just two non-linearly separable classes.
I am aware of two places from which you can retrieve this data. The first Site has a single data set (PCI Code downloads, chapter9, matchmaker.csv) comprised of 500 data points (row) and six features (columns). Although this set is simpler to work with, the data is more or less in a 'raw' form and will require some processing before you can use it.
The second source for this data, contains two eHarmony data sets, one of them is comprised of over half million rows and 59 features. In addition, these two data sets have undergone substantial processing such that the only task required before feeding them to your SVM is routine rescaling of the features.
The particular data set you need will depend highly on your choice of kernel function, so It seems the easiest method is simply creating a toy data set yourself.
Some helpful ideas:
Concentric circles
Spiral-shaped classes
Nested banana-shaped classes
If you just want a random data set which is not linearly separable, may I suggest the Iris dataset? It is a multivariate data set where at least a couple of the classes in question are not linearly separable.
Hope this helps!
You can start with simple datasets like Iris or two-moons both of which are linearly non-separable. Once you are satisfied, you can move on to bigger datasets from the UCI ML repository, classification datasets.
Be sure to compare and benchmark against standard SVM solvers like libSVM and SVM-light.
If you program in Python, you can use a few functions in the package of sklearn.datasets.samples_generator to manully generate nested moon-shape data set, concentric circle data set etc. Here is a page of plots of these data sets.
And if you don't want to generate data set manually, you can refer to this website, where in the seciton of "shape sets", you can download these data set and test on them directly.

In what sequence cluster analysis is done?

First find the minimum frequent patterns from the database.
Then divide them into various data types like interval based , binary ,ordinal variables etc and define various distance measures for all the variables.
Finally apply cluster analysis method.
Is this sequence right or am i missing something?
whether you're right or not depends on what you want to do. The general approach that you describe seems to go into the right direction, but you'll never know if your on target until you answer the following questions:
What is your data?
What are trying to find/Which cluster method do you want to use?
From what you describe it seems to me that you want to do 'preprocessing' steps like feature selection and vectorization. Unfortunately, this by itself can be quite challenging. For example, one of the biggest partical problems is the design of a distance function (there's a tremendous amount of research available).
So, please give us more information on your specific target application.

Resources