Datasets to test Nonlinear SVM - dataset

I'm implementing a nonlinear SVM and I want to test my implementation on a simple not linearly separable data. Google didn't help me find what I want. Can you please advise me where I can find such data. Or at least, how can I generate such data manually ?
Thanks,

Well, SVMs are two-class classifiers--i.e., these classifiers place data on either side of a single decision boundary.
Therefore, i would suggest a data set comprised of just two classes (that's not strictly necessary because of course an SVM can separate more than two classes by passing the Classifier multiple times (in series) over the data, it's cumbersome to do this during initial testing).
So for instance, you can use the iris data set, linked to in Scott's answer; it's comprised of three classes, Class I is linear separable from Class II and III; Class II and III are not linear separable. If you want to use this data set, for convenience-sake you might prefer to remove Class I (approx. the first 50 data rows), so what remains is a two-class system, in which the two remaining classes are not linearly separable.
The iris data set is quite small (150 x 4, or 50 rows/class x four features)--depending where you are with your SVM prototype testing, this might be exactly what you want, or you might want a larger data set.
An interesting family of data sets that are comprised of just two classes and that are definitely non-linearly separable are the the anonymized data sets supplied by the mega-dating site eHarmony (no affiliation of any kind). In addition to the iris data, I like to use these data sets for SVM prototype evaluation because they are large data sets with quite a few features yet still comprised of just two non-linearly separable classes.
I am aware of two places from which you can retrieve this data. The first Site has a single data set (PCI Code downloads, chapter9, matchmaker.csv) comprised of 500 data points (row) and six features (columns). Although this set is simpler to work with, the data is more or less in a 'raw' form and will require some processing before you can use it.
The second source for this data, contains two eHarmony data sets, one of them is comprised of over half million rows and 59 features. In addition, these two data sets have undergone substantial processing such that the only task required before feeding them to your SVM is routine rescaling of the features.

The particular data set you need will depend highly on your choice of kernel function, so It seems the easiest method is simply creating a toy data set yourself.
Some helpful ideas:
Concentric circles
Spiral-shaped classes
Nested banana-shaped classes
If you just want a random data set which is not linearly separable, may I suggest the Iris dataset? It is a multivariate data set where at least a couple of the classes in question are not linearly separable.
Hope this helps!

You can start with simple datasets like Iris or two-moons both of which are linearly non-separable. Once you are satisfied, you can move on to bigger datasets from the UCI ML repository, classification datasets.
Be sure to compare and benchmark against standard SVM solvers like libSVM and SVM-light.

If you program in Python, you can use a few functions in the package of sklearn.datasets.samples_generator to manully generate nested moon-shape data set, concentric circle data set etc. Here is a page of plots of these data sets.
And if you don't want to generate data set manually, you can refer to this website, where in the seciton of "shape sets", you can download these data set and test on them directly.

Related

Do you keep the true distribution between classes when you manually create your own dataset, or do you make it balanced?

(I struggled a bit to phrase the title - please feel free to suggest another title).
I have a text-dataset which I need to classify, say there's three classes. I need to create the targets by manually setting the labels based on the text (say the three classes are dog,cat,bird).
When I do so I notice we have, say, 70% dog, 20% cat and 10% bird.
Since a lot of machine learning models struggle with imbalanced data, my first thought would be to force the dataset being balanced simply to ignore some of the dog and cat text (i.e "undersampling") thus ending up with (almost) a balanced dataset, making it more easy to train the model.
My concern is though that if we want to train e.g a neural network and get the probability for each class, not training over the correct distribution of the data would result in over/under-confident predictions?
Indeed if your dataset is imbalanced, there is a risk of affecting the performance of your classifier.
You'll find plenty of libraries to help you deal with this problem (see below) and the bottom line is if classes are equally represented in your dataset, it can only help prevent biais of your classifier:
https://imbalanced-learn.org/stable/auto_examples/index.html#general-examples
https://github.com/ufoym/imbalanced-dataset-sampler
https://github.com/MaxHalford/pytorch-resample etc...
(but you can also do that sampling yourself, shouldn't be too difficult, eg libraries like pandas have such functionality)
As a safeguard, split your dataset into 3:
Training (eg 70% of our data): the bulk of the data used for learning
Validation (eg 20%): what your classifier uses for regularization (ie to prevent over fitting)
Test (eg 10%): this data is NEVER exposed to your classifier for learning purposes, you keep it separate and just use it at the end on your model to evaluate its true performance (you call predict and compare with expected classes).
This should be a good starting point.

Deep Learning: Dataset containing images of varying dimensions and orientations

This is a repost of the question asked in ai.stackexchange. Since there is not much traction in that forum, I thought I might try my chances here.
I have a dataset of images of varying dimensions of a certain object. A few images of the object are also in varying orientations. The objective is to learn the features of the object (using Autoencoders).
Is it possible to create a network with layers that account for varying dimensions and orientations of the input image, or should I strictly consider a dataset containing images of uniform dimensions? What is the necessary criteria of an eligible dataset to be used for training a Deep Network in general.
The idea is, I want to avoid pre-processing my dataset by normalizing it via scaling, re-orienting operations etc. I would like my network to account for the variability in dimensions and orientations. Please point me to resources for the same.
EDIT:
As an example, consider a dataset consisting of images of bananas. They are of varying sizes, say, 265x525 px, 1200x1200 px, 165x520 px etc. 90% of the images display the banana in one orthogonal orientation (say, front view) and the rest display the banana in varying orientations (say, isometric views).
Almost always people will resize all their images to the same size before sending them to the CNN. Unless you're up for a real challenge this is probably what you should do.
That said, it is possible to build a single CNN that takes input of images as varying dimensions. There are a number of ways you might try to do this, and I'm not aware of any published science analyzing these different choices. The key is that the set of learned parameters needs to be shared between the different inputs sizes. While convolutions can be applied at different images sizes, ultimately they always get converted to a single vector to make predictions with, and the size of that vector will depend on the geometries of the inputs, convolutions and pooling layers. You'd probably want to dynamically change the pooling layers based on the input geometry and leave the convolutions the same, since the convolutional layers have parameters and pooling usually doesn't. So on bigger images you pool more aggressively.
Practically you'd want to group together similarly (identically) sized images together into minibatches for efficient processing. This is common for LSTM type models. This technique is commonly called "bucketing". See for example http://mxnet.io/how_to/bucketing.html for a description of how to do this efficiently.
Is it possible to create a network with layers that account for varying dimensions and orientations of the input image
The usual way to deal with different images is the following:
You take one or multiple crops of the image to make width = height. If you take multiple crops, you pass all of them through the network and average the results.
You scale the crop(s) to the size which is necessary for the network.
However, there is also Global Average Pooling (e.g. Keras docs).
What is the necessary criteria of an eligible dataset to be used for training a Deep Network in general.
That is a difficult question to answer as (1) there are many different approaches in deep learning and the field is quite young (2) I'm pretty sure there is no quantitative answer right now.
Here are two rules of thumb:
You should have at least 50 examples per class
The more parameters your model has, the more data you need
Learning curves and validation curves help to estimate the effect of more training data.

How to save R list object to a database?

Suppose I have a list of R objects which are themselves lists. Each list has a defined structure: data, model which fits data and some attributes for identifying data. One example would be time series of certain economic indicators in particular countries. So my list object has the following elements:
data - the historical time series for economic indicator
country - the name of the country, USA for example
name - the indicator name, GDP for example
model - ARIMA orders found out by auto.arima in suitable format, this again may be a list.
This is just an example. As I said suppose I have a number of such objects combined into a list. I would like to save it into some suitable format. The obvious solution is simply to use save, but this does not scale very well for large number of objects. For example if I only wanted to inspect a subset of objects, I would need to load all of the objects into memory.
If my data is a data.frame I could save it to database. If I wanted to work with particular subset of data I would use SELECT and rely on database to deliver the required subset. SQLite served me well in this regard. Is it possible to replicate this for my described list object with some fancy database like MongoDB? Or should I simply think about how to convert my list to several related tables?
My motivation for this is to be able to easily generate various reports on the fitted models. I can write a bunch of functions which produce some report on a given object and then just use lapply on my list of objects. Ideally I would like to parallelise this process, but this is a another problem.
I think I explained the basics of this somewhere once before---the gist of it is that
R has complete serialization and deserialization support built in, so you can in fact take any existing R object and turn it into either a binary or textual serialization. My digest package use that to turn the serialization into hash using different functions
R has all the db connectivity you need.
Now, what a suitable format and db schema is ... will depend on your specifics. But there is (as usual) nothing in R stopping you :)
This question has been inactive for a long time. Since I had a similar concern recently, I want to add the pieces of information that I've found out. I recognise these three demands in the question:
to have the data stored in a suitable structure
scalability in terms of size and access time
the possibility to efficiently read only subsets of the data
Beside the option to use a relational database, one can also use the HDF5 file format which is designed to store a large amount of possible large objects. The choice depends on the type of data and the intended way to access it.
Relational databases should be favoured if:
the atomic data items are small-sized
the different data items possess the same structure
there is no anticipation in which subsets the data will be read out
convenient transfer of the data from one computer to another is not an issue or the computers where the data is needed have access to the database.
The HDF5 format should be preferred if:
the atomic data items are themselves large objects (e.g. matrices)
the data items are heterogenous, it is not possible to combine them into a table like representation
most of the time the data is read out in groups which are known in advance
moving the data from one computer to another should not require much effort
Furthermore, one can distinguish between relational and hierarchial relationships, where the latter is contained in the former. Within a HDF5 file, the information chunks can be arranged in a hierarchial way, e.g.:
/Germany/GDP/model/...
/Germany/GNP/data
/Austria/GNP/model/...
/Austria/GDP/data
The rhdf5 package for handling HDF5 files is available on Bioconductor. General information on the HDF5 format is available here.
Not sure if it is the same, but I had some good experience with time series objects with:
str()
Maybe you can look into that.

Regression Model for categorical data

I have very large dataset in csv file (1,700,000 raws and 300 sparse features).
- It has a lot of missing values.
- the data varies between numeric and categoral values.
- the dependant variable (the class) is binary (either 1 or 0).
- the data is highly skewed, the number of positive response is low.
Now what is required from me is to apply regression model and any other machine learning algorithm on this data.
I'm new on this and I need help..
-how to deal with categoral data in case of regression model? and does the missing values affects too much on it?
- what is the best prediction model i can try for large, sparse, skewed data like this?
- what program u advice me to work with? I tried Weka but it can't even open that much of data (memory failure). I know that matlab can open either numeric csv or categories csv not mixed, beside the missing values has to be imputed to allow it to open the file. I know a little bit of R.
I'm trying to manipulate the data using excel, access and perl script. and that's really hard with that amount of data. excel can't open more than almost 1M record and access can't open more than 255 columns. any suggestion.
Thank you for help in advance
First of all, you are talking about classification, not regression - classification allows to predict value from the fixed set (e.g. 0 or 1) while regression produces real numeric output (e.g. 0, 0.5, 10.1543, etc.). Also don't be confused with so called logistic regression - it is classifier too, and its name just shows that it is based on linear regression.
To process such a large amount of data you need inductive (updatable) model. In particular, in Weka there's a number of such algorithms under classification section (e.g. Naive Bayes Updatable, Neutral Networks Updatable and others). With inductive model you will be able to load data portion by portion and update model in appropriate way (for Weka see Knowledge Flow interface for details of how to use it easier).
Some classifiers may work with categorical data, but I can't remember any updatable from them, so most probably you still need to transform categorical data to numeric. Standard solution here is to use indicator attributes, i.e. substitute every categorical attribute with several binary indicator. E.g. if you have attribute day-of-week with 7 possible values you may substitute it with 7 binary attributes - Sunday, Monday, etc. Of course, in each particular instance only one of 7 attributes may hold value 1 and all others have to be 0.
Importance of missing values depend on the nature of your data. Sometimes it worth to replace them with some neutral value beforehand, sometimes classifier implementation does it itself (check manuals for an algorithm for details).
And, finally, for highly skewed data use F1 (or just Precision / Recall) measure instead of accuracy.

Get info from map data file by coordinate

Imagine I have a map shape file (.shp) or osm xml, I'm able to see different kind of data from different layers in GIS oriented programs, e.g. ArcGIS, QGIS etc. But how can I get this info programmatically? Is there a specific library for that?
What I'm really looking for is a some kind of method getMapData(longitude, latitude) to get landscape/terrain info (e.g. forest, river, city, highway) in specified location
Thanks in advance for your answers!
It still depends what you want to achieve whether you are better off using raster or vector data.
If your are using your grid to subdivide an area as an array of containers for geographic features, then stick with vector data. To do this, I would create a polygon grid file and intersect it with each of your data layers. You can then add an ID field that represents the cell's location in the array (and hence it's relative position to a known lat/long coordinate - let's say lower left). Alternatively you can use spatial queries to access your data by selecting a polygon in your vector grid file and then finding all the features in your other file that are contained by it.
OTOH, if you want to do some multi-feature analysis based on presence/abscence then you may be better going down the route of raster analysis. My gut feeling from what you have said is that this is what you are trying to achieve but I am still not 100% sure. You would handle this by creating a set of boolean rasters of a suitable resolution and then performing maths operations on the set (add, subtract, average etc - depending on what questions your are asking).
Let's say you are looking at animal migration. Let's say your model assumes that streams, hedges and towns are all obstacles to migration but roads only reduce the chance of an area being crossed. So you convert your obstacles to a value of '1' and NoData to '0' in each case, except roads where you decide to set the value to 0.5. You can then add all your rasters together in one big stack and predict migration routes.
Ok that's a simplistic example but perhaps you can see why we need EVEN more information on what you are wanting to do.
Shapefiles or an osm xml file are just containers that hold geometric shapes. There are plenty of software libraries out there that let you read these files and extract the data. I would recommend looking at GDAL/OGR as a starting point.
A method like getMapData(longitude, latitude) is essentially a search/query function. You need to be a little more specific too, do you want geometries that contain the point, are within a distance of a point, etc?
You could find the map data using a brute force algorithm
for shape in shapefile:
if shape.contains(query_point):
return shape
Or you can use more advanced algorithms/data structures such as RTrees, KDTrees, QuadTrees, etc. The easiest way to get start with querying map data is to load it into a spatial database. I would recommending investigating PostgreSQL+PostGIS and SpatiaLite
You may also like to look at Spatialite and/or PostGIS which are two spatial enabled databses that you could use separately or in conjunction with GDAL/OGR.
I must echo Charles' request that you explain your use-case in more detail because the actual implementation will depend greatly on exactly what you are wanting to achieve. My reading of this is that you may want to convert your data into a series of aligned rasters which you can overlay and treat as a 3 dimensional array.

Resources