Regression Model for categorical data - database

I have very large dataset in csv file (1,700,000 raws and 300 sparse features).
- It has a lot of missing values.
- the data varies between numeric and categoral values.
- the dependant variable (the class) is binary (either 1 or 0).
- the data is highly skewed, the number of positive response is low.
Now what is required from me is to apply regression model and any other machine learning algorithm on this data.
I'm new on this and I need help..
-how to deal with categoral data in case of regression model? and does the missing values affects too much on it?
- what is the best prediction model i can try for large, sparse, skewed data like this?
- what program u advice me to work with? I tried Weka but it can't even open that much of data (memory failure). I know that matlab can open either numeric csv or categories csv not mixed, beside the missing values has to be imputed to allow it to open the file. I know a little bit of R.
I'm trying to manipulate the data using excel, access and perl script. and that's really hard with that amount of data. excel can't open more than almost 1M record and access can't open more than 255 columns. any suggestion.
Thank you for help in advance

First of all, you are talking about classification, not regression - classification allows to predict value from the fixed set (e.g. 0 or 1) while regression produces real numeric output (e.g. 0, 0.5, 10.1543, etc.). Also don't be confused with so called logistic regression - it is classifier too, and its name just shows that it is based on linear regression.
To process such a large amount of data you need inductive (updatable) model. In particular, in Weka there's a number of such algorithms under classification section (e.g. Naive Bayes Updatable, Neutral Networks Updatable and others). With inductive model you will be able to load data portion by portion and update model in appropriate way (for Weka see Knowledge Flow interface for details of how to use it easier).
Some classifiers may work with categorical data, but I can't remember any updatable from them, so most probably you still need to transform categorical data to numeric. Standard solution here is to use indicator attributes, i.e. substitute every categorical attribute with several binary indicator. E.g. if you have attribute day-of-week with 7 possible values you may substitute it with 7 binary attributes - Sunday, Monday, etc. Of course, in each particular instance only one of 7 attributes may hold value 1 and all others have to be 0.
Importance of missing values depend on the nature of your data. Sometimes it worth to replace them with some neutral value beforehand, sometimes classifier implementation does it itself (check manuals for an algorithm for details).
And, finally, for highly skewed data use F1 (or just Precision / Recall) measure instead of accuracy.

Related

Do you keep the true distribution between classes when you manually create your own dataset, or do you make it balanced?

(I struggled a bit to phrase the title - please feel free to suggest another title).
I have a text-dataset which I need to classify, say there's three classes. I need to create the targets by manually setting the labels based on the text (say the three classes are dog,cat,bird).
When I do so I notice we have, say, 70% dog, 20% cat and 10% bird.
Since a lot of machine learning models struggle with imbalanced data, my first thought would be to force the dataset being balanced simply to ignore some of the dog and cat text (i.e "undersampling") thus ending up with (almost) a balanced dataset, making it more easy to train the model.
My concern is though that if we want to train e.g a neural network and get the probability for each class, not training over the correct distribution of the data would result in over/under-confident predictions?
Indeed if your dataset is imbalanced, there is a risk of affecting the performance of your classifier.
You'll find plenty of libraries to help you deal with this problem (see below) and the bottom line is if classes are equally represented in your dataset, it can only help prevent biais of your classifier:
https://imbalanced-learn.org/stable/auto_examples/index.html#general-examples
https://github.com/ufoym/imbalanced-dataset-sampler
https://github.com/MaxHalford/pytorch-resample etc...
(but you can also do that sampling yourself, shouldn't be too difficult, eg libraries like pandas have such functionality)
As a safeguard, split your dataset into 3:
Training (eg 70% of our data): the bulk of the data used for learning
Validation (eg 20%): what your classifier uses for regularization (ie to prevent over fitting)
Test (eg 10%): this data is NEVER exposed to your classifier for learning purposes, you keep it separate and just use it at the end on your model to evaluate its true performance (you call predict and compare with expected classes).
This should be a good starting point.

Tool for manipulating results of large set of computational experiments

I am a researcher and my primary interest is improving sparse kernels for high performance computing. I investigate large number of parameters on many sparse matrices. I wonder whether there is a tool to manage these results. The problems that I encounter are:
Combine results of several experiments for each matrix
Version the results
Taking average, finding minimum/maximum/standard deviation of results
There are hundreds of metrics that describe the performance improvement. I want to select a couple of the metrics easily and try to find which metric correlates with the performance improvement.
Here I gave a sample small instance of my huge problem. There are three types of parameters and two values for each parameter: Row/Column, Cyclic/Block, HeuristicA/HeuristicB. So there must 8 files for the combination of these parameters. Contents of two of them:
Contents of the file RowCyclicHeuristicA.txt
a.mtx#3#5.1#10#2%#row#cyclic#heuristicA#1
a.mtx#7#4.1#10#4%#row#cyclic#heuristicA#2
b.mtx#4#6.1#10#3%#row#cyclic#heuristicA#1
b.mtx#12#5.7#10#7%#row#cyclic#heuristicA#2
b.mtx#9#3.1#10#10%#row#cyclic#heuristicA#3
Contents of the file ColumnCyclicHeuristicA.txt
a.mtx#3#5.1#10#5%#column#cyclic#heuristicA#1
a.mtx#1#5.3#10#6%#column#cyclic#heuristicA#2
b.mtx#4#7.1#10#5%#column#cyclic#heuristicA#1
b.mtx#3#5.7#10#9%#column#cyclic#heuristicA#2
b.mtx#5#4.1#10#3%#column#cyclic#heuristicA#3
I have a scheme file to describe the contents of these files. This file has a line describing type and meaning of each column in the result files:
str MatrixName
int Speedup
double Time
int RepetationCount
double Imbalance
str Parameter1
str Parameter2
str Parameter3
int ExperimentId
I need to display average Time and two types of parameters as follows: (numbers in the following table are random)
Parameter1 Parameter2
Matrix row col cyclic block
a.mtx 4.3 5.2 4.2 5.4
b.mtx 2.1 6.3 8.4 3.3
Is there an advanced and sophisticated tool that gets the scheme of the table above and generates this table automatically? Currently I have a tool written in Java to process raw files and Latex code to manipulate and display the table using pgfplotstable. However, I need one tool that is more professional. I do not want pivot tables of MS Excel.
A similar question is here.
Manipulating large amounts of data in an unknown format is...challenging for a generic program. Your best bet is probably similar to what you're doing already. Use a custom program to reformat your results into something easier to handle (backend), and a visualisation program of your choice to let you view and play around with the data (frontend).
Backend
For your problem I'd suggest a relational database (e.g.Mysql). Has a longer setup time than other options, but if this is an ongoing problem it should be worthwhile, as it allows you to easily pull fields of interest.
SELECT AVG(Speedup) FROM results WHERE Parameter1="column" AND Parameter2="cyclic" for example. You'll then still need a simple script to insert your data in the first place, and then to pull the results of interest in a useful format you can stick into your viewer. Or if you so desire you can just run queries directly against the db.
Alternatively, what I usually use is just Python or Perl. Read in your data files, strip the data you don't want, rearrange into the desire structure, and write out to some standard format your frontend and use. Replace Python/Perl with the language of your choice.
Frontend
Personally, I almost always use Excel. The backend does most of the heavy lifting, so I get a csv file with the results I care about all nicely ordered already. Excel then lets me play around with the data, doing stuff like taking averages, plotting, reordering, etc fairly simply.
Other tools I use to display stuff which are probably not useful for you, but included for completeness include:
Weka - Mostly machine learning targetted, but provides tools for searching for trends or correlations. Useful to play around with data looking for things of interest.
Python/IDL/etc - For when I need data that can't be represented by a spreadsheet. These programs can, in addition to doing the backend's job of extracting and bulk manipulations, generate difference images, complicated graphs, or whatever else I need.

How to save R list object to a database?

Suppose I have a list of R objects which are themselves lists. Each list has a defined structure: data, model which fits data and some attributes for identifying data. One example would be time series of certain economic indicators in particular countries. So my list object has the following elements:
data - the historical time series for economic indicator
country - the name of the country, USA for example
name - the indicator name, GDP for example
model - ARIMA orders found out by auto.arima in suitable format, this again may be a list.
This is just an example. As I said suppose I have a number of such objects combined into a list. I would like to save it into some suitable format. The obvious solution is simply to use save, but this does not scale very well for large number of objects. For example if I only wanted to inspect a subset of objects, I would need to load all of the objects into memory.
If my data is a data.frame I could save it to database. If I wanted to work with particular subset of data I would use SELECT and rely on database to deliver the required subset. SQLite served me well in this regard. Is it possible to replicate this for my described list object with some fancy database like MongoDB? Or should I simply think about how to convert my list to several related tables?
My motivation for this is to be able to easily generate various reports on the fitted models. I can write a bunch of functions which produce some report on a given object and then just use lapply on my list of objects. Ideally I would like to parallelise this process, but this is a another problem.
I think I explained the basics of this somewhere once before---the gist of it is that
R has complete serialization and deserialization support built in, so you can in fact take any existing R object and turn it into either a binary or textual serialization. My digest package use that to turn the serialization into hash using different functions
R has all the db connectivity you need.
Now, what a suitable format and db schema is ... will depend on your specifics. But there is (as usual) nothing in R stopping you :)
This question has been inactive for a long time. Since I had a similar concern recently, I want to add the pieces of information that I've found out. I recognise these three demands in the question:
to have the data stored in a suitable structure
scalability in terms of size and access time
the possibility to efficiently read only subsets of the data
Beside the option to use a relational database, one can also use the HDF5 file format which is designed to store a large amount of possible large objects. The choice depends on the type of data and the intended way to access it.
Relational databases should be favoured if:
the atomic data items are small-sized
the different data items possess the same structure
there is no anticipation in which subsets the data will be read out
convenient transfer of the data from one computer to another is not an issue or the computers where the data is needed have access to the database.
The HDF5 format should be preferred if:
the atomic data items are themselves large objects (e.g. matrices)
the data items are heterogenous, it is not possible to combine them into a table like representation
most of the time the data is read out in groups which are known in advance
moving the data from one computer to another should not require much effort
Furthermore, one can distinguish between relational and hierarchial relationships, where the latter is contained in the former. Within a HDF5 file, the information chunks can be arranged in a hierarchial way, e.g.:
/Germany/GDP/model/...
/Germany/GNP/data
/Austria/GNP/model/...
/Austria/GDP/data
The rhdf5 package for handling HDF5 files is available on Bioconductor. General information on the HDF5 format is available here.
Not sure if it is the same, but I had some good experience with time series objects with:
str()
Maybe you can look into that.

Datasets to test Nonlinear SVM

I'm implementing a nonlinear SVM and I want to test my implementation on a simple not linearly separable data. Google didn't help me find what I want. Can you please advise me where I can find such data. Or at least, how can I generate such data manually ?
Thanks,
Well, SVMs are two-class classifiers--i.e., these classifiers place data on either side of a single decision boundary.
Therefore, i would suggest a data set comprised of just two classes (that's not strictly necessary because of course an SVM can separate more than two classes by passing the Classifier multiple times (in series) over the data, it's cumbersome to do this during initial testing).
So for instance, you can use the iris data set, linked to in Scott's answer; it's comprised of three classes, Class I is linear separable from Class II and III; Class II and III are not linear separable. If you want to use this data set, for convenience-sake you might prefer to remove Class I (approx. the first 50 data rows), so what remains is a two-class system, in which the two remaining classes are not linearly separable.
The iris data set is quite small (150 x 4, or 50 rows/class x four features)--depending where you are with your SVM prototype testing, this might be exactly what you want, or you might want a larger data set.
An interesting family of data sets that are comprised of just two classes and that are definitely non-linearly separable are the the anonymized data sets supplied by the mega-dating site eHarmony (no affiliation of any kind). In addition to the iris data, I like to use these data sets for SVM prototype evaluation because they are large data sets with quite a few features yet still comprised of just two non-linearly separable classes.
I am aware of two places from which you can retrieve this data. The first Site has a single data set (PCI Code downloads, chapter9, matchmaker.csv) comprised of 500 data points (row) and six features (columns). Although this set is simpler to work with, the data is more or less in a 'raw' form and will require some processing before you can use it.
The second source for this data, contains two eHarmony data sets, one of them is comprised of over half million rows and 59 features. In addition, these two data sets have undergone substantial processing such that the only task required before feeding them to your SVM is routine rescaling of the features.
The particular data set you need will depend highly on your choice of kernel function, so It seems the easiest method is simply creating a toy data set yourself.
Some helpful ideas:
Concentric circles
Spiral-shaped classes
Nested banana-shaped classes
If you just want a random data set which is not linearly separable, may I suggest the Iris dataset? It is a multivariate data set where at least a couple of the classes in question are not linearly separable.
Hope this helps!
You can start with simple datasets like Iris or two-moons both of which are linearly non-separable. Once you are satisfied, you can move on to bigger datasets from the UCI ML repository, classification datasets.
Be sure to compare and benchmark against standard SVM solvers like libSVM and SVM-light.
If you program in Python, you can use a few functions in the package of sklearn.datasets.samples_generator to manully generate nested moon-shape data set, concentric circle data set etc. Here is a page of plots of these data sets.
And if you don't want to generate data set manually, you can refer to this website, where in the seciton of "shape sets", you can download these data set and test on them directly.

Is this data appropriate for keeping in a database?

In relation to my previous question where I was asking for some database suggestions; it just occured to me that I don't even know if what I'm trying to store there is appropriate for a database. Or should some other data storage method be used.
I have some physical models testing (let's say wind tunnel data; something similar) where for every model (M-1234) I have:
name (M-1234)
length L
breadth B
height H
L/B ratio
L/H ratio
...
lot of other ratios and dimensions ...
force versus speed curve given in the form of a lot of points for x-y plotting
...
few other similar curves (all of them of type x-y).
Now, what I'm trying to accomplish is store that in some reasonable way, so that the user who will be using the database can come and see what are the closest ten models to L/B=2.5 (or some similar demand). Then for that, somehow get all the data of those models, including the curve data (in a plain text file format).
Is a sql database (or any other, for that matter) an appropriate way of handling something like this ? Or should I take some other approach ?
I have about a month to finish this, and in that time I have to learn enough about databases as well, so ... give your suggestions, please, bearing that in mind. Assume no previous knowledge on the subject, whatsoever.
I think what you're looking for is possible. I'm using Postgresql here, but any database should work. This is my test database
CREATE TABLE test (
id serial primary key,
ratio double precision
);
COPY test (id, ratio) FROM stdin;
1 0.29999999999999999
2 0.40000000000000002
3 0.59999999999999998
4 0.69999999999999996
.
Then, to find the nearest values to a particular ratio
select id,ratio,abs(ratio-0.5) as score from test order by score asc limit 2;
In this case, I'm looking for the 2 nearest to 0.5
I'd probably do a datamodel where you have one table for the main data, the ratios and so on, and then a second table which holds the curve points, as I'm assuming that the curves aren't always the same size.
Yes, a database is probably the best approach for this.
A relational database (which usually uses SQL for data access) is suitable for data that is more or less structured as tables.
To give you an idea:
You could have a main table model with fields name, width etc. . Then subtable(s) for any values which can appear more than once, which refers back to model (look up "foreign key").
Then a subtable for your actual curves, again refering back to model.
How to actually model the curves in the DB I don't know, as I don't know how you model them. But if its lots of numbers, it can go into the DB.
It seems you know little about relational DBMS. Consider reading something on WIkipedia, or doing a few simple DBMS tutorials (PostgreSQL has some: http://www.postgresql.org/docs/8.4/interactive/tutorial.html , but there are many others). Then pick a DBMS for trying out (PostgreSQL is probably not a bad choice, but again there are many others).
Then try implementing a simple table schema, and get back to us with any detail questions (which you'll probably have).
One more thing: Those questions are probably more appropriate to serverfault.com.
This is arguably scientific data: you might find libraries/formats intended for arbitrary scientific data useful: HDF5 http://www.hdfgroup.org/ (note I am not an expert)

Resources