Does logistic regression work with a categorical dependent variable, continuous and categorical independent variables?
Does it matter if the covariates are continuous or categorical? i keep getting this warning of redundancies which after i looked out for multicollinearity and take out the variables that are highly correlated i still get the warnings.
enter image description here
The procedure works with binary response or dependent variables, and independent variables can be scale or "continuous" or categorical. String independent variables are automatically categorical, while numeric ones have to be specific as categorical by clicking on the Categorical button and moving them from the source list to the target list in that sub-dialog box.
Redundancies are not indicative of just highly correlated variables, but actually perfect linear dependencies, typically among categorical variables. The procedure creates k-1 dummy or contrast-coded variables to represent a k-level categorical independent variable, with specific codings determined by your choice of contrast type(s). The codings are displayed in the Categorical Variables Codings table. If the status of every case on any of these variables can be determined by the values of any others (or any of the continuous variables), then that's a redundancy (a linear dependency). For continuous variables, the same is true, though this less common, usually occurring only if something like a copy of a variable gets included by mistake, or one or more variables are computed as linear combinations of others and all are included as independent variables.
Your screen shot doesn't show any independent variables specified as categorical, but I can't tell if there are others also used that might be. If not, then you could try running a linear regression model to diagnose the issue(s). Use the same dependent variable and independent variables and try the forced entry method (the default), and if there are linear dependencies among the predictors, at least one won't be included in the linear model.
One possibility for creation of dependencies is having more predictors than cases, which can unexpectedly happen if you have a lot of missing data, as logistic regression will drop any case missing on the dependent or any independent variables.
Related
I'm currently writing my master's thesis (I'm using SPSS for statistical analyses). One of my calculations is a logistic regression. These are the variables:
dependent variable: occupation (dichotomous, 1=yes, person has a job, 0= person is unemployed)
independent variable 1: self-stigmatization (mean value of a questionnaire,between 1 and 4, continuous).
Now my problem is that apparently there is no linear relationship between my independent variable and the log of my dependent variable (calculated using Box-Tidwell method). Obviously it's possible that there is no relationship between the two constructs in my data, but I've been asking myself, if there's another way to calculate a regression between these two variables if the assumptions for the logistic regression are not met? I just don't want to miss a valid (well, actually better fitting) option that I didn't know of yet...
Does anyone know a method or have any literature tips? Thanks for the help!
We need more info on the distributions of each of these variables and number of cases. One thought is whether transforming your independent variable might yield better results. If the mean value is normal, could you transform it into quartiles and see if you get a different/significant result? Additionally, you could group your sample by another variable in your dataset and see if relationships arise.
I would make this a comment but still need only one more point to do so!
I have what sounds like a typical bin-packing problem: x products of differing sizes need to be packed into y containers of differing capacities, minimizing the number of containers used, as well as minimizing the wasted space.
I can simplify the problem in that product sizes and container capacities can be reduced to standard 1-dimensional units. i.e. this product is 1 unit big while that one is 3 units, this box holds 6 units, that one 12. Think of eggs and cartons, or cases of beer.
But there's an additional constraint: each container has a particular attribute (we'll call it colour ), and each product has a set of colours it is compatible with. There is no correlation between colour and product/container sizing; One product may be colour-compatible with the entire palette, Another may only be compatible with the red containers.
Is this problem variant already described in literature? If so, what is its name?
I think there is no special name for this variant. Although the coloring constraint first gives the impression it's graph coloring related, it's not. It's simply a limitation on the values for a variable.
In a typical solver implementation, each product (= item) will have a variable to which container it's assigned. The color constraints just reduces the value range for a specific variable. So instead of specifying that all variables use the same value range, make it variable specific. (For example, in OptaPlanner this is the difference between a value range provided by the solution generally or by the entity specifically.) So the coloring constraint doesn't even need to be a constraint: it can be part of the model in most solvers.
Any solver that can handle bin packing should be able to handle this variant. Your problem is actually a relaxation of the Roadef 2012 Machine Reassignment problem, which is about assigning processes to computers. Simply drop all the constraints, except for 1 resource usage constraints and the constraint which excludes certain processes to certain machines. That use case is implemented in many solvers. (Although, in practice it is probably be easier to start from a basic bin packing example such as Cloud Balancing.)
Most likely 2d bin-packing or classic knapsack problem.
Does the java.util.UUID is unique in each file?
For example:
We have file IDFile1.java, generating here a random id, using UUID.randomUUID().toString().
And we have file2 with name IDFile2.java, which generates another id.
Will this two file IDs collide with each other?
Is there any way to "turn back" used ID, generated from java.util.UUID, that will mean, this ID could be used again?
The purpose of functions like randomUUID is to munge together various pieces of information which are likely to have some amount of randomness such that two UUIDs generated at different times or in different places would have an extremely small likelihood of matching unless all sources of potential randomness happened to yield identical results. No effort is made to keep track of which UUIDs have or have not been issued; instead, the goal is to have enough randomness that the probability of an unintentional match will be small relative to e.g. the probability of a computer being smashed to a million pieces by a meteor strike.
Note that the system may use "number of UUIDs issued" as part of its UUID calculation, but that would only be one of many factors that go into it. The purpose of such a counter isn't to allow one to "go back" to a previous UUID, but rather to ensure that if e.g. two UUIDs are requested nearly simultaneously without any source of randomness having become available between them, the two requests will yield different values.
UUIDs are unrelated to what file generates it. It doesn't matter which file generates them they more than likely will not be the same.
For question 2 the way UUIDs are generated doesn't really allow for them to be regenerated in any meaningful way. They are usually generated based on some info from your computer, the current time and other stuff. The Java algorithm uses a cryptographically secure random number generator and are known as type 4 UUIDs.
I have very large dataset in csv file (1,700,000 raws and 300 sparse features).
- It has a lot of missing values.
- the data varies between numeric and categoral values.
- the dependant variable (the class) is binary (either 1 or 0).
- the data is highly skewed, the number of positive response is low.
Now what is required from me is to apply regression model and any other machine learning algorithm on this data.
I'm new on this and I need help..
-how to deal with categoral data in case of regression model? and does the missing values affects too much on it?
- what is the best prediction model i can try for large, sparse, skewed data like this?
- what program u advice me to work with? I tried Weka but it can't even open that much of data (memory failure). I know that matlab can open either numeric csv or categories csv not mixed, beside the missing values has to be imputed to allow it to open the file. I know a little bit of R.
I'm trying to manipulate the data using excel, access and perl script. and that's really hard with that amount of data. excel can't open more than almost 1M record and access can't open more than 255 columns. any suggestion.
Thank you for help in advance
First of all, you are talking about classification, not regression - classification allows to predict value from the fixed set (e.g. 0 or 1) while regression produces real numeric output (e.g. 0, 0.5, 10.1543, etc.). Also don't be confused with so called logistic regression - it is classifier too, and its name just shows that it is based on linear regression.
To process such a large amount of data you need inductive (updatable) model. In particular, in Weka there's a number of such algorithms under classification section (e.g. Naive Bayes Updatable, Neutral Networks Updatable and others). With inductive model you will be able to load data portion by portion and update model in appropriate way (for Weka see Knowledge Flow interface for details of how to use it easier).
Some classifiers may work with categorical data, but I can't remember any updatable from them, so most probably you still need to transform categorical data to numeric. Standard solution here is to use indicator attributes, i.e. substitute every categorical attribute with several binary indicator. E.g. if you have attribute day-of-week with 7 possible values you may substitute it with 7 binary attributes - Sunday, Monday, etc. Of course, in each particular instance only one of 7 attributes may hold value 1 and all others have to be 0.
Importance of missing values depend on the nature of your data. Sometimes it worth to replace them with some neutral value beforehand, sometimes classifier implementation does it itself (check manuals for an algorithm for details).
And, finally, for highly skewed data use F1 (or just Precision / Recall) measure instead of accuracy.
I'm implementing a nonlinear SVM and I want to test my implementation on a simple not linearly separable data. Google didn't help me find what I want. Can you please advise me where I can find such data. Or at least, how can I generate such data manually ?
Thanks,
Well, SVMs are two-class classifiers--i.e., these classifiers place data on either side of a single decision boundary.
Therefore, i would suggest a data set comprised of just two classes (that's not strictly necessary because of course an SVM can separate more than two classes by passing the Classifier multiple times (in series) over the data, it's cumbersome to do this during initial testing).
So for instance, you can use the iris data set, linked to in Scott's answer; it's comprised of three classes, Class I is linear separable from Class II and III; Class II and III are not linear separable. If you want to use this data set, for convenience-sake you might prefer to remove Class I (approx. the first 50 data rows), so what remains is a two-class system, in which the two remaining classes are not linearly separable.
The iris data set is quite small (150 x 4, or 50 rows/class x four features)--depending where you are with your SVM prototype testing, this might be exactly what you want, or you might want a larger data set.
An interesting family of data sets that are comprised of just two classes and that are definitely non-linearly separable are the the anonymized data sets supplied by the mega-dating site eHarmony (no affiliation of any kind). In addition to the iris data, I like to use these data sets for SVM prototype evaluation because they are large data sets with quite a few features yet still comprised of just two non-linearly separable classes.
I am aware of two places from which you can retrieve this data. The first Site has a single data set (PCI Code downloads, chapter9, matchmaker.csv) comprised of 500 data points (row) and six features (columns). Although this set is simpler to work with, the data is more or less in a 'raw' form and will require some processing before you can use it.
The second source for this data, contains two eHarmony data sets, one of them is comprised of over half million rows and 59 features. In addition, these two data sets have undergone substantial processing such that the only task required before feeding them to your SVM is routine rescaling of the features.
The particular data set you need will depend highly on your choice of kernel function, so It seems the easiest method is simply creating a toy data set yourself.
Some helpful ideas:
Concentric circles
Spiral-shaped classes
Nested banana-shaped classes
If you just want a random data set which is not linearly separable, may I suggest the Iris dataset? It is a multivariate data set where at least a couple of the classes in question are not linearly separable.
Hope this helps!
You can start with simple datasets like Iris or two-moons both of which are linearly non-separable. Once you are satisfied, you can move on to bigger datasets from the UCI ML repository, classification datasets.
Be sure to compare and benchmark against standard SVM solvers like libSVM and SVM-light.
If you program in Python, you can use a few functions in the package of sklearn.datasets.samples_generator to manully generate nested moon-shape data set, concentric circle data set etc. Here is a page of plots of these data sets.
And if you don't want to generate data set manually, you can refer to this website, where in the seciton of "shape sets", you can download these data set and test on them directly.