Binomial logistic regression or binomial GAM? - logistic-regression

I have presence/absence data of a species as my dependent variable and climatic data for my independent variables.
The present samples are found within a particular range with the absent data generally either side.
To run a regression is it best to use a binomial logistic regression or binomial generalized additive model?

Related

How do I modify the coefficients of a logistic regression model in Weka?

I have previously trained a logistic regression classifier on the Iris data set, and saved the resulting model to a file named iris.model.
I now load the model into the Weka Explorer:
How do I edit the coefficients of this model? For example, I want to change Iris-setosa's sepallength coefficient from 21.8065 to 19.
You can't. Weka's classifiers are data driven and don't offer post-build fine-tuning or manual modifications.

logistic regression assumption of linearity of logit not met (SPSS)

I'm currently writing my master's thesis (I'm using SPSS for statistical analyses). One of my calculations is a logistic regression. These are the variables:
dependent variable: occupation (dichotomous, 1=yes, person has a job, 0= person is unemployed)
independent variable 1: self-stigmatization (mean value of a questionnaire,between 1 and 4, continuous).
Now my problem is that apparently there is no linear relationship between my independent variable and the log of my dependent variable (calculated using Box-Tidwell method). Obviously it's possible that there is no relationship between the two constructs in my data, but I've been asking myself, if there's another way to calculate a regression between these two variables if the assumptions for the logistic regression are not met? I just don't want to miss a valid (well, actually better fitting) option that I didn't know of yet...
Does anyone know a method or have any literature tips? Thanks for the help!
We need more info on the distributions of each of these variables and number of cases. One thought is whether transforming your independent variable might yield better results. If the mean value is normal, could you transform it into quartiles and see if you get a different/significant result? Additionally, you could group your sample by another variable in your dataset and see if relationships arise.
I would make this a comment but still need only one more point to do so!

Decision Tree Categorical and Continuous Variable

I'm new to data science and currently trying to learn and understand decision tree algorithm. I have a question about how the algorithm works when we have some continuous variables in a classification problem and categorical variables in regression problems.
Usually algo works on the basis of gini index in classificaton problems and variance reduction technique in regression problem.
But when it comes to dealing with continuous variable in a classification problem, how the algo consider continuous variable, in the selection of best split (with highest gini index) done. -- vice versa for regression problem
Thanks in advance :)

Vowpal Wabbit: unbalanced classes

I would like to perform Logistic Regression using Vowpal Wabbit. How can I handle imbalanced classes (e.g. 1000/50000)? I know that I can use importance weighting but I'm not sure this is the best option in this case. There also exist some algorithms like SMOTE but I don't know how to use them in Vowpal Wabbit.
Yes, importance weighting is the solution for imbalanced classes in Vowpal Wabbit. The most important question is what is your final evaluation criterion. Is it Area Under RO Curve (aka ROC, AUC)? See Calculating AUC when using Vowpal Wabbit and How to perform logistic regression using vowpal wabbit on very imbalanced dataset (here see both answers).
SMOTE seems to be a combination of over-sampling the minority class and under-sampling the majority class, where the oversampling is done by generating synthetic examples from e.g. 5 nearest neighbor examples, which are randomly mixed together. This method is not implemented in Vowpal Wabbit and it is not compatible with online learning (because of the nearest neighbors). It could be probably approximated in online fashion somehow.

Use case for incremental supervised learning using apache mahout

Business case:
Forecasting fuel consumption at site.
Say fuel consumption C, is dependent on various factors x1,x2,...xn. So mathematically speaking, C = F{x1,x2,...xn}. I do not have any equation to put this.
I do have historical dataset from where I can get a correlation of C to x1,x2 .. etc. C,x1,x2,.. are all quantitative. Finding out the correlation seems tough for a person like me with limited statistical knowledge, for a n variable equation.
So, I was thinking of employing some supervised machine learning techniques for the same. I will train a classifier with the historic data to get a prediction for the next consumption.
Question: Am I thinking in the right way?
Question: If this is correct, my system should be an evolving one. So the more real data I am going to feed to the system, that would evolve my model to make a better prediction the next time. Is this a correct understanding?
If the above the statements are true, does the AdaptiveLogisticRegression algorithm, as present in Mahout, will be of help to me?
Requesting advises from the experts here!
Thanks in advance.
Ok, correlation is not a forecasting model. Correlation simply ascribes some relationship between the datasets based on covariance.
In order to develop a forecasting model, what you need to peform is regression.
The simplest form of regression is linear univariate, where C = F (x1). This can easily be done in Excel. However, you state that C is a function of several variables. For this, you can employ linear multivariate regression. There are standard packages that can perform this (within Excel for example), or you can use Matlab, etc.
Now, we are assuming that there is a "linear" relationship between C and the components of X (the input vector). If the relationship were not linear, then you would need more sophisticated methods (nonlinear regression), which may very well employ machine learning methods.
Finally, some series exhibit auto-correlation. If this is the case, then it may be possible for you to ignore the C = F(x1, x2, x3...xn) relationships, and instead directly model the C function itself using time-series techniques such as ARMA and more complex variants.
I hope this helps,
Srikant Krishna

Resources