statsmodel logit results very different from R and Stata - logistic-regression

I am fitting a logistic regression using SAT scores to predict a binary outcome - the bivariate correlation is 0.17. Stata and R (aod package) both give a logit coefficient of 0.004, but statsmodel (python) gives -0.0013 (I have tried both MLE and IRLS). There is no missing data, and the number of observations is exactly the same across all three platforms – it is the same .csv file being used in each case.
R:
Call:
glm(formula = df$outcome ~ df$sat, family = "binomial", data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7527 -0.5911 -0.4778 -0.3406 3.0509
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.758e+00 6.274e-02 -123.7 <2e-16 ***
df$sat 4.151e-03 4.351e-05 95.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 257024 on 334357 degrees of freedom
Residual deviance: 245878 on 334356 degrees of freedom
AIC: 245882
Number of Fisher Scoring iterations: 5
Stata:
. logit outcome sat
Iteration 0: log likelihood = -128512.03
Iteration 1: log likelihood = -123233.13
Iteration 2: log likelihood = -122939.88
Iteration 3: log likelihood = -122939.1
Iteration 4: log likelihood = -122939.1
Logistic regression Number of obs = 334,358
LR chi2(1) = 11145.86
Prob > chi2 = 0.0000
Log likelihood = -122939.1 Pseudo R2 = 0.0434
------------------------------------------------------------------------------
outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
sat | .0041509 .0000435 95.40 0.000 .0040656 .0042362
_cons | -7.75775 .0627402 -123.65 0.000 -7.880719 -7.634782
Statsmodel:
Optimization terminated successfully.
Current function value: 0.399258
Iterations 5
Logit Regression Results
==============================================================================
Dep. Variable: outcome No. Observations: 334358
Model: Logit Df Residuals: 334357
Method: MLE Df Model: 0
Date: Wed, 15 Jul 2015 Pseudo R-squ.: -0.03878
Time: 13:09:47 Log-Likelihood: -1.3350e+05
converged: True LL-Null: -1.2851e+05
LLR p-value: 1.000
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
sat -0.0013 3.69e-06 -363.460 0.000 -0.001 -0.001
==============================================================================
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: outcome No. Observations: 334358
Model: GLM Df Residuals: 334357
Model Family: Binomial Df Model: 0
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -1.3350e+05
Date: Wed, 15 Jul 2015 Deviance: 2.6699e+05
Time: 13:09:48 Pearson chi2: 3.50e+05
No. Iterations: 7
==============================================================================
coef std err z P>|z| [95.0% Conf. Int.]
------------------------------------------------------------------------------
sat -0.0013 3.69e-06 -363.460 0.000 -0.001 -0.001
==============================================================================

Related

Drop columns from a data frame but I keep getting this error below

enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
No matter how I try to code this in R, I still cannot drop my columns so that I can build my logistic regression model. I tried to run it two different ways
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[-cols,]
Error in -cols : invalid argument to unary operator
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[!cols,]
Error in !cols : invalid argument type
This may solve your problem:
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[ , !colnames(DAT_690_Attrition_Proj1EmpAttrTrain) %in% cols]
Please note that if you want to drop columns, you should put your code inside [ on the right side of the comma, not on the left side.
So [, your_code] not [your_code, ].
Here is an example of dropping columns using the code above.
cols <- c("cyl", "hp", "wt")
mtcars[, !colnames(mtcars) %in% cols]
# mpg disp drat qsec vs am gear carb
# Mazda RX4 21.0 160.0 3.90 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 160.0 3.90 17.02 0 1 4 4
# Datsun 710 22.8 108.0 3.85 18.61 1 1 4 1
# Hornet 4 Drive 21.4 258.0 3.08 19.44 1 0 3 1
# Hornet Sportabout 18.7 360.0 3.15 17.02 0 0 3 2
# Valiant 18.1 225.0 2.76 20.22 1 0 3 1
#...
Edit to Reproduce the Error
The error message you got indicates that there is a column that has only one, identical value in all rows.
To show this, let's try a logistic regression using a subset of mtcars data, which has only one, identical values in its cyl column, and then we use that column as a predictor.
mtcars_cyl4 <- mtcars |> subset(cyl == 4)
mtcars_cyl4
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars_cyl4, family = "binomial")
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels
Now, compare it with the same logistic regression by using full mtcars data, which have various values in cyl column.
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars, family = "binomial")
# Call: glm(formula = am ~ as.factor(cyl) + mpg + disp, family = "binomial",
# data = mtcars)
#
# Coefficients:
# (Intercept) as.factor(cyl)6 as.factor(cyl)8 mpg disp
# -5.08552 2.40868 6.41638 0.37957 -0.02864
#
# Degrees of Freedom: 31 Total (i.e. Null); 27 Residual
# Null Deviance: 43.23
# Residual Deviance: 25.28 AIC: 35.28
It is likely that, even though you have drop three columns that have one,identical values in all the respective rows, there is another column in Trainingmodel1 that has one identical values. The identical values in the column were probably resulted during filtering the data frame and splitting data into training and test groups. Better to have a check by using summary(Trainingmodel1).
Further edit
I have checked the summary(Trainingmodel1) result, and it becomes clear that EmployeeNumber has one identical value (called "level" for a factor) in all rows. To run your regression properly, either you drop it from your model, or if EmployeeNumber has another level and you want to include it in your model, you should make sure that it contains at least two levels in the training data. It is possible to achieve that during splitting by repeating the random sampling until the randomly selected EmployeeNumber samples contain at least two levels. This can be done by looping using for, while, or repeat. It is possible, but I don't know how proper the repeated sampling is for your study.
As for your question about subsetting more than one variable, you can use subset and conditionals. For example, you want to get a subset of mtcars that has cyl == 4 and mpg > 20 :
mtcars |> subset(cyl == 4 & mpg > 20 )
If you want a subset that has cyl == 4 or mpg > 20:
mtcars |> subset(cyl == 4 | mpg > 20 )
You can also subset by using more columns as subset criteria:
mtcars |> subset((cyl > 4 & cyl <8) | (mpg > 20 & gear > 4 ))

Alternative for PSM package

Anyone could suggest an alternative for PSM package in R for parametric survival model since this package has been removed?
psm() is a function within the rms package; can you clarify which psm package do you mean?
the PSM package is here: https://rdrr.io/cran/PSM/
You can reproduce the results of the paper with the following codes:
Zhang Z. Parametric regression modelfor survival data: Weibull
regression model as an example. Ann Transl Med 2016;4(24):484. doi:
10.21037/atm.2016.08.45
> install.packages("rms")
> library(rms)
> library(survival)
> data(lung)
> psm.lung<-psm(Surv(time, status)~ph.ecog+sex*age+
+ ph.karno+pat.karno+meal.cal+
+ wt.loss,lung, dist='weibull')
> anova(psm.lung)
Wald Statistics Response: Surv(time, status)
Factor Chi-Square d.f. P
ph.ecog 13.86 1 0.0002
sex (Factor+Higher Order Factors) 10.24 2 0.0060
All Interactions 3.22 1 0.0728
age (Factor+Higher Order Factors) 3.75 2 0.1532
All Interactions 3.22 1 0.0728
ph.karno 5.86 1 0.0155
pat.karno 3.54 1 0.0601
meal.cal 0.00 1 0.9439
wt.loss 3.85 1 0.0498
sex * age (Factor+Higher Order Factors) 3.22 1 0.0728
TOTAL 33.18 8 0.0001

Optimized method to partition numpy 2D array

I am trying to partition a 2D numpy array into 2 separate numpy arrays based on the contents
of a particular column. This is my code:
import numpy as np
import pandas as pd
#profile
def partition_data(arr,target_colm):
total_colms = arr.shape[1]
target_data = arr[:,target_colm]
type1_data = []
type2_data = []
for i in range(arr.shape[0]):
if target_data[i]==0: # if value==0, put in another array
type1_data = np.append(type1_data,arr[i])
else:
type2_data = np.append(type2_data,arr[i])
type1_data = np.array(type1_data).reshape(int(len(type1_data)/total_colms),total_colms)
type2_data = np.array(type2_data).reshape(int(len(type2_data)/total_colms),total_colms)
return type1_data, type2_data
d = pd.read_csv('data.csv').values
x,y = partition_data(d,7) # check values of 7th column
Note: For my experiment, I have used a array of (14359,42) elements.
Now, when I profile this function using kernprof line profiler, I get the following results.
Wrote profile results to code.py.lprof
Timer unit: 1e-06 s
Total time: 7.3484 s
File: code2.py
Function: part_data at line 8
Line # Hits Time Per Hit % Time Line Contents
==============================================================
8 #profile
9 def part_data(arr,target_col):
10 1 7.0 7.0 0.0 total_colms = arr.shape[1]
11 1 14.0 14.0 0.0 target_data = arr[:,target_col]
12 1 2.0 2.0 0.0 type1_data = []
13 1 1.0 1.0 0.0 type2_data = []
14 5161 40173.0 7.8 0.5 for i in range(arr.shape[0]):
15 5160 39225.0 7.6 0.5 if target_data[i]==6:
16 4882 7231260.0 1481.2 98.4 type1_data = np.append(type1_data,arr[i])
17 else:
18 278 33915.0 122.0 0.5 type2_data = np.append(type2_data,arr[i])
19 1 3610.0 3610.0 0.0 type1_data = np.array(type1_data).reshape(int(len(type1_data)/total_colms),total_colms)
20 1 187.0 187.0 0.0 type2_data = np.array(type2_data).reshape(int(len(type2_data)/total_colms),total_colms)
21 1 3.0 3.0 0.0 return type1_data, type2_data
Here, one line-16 takes up significant time. In future, the real data size I will work with will be much bigger.
Can anyone please suggest a faster method of partitioning a numpy array?
This should make it alot faster:
def partition_data_vectorized(arr, target_colm):
total_colms = arr.shape[1]
target_data = arr[:,target_colm]
mask = target_data == 0
type1_data = arr[mask, :]
type2_data = arr[~mask, :]
return (
type1_data.reshape(int(type1_data.size / total_colms), total_colms),
type2_data.reshape(int(type2_data.size / total_colms), total_colms))
Some timings:
# Generate some sample inputs:
arr = np.random.rand(10000, 42)
arr[:, 7] = np.random.randint(0, 10, 10000)
%timeit c, d = partition_data_vectorized(arr, 7)
# 2.09 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit a, b = partition_data(arr, 7)
# 4.07 s ± 102 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This is 2000 times faster than the non-vectorized calculation!
Comparing the results:
np.all(b == d)
# Out: True
np.all(a == c)
# Out: True
So the results are correct and it is 2000 times faster just by replacing the for-loop and the repeated array creation with np.append by vectorized operations.

End Loop when significant value found : Stata?

could you help me in figuring out: ho do i tell Stata to end the loop over iterations when it finds the first positive and significant value of a particular coefficient in a regression.
Here is a small sample using publicly available dataset that shows what I am trying to do: In the following case, I want stata to stop looping when it finds the "year" coefficient to be positive and significant.
set more off
clear all
clear matrix
use http://www.stata-press.com/data/r13/abdata
forvalues i=1/8{
xtabond n w k ys year, lags(`i') noconstant
matrix b = e(b)'
mat byear = b["year",1]
if `i'==1 matrix byear=b["year",1]
else matrix byear=(byear\ b["year",1])
}
Could you please help in figuring out how to tell stata to stop looping when it finds a condition is met.
Thank you
Here is some code that seems to do what you want. I had to set the confidence level to 80 (from the default of 95) so it would terminate before it exceeded the maximum number of lags.
set more off
clear all
clear matrix
set level 80
use http://www.stata-press.com/data/r13/abdata
forvalues i=1/8{
quietly xtabond n w k ys year, lags(`i') noconstant
matrix t = r(table)
scalar b = t[rownumb(t,"b"),colnumb(t,"year")]
scalar p = t[rownumb(t,"pvalue"),colnumb(t,"year")]
scalar r = 1-r(level)/100
scalar q = (b>0) & (p<=r)
if q {
display "success with `i' lags"
display "b: " b " p: " p " r: " r " q: " q
xtabond
continue, break
}
else {
display "no luck with `i' lags"
}
}
which yields
no luck with 1 lags
success with 2 lags
b: .00759529 p: .18035747 r: .2 q: 1
Arellano-Bond dynamic panel-data estimation Number of obs = 611
Group variable: id Number of groups = 140
Time variable: year
Obs per group:
min = 4
avg = 4.364286
max = 6
Number of instruments = 31 Wald chi2(6) = 1819.55
Prob > chi2 = 0.0000
One-step results
------------------------------------------------------------------------------
n | Coef. Std. Err. z P>|z| [80% Conf. Interval]
-------------+----------------------------------------------------------------
n |
L1. | .3244849 .0774312 4.19 0.000 .1727225 .4762474
L2. | -.0266879 .0363611 -0.73 0.463 -.0979544 .0445785
|
w | -.5464779 .0562155 -9.72 0.000 -.6566582 -.4362975
k | .360622 .0330634 10.91 0.000 .2958189 .4254252
ys | .5948084 .0818672 7.27 0.000 .4343516 .7552652
year | .0075953 .0056696 1.34 0.180 -.0035169 .0187075
------------------------------------------------------------------------------
Instruments for differenced equation
GMM-type: L(2/.).n
Standard: D.w D.k D.ys D.year
.
end of do-file

Strange pattern of non-convergence when flattening in mixed effects logistic regression

I'm running a simulation study on the effects of adding fractional numbers of successes and failures, which I'll call C, to mixed effects logistic regressions. I've simulated 2000 datasets and modeled each with 5 logistic regressions (adding an C of either 1, .5, .25, .1 and .05). The models converge on the majority of the datasets, but ~200 fail to converge when I add an C of .25 and ~50 fail to converge when I add an C of .5 (Sometimes I get a warning message and sometimes I get implausible standard errors). I very rarely see any evidence of non-convergence with the other values (I've looked at warning messages, standard errors and the ratio of highest to lowest eigenvalues in the random effects matrix). Even in the datasets that fail to converge when C = .25, slightly changing C often solves the problem, such as in this example (data sets available here: https://www.dropbox.com/sh/ro92mtjkpqwlnws/AADSVzcNvl0nnnzCEF5QGM6qa?oref=e&n=19939135)
m7 <- glmer(cbind(Data + .25, (10+.5- (Data + .25))) ~ Group*Condition + (1 + Condition |ID), family="binomial", data=df2)
Warning messages:
1: In eval(expr, envir, enclos) : non-integer counts in a binomial glm!
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model is nearly unidentifiable: very large eigenvalue
- Rescale variables?
summary(m7)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: cbind(Data + 0.25, (10 + 0.5 - (Data + 0.25))) ~ Group * Condition + (1 + Condition | ID)
Data: df2
AIC BIC logLik deviance df.resid
7001.1 7040.0 -3493.6 6987.1 1913
Scaled residuals:
Min 1Q Median 3Q Max
-3.5444 -0.6387 0.0143 0.6945 2.9802
Random effects:
Groups Name Variance Std.Dev. Corr
ID (Intercept) 0.26598 0.5157
Condition 0.06413 0.2532 0.66
Number of obs: 1920, groups: ID, 120
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.760461 0.001226 1436.5 <2e-16 ***
Group -1.816952 0.001225 -1483.0 <2e-16 ***
Condition -0.383383 0.001226 -312.7 <2e-16 ***
Group:Condition -0.567517 0.001225 -463.2 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) Group Condtn
Group 0.000
Condition 0.000 0.000
Group:Cndtn 0.000 0.000 0.000
m8 <- glmer(cbind(Data + .2, (10+.4- (Data + .2))) ~ Group*Condition + (1 + Condition |ID), family="binomial", data=df2)
Warning message:
In eval(expr, envir, enclos) : non-integer counts in a binomial glm!
summary(m8)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: cbind(Data + 0.2, (10 + 0.4 - (Data + 0.2))) ~ Group * Condition + (1 + Condition | ID)
Data: df2
AIC BIC logLik deviance df.resid
6929.3 6968.2 -3457.6 6915.3 1913
Scaled residuals:
Min 1Q Median 3Q Max
-3.5724 -0.6329 0.0158 0.6945 2.9976
Random effects:
Groups Name Variance Std.Dev. Corr
ID (Intercept) 0.2698 0.5194
Condition 0.0652 0.2553 0.66
Number of obs: 1920, groups: ID, 120
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.76065 0.07850 22.429 < 2e-16 ***
Group -1.81762 0.10734 -16.933 < 2e-16 ***
Condition -0.38111 0.06377 -5.977 2.28e-09 ***
Group:Condition -0.57033 0.08523 -6.692 2.21e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) Group Condtn
Group -0.732
Condition -0.033 0.025
Group:Cndtn 0.029 0.045 -0.758
As this is a simulation study, I'm not especially interested in making those models converge, but I'd like to understand why they're not converging. Does anybody have any ideas?

Resources