P-values for glmer mixed effects logistic regression in Python - logistic-regression

I have a dataset for one year for all employees with individual-level data (e.g. age, gender, promotions, etc.). Each employee is in a team of a certain manager. I have some variables on the team- and manager-levels as well (e.g. manager's tenure, team diversity, etc.). I want to explain the termination of employees (binary: left the company or not). I am running a multilevel logistic regression, where employees are grouped by their managers, therefore they share the same team- and manager-level characteristics.
So, my model looks like:
Termination ~ Age + Time in company + Promotions + Manager tenure + Percent of employees who completed training", data, groups=data[Manager_ID]
Dataset example:
data = {'Employee': ['ID1', 'ID2','ID3','ID4','ID5','ID6','ID7', 'ID8'],
'Manager_ID': ['MID1', 'MID2','MID2','MID1','MID3','MID3','MID3', 'MID1'],
'Termination': ['0', '0', '0', '0', '1', '1', '1', '0'],
'Age': ['35', '40','50','24','33','46','44', '31'],
'TimeinCompany': ['1', '3', '10', '20', '4', '0', '4', '9'],
'Promotions': ['1', '0', '0', '0', '1', '1', '1', '0'],
'Manager_Tenure': ['10', '5', '5', '10', '8', '8', '8', '10'],
'PercentCompletedTrainingTeam': ['40', '20', '20', '40', '49', '49', '49', '40']}
columns = ['Employee','Manager_ID','Age', 'TimeinCompany', 'Promotions', 'Manager_Tenure', 'AverageAgeTeam', 'PercentCompletedTrainingTeam']
data = pd.DataFrame(data, columns=columns)
I managed to run mixed effects logistic regression using lme4 package from R in Python.
importr('lme4')
model1 = r.glmer(formula=Formula('Termination ~ Age + TimeinCompany + Promotions + Manager_Tenure + PercentCompletedTrainingTeam + (1 | Manager_ID)'),
data=data)
print(r.summary(model1))
I receive the following output for the full sample:
REML criterion at convergence: 54867.6
Scaled residuals:
Min 1Q Median 3Q Max
-2.9075 -0.3502 -0.2172 -0.0929 3.9378
Random effects:
Groups Name Variance Std.Dev.
Manager_ID (Intercept) 0.005033 0.07094
Residual 0.072541 0.26933
Number of obs: 211974, groups: Manager_ID, 24316
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.14635573 0.00299341 48.893
Age -0.00112153 0.00008079 -13.882
TimeinCompany -0.00238352 0.00010314 -23.110
Promotions -0.01754085 0.00491545 -3.569
Manager_Tenure -0.00044373 0.00010834 -4.096
PercentCompletedTrainingTeam -0.00014393 0.00002598 -5.540
Correlation of Fixed Effects:
(Intr) Age TmnCmpny Promotions Mngr_Tenure
Age -0.817
TmnCmpny 0.370 -0.616
Promotions -0.011 -0.009 -0.033
Mngr_Tenure -0.279 0.013 -0.076 0.035
PrcntCmpltT -0.309 -0.077 -0.021 -0.042 0.052
But, there are no p-values displayed. I read a lot that lme4 does not provide p-values for a number of reasons, however I have to have them for the work presentation.
I tried several possible solutions that I found, but none of them worked:
importr('lmerTest')
importr('afex')
print(r.anova(model1))
does not display any output
print(r.anova(model1, ddf="Kenward-Roger"))
only displays npar, Sum Sq, Mean Sq, F value
print(r.summary(model1, ddf="merModLmerTest"))
Provides the same output as with just summary
print(r.anova(model1, "merModLmerTest"))
only displays npar, Sum Sq, Mean Sq, F value
Any ideas on how to get p-values are much appreciated.

Related

Looping in R with tidyverse

I am trying to read multiple csv files with date in the format dd-mm-yyyy. I want to convert the months into seasons, for which I have used the following codes (for one csv file):
x= data %>%
dplyr::mutate(year = lubridate::year(UDATE),
month = lubridate::month(UDATE),
day = lubridate::day(UDATE))
x %>%
mutate(season = case_when(
month %in% c('3', '4', '5', '6') ~ 'Summer',
month %in% c('7', '8', '9', '10') ~ 'Monsoon',
month %in% c('11', '12', '1', '2') ~ 'Winter'
))
Now I want to run this for the multiple csv files simultaneously and export those files with the converted data frames such that my month is converted into seasons.
Can someone please suggest me how to put that in a loop function for multiple csv files simultaneously.
Thank you
Without data to reproduce it is a little hard to help, so I wrote some generic code that might need a few tweaks to solve your problem.
First, I would create a single function that import, transform and export the data.
prep_data <-
function(file_name){
data <-
read.csv2(file_name) %>%
dplyr::mutate(
year = lubridate::year(UDATE),
month = lubridate::month(UDATE),
day = lubridate::day(UDATE),
season = case_when(
month %in% c('3', '4', '5', '6') ~ 'Summer',
month %in% c('7', '8', '9', '10') ~ 'Monsoon',
month %in% c('11', '12', '1', '2') ~ 'Winter'
)
#I put "_new" as prefix to not overwrite the original .csv
write.csv2(file = paste0("new_",file_name))
}
Then I would create a vector with all my files names.
all_files <- list.files(pattern = "*.csv",path = "your path with the csv files",full.names = TRUE)
Lastly, apply the function for all files.
purrr::map(.x = all_files,.f = prep_data)

I have the following code to write my outputs from SQL Server using Python: output is '1', '2', '3', but I want 1 2 3

conn = pyodbc.connect('Driver={SQL Server};'
'Server=DESKTOP-IINBCRC\SQLEXPRESS;'
'Database=employees;'
'Trusted_Connection=yes;')
cur = conn.cursor()
cur.execute("select * from Login")
for row in cur:
print(row)
The output is '1', '2', '3', but I want the output to be 1 2 3
So, given,
txt = "'1', '2', '3'"
lst=[x.strip()[1:-1] for x in txt.split(",")]
" ".join(lst)
this should produce (I have tested)
'1 2 3'
Or to do it a bit more smart:
txt=txt.replace(",","")
txt=txt.replace("'","")

Snowflake EXPLAIN query support with Snowflake JDBC Driver

Is there a way to run an EXPLAIN snowflake query through the JDBC driver with the Snowflake extension? I am running net.snowflake snowflake-jdbc 3.12.8 and it throws an error saying net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error: syntax error line 1 at position 15 unexpected 'EXPLAIN'.. I see there are more up to date versions to 3.12.16 but nothing in the release notes mentions this added capability. The same exact query I am running works successfully in the snowflake UI.
I had no problem using EXPLAIN and the Snowflake JDBC driver 3.12.8:
print(sc._jvm.net.snowflake.spark.snowflake.Utils.getClientInfoString())
x=sc._jvm.net.snowflake.spark.snowflake.Utils.runQuery(sfOptions, 'explain select * from numbers limit 10')
cols = x.getMetaData().getColumnNames()
print(cols)
while(x.next()):
print([x.getString(i) for i in range(1, 1+cols.size())])
The results show that I'm using the specified JDBC version (through PySpark) and the results of the EXPLAIN query:
{
"spark.version" : "2.4.4",
"spark.snowflakedb.version" : "2.8.1",
"spark.app.name" : "Simple App",
"scala.version" : "2.11.12",
"java.version" : "1.8.0_242",
"snowflakedb.jdbc.version" : "3.12.8"
}
['step', 'id', 'parent', 'operation', 'objects', 'alias', 'expressions', 'partitionsTotal', 'partitionsAssigned', 'bytesAssigned']
[None, None, None, 'GlobalStats', None, None, None, '1', '1', '512']
['1', '0', None, 'Result', None, None, 'NUMBERS.X', None, None, None]
['1', '1', '0', 'Limit', None, None, 'rowCount: 10', None, None, None]
['1', '2', '1', 'TableScan', 'TEMP.PUBLIC.NUMBERS', None, 'X', '1', '1', '512']
For further community debugging, you'll need to paste your code to check what's happening.
The explain query can be executed via Snowflake JDBC connector :
Example:
ResultSet rs = stmt.executeQuery("explain SELECT top 5 * from SNOWFLAKE_SAMPLE_DATA.TPCH_SF10.ORDERS where O_ORDERDATE between '1992-01-01' and '1992-12-31'");
ResultSetMetaData rsmd = rs.getMetaData();
int numberOfColumns = rsmd.getColumnCount();
for (int i = 1; i <= numberOfColumns; i++ ) {
String name = rsmd.getColumnName(i);
System.out.println("name :" + name +" size :" + rsmd.getColumnDisplaySize(i) );
}
O/p:
name :id size :10
name :parent size :10
name :operation size :16777216
name :objects size :16777216
name :alias size :16777216
name :expressions size :16777216
name :partitionsTotal size :39
name :partitionsAssigned size :39
name :bytesAssigned size :39
Thanks,
Sujan Ghosh

can we delete element in list based on certain condition

I have a Python list:
test_list = ['LeBron James', 'and', 'Nina Mata', 'Dan Brown', 'Derrick Barnes', 'and',
'Gordon C. James', 'J. Kenji López-Alt', 'and', 'Gianna Ruggiero', ]
I want output like this:
final_list = ['LeBron James and Nina Mata', 'Dan Brown', 'Derrick Barnes and
Gordon C. James', 'J. Kenji López-Alt and Gianna Ruggiero']
In short, I want one item before 'and' one item after 'and' to be combined. On the other hand, names coming without 'and' should not be combined and left as it is. How can we do this in Python?
Here's a solution perhaps not too elegant but simple and functional: join all words with a glue symbol that does not happen in any of them (e.g., "~"), replace the resulting "~and~"s with " and "s, and split by the glue symbol again:
"~".join(test_list).replace("~and~", " and ").split("~")
#['LeBron James and Nina Mata', 'Dan Brown',
# 'Derrick Barnes and Gordon C. James',
# 'J. Kenji López-Alt and Gianna Ruggiero']
This should work for groups with more than one "and," too.

How do I normalize temporal data to their initial values?

I have data that are acquired every 3 seconds. Initially, they always begin within a narrow baseline range (i.e. 100±10) but after ~30 seconds they begin to increase in value.
Here's an example.
The issue is that for every experiment, the initial baseline value may start at a different point in the y-axis (i.e. 100, 250, 35) due to variations in equipment calibration.
Although the relative signal enhancement at ~30 seconds behaves the same across different experiments, there may be an offset along the y-axis.
My intention is to measure the AUC of these curves. Because of the offset between experiments, they are not comparable, although they could potentially be identical in shape and enhancement ratio.
Therefore I need to normalize the data so that regardless of offset they all have comparable baseline initial values. This could be set to 0.
Can you give me any suggestions on how to accomplish the normalization on Matlab?
Ideally the output data should be of relative signal enhancement (in percent relative to baseline).
For example, the baseline values above would hover around 0±10 (instead of the raw original value of ~139) and with enhancement they would build up to ~65% (instead of the original raw value of ~230).
Sample data:
index SQMean
_____ ____________
'0' '139.428574'
'1' '133.298706'
'2' '135.961044'
'3' '143.688309'
'4' '133.298706'
'5' '133.181824'
'6' '134.896103'
'7' '146.415588'
'8' '142.324677'
'9' '128.168839'
'10' '146.116882'
'11' '146.766235'
'12' '134.675323'
'13' '138.610382'
'14' '140.558441'
'15' '128.662338'
'16' '138.480515'
'17' '153.610382'
'18' '156.207794'
'19' '183.428574'
'20' '220.324677'
'21' '224.324677'
'22' '230.415588'
'23' '226.766235'
'24' '223.935059'
'25' '229.922073'
'26' '234.389618'
'27' '235.493500'
'28' '225.727280'
'29' '241.623383'
'30' '225.805191'
'31' '240.896103'
'32' '224.090912'
'33' '230.467529'
'34' '248.285721'
'35' '233.779221'
'36' '225.532471'
'37' '247.337662'
'38' '233.000000'
'39' '241.740265'
'40' '235.688309'
'41' '238.662338'
'42' '236.636368'
'43' '236.025970'
'44' '234.818176'
'45' '240.974030'
'46' '251.350647'
'47' '241.857147'
'48' '242.623383'
'49' '245.714279'
'50' '250.701294'
'51' '229.415588'
'52' '236.909088'
'53' '243.779221'
'54' '244.532471'
'55' '241.493500'
'56' '245.480515'
'57' '244.324677'
'58' '244.025970'
'59' '231.987015'
'60' '238.740265'
'61' '239.532471'
'62' '232.363632'
'63' '242.454544'
'64' '243.831161'
'65' '229.688309'
'66' '239.493500'
'67' '247.324677'
'68' '245.324677'
'69' '244.662338'
'70' '238.610382'
'71' '243.324677'
'72' '234.584412'
'73' '235.181824'
'74' '228.974030'
'75' '228.246750'
'76' '230.519485'
'77' '231.441559'
'78' '236.324677'
'79' '229.935059'
'80' '238.701294'
'81' '236.441559'
'82' '244.350647'
'83' '233.714279'
'84' '243.753250'
Close to what was mentioned by Shai:
blwindow = 1:nrSamp;
DataNorm = 100*(Data/mean(Data(blwindow))-1)
Set the window to the right size, however you want to determine it, it depends on your data. Output DataNorm is in %.
Usually this kind of problems requires some more specific knowledge about the data you are measuring (range, noise level, if you know when the actual data starts etc.) and the results you are trying to achieve. However, based on your question only and by looking at your example graph, I'd do something like this (assuming your data is in two arrays, time and data):
initialTimeMax = 25; % take first 25 s
baseSample = data(time <= initialTimeMax); % take part of the data corresponding to the first 25 s
baseSampleAverage = mean(baseSample); % take average to deal with noise
data = data - baseSampleAverage;
If you don't know when your data starts, you can apply a smoothing filter, then take a derivative, find the x-position of its maximum, and set initialTimeMax to this x-position.

Resources