How can I obtain overall p-value in feglm model? - logistic-regression

I am running on project to find associated factors of a certain blood test performed, let's says diabetes blood test for this post. The variables that I have are 1) year (2018, 2019, 2020), 2) gender (male, female, other), 3) clinic locations of individual clinics (metropolitan, regional, rural), 4) Age group (20-29,30-39,40-49, 50-59, 60-89yrs old). This data is clustered sample by medical clinic (clinic_id)
I tried survey, srvyr and fixest packages (cluster sample) and found that the of feglm of fixest package were very similar to those of stata.
I fit this model using fixest package using the following script;
model_tested <- feglm (tested ~ year + gender + clinic_location + age_group, data = tested_proportion, family = "binomial", se = "cluster", cluster= ~clinic_id)
I was able to obtain individual p-value like the followings;
pr (>
t
)
year2019
0.71101
year2020
0.00973
female
0.00000
other
0.08090
age20-29
0.00000
age30-39
0.00000
age40-49
0.39693
age50-59
0.00000
age60-80
0.00000
In glm, I can run Anova (or aov) test to obtain overall p-value of each variable, such as p-value of year, gender and age-group.
However, I cannot run anova(model_tested) and got error message that Anova test was not supported in feglm model.
I tried the following script to obtain overall-p value of each variable, using wald.test
p_overall_year <- aod:wald.test(sigma = vcov(model_tested), b= coef(model_tested), Term = 2:3)
p_overall_gender <- aod:wald.test(sigma = vcov(model_tested), b= coef(model_tested), Term = 4:5)
p_overall_gender <- aod:wald.test(sigma = vcov(model_tested), b= coef(model_tested), Term = 6:10)
My question is, are there better way to obtain overall-p values of each variable?
Also, These showed overall p-value of each group but it was somewhat different to those of stata that i obtained using script, testparm i(2018/2020).year, that showed results of adjusted wald test. For example, overall p-value of year in R was 0.0013 whereas that in Stata was 0.0891.
Any other methods that I can try in R to achieve similar overall p-value to Stata?

Related

Extraction of matched dataset from MatchThem

I have browsed almost all possible pages on the subject and I still can't find a way to extract a matched data dataset with the MatchThem package.
By analogy, MatchIt allows via the function match.data() to extract the dataset of matched data for example 3:1. Although MatchThem's complete() function is the equivalent, this function apparently does not allow to extract exclusively the imputed AND matched dataset.
Here is an example of multiple imputation with 3:1 matching from which I am trying to extract multiple matched datasets:
library(mice)
library(MatchThem)
#Multiple imputations
mids_object <- mice(data, maxit = 5, m=3, seed= 20211022, printFlag = F) # m=3 is voluntarily low for this example.
#Matching
mimids_object <- matchthem(primary_subtype ~ age + bmi + ps, data = mids_object, approach = "within" ,ratio= 3, method = "optimal")
#Details of matched data
print(mimids_object)
Printing | dataset: #1
A matchit object
method: Variable ratio 3:1 optimal pair matching
distance: Propensity score
- estimated with logistic regression
number of obs: 761 (original), 177 (matched)
target estimand: ATT
covariates: age, bmi, ps
#Extracting matched dataset
complete(mimids_object, action = "long") -> complete_mi_matched
#Summary of extracted dataset to check correct number of match
summary(complete_mi_matched$primary_subtype)
classic ADK SRC
702 59
It should show the matched proportion 3:1 with 177 matched (177 classic ADK and 59 SRC)
I am missing something. Thanks in advance for your help or suggestions.

Replace Null Values of a Column with mean of another Categorcial Column in Spark Dataframe

I have a dataset like this
id category value
1 A NaN
2 B NaN
3 A 10.5
5 A 2.0
6 B 1.0
I want to fill the NAN values with the mean of their respective category. As shown below
id category value
1 A 4.16
2 B 0.5
3 A 10.5
5 A 2.0
6 B 1.0
I tried to calculate first mean values of each category using group by
val df2 = dataFrame.groupBy(category).agg(mean(value)).rdd.map{
case r:Row => (r.getAs[String](category),r.get(1))
}.collect().toMap
println(df2)
I got map of each category and their respective mean values.output: Map(A ->4.16,B->0.5)
Now i tried update query in Sparksql to fill column but it seems spqrkSql dosnt support update query. I tried to fill null values with in dataframe but failed to do so.
What can i do? We can do the same in pandas as shown in Pandas: How to fill null values with mean of a groupby?
But how can i do using spark dataframe
The simplest solution would be to use groupby and join:
val df2 = df.filter(!(isnan($"value"))).groupBy("category").agg(avg($"value").as("avg"))
df.join(df2, "category").withColumn("value", when(col("value").isNaN, $"avg").otherwise($"value")).drop("avg")
Note that if there is a category with all NaN it will be removed from the result
Indeed, you cannot update DataFrames, but you can transform them using functions like select and join. In this case, you can keep the grouping result as a DataFrame and join it (on category column) to the original one, then perform the mapping that would replace NaNs with the mean values:
import org.apache.spark.sql.functions._
import spark.implicits._
// calculate mean per category:
val meanPerCategory = dataFrame.groupBy("category").agg(mean("value") as "mean")
// use join, select and "nanvl" function to replace NaNs with the mean values:
val result = dataFrame
.join(meanPerCategory, "category")
.select($"category", $"id", nanvl($"value", $"mean")).show()
I stumbled upon same problem and came across this post. But tried a different solution i.e. using window functions. The code below is tested on pyspark 2.4.3 (Window functions are available from Spark 1.4). I believe this is bit cleaner solution.
This post is quiet old, but hope this answer will be helpful for others.
from pyspark.sql import Window
from pyspark.sql.functions import *
df = spark.createDataFrame([(1,"A", None), (2,"B", None), (3,"A",10.5), (5,"A",2.0), (6,"B",1.0)], ['id', 'category', 'value'])
category_window = Window.partitionBy("category")
value_mean = mean("value0").over(category_window)
result = df\
.withColumn("value0", coalesce("value", lit(0)))\
.withColumn("value_mean", value_mean)\
.withColumn("new_value", coalesce("value", "value_mean"))\
.select("id", "category", "new_value")
result.show()
Output will be as expected (in question):
id category new_value
1 A 4.166666666666667
2 B 0.5
3 A 10.5
5 A 2
6 B 1

Using multiple aggregate functions - sum and count

I've tried several of the solutions to my question on the site but could not find one that worked. Please help!
Other than taking some liberties with the report_names, the data is realistic of what I am trying to accomplish and is just a small portion of what I am up against, roughly 97K rows of data with the same type of repetition of branch, file_count, report_name...the file numbers are unique and are insignificant. It is for informational purposes of my question and explains why the amounts are unique - they are tied to the file_name
I am looking for one report_name with the sum of the two amounts.
Here are the current results to my query:
branch file_count file_volume net_profit report_name file_number
Northeast 1 $200,000.00 $200,000.00 bogart.hump.new 12345
Northeast 1 $195,000.00 $197,837.00 bogart.hump.new 23456
Northeast 1 $111,500.00 $113,172.00 bogart.hump.new 34567
Northwest 1 $66,000.00 -$1,500.18 jolie.angela.new 45678
Northwest 1 $159,856.00 -$2,745.58 jolie.angela.new 56789
Northwest 1 $140,998.00 -$2,421.69 jolie.angela.new 67890
Southwest 1 $74,000.00 $73,904.00 Man.bat.net 78901
Southwest 1 $186,245.00 -$4,231.25 Man.bat.net 89012
Southwest 1 $72,375.00 $73,641.00 Man.bat.net 90123
Southeast 1 $79,575.00 -$1,821.76 zep.led.new 1234A
Southeast 1 $268,600.00 $268,600.00 zep.led.new 2345A
Southeast 1 $77,103.00 -$1,751.68 zep.led.new 3456A
This is what I am looking for:
branch file_count file_volume net_profit report_name file_number
Northeast 3 $506,500.00 $511,009.00 bogart.hump.new
Northwest 3 $366,854.00 -$6,667.45 jolie.angela.new
Southwest 3 $332,620.00 $143,313.75 Man.bat.net
Southeast 3 $425,278.00 $265,026.56 zep.led.new
My query:
SELECT
branch,
count(filenumber) AS file_count,
sum(fileAmount) AS file_amount,
sum(netprofit*-1) AS net_profit,
concat(d2.lastname,'.',d2.firstname,'.','new') AS report_name,
FROM user.summary u
inner join user.db1 d1 ON d1.loaname = u.loaname
inner join user.db2 d2 ON d2.cn = u.loaname
WHERE d2.filedate = '2015-09-01'
AND filenumber is not null
GROUP BY branch,concat(d2.lastname,'.',d2.firstname,'.','new')
The only issue i see with your current query is that you have a comma at the end of this line that would give you a syntax error:
concat(d2.lastname,'.',d2.firstname,'.','new') AS report_name,
If you want the blank field file_number as shown in your desired result set though, you could leave the comma and follow it with the blank field by adding to it:
concat(d2.lastname,'.',d2.firstname,'.','new') AS report_name,
'' file_number
I figured it out but could not have done it without airing it out in this forum. In my actual query, I included the "file_name" column, so I had both the "count(file_name)" and "file_name" columns...but in my query example, I only had the "count(file_name)" column. When I removed the "file_column" column from my actual query, it worked. Side note...it was obvious that I excluded a key component in my query. On any future query questions, I will include the complete query but substitute actual column names with col1, col2, db1, db2, etc... thanks very much for responding to my question.

Link two tables based on conditions in matlab

I am using matlab to prepare my dataset in order to run it in certain data mining models and I am facing an issue with linking the data between two of my tables.
So, I have two tables, A and B, which contain sequential recordings of certain values in a certain timestamps and I want to create a third table, C, in which I will add columns of both A and B in the same rows according to some conditions.
Tables A and B don't have the same amount of rows (A has more measurements) but they both have two columns:
1st column: time of the recording (hh:mm:ss) and
2nd column: recorded value in that time
Columns of A and B are going to be added in table C when all the following conditions stand:
The time difference between A and B is more than 3 sec but less than 5 sec
The recorded value of A is the 40% - 50% of the recorded value of B.
Any help would be greatly appreciated.
For the first condition you need something like [row,col,val]=find((A(:,1)-B(:,1))>2sec && (A(:,1)-B(:,1))<5sec) where you do need to use datenum or equivalent to transform your timestamps. For the second condition this works the same, use [row,col,val]=find(A(:,2)>0.4*B(:,2) && A(:,2)<0.5*B(:,2)
datenum allows you to transform your arrays, so do that first:
A(:,1) = datenum(A(:,1));
B(:,1) = datenum(B(:,1));
you might need to check the documentation on datenum, regarding the format your string is in.
time1 = [datenum([0 0 0 0 0 3]) datenum([0 0 0 0 0 3])];
creates the datenums for 3 and 5 seconds. All combined:
A(:,1) = datenum(A(:,1));
B(:,1) = datenum(B(:,1));
time1 = [datenum([0 0 0 0 0 3]) datenum([0 0 0 0 0 3])];
[row1,col1,val1]=find((A(:,1)-B(:,1))>time1(1)&& (A(:,1)-B(:,1))<time1(2));
[row2,col2,val2]=find(A(:,2)>0.4*B(:,2) && A(:,2)<0.5*B(:,2);
The variables of row and col you might not need when you want only the values though. val1 contains the values of condition 1, val2 of condition 2. If you want both conditions to be valid at the same time, use both in the find command:
[row3,col3,val3]=find((A(:,1)-B(:,1))>time1(1)&& ...
(A(:,1)-B(:,1))<time1(2) && A(:,2)>0.4*B(:,2)...
&& A(:,2)<0.5*B(:,2);
The actual adding of your two arrays based on the conditions:
C = A(row3,2)+B(row3,2);
Thank you for your response and help! However for the time I followed a different approach by converting hh:mm:ss to seconds that will make the comparison easier later on:
dv1 = datevec(A, 'dd.mm.yyyy HH:MM:SS.FFF ');
secs = [3600,60,1];
dv1(:,6) = floor(dv1(:,6));
timestamp = dv1(:,4:6)*secs.';
Now I am working on combining both time and weight conditions in a piece of code that will run. Should I use an if condition inside a for loop or is a for loop not necessary?

Split array into chunks based on timestamp in Haskell

I have an array of records (custom data type) in Haskell which I want to aggregate based on a each records' timestamp. In very general terms each record looks like this:
data Record = Record { event :: String,
time :: Double,
from :: Int,
to :: Int
} deriving (Show, Eq)
I used a Double for the timestamp since that is the same format used in the tracefile.
And I parse them from a CSV file into an array of records: [Record]
Now I'm looking to get an approximation of instantaneous events / time. So I want to split the array into several arrays based on the timestamp (say. every 1 seconds) and then fold across each smaller array.
The problem is I can't figure out how to split an array based on the value of a record. Looking on Hoogle I found several functions like splitEvery and splitWhen, but I'm lost. I considered using splitWhen to break up the list when, say, (mod time 0.1) == 0, but even if that worked it would remove the elements it's splitting on (which I don't want to do).
I should note that the records are NOT evenly spaced in time. E.g. the timestamp on sequential records is not going to differ by a fixed amount.
I am more than willing to store the data in a different format if you can suggest one that would make this sort of work easier.
A quick sample of the data I'm parsing (from a ns2 simulation):
r 0.114 1 2 tcp 1000 ________ 2 1.0 5.0 0 2
r 0.240 1 2 tcp 1000 ________ 2 1.0 5.0 0 2
r 0.914 2 1 tcp 1000 ________ 2 5.0 1.0 0 3
If you have [Record] and you want to group them by a specific condition, you can use Data.List.groupBy. I'm assuming that for your time :: Double, 1 second is the base unit, so time = 1 is 1 second, time = 100 is 100 seconds, etc, so adjust this to whatever system you're actually using:
import Data.List
import Data.Function (on)
isInSameClockSecond :: Record -> Record -> Bool
isInSameClockSecond = (==) `on` (floor . time :: Record -> Integer)
-- The type signature is given for floor . time to remove any ambiguity
-- due to floor's polymorphic type signature.
groupBySameClockSecond :: [Record] -> [[Record]]
groupBySameClockSecond = groupBy isInSameClockSecond

Resources