Anyone could suggest an alternative for PSM package in R for parametric survival model since this package has been removed?
psm() is a function within the rms package; can you clarify which psm package do you mean?
the PSM package is here: https://rdrr.io/cran/PSM/
You can reproduce the results of the paper with the following codes:
Zhang Z. Parametric regression modelfor survival data: Weibull
regression model as an example. Ann Transl Med 2016;4(24):484. doi:
10.21037/atm.2016.08.45
> install.packages("rms")
> library(rms)
> library(survival)
> data(lung)
> psm.lung<-psm(Surv(time, status)~ph.ecog+sex*age+
+ ph.karno+pat.karno+meal.cal+
+ wt.loss,lung, dist='weibull')
> anova(psm.lung)
Wald Statistics Response: Surv(time, status)
Factor Chi-Square d.f. P
ph.ecog 13.86 1 0.0002
sex (Factor+Higher Order Factors) 10.24 2 0.0060
All Interactions 3.22 1 0.0728
age (Factor+Higher Order Factors) 3.75 2 0.1532
All Interactions 3.22 1 0.0728
ph.karno 5.86 1 0.0155
pat.karno 3.54 1 0.0601
meal.cal 0.00 1 0.9439
wt.loss 3.85 1 0.0498
sex * age (Factor+Higher Order Factors) 3.22 1 0.0728
TOTAL 33.18 8 0.0001
Related
I would like to use the Lead function to get the closest value for a group
Below is some sample data from flx_alps_boundaries
Subject code
Grade
Score
20-BD-AC-AL
1
1.12
20-BD-AC-AL
2
1.03
20-BD-AC-AL
3
0.97
20-BD-AC-AL
4
0.92
20-BD-AC-AL
5
0.86
20-BD-AC-AL
6
0.84
20-BD-AH-AL
1
1.15
20-BD-AH-AL
2
1.10
20-BD-AH-AL
3
1.05
20-BD-AH-AL
4
1.00
20-BD-AH-AL
5
0.98
20-BD-AH-AL
6
0.96
I am calculating the score for a subject using a formula and getting the grade for the nearest matching score from the above table . eg if score is 0.95 for subject 20-BD-AC-AL the grade should be 4
This is my current sql
select top 1
ab.alps_grade as alps_grade,
round( sum (actual_alps_points - expected_alps_points)
/ (count(reference) * 100) + 1,2 ) as alps_score
from alps_cte
inner join [flx_alps_boundaries] ab
on alps_cte.course = ab.course_code
where ab.course_code in ('20-BD-AC-AL','20-BD-AH-AL')
group by course,ab.alps_grade,ab.alps_score
order by abs(round(sum(actual_alps_points
- expected_alps_points)
/ (count(reference)*100) + 1, 2)
- ab.alps_score)
This query only returns one row. How do I use LEAD to get the appropriate grade for each
subject's score?
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
No matter how I try to code this in R, I still cannot drop my columns so that I can build my logistic regression model. I tried to run it two different ways
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[-cols,]
Error in -cols : invalid argument to unary operator
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[!cols,]
Error in !cols : invalid argument type
This may solve your problem:
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[ , !colnames(DAT_690_Attrition_Proj1EmpAttrTrain) %in% cols]
Please note that if you want to drop columns, you should put your code inside [ on the right side of the comma, not on the left side.
So [, your_code] not [your_code, ].
Here is an example of dropping columns using the code above.
cols <- c("cyl", "hp", "wt")
mtcars[, !colnames(mtcars) %in% cols]
# mpg disp drat qsec vs am gear carb
# Mazda RX4 21.0 160.0 3.90 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 160.0 3.90 17.02 0 1 4 4
# Datsun 710 22.8 108.0 3.85 18.61 1 1 4 1
# Hornet 4 Drive 21.4 258.0 3.08 19.44 1 0 3 1
# Hornet Sportabout 18.7 360.0 3.15 17.02 0 0 3 2
# Valiant 18.1 225.0 2.76 20.22 1 0 3 1
#...
Edit to Reproduce the Error
The error message you got indicates that there is a column that has only one, identical value in all rows.
To show this, let's try a logistic regression using a subset of mtcars data, which has only one, identical values in its cyl column, and then we use that column as a predictor.
mtcars_cyl4 <- mtcars |> subset(cyl == 4)
mtcars_cyl4
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars_cyl4, family = "binomial")
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels
Now, compare it with the same logistic regression by using full mtcars data, which have various values in cyl column.
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars, family = "binomial")
# Call: glm(formula = am ~ as.factor(cyl) + mpg + disp, family = "binomial",
# data = mtcars)
#
# Coefficients:
# (Intercept) as.factor(cyl)6 as.factor(cyl)8 mpg disp
# -5.08552 2.40868 6.41638 0.37957 -0.02864
#
# Degrees of Freedom: 31 Total (i.e. Null); 27 Residual
# Null Deviance: 43.23
# Residual Deviance: 25.28 AIC: 35.28
It is likely that, even though you have drop three columns that have one,identical values in all the respective rows, there is another column in Trainingmodel1 that has one identical values. The identical values in the column were probably resulted during filtering the data frame and splitting data into training and test groups. Better to have a check by using summary(Trainingmodel1).
Further edit
I have checked the summary(Trainingmodel1) result, and it becomes clear that EmployeeNumber has one identical value (called "level" for a factor) in all rows. To run your regression properly, either you drop it from your model, or if EmployeeNumber has another level and you want to include it in your model, you should make sure that it contains at least two levels in the training data. It is possible to achieve that during splitting by repeating the random sampling until the randomly selected EmployeeNumber samples contain at least two levels. This can be done by looping using for, while, or repeat. It is possible, but I don't know how proper the repeated sampling is for your study.
As for your question about subsetting more than one variable, you can use subset and conditionals. For example, you want to get a subset of mtcars that has cyl == 4 and mpg > 20 :
mtcars |> subset(cyl == 4 & mpg > 20 )
If you want a subset that has cyl == 4 or mpg > 20:
mtcars |> subset(cyl == 4 | mpg > 20 )
You can also subset by using more columns as subset criteria:
mtcars |> subset((cyl > 4 & cyl <8) | (mpg > 20 & gear > 4 ))
I have the following SAS data set:
Subject AETERM1 AETERM2 TREATMENT
001 Illness Fever 0
001 Illness Cold 0
002 Cardiac AFIB 1
003 Cardiac AFLUT 1
I would like to create a table like this in SAS:
___________________________________________________________________________________________________
AETERM1
AETERM2 TREATMENT = 0 (N = 1) TREATMENT = 1 (N = 2) OVERALL (N = 3)
__________________________________________________________________________________________________
Any Event 1 (100%) 2 (100%) 3 (100%)
Illness 1 (100%) 1 (33%)
Fever 1 (100%) 1 (33%)
Cold 1 (100%) 1 (33%)
Cardiac 2 (100%) 2 (67%)
AFIB 1 (50%) 1 (33%)
AFLUT 1 (50%) 1 (33%)
I'm able to generate something close with the following PROC FREQ statement:
proc freq data = have order = freq;
table aeterm1 * aeterm2 / missing;
run;
You could actually use
proc freq data = have order = freq;
table aeterm1 * aeterm2 * treatment / out = results;
run;
and process the results dataset to get the view you want
I have a situation where I get trip data from another company. The other company measures fuel with a precision of ⅛ gallon.
I get data from the other company and store it in my SQL Server table. The aggregated fuel amounts aren't right. I discovered that while the other company stores fuel in 1/8 gallons, it was sending me only one decimal place.
Furthermore, thanks to this post, I've determined that the company isn't rounding the values to the nearest tenth but is instead truncating them.
Query:
/** Fuel Fractions **/
SELECT DISTINCT ([TotalFuelUsed] % 1) AS [TotalFuelUsedDecimals]
FROM [Raw]
ORDER BY [TotalFuelUsedDecimals]
Results:
TotalFuelUsedDecimals
0.00
0.10
0.20
0.30
0.50
0.60
0.70
0.80
What I'd like is an efficient way to add a corrected fuel column to my views which would map as follows:
0.00 → 0.000
0.10 → 0.125
0.20 → 0.250
0.30 → 0.375
0.50 → 0.500
0.60 → 0.625
0.70 → 0.750
0.80 → 0.875
1.80 → 1.875
and so on
I'm new to SQL so please be kind.
Server is running Microsoft SQL Server 2008. But if you know a way better function only supported by newer SQL Server, please post it too because we may upgrade someday soon and it may help others.
Also, if it makes any difference, there are several different fuel columns in the table that I'll be correcting.
While writing up the question, I tried the following method using a temp table and multiple joins which seemed to work. I expect there are better solutions out there to be had.
CREATE TABLE #TempMap
([from] decimal(18,2), [to] decimal(18,3))
;
INSERT INTO #TempMap
([from], [to])
VALUES
(0.0, 0.000),
(0.1, 0.125),
(0.2, 0.250),
(0.3, 0.375),
(0.5, 0.500),
(0.6, 0.625),
(0.7, 0.750),
(0.8, 0.875)
;
SELECT [TotalFuelUsed]
,[TotalFuelCorrect].[to] + ROUND([TotalFuelUsed], 0, 1) AS [TotalFuelUsedCorrected]
,[IdleFuelUsed]
,[IdleFuelCorrect].[to] + ROUND([IdleFuelUsed], 0, 1) AS [IdleFuelUsedCorrected]
FROM [Raw]
JOIN [#TempMap] AS [TotalFuelCorrect] ON [TotalFuelUsed] % 1 = [TotalFuelCorrect].[from]
JOIN [#TempMap] AS [IdleFuelCorrect] ON [IdleFuelUsed] % 1 = [IdleFuelCorrect].[from]
ORDER BY [TotalFuelUsed] DESC
DROP TABLE #TempMap;
Try adding a column as:
select ....
, case when right(cast([TotalFuelUsed] as decimal(12,1)), 1) = 1 then [TotalFuelUsed] + 0.025
when right(cast([TotalFuelUsed] as decimal(12,1)), 1) = 2 then [TotalFuelUsed] + 0.05
when right(cast([TotalFuelUsed] as decimal(12,1)), 1) = 3 then [TotalFuelUsed] + 0.075
when right(cast([TotalFuelUsed] as decimal(12,1)), 1) = 6 then [TotalFuelUsed] + 0.025
when right(cast([TotalFuelUsed] as decimal(12,1)), 1) = 7 then [TotalFuelUsed] + 0.05
when right(cast([TotalFuelUsed] as decimal(12,1)), 1) = 8 then [TotalFuelUsed] + 0.075
else [TotalFuelUsed] end as updatedTotalFuelUsed
I am trying to extract the LRR and BAF values from an affymetrix SNP chip without success using linux based tools. I tried to use a small subset in windows designed software called Axiom™ CNV Summary Tools Software and it works perfectly. The problem is that I have a huge dataset and would be impossible to run in windows machine powerful enough.
Let´s expose my steps until this point. First, I obtained five tab delimited files which are require to linux and/or windows pipeline (1-3 obtained with APT affymetrix software).
1 - The Axiom calls.txt or genotype file:
calls <- 'probeset_id sample_1 sample_2 sample_3
AX-100010998 2 2 2
AX-100010999 1 0 1
AX-100011005 0 1 2
AX-100011007 2 2 1
AX-100011008 1 1 2
AX-100011010 2 2 2
AX-100011011 0 1 0
AX-100011012 0 1 0
AX-100011016 0 0 1
AX-100011017 0 0 2'
calls <- read.table(text=calls, header=T)
2 - The confidences.txt file:
conf<- 'probeset_id sample_1 sample_2 sample_3
AX-100010998 0.00001 0.0002 0.00006
AX-100010999 0.00001 0.00001 0.00001
AX-100011005 0.00007 0.00017 0.00052
AX-100011007 0.00001 0.00001 0.00001
AX-100011008 0.001 0.00152 0.00001
AX-100011010 0.00001 0.00001 0.00002
AX-100011011 0.00004 0.00307 0.00002
AX-100011012 0.00001 0.00001 0.00001
AX-100011016 0.00003 0.00001 0.00001
AX-100011017 0.00003 0.01938 0.00032'
conf <- read.table(text=conf, header=T)
3 - The summary.txt file:
summ <- 'probeset_id sample_1 sample_2 sample_3
AX-100010998-A 740.33229 655.41465 811.98053
AX-100010998-B 1139.25679 1659.55079 917.7128
AX-100010999-A 1285.67306 1739.03296 1083.48455
AX-100010999-B 1403.51265 341.85893 1237.48577
AX-100011005-A 1650.03408 1274.57594 485.5324
AX-100011005-B 430.3122 2674.70182 4070.90727
AX-100011007-A 411.28952 449.76345 2060.7136
AX-100011007-B 4506.77692 4107.12982 2065.58516
AX-100011008-A 427.78263 439.63541 333.86312
AX-100011008-B 1033.41335 1075.31617 1623.69271
AX-100011010-A 390.12996 350.54456 356.63156
AX-100011010-B 1183.29912 1256.01391 1650.82396
AX-100011011-A 3593.93578 2902.34079 2776.2503
AX-100011011-B 867.33447 2252.54552 961.31596
AX-100011012-A 2250.44699 1192.46116 1927.70581
AX-100011012-B 740.31957 1721.70283 662.1414
AX-100011016-A 1287.9221 1367.95468 1037.98191
AX-100011016-B 554.8795 666.93132 1487.2143
AX-100011017-A 2002.40468 1787.42982 490.28802
AX-100011017-B 849.92775 1025.44417 1429.96567'
summ <- read.table(text=summ, header=T)
4 - The gender.txt:
gender <- 'cel_files gender
sample_1 female
sample_2 female
sample_3 female'
gender <- read.table(text=gender, header=F)
And finally the map file map.db in windows (non readable) or map.txt in linux as follows:
map <- 'Name Chr Position
AX-100010998 Z 70667736
AX-100010999 4 36427048
AX-100011005 26 4016045
AX-100011007 6 25439800
AX-100011008 2 147800617
AX-100011010 1 98919397
AX-100011011 Z 66652642
AX-100011012 7 28180218
AX-100011016 1A 33254907
AX-100011017 5 1918020'
map <- read.table(text=map, header=T)
This is my result in windows based result for sample_1:
Name Chr Position sample_1.GType sample_1.Log R Ratio sample_1.B Allele Freq
AX-100010998 Z 70667736 BB Infinity 0.675637419295063
AX-100010999 4 36427048 AB 0.101639462657534 0.531373516807123
AX-100011005 26 4016045 AA -0.111910305454305 0
AX-100011007 6 25439800 BB 0.148781943283483 1
AX-100011008 2 147800617 AB -0.293273363654622 0.609503132331127
AX-100011010 1 98919397 BB -0.283993308525307 0.960031843823016
AX-100011011 Z 66652642 AA Infinity 0.00579049667757003
AX-100011012 7 28180218 AA 0.0245684274744242 0.032174599843476
AX-100011016 1A 33254907 AA -0.265925457515035 0
AX-100011017 5 1918020 AA -0.0091211520536838 0
The values from the windows based tool seems to be correct, but in linux output that´s is not the case. I am following the steps decribed at penncnv software (http://penncnv.openbioinformatics.org/en/latest/user-guide/input/) and I log2 transformed my summary.txt and did the quantile normalization with limma package using normalizeBetweenArrays(x), finishing with the corrsummary.txt:
corrsum <- 'probeset_id sample_1 sample_2 sample_3
AX-100010998-A 9.804932 9.285738 9.530882
AX-100010998-B 10.249239 10.528922 9.804932
AX-100010999-A 10.528922 10.641862 10.134816
AX-100010999-B 10.641862 8.472829 10.249239
AX-100011005-A 10.804446 10.249239 8.816931
AX-100011005-B 8.835381 11.186266 12.045852
AX-100011007-A 8.542343 8.835381 11.039756
AX-100011007-B 12.045852 12.045852 11.186266
AX-100011008-A 8.816931 8.816931 8.472829
AX-100011008-B 10.134816 9.910173 10.592867
AX-100011010-A 8.472829 8.542343 8.542343
AX-100011010-B 10.374032 10.134816 10.641862
AX-100011011-A 11.593784 11.593784 11.593784
AX-100011011-B 10.012055 11.039756 9.910173
AX-100011012-A 11.186266 10.012055 10.804446
AX-100011012-B 9.530882 10.592867 9.285738
AX-100011016-A 10.592867 10.374032 10.012055
AX-100011016-B 9.285738 9.530882 10.528922
AX-100011017-A 11.039756 10.804446 8.835381
AX-100011017-B 9.910173 9.804932 10.374032'
corrsum <- read.table(text=corrsum, header=T)
Thus I applied:
./generate_affy_geno_cluster.pl calls.txt confidences.txt corrsummary.txt --locfile map.txt --sexfile gender.txt --output gencluster
and
./normalize_affy_geno_cluster.pl --locfile map.txt gencluster calls.txt --output lrrbaf.txt
And my linux based result (lrrbaf.txt) which must contain LRR and BAF information looks like that:
output <- 'Name Chr Position sample_1.LogRRatio sample_1.BAlleleFreq sample_2.LogRRatio sample_2.BAlleleFreq sample_3.LogRRatio sample_3.BAlleleFreq
AX-100010999 4 36427048 -1952.0739 2 -1953.0739 2 -1952.0739 2
AX-100011005 26 4016045 -2245.1784 2 -2244.1784 2 -2243.1784 2
AX-100011007 6 25439800 -4433.4661 2 -4433.4661 2 -4434.4661 2
AX-100011008 2 147800617 -1493.2287 2 -1493.2287 2 -1492.2287 2
AX-100011011 Z 66652642 -4088.2311 2 -4087.2311 2 -4088.2311 2
AX-100011012 7 28180218 -2741.2623 2 -2740.2623 2 -2741.2623 2
AX-100011016 1A 33254907 -2117.7005 2 -2117.7005 2 -2116.7005 2
AX-100011017 5 1918020 -3067.4077 2 -3067.4077 2 -3065.4077 2'
output <- read.table(text=output, header=T)
As showed above the linux result is completely different from windows based results (and make much less sense) and additionally do not contain the GType column in the output. Sorry to compose such a long question, but my intention was to make it as reproducible as possible. I would be grateful for any light to solve this problem as well any important remarks about this kind of data that I maybe forgot.