I have the following SAS data set:
Subject AETERM1 AETERM2 TREATMENT
001 Illness Fever 0
001 Illness Cold 0
002 Cardiac AFIB 1
003 Cardiac AFLUT 1
I would like to create a table like this in SAS:
___________________________________________________________________________________________________
AETERM1
AETERM2 TREATMENT = 0 (N = 1) TREATMENT = 1 (N = 2) OVERALL (N = 3)
__________________________________________________________________________________________________
Any Event 1 (100%) 2 (100%) 3 (100%)
Illness 1 (100%) 1 (33%)
Fever 1 (100%) 1 (33%)
Cold 1 (100%) 1 (33%)
Cardiac 2 (100%) 2 (67%)
AFIB 1 (50%) 1 (33%)
AFLUT 1 (50%) 1 (33%)
I'm able to generate something close with the following PROC FREQ statement:
proc freq data = have order = freq;
table aeterm1 * aeterm2 / missing;
run;
You could actually use
proc freq data = have order = freq;
table aeterm1 * aeterm2 * treatment / out = results;
run;
and process the results dataset to get the view you want
Related
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
No matter how I try to code this in R, I still cannot drop my columns so that I can build my logistic regression model. I tried to run it two different ways
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[-cols,]
Error in -cols : invalid argument to unary operator
cols<-c("EmployeeCount","Over18","StandardHours")
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[!cols,]
Error in !cols : invalid argument type
This may solve your problem:
Trainingmodel1 <- DAT_690_Attrition_Proj1EmpAttrTrain[ , !colnames(DAT_690_Attrition_Proj1EmpAttrTrain) %in% cols]
Please note that if you want to drop columns, you should put your code inside [ on the right side of the comma, not on the left side.
So [, your_code] not [your_code, ].
Here is an example of dropping columns using the code above.
cols <- c("cyl", "hp", "wt")
mtcars[, !colnames(mtcars) %in% cols]
# mpg disp drat qsec vs am gear carb
# Mazda RX4 21.0 160.0 3.90 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 160.0 3.90 17.02 0 1 4 4
# Datsun 710 22.8 108.0 3.85 18.61 1 1 4 1
# Hornet 4 Drive 21.4 258.0 3.08 19.44 1 0 3 1
# Hornet Sportabout 18.7 360.0 3.15 17.02 0 0 3 2
# Valiant 18.1 225.0 2.76 20.22 1 0 3 1
#...
Edit to Reproduce the Error
The error message you got indicates that there is a column that has only one, identical value in all rows.
To show this, let's try a logistic regression using a subset of mtcars data, which has only one, identical values in its cyl column, and then we use that column as a predictor.
mtcars_cyl4 <- mtcars |> subset(cyl == 4)
mtcars_cyl4
# mpg cyl disp hp drat wt qsec vs am gear carb
# Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
# Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
# Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
# Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
# Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
# Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
# Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
# Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
# Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
# Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars_cyl4, family = "binomial")
#Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
# contrasts can be applied only to factors with 2 or more levels
Now, compare it with the same logistic regression by using full mtcars data, which have various values in cyl column.
glm(am ~ as.factor(cyl) + mpg + disp, data = mtcars, family = "binomial")
# Call: glm(formula = am ~ as.factor(cyl) + mpg + disp, family = "binomial",
# data = mtcars)
#
# Coefficients:
# (Intercept) as.factor(cyl)6 as.factor(cyl)8 mpg disp
# -5.08552 2.40868 6.41638 0.37957 -0.02864
#
# Degrees of Freedom: 31 Total (i.e. Null); 27 Residual
# Null Deviance: 43.23
# Residual Deviance: 25.28 AIC: 35.28
It is likely that, even though you have drop three columns that have one,identical values in all the respective rows, there is another column in Trainingmodel1 that has one identical values. The identical values in the column were probably resulted during filtering the data frame and splitting data into training and test groups. Better to have a check by using summary(Trainingmodel1).
Further edit
I have checked the summary(Trainingmodel1) result, and it becomes clear that EmployeeNumber has one identical value (called "level" for a factor) in all rows. To run your regression properly, either you drop it from your model, or if EmployeeNumber has another level and you want to include it in your model, you should make sure that it contains at least two levels in the training data. It is possible to achieve that during splitting by repeating the random sampling until the randomly selected EmployeeNumber samples contain at least two levels. This can be done by looping using for, while, or repeat. It is possible, but I don't know how proper the repeated sampling is for your study.
As for your question about subsetting more than one variable, you can use subset and conditionals. For example, you want to get a subset of mtcars that has cyl == 4 and mpg > 20 :
mtcars |> subset(cyl == 4 & mpg > 20 )
If you want a subset that has cyl == 4 or mpg > 20:
mtcars |> subset(cyl == 4 | mpg > 20 )
You can also subset by using more columns as subset criteria:
mtcars |> subset((cyl > 4 & cyl <8) | (mpg > 20 & gear > 4 ))
I want to loop first through columns and then through rows.
data test ;
input cat $ cat3 cat4 cat5 cat6 cat7 cat8 num_rat ;
cards;
cat3 0 -1.78 -2.68 -3.06 -3.4 -3.83 1885
cat4 0 2.12 -2.15 -2.63 -2.94 -3.34 3151
cat5 0 2.45 1.16 -1.39 -1.99 -2.54 246
cat6 0 2.48 1.92 1.19 -1.13 -2.39 80
cat7 0 2.68 2.32 1.82 1.52 -1.56 89
;
run;
DATA TEST1;
SET test;
Array Cat_C(6) cat3-cat8;
Array Cat_g(5) catg3-catg7;
do i= 1to 5;
cat_g(i)= num_rat * ((((CDF('Normal', Cat_C(i+1))-CDF('NORMAL',Cat_C(i))-
((CDF('NORMAL', ((Cat_C(i+1) - (sqrt(&rho)*&z))/sqrt(1-&rho))))-(CDF
('NORMAL', ((Cat_C(i) - (sqrt(&rho)*&z))/sqrt(1-&rho))))))**2))/
(((CDF('NORMAL', ((Cat_C(i+1) - (sqrt(&rho)*&z))/sqrt(1-&rho))))-(CDF
('NORMAL', ((Cat_C(i) - (sqrt(&rho)*&z))/sqrt(1-&rho)))))*(1- /**/((CDF('NORMAL', ((Cat_C(i+1) - (sqrt(&rho)*&z))/sqrt(1-&rho))))-(CDF
('NORMAL', ((Cat_C(i) - (sqrt(&rho)*&z))/sqrt(1-&rho))))))));
end;
run;
I need to sum for cat3 (row) by first subtraction (0 + -1.78)*1885 + (-1.78 + -2.68)*1885 + (-2.68 + -3.06)*1885 + (-3.06 + -3.4)*1885 + (-3.4 + -3.83)*1885
and then calculate for all the rows (cat4, cat5, cat6, cat7) and then sum all the rows together and minimize them like (min(sum_rows).
The minimization equation can be found at this paper:
https://www.z-riskengine.com/media/1032/a-one-parameter-representation-of-credit-risk-and-transition-matrices.pdf
where x(g+1) is the movement from cat3 to cat4. and rho needs to be any fixed number (say 0.8) and z needs to be minimized using any initial number (say 0.89).
Depending on the data you might try working with a 2D array.
data test ;
input cat $ cat3 cat4 cat5 cat6 cat7 cat8 num_rat ;
cards;
cat3 0 -1.78 -2.68 -3.06 -3.4 -3.83 1885
cat4 0 2.12 -2.15 -2.63 -2.94 -3.34 3151
cat5 0 2.45 1.16 -1.39 -1.99 -2.54 246
cat6 0 2.48 1.92 1.19 -1.13 -2.39 80
cat7 0 2.68 2.32 1.82 1.52 -1.56 89
;
data _null_;
array X (3:7,3:9); * column index 9 is num_rat;
* load 2D array;
do row = 3 to 7;
set test;
array cols cat3-cat8 num_rat;
do col = 3 to 9;
X(row,col) = cols(col-2);
end;
end;
do index1 = 3 to 7;
do index2 = 3 to 9;
putlog #(8*(index2-3)+1) X(index1,index2) #;
end;
putlog;
end;
run;
Other alternatives is to do row-wise computations in DATA Step and column-wise computations with Proc MEANS
You can also look into Proc IML, SAS/OR, Proc FCMP and the SOLVE function, or Proc DS22 and it's matrix functions.
Rows are Temperature and columns are Pressure.
Temp 750 755 760 765(pressure)
0 1.1 2 1 4
1 3 4 2 1 (factors)
2 4 5 5 9
I need a help in making this table in code with that i would like to access factor values for respective temp and pressure .
For example if temp 0 and pressure 750 the factor value is 1.1 ,if temp 1 and pressure 750 factor value is 3.
My Sample Output image
I am new to R and i've been stuck on this. I have a data set below wherein I created a new array list variable called 'amountOfTxn_array' that contains three numeric values in sequential order. These are amounts of transactions taken from Jan to Mar. My objective is to create new variables from this array list that iterate each data elements in the 'amountOfTxn_array'.
> head(myData_05_Array)
Index accountID amountOfTxn_array
1:00 8887 c(36.44, 75.00,185.24)
2:00 13462 c(639.45,656.10,237.00)
3:00 47249 c(0, 24, 2012)
4:00 49528 c(1189.20,2326.26,1695.89)
5:00 57201 c(24.67, 0.00, 0.00)
6:00 57206 c(0.00, 661.98,2957.68)
str(myData_05_Array)
Classes ‘data.table’ and 'data.frame': 3176 obs. of 4 variables:
$ accountID : int 8887 13462 47249 49528 57201 57206 58522 79073 80465 81032 ...
$ amountOfTxn_200501: num 36.4 639.5 0 1189.2 24.7 ...
$ amountOfTxn_200502: num 75 656 24 2326 0 ...
$ amountOfTxn_200503: num 185 237 2012 1696 0 ...
$ amountOfTxn_array :List of 3176
Also, an example code for creating a new variable is provided below wherein I would like to tag 1 if a value in the array is greater than 100 and 0 else. When I ran the example code, I am getting "Error: (list) object cannot be coerced to type ‘double’ error. May I ask for a solution for this. I would highly appreciate any response.
Thanks!
> for(i in 1:3)
+ {
+ if(myData_05_Array$amountOfTxn_array[i] > 100){
+ myData_05_Array$testArray[i] <- 1
+ }
+ else{
+ myData_05_Array$testArray[i] <- 0
+ }
+ }
Error: (list) object cannot be coerced to type 'double'
What I am expecting as the output is as follows:
amountOfTxn_testArray
c(0, 0, 1)
c(1, 1, 1)
c(0, 0, 0)
c(1, 1, 1)
c(0, 0, 0)
c(0, 1, 1)
"Doing calculations for 24 columns is quite cumbersome"
a HA! welcome to the dplyr world:
library(dplyr)
#generate dummy data
dummyDf <-read.table(text='Index accountID Jan Feb March
1:00 8887 36.44 75.00 185.24
2:00 13462 639.45 656.10 237.00
3:00 47249 0 24 2012
4:00 49528 1189.20 2326.26 1695.89
5:00 57201 24.67 0.00 0.00
6:00 57206 0.00 661.98 2957.68', header=TRUE, stringsAsFactors=FALSE)
mutate column by column index
#the dot (.) argument refers to the focal column
df %>% mutate_at(3:5, funs(as.numeric(.>100)))
mutate columns by predefined names
changeVars =c("Jan","Feb","March")
df %>% mutate_at(.cols=changeVars, funs(as.numeric(.>100)))
mutate columns if some condition is met
df %>%mutate_if(is.double, funs(as.numeric(.>100)))
output:
Index accountID Jan Feb March
1 1:00 8887 0 0 1
2 2:00 13462 1 1 1
3 3:00 47249 0 0 1
4 4:00 49528 1 1 1
5 5:00 57201 0 0 0
6 6:00 57206 0 1 1
I have an array with some durations (in seconds), I'd like to split that array into accumulated duration groups that not surpass 3600 seconds in MATLAB. The durations are in order.
Input:
Duration(s) | 2010 1000 500 1030 80 2030 1090
With an:
------------- ------------ ----
Accumulated duration (s) | 3510 3130 1090
------------- ------------ ----
1st group 2nd group 3rd
Output:
Groups index | 1 1 1 2 2 2 3
I've tried with some scripts, but these take so long, and I have to process a lot of data.
Here is a vectorized way using bsxfun and cumsum:
durations = [2010 1000 500 1030 80 2030 1090]
stepsize = 3600;
idx = sum(bsxfun(#ge, cumsum(durations), (0:stepsize:sum(durations)).'),1)
idx =
1 1 1 2 2 2 3
The accumulated durations you can then get with:
accDuratiation = accumarray(idx(:),durations(:),[],#sum).'
accDuratiation =
3510 3140 1090
Explanation:
%// cumulative sum of all durations
csum = cumsum(durations);
%// thresholds
threshs = 0:stepsize:sum(durations);
%// comparison
comp = bsxfun(#ge, csum(:).',threshs(:)) %'
comp =
1 1 1 1 1 1 1
0 0 0 1 1 1 1
0 0 0 0 0 0 1
%// get index
idx = sum(comp,1)
This will get you close . . .
durs = [2010 1000 500 1030 80 2030 1090];
cums = cumsum(durs);
t = 3600;
idx = zeros(size(durs));
while ~all(idx)
idx = idx + (cums <= t);
cums = cums - max(cums(cums <= t));
end
You can then get the output into your preferred format with a simple . .
idx = -(idx-max(idx)-1)
and just in case you don't have enough, yet another way to do it:
durations = [2010 1000 500 1030 80 2030 1090] ;
stepsize = 3600;
cs = cumsum(durations) ;
idxbeg = [1 find(sign([1 diff(mod(cs,stepsize))])==-1)] ; %// first index of each group
idxend = [idxbeg(2:end)-1 numel(d)] ; %// last index of each group
groupDuration = [cs(idxend(1)) diff(cs(idxend))]
groupIndex = cell2mat( arrayfun(#(x,y) repmat(x,1,y), 1:numel(idxbeg) , idxend-idxbeg+1 , 'uni',0) )
groupDuration =
3510 3140 1090
groupIndex =
1 1 1 2 2 2 3
although if you ask me I find the bsxfun solution more elegant