SAS Newcomer: Data arrangement for a ProcFreq - database

Looking at running a ProcFreq on the following snippet of data
SampleData
Looking to find out the proportion of MutYes to MutNo by gene, comparing/controlling cancer.
Here's the code that I've got so far:
Proc Freq data=polysorted;
by Gene;
weight Status;
table MutYes*MutNo /chisq ;
run;
My question is how do I need to rearrange the data to make this work correctly. Right now, it's giving me:
ERROR: Variable Status in list does not match type prescribed for this list.
Trying to get a layout like this:
Layout
clearly outlying the proportion of MutYes to MutNo by control and cancer for each gene

You need a variable(resp) with value Y/N and WEIGHT variable(Y) for the counts;
data mut;
do gene = 'ATPhase6','ATPhase8';
do status = 'Control','PC';
do resp = 'Yes','No';
input Y #;
output;
end;
end;
end;
cards;
29 236
21 169
6 259
13 177
;;;;
run;
proc print;
run;
proc freq data=mut order=data;
by gene;
table status*resp / cmh;
weight y;
run;

Related

Can I use array based processing to add additional column(s)? SAS

I have a dataset (a) that looks like this:
Name Value
Cost_1 28
Cost_2 22
Unit_1 Fixed
Unit_2 C
Is it possible to use an array to have a dataset that looks like this:
Name Cat_1 Cat_2
Cost 28 22
Unit Fixed C
%let Cat_Count = 2;
data b;
set a;
array category [&Cat_Count] cat_1-cat_&Cat_count;
.
.
.
run;
Not sure how to execute this...the macro variable cat_count will be dynamic.
You can use array's but a transpose is more efficient.
First create a new column that separates name into the name and count and then use a proc transpose.
data have;
input Name $ Value $;
cards;
Cost_1 28
Cost_2 22
Unit_1 Fixed
Unit_2 C
;;;;
run;
data have_cat;
set have;
cat = input(scan(name, 2, "_"), 8.); *numeric conversion not required for this approach but for array approach;
name = scan(name, 1, "_");
run;
proc sort data=have_cat;
by name cat value;
run;
proc transpose data=have_cat out=want prefix=cat_;
by name;
id cat;
var value;
run;
Array method - requires everything before PROC TRANSPOSE and max_count macro variable.
%let Cat_Count = 2;
data want_array;
set have_cat;
by name;
array category(&cat_count) $ cat_1-cat_&cat_count;
retain cat_1-cat_&cat_count;
if first.name then
call missing(of category (*));
category(cat) = value;
if last.name then output;
run;

SAS array unable to process long list of variables

I am trying to log, square, cubic and log-odds transform my input data to provide an exhaustive overview of the best performing transformation in univariate regression
I have tried the following code on a dataset with 1,000 variables - It returns an error / runs out of memory or simply cannot execute. Are there any limitations with transforming variables en-masse in this way using arrays?
/*Create a table for reference*/
DATA input_data;
ARRAY var_[*] var_1-var_1000;
DO i = 1 to 1000;
DO i = 1 to 1000;
var_(i)= i*j;
output;
END;
END;
RUN;
/*Log, square, cubic, logit transform all variables*/
DATA input_transform;
SET input_data;
ARRAY var[*] var_1-var_1000;
ARRAY log[*] log_1-log_1000;
ARRAY logit[*] logit_1-logit_1000;
ARRAY sq[*] sq_1-sq_1000;
ARRAY cubic[*] cubic_1-cubic_1000;
DO i = 1 to 1000;
log(i) = log(var(i));
logit(i) = log((var(i))/(1-var(i)));
sq(i) = var(i)**2;
cubic(i) = var(i)**3;
END;
RUN;
A new dataset with 5000 variables each with the respective transformation
You are using I as the index variable for both or your two nested do loops. That is probably messing them up.
Also your first data step is writing 1,000,000 observations of 1,002 variables with only the lower left triangle of the "array" filled in. Do you really want the OUTPUT statement inside the loop?
Hypothetically there are no issues with this, as long as your code is correct. Here's an example and the log.
option notes;
%let size=1000;
/*Create a table for reference*/
DATA input_data;
ARRAY var_[*] var_1-var_&size.;
DO i = 1 to &size.;
DO j = 1 to &size.;
var_(j)= i*j;
END;
output;
END;
RUN;
/*Log, square, cubic, logit transform all variables*/
DATA input_transform;
SET input_data;
ARRAY _var[*] var_1-var_&size.;
ARRAY _log[*] log_1-log_&size.;
ARRAY _logit[*] logit_1-logit_&size.;
ARRAY _sq[*] sq_1-sq_&size.;
ARRAY _cubic[*] cubic_1-cubic_&size.;
DO i = 1 to &size.;
_log(i) = log(_var(i));
_logit(i) = sqrt(_var(i));
_sq(i) = _var(i)**2;
_cubic(i) = _var(i)**3;
END;
RUN;
and the log:
1576 option notes;
1577 %let size=1000;
1578
1579 /*Create a table for reference*/
1580 DATA input_data;
1581 ARRAY var_[*] var_1-var_&size.;
1582
1583 DO i = 1 to &size.;
1584 DO j = 1 to &size.;
1585 var_(j)= i*j;
1586 END;
1587 output;
1588 END;
1589 RUN;
NOTE: The data set WORK.INPUT_DATA has 1000 observations and 1002
variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
1590
1591 /*Log, square, cubic, logit transform all variables*/
1592 DATA input_transform;
1593 SET input_data;
1594 ARRAY _var[*] var_1-var_&size.;
1595 ARRAY _log[*] log_1-log_&size.;
1596 ARRAY _logit[*] logit_1-logit_&size.;
1597 ARRAY _sq[*] sq_1-sq_&size.;
1598 ARRAY _cubic[*] cubic_1-cubic_&size.;
1599
1600 DO i = 1 to &size.;
1601 _log(i) = log(_var(i));
1602 _logit(i) = sqrt(_var(i));
1603 _sq(i) = _var(i)**2;
1604 _cubic(i) = _var(i)**3;
1605 END;
1606 RUN;
NOTE: There were 1000 observations read from the data set
WORK.INPUT_DATA.
NOTE: The data set WORK.INPUT_TRANSFORM has 1000 observations and 5002
variables.
NOTE: DATA statement used (Total process time):
real time 0.12 seconds
cpu time 0.10 seconds

Macro loop in SAS - passing value to condition

I'm a beginner in SAS and I am struggling a bit with the macro loop in SAS. The problem is illustrated by the code below. The task here is to create separate subsets and save them as libraries for later post-processing. Additionally I added graphs for visualization. I am operating on a huge database but for this post I create a sample at the beginning of the code for simplification.
However, it seems that the internal condition (IF ID = i ) doesn't filter out the data. Instead the internal loop creates empty tables (but with correct names: "SUB1", "SUB2", "SUB3") with a column (variabale) called "i".
DATA EXAMPLE;
INPUT ID DATE DDMMYY8. VALUE;
FORMAT DATE DDMMYY8.;
DATALINES;
1 01012011 100
1 01022011 400
1 01032011 678
2 01012011 678
2 01022011 333
2 01032011 333
3 01012011 733
3 01022011 899
3 01032011 999
;
%MACRO filter(number);
%DO i=1 %TO &number;
DATA SUB&i;
SET WORK.EXAMPLE;
IF ID = i;
PROC SGPLOT DATA=SUB&i;
reg x=DATE y=VALUE;
RUN;
%END;
%mend filter;
%filter(3);
If I manually copy and paste the part inside macro and manually change i to numbers 1 to 3 it creates correct graphs. What is wrong in this code? How can I pass the value from the DO statement inside the code?
I am using SAS Studio.
The macro is creating empty data sets because the code that the macro eventually writes contains the subsetting if statement
if ID = i;
Because the data set does not contain a variable i a new variable named i is added to the PDV and the output data sets SUB1, SUB2, SUB3. The default value for i is missing and no ID value is missing, thus no rows pass the test and you get empty data sets. The log will also provide clues to the situation:
NOTE: Variable i is uninitialized.
When abstracting a code segment for 'macroization' be sure to use & in front of the macro variables. Thus, when the macro contains
if ID = &i;
The eventual code written by the macro system will have your 3 similar code operations with the different values of the macro variable.
...
if ID = 1;
...
...
if ID = 2;
...
...
if ID = 3;
...
Right now you are producing the same graph three times because the datasets SUB1, SUB2, SUB3 all use the same set of data. That is because the only thing in your data step that depends on the value of the macro variable I is the name.
You are currently selecting the observations where the variable ID matches the variable I. Perhaps you meant to select the observations where the variable ID matches the macro variable used in the %DO loop?
IF ID = &i;
One tip for debugging your macro code is to add the statement
options mprint;
This will show the code that SAS is actually using.
For example in the log:
70 options mprint;
71 %MACRO filter(number);
72 %DO i=1 %TO &number;
73 DATA SUB&i;
74 SET WORK.EXAMPLE;
75 IF ID = &i;
76 PROC SGPLOT DATA=SUB&i;
77 reg x=DATE y=VALUE;
78 RUN;
79 %END;
80 %mend filter;
81
82 %filter(2);
MPRINT(FILTER): DATA SUB1;
MPRINT(FILTER): SET WORK.EXAMPLE;
MPRINT(FILTER): IF ID = 1;
NOTE: There were 9 observations read from the data set WORK.EXAMPLE.
NOTE: The data set WORK.SUB1 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
MPRINT(FILTER): PROC SGPLOT DATA=SUB1;
MPRINT(FILTER): reg x=DATE y=VALUE;
MPRINT(FILTER): RUN;

Get rid of kth smallest and largest values of a dataset in SAS

I have a datset sort of like this
obs| foo | bar | more
1 | 111 | 11 | 9
2 | 9 | 2 | 2
........
I need to throw out the 4 largest and 4 smallest of foo (later then I would do a similar thing with bar) basically to proceed but I'm unsure the most effective way to do this. I know there are functions smallest and largest but I don't understand how I can use them to get the smallest 4 or largest 4 from an already made dataset. I guess alternatively I could just remove the min and max 4 times but that sounds needlessly tedious/time consuming. Is there a better way?
PROC RANK will do this for you pretty easily. If you know the total count of observations, it's trivial - it's slightly harder if you don't.
proc rank data=sashelp.class out=class_ranks(where=(height_r>4 and weight_r>4));
ranks height_r weight_r;
var height weight;
run;
That removes any observation that is in the 4 smallest heights or weights, for example. The largest 4 would require knowing the maximum rank, or doing a second processing step.
data class_final;
set class_ranks nobs=nobs;
if height_r lt (nobs-3) and weight_r lt (nobs-3);
run;
Of course if you're just removing the values then do it all in the data step and call missing the variable if the condition is met rather than deleting the observation.
You are going to need to make at least 2 passes through your dataset however you do this - one to find out what the top and bottom 4 values are, and one to exclude those observations.
You can use proc univariate to get the top and bottom 5 values, and then use the output from that to create a where filter for a subsequent data step. Here's an example:
ods _all_ close;
ods output extremeobs = extremeobs;
proc univariate data = sashelp.cars;
var MSRP INVOICE;
run;
ods listing;
data _null_;
do _N_ = 1 by 1 until (last.varname);
set extremeobs;
by varname notsorted;
if _n_ = 2 then call symput(cats(varname,'_top4'),high);
if _n_ = 4 then call symput(cats(varname,'_bottom4'),low);
end;
run;
data cars_filtered;
set sashelp.cars(where = ( &MSRP_BOTTOM4 < MSRP < &MSRP_TOP4
and &INVOICE_BOTTOM4 < INVOICE < &INVOICE_TOP4
)
);
run;
If there are multiple observations that tie for 4th largest / smallest this will filter out all of them.
You can use proc sql to place the number of distinct values of foo into a macro var (includes null values as distinct).
In you data step you can use first.foo and the macro var to selectively output only those that are no the smallest or largest 4 values.
proc sql noprint;
select count(distinct foo) + count(distinct case when foo is null then 1 end)
into :distinct_obs from have;
quit;
proc sort data = have; by foo; run;
data want;
set have;
by foo;
if first.foo then count+1;
if 4 < count < (&distinct_obs. - 3) then output;
drop count;
run;
I also found a way to do it that seems to work with IML (I'm practicing by trying to redo things different ways). I knew my maximum number of observations and had already sorted it by order of the variable of interest.
PROC IML;
EDIT data_set;
DELETE point {{1, 2, 3, 4,51, 52, 53, 54};
PURGE;
close data_set;
run;
I've not used IML very much but I stumbled upon this while reading documentation. Thank you to everyone who answered my question!

SAS: sum all values except one

I'm working in SAS and I'm trying to sum all observations, leaving out one each time.
For example, if I have:
Count Name Grade
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
I want to output a value for Sam that is the sum of all grades but his own, and a value for Adam that is a sum of all grades but his own - etc.
Any ideas? Thanks!
You can do it in a single proc sql instead, using key word calculated:
data have;
input Count Name $ Grade;
datalines;
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
;;;;
run;
proc sql;
create table want as
select *, sum(grade) as all_grades, calculated all_grades-grade as minus_grade
from have;
quit;
Here's a nearly one pass solution (it will be about the same speed as a one pass solution if the dataset fits in the read buffer). I actually calculate the mean here instead of just the sum, as I feel that's a more interesting result (and the sum is of course the mean without the division).
data have;
input Count Name $ Grade;
datalines;
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
;;;;
run;
data want;
retain grademean;
if _n_=1 then do;
do _n_ = 1 to nobs_have;
set have(keep=grade) point=_n_ nobs=nobs_have;
gradesum+grade;
end;
grademean=gradesum/nobs_have;
end;
set have;
grade_noti = ((grademean*nobs_have)-grade)/(nobs_have-1);
run;
Calculate the mean, then for each record subtract the portion that record contributed to the mean. This is a super useful technique for stat testing when you want to compare a record to the rest of the population, and you have a complicated class combination where you'd rather do the mean first. In those cases you use PROC MEANS first and then merge it on, then do this subtraction.
proc sql;
create table temp as select
sum(grade) as all_grades
from orig_data;
quit;
proc sql;
create table temp2 as select
a.count,
a.name,
a.grade,
(b.all_grades-a.grade) as sum_other_grades
from orig_data a
left join temp b;
quit;
Haven't tested it but the above should work. It creates a new dataset temp that has the sum of all grades and merges that back to create a new table with the sum of all grades less the current students grade as sum_other_grades.
This solution performs takes each observation of your starting dataset, and then loops through the same dataset summing up grade values for any records with different names, so beginning with 'Sam', we only add the oth_g variable when we find names that are NOT 'Sam':
data want;
set have;
oth_g=0;
do i=1 to n;
set have
(keep=name grade rename=(name=name_loop grade=grade_loop))
nobs=n point=i;
if name^=name_loop then oth_g+grade_loop;
end;
drop grade_loop name_loop i n;
run;
This is a slight modification to the answer #Reese provided above.
proc sql;
create table want as
select *,
(select sum(grade) from have) as all_grades,
calculated all_grades - grade as minus_grade
from have;
quit;
I've rearranged it this way to avoid the below message being printed to the log:
NOTE: The query requires remerging summary statistics back with the original data.
If you see the above message, it almost always means that you have made a mistake. If you actually did mean to remerge summary stats back with the original data, you should do so explicitly (like I have done above by refactoring #reese 's query.
Personally I think the refactored version is also easier to understand.

Resources