Making an (iterative?) Do loop in SAS - arrays

I know this is simple but I can't seem to figure it out.
I have a dataset that has 50 students, and one of the columns is called test score which has a test score for each of the students. I need to go through and find the difference between all of the students- so student 1 score- student 2 score, student 2 score-student 3 score..... to 50 then student 2 score-student 3 score,... student 2 score-student 50 score.
I essentially need to end up with a matrix of differences for the test scores.
I have to use an array- so it would be something like
Data:
Student Score
Alejandro 91
Atkinsin 87
Beal 72
Butler 94
Coleman 91
data array;
set testscores;
array score(50) Score1-Score 50; ?I dont think this is correct
do i=1 to 50;
difference= score(i) -score(i+1)?? I really have no idea everything I try isn't working
end;
run;
I need to end up with something that has the difference between every students scores

The first data step generates random scores between 1 (a) and 100 (b) following the uniform distribution for 50 students.
Then proc distance calculates the differences between the score of each student with all other students. For this to work, "Student" should be a character variable.
data scores;
a = 1;
b = 100;
do Student_temp = 1 to 50;
Student = compress(put(student_temp, 8.));
u = ranuni(12345);
Score = floor(a + (b-a)*u);
output;
end;
drop Student_temp a b u;
run;
proc distance data=scores out=Diff method=Euclid;
var interval(score);
id Student;
run;

Related

Find corresponding variable to a certain value through array

So if I have identified a max value regarding a test result (Highest variable listed below), which occurred during one of the three dates that are being tested (testtime variables listed below), what I want to do is to create a new variable called Highesttime identifying the date when the test was given.
However, I am stuck in an array looping. SAS informs that "ERROR: Array subscript out of range at line x", guess there's something working regarding the logic? See codes below:
Example:
ID time1_a time_b time_c result_a result_b result_c Highest
001 1/1/22 1/2/22 1/3/22 3 2 4 4
002 12/1/21 12/23/21 1/5/22 6 1 2 6
003 12/22/21 1/6/22 2/2/22 5 5 7 7
...
data want;
set origin;
array testtime{3} time1_a time_b time_c;
array maxvalue{1} Highest;
array corr_time{1} Highesttime;
do i=1 to dim(testttime);
corr_time{i}=testttime{i=maxvalue{i}};
end;
run;
There is no need to make an array for HIGHEST since there is only one variable that you would put into that array. In that case just use the variable directly instead of trying to access it indirectly via an array reference.
First let's make an actual SAS dataset out of the listing you provided.
data have;
input ID (time_a time_b time_c) (:mmddyy.) result_a result_b result_c Highest ;
format time_a time_b time_c yymmdd10.;
cards;
001 1/1/22 1/2/22 1/3/22 3 2 4 4
002 12/1/21 12/23/21 1/5/22 6 1 2 6
003 12/22/21 1/6/22 2/2/22 5 5 7 7
;
If you want to loop then you need two arrays. One for times and the other for the values. Then you can loop until you find which index points to the highest value and use the same index into the other array.
data want ;
set have;
array times time_a time_b time_c ;
array results result_a result_b result_c;
do which_one=1 to dim(results) until (not missing(highest_time));
if results[which_one] = highest then highest_time=times[which_one];
end;
format highest_time yymmdd10.;
run;
Or you can avoid the looping by using the WHICHN() function to figure out which of three result variables is the first one that has that HIGHEST value. Then you can use that value as the index into the array of the TIME variables (which in your case have DATE instead of TIME or DATETIME values).
data want ;
set have;
which_one = whichn(highest, of result_a result_b result_c);
array times time_a time_b time_c ;
highest_time = times[which_one];
format highest_time yymmdd10.;
run;
Your code from this question was close, you just had the assignment backwards.
Note that an array method will assign the last date in the case of duplicate high results and WHICHN will report the first date so the answers are not identical unless you modify the loop to exit after the first maximum value is found.
With the changes suggested in the answer proposed:
data temp2_lead_f2022;
set temp_lead_f2022;
array _day {3} daybld_a daybld_b daybld_c;
array _month {3} mthbld_a mthbld_b mthbld_c;
array _dates {3} date1_a date2_b date3_c;
array _pblev{3} pblev_a pblev_b pblev_c;
do i = 1 to 3;
_dates{i} = mdy(_month{i}, _day{i}, 1990);
end;
maxlead= max(of _pblev(*));
do i=1 to 3;
if _pblev{i} = maxlead then max_date=_dates(i);
end;
*Using WHICHN to identify the maximum occurence;
max_first_index=whichn(maxlead, of _pblev(*));
max_date2 = _dates(max_first_index);
drop k;
format date1_a date2_b date3_c dob mmddyy8. ;
run;

SAS Newcomer: Data arrangement for a ProcFreq

Looking at running a ProcFreq on the following snippet of data
SampleData
Looking to find out the proportion of MutYes to MutNo by gene, comparing/controlling cancer.
Here's the code that I've got so far:
Proc Freq data=polysorted;
by Gene;
weight Status;
table MutYes*MutNo /chisq ;
run;
My question is how do I need to rearrange the data to make this work correctly. Right now, it's giving me:
ERROR: Variable Status in list does not match type prescribed for this list.
Trying to get a layout like this:
Layout
clearly outlying the proportion of MutYes to MutNo by control and cancer for each gene
You need a variable(resp) with value Y/N and WEIGHT variable(Y) for the counts;
data mut;
do gene = 'ATPhase6','ATPhase8';
do status = 'Control','PC';
do resp = 'Yes','No';
input Y #;
output;
end;
end;
end;
cards;
29 236
21 169
6 259
13 177
;;;;
run;
proc print;
run;
proc freq data=mut order=data;
by gene;
table status*resp / cmh;
weight y;
run;

SAS Array Calculations Row Operations

I have a dataset that has a list of contributions of members of a sales organization by day. What I want to ultimately end up with is the following information:
For each day:
How much the entire team sold. ($200 for day one, $350 for day two..)
How much a designated subset ("Joe"...for example) of that team sold (Joe sold $100 day one, $200 day two...)
the difference in the above two calculations ($200-$100 for day one, $350-$200 for day two....)
how many total people contributed that day (2 in day 1, 3 in day two, 5 in day 3)
how many of my designated subset contributed that day (1 every day in this case, since Joe was there every day)
In the example below, Joe is my designated subset. The problem I am having is directing SAS to only sum up Joe's contributions. The method I have below works, but only if Joe is the only contributor AND if he contributes every day. I basically force him to be the first entry, then point to him. This fails if he is not there one day, or if my subset has multiple people.
Below is my attempt I've been working on, but I think I'm going down the wrong path, since this will not be dynamic enough when I add more people. For example, if the subset now becomes Joe and Sue....the calculation will still just point to Joe. If I point it two first two obs, it may select hal accidentally from day one. Is there a way to specify by rom "Only add the Amount column if the name next to it is either Joe or Sue? Help!
*declare team;
/*%let team=('joe','sue');*/
%let team=('joe');
*input data;
data have;
input day name $ amount;
cards;
1 hal 100
1 joe 100
2 joe 80
2 sue 70
2 jim 200
3 joe 50
3 sue 100
3 ted 200
3 tim 100
3 wen 5000
;
run;
*getting my team to float to top of order list;
data have;
set have;
if name in &team. then order=1;
else order=2;
run;
*order;
proc sort data=have;
by day order name;
run;
*add running count by day;
data have;
set have;
by day;
x+1;
if first.day then x=1;
run;
*get number of people on team;
proc sql noprint;
select count(distinct name) into :count
from have
where name in &team.;
quit;
*get max of people per day;
proc sql noprint;
select max(x) into :max_freq from have;
quit;
*pre transpose...set labels;
data have;
set have;
varname=cats('Name_',x);
value=name;
output;
varname=cats('Amount_',x);
value=amount;
output;
keep day value varname;
run;
*transpose;
proc transpose data=have out=have_transp(drop=_NAME_);
by day;
id varname;
var value;
run;
data want;
set have_transp;
array Amount {*} Amount:;
TOT_Amount=0;
NUM_TOTAL_PEOPLE=0;
do i=1 to dim(Amount);
if Amount[i]>0
then
do;
TOT_Amount+Amount[i];
NUM_TOTAL_PEOPLE+1;
end;
end;
TEAM_CONTRIB=Amount_1;
NON_TEAM_CONTRIB=TOT_Amount-TEAM_CONTRIB;
run;
A few other things:
Every member of the team will not always be present every day
There are very many possibilities for how many people might be on the total team and/or subset
Here's a way using proc means that doesn't use arrays. Proc means will calculate data at different levels by default when using the CLASS and TYPES statements. The data can then be merged into the appropriate level. In this solution it doesn't matter how many people are in the group/subset or that everyone is present for every day.
/*Subset group*/
data subteam;
input name $;
cards;
joe
sue
;
run;
/*Sample data*/
data have;
input day name $ amount;
cards;
1 hal 100
1 joe 100
2 joe 80
2 sue 70
2 jim 200
3 joe 50
3 sue 100
3 ted 200
3 tim 100
3 wen 5000
;
run;
*Set group variable for subset team;
data have;
set have;
group=0;
run;
*Set group variable=1 to subset;
proc sql;
update have
set group=1
where name in (select name from subteam);
quit;
*Calculate sums;
proc means data=have;
class day group;
types day day*group;
var amount;
output out=want1 sum=total n=count;
run;
*Reformat into desired format;
data want2;
merge want1 (where=(group=.) rename=(total=total_overall count=count_overall))
want1 (where=(group=1) rename=(total=total_group count=count_group));
by day;
run;

Get rid of kth smallest and largest values of a dataset in SAS

I have a datset sort of like this
obs| foo | bar | more
1 | 111 | 11 | 9
2 | 9 | 2 | 2
........
I need to throw out the 4 largest and 4 smallest of foo (later then I would do a similar thing with bar) basically to proceed but I'm unsure the most effective way to do this. I know there are functions smallest and largest but I don't understand how I can use them to get the smallest 4 or largest 4 from an already made dataset. I guess alternatively I could just remove the min and max 4 times but that sounds needlessly tedious/time consuming. Is there a better way?
PROC RANK will do this for you pretty easily. If you know the total count of observations, it's trivial - it's slightly harder if you don't.
proc rank data=sashelp.class out=class_ranks(where=(height_r>4 and weight_r>4));
ranks height_r weight_r;
var height weight;
run;
That removes any observation that is in the 4 smallest heights or weights, for example. The largest 4 would require knowing the maximum rank, or doing a second processing step.
data class_final;
set class_ranks nobs=nobs;
if height_r lt (nobs-3) and weight_r lt (nobs-3);
run;
Of course if you're just removing the values then do it all in the data step and call missing the variable if the condition is met rather than deleting the observation.
You are going to need to make at least 2 passes through your dataset however you do this - one to find out what the top and bottom 4 values are, and one to exclude those observations.
You can use proc univariate to get the top and bottom 5 values, and then use the output from that to create a where filter for a subsequent data step. Here's an example:
ods _all_ close;
ods output extremeobs = extremeobs;
proc univariate data = sashelp.cars;
var MSRP INVOICE;
run;
ods listing;
data _null_;
do _N_ = 1 by 1 until (last.varname);
set extremeobs;
by varname notsorted;
if _n_ = 2 then call symput(cats(varname,'_top4'),high);
if _n_ = 4 then call symput(cats(varname,'_bottom4'),low);
end;
run;
data cars_filtered;
set sashelp.cars(where = ( &MSRP_BOTTOM4 < MSRP < &MSRP_TOP4
and &INVOICE_BOTTOM4 < INVOICE < &INVOICE_TOP4
)
);
run;
If there are multiple observations that tie for 4th largest / smallest this will filter out all of them.
You can use proc sql to place the number of distinct values of foo into a macro var (includes null values as distinct).
In you data step you can use first.foo and the macro var to selectively output only those that are no the smallest or largest 4 values.
proc sql noprint;
select count(distinct foo) + count(distinct case when foo is null then 1 end)
into :distinct_obs from have;
quit;
proc sort data = have; by foo; run;
data want;
set have;
by foo;
if first.foo then count+1;
if 4 < count < (&distinct_obs. - 3) then output;
drop count;
run;
I also found a way to do it that seems to work with IML (I'm practicing by trying to redo things different ways). I knew my maximum number of observations and had already sorted it by order of the variable of interest.
PROC IML;
EDIT data_set;
DELETE point {{1, 2, 3, 4,51, 52, 53, 54};
PURGE;
close data_set;
run;
I've not used IML very much but I stumbled upon this while reading documentation. Thank you to everyone who answered my question!

SAS: sum all values except one

I'm working in SAS and I'm trying to sum all observations, leaving out one each time.
For example, if I have:
Count Name Grade
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
I want to output a value for Sam that is the sum of all grades but his own, and a value for Adam that is a sum of all grades but his own - etc.
Any ideas? Thanks!
You can do it in a single proc sql instead, using key word calculated:
data have;
input Count Name $ Grade;
datalines;
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
;;;;
run;
proc sql;
create table want as
select *, sum(grade) as all_grades, calculated all_grades-grade as minus_grade
from have;
quit;
Here's a nearly one pass solution (it will be about the same speed as a one pass solution if the dataset fits in the read buffer). I actually calculate the mean here instead of just the sum, as I feel that's a more interesting result (and the sum is of course the mean without the division).
data have;
input Count Name $ Grade;
datalines;
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
;;;;
run;
data want;
retain grademean;
if _n_=1 then do;
do _n_ = 1 to nobs_have;
set have(keep=grade) point=_n_ nobs=nobs_have;
gradesum+grade;
end;
grademean=gradesum/nobs_have;
end;
set have;
grade_noti = ((grademean*nobs_have)-grade)/(nobs_have-1);
run;
Calculate the mean, then for each record subtract the portion that record contributed to the mean. This is a super useful technique for stat testing when you want to compare a record to the rest of the population, and you have a complicated class combination where you'd rather do the mean first. In those cases you use PROC MEANS first and then merge it on, then do this subtraction.
proc sql;
create table temp as select
sum(grade) as all_grades
from orig_data;
quit;
proc sql;
create table temp2 as select
a.count,
a.name,
a.grade,
(b.all_grades-a.grade) as sum_other_grades
from orig_data a
left join temp b;
quit;
Haven't tested it but the above should work. It creates a new dataset temp that has the sum of all grades and merges that back to create a new table with the sum of all grades less the current students grade as sum_other_grades.
This solution performs takes each observation of your starting dataset, and then loops through the same dataset summing up grade values for any records with different names, so beginning with 'Sam', we only add the oth_g variable when we find names that are NOT 'Sam':
data want;
set have;
oth_g=0;
do i=1 to n;
set have
(keep=name grade rename=(name=name_loop grade=grade_loop))
nobs=n point=i;
if name^=name_loop then oth_g+grade_loop;
end;
drop grade_loop name_loop i n;
run;
This is a slight modification to the answer #Reese provided above.
proc sql;
create table want as
select *,
(select sum(grade) from have) as all_grades,
calculated all_grades - grade as minus_grade
from have;
quit;
I've rearranged it this way to avoid the below message being printed to the log:
NOTE: The query requires remerging summary statistics back with the original data.
If you see the above message, it almost always means that you have made a mistake. If you actually did mean to remerge summary stats back with the original data, you should do so explicitly (like I have done above by refactoring #reese 's query.
Personally I think the refactored version is also easier to understand.

Resources