SAS: sum all values except one - loops

I'm working in SAS and I'm trying to sum all observations, leaving out one each time.
For example, if I have:
Count Name Grade
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
I want to output a value for Sam that is the sum of all grades but his own, and a value for Adam that is a sum of all grades but his own - etc.
Any ideas? Thanks!

You can do it in a single proc sql instead, using key word calculated:
data have;
input Count Name $ Grade;
datalines;
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
;;;;
run;
proc sql;
create table want as
select *, sum(grade) as all_grades, calculated all_grades-grade as minus_grade
from have;
quit;

Here's a nearly one pass solution (it will be about the same speed as a one pass solution if the dataset fits in the read buffer). I actually calculate the mean here instead of just the sum, as I feel that's a more interesting result (and the sum is of course the mean without the division).
data have;
input Count Name $ Grade;
datalines;
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
;;;;
run;
data want;
retain grademean;
if _n_=1 then do;
do _n_ = 1 to nobs_have;
set have(keep=grade) point=_n_ nobs=nobs_have;
gradesum+grade;
end;
grademean=gradesum/nobs_have;
end;
set have;
grade_noti = ((grademean*nobs_have)-grade)/(nobs_have-1);
run;
Calculate the mean, then for each record subtract the portion that record contributed to the mean. This is a super useful technique for stat testing when you want to compare a record to the rest of the population, and you have a complicated class combination where you'd rather do the mean first. In those cases you use PROC MEANS first and then merge it on, then do this subtraction.

proc sql;
create table temp as select
sum(grade) as all_grades
from orig_data;
quit;
proc sql;
create table temp2 as select
a.count,
a.name,
a.grade,
(b.all_grades-a.grade) as sum_other_grades
from orig_data a
left join temp b;
quit;
Haven't tested it but the above should work. It creates a new dataset temp that has the sum of all grades and merges that back to create a new table with the sum of all grades less the current students grade as sum_other_grades.

This solution performs takes each observation of your starting dataset, and then loops through the same dataset summing up grade values for any records with different names, so beginning with 'Sam', we only add the oth_g variable when we find names that are NOT 'Sam':
data want;
set have;
oth_g=0;
do i=1 to n;
set have
(keep=name grade rename=(name=name_loop grade=grade_loop))
nobs=n point=i;
if name^=name_loop then oth_g+grade_loop;
end;
drop grade_loop name_loop i n;
run;

This is a slight modification to the answer #Reese provided above.
proc sql;
create table want as
select *,
(select sum(grade) from have) as all_grades,
calculated all_grades - grade as minus_grade
from have;
quit;
I've rearranged it this way to avoid the below message being printed to the log:
NOTE: The query requires remerging summary statistics back with the original data.
If you see the above message, it almost always means that you have made a mistake. If you actually did mean to remerge summary stats back with the original data, you should do so explicitly (like I have done above by refactoring #reese 's query.
Personally I think the refactored version is also easier to understand.

Related

sas loop over month from variable

I am tryinng to loop over a series of dates in order to create the dates inbetween. This is to be done in steps of month, always displaying the last day of the respective month. The start and end dates are given (first_date and last_date), while the last_date should always refer to the end of the previous month.
The original dataset looks like the following:
customer id first_date last_date
xy 135 01.01.2000 25.03.2005
xy 247 19.03.2003 25.03.2005
ab 387 01.06.2010 30.12.2012
ab 128 01.05.2010 28.02.2011
...
My goal is to have a dataset which looks like this:
customer id date
xy 135 31.01.2000
xy 135 28.02.2000
...
xy 135 28.02.2005
xy 247 31.03.2003
xy 247 30.04.2003
...
xy 247 28.02.2005
I found the solution to iterate over days quite straightforward (see below), but I am struggling to implement the monthly steps and the end of month dates.
data want;
set have;
by customer id;
do day = first_date to last_date;
output;
end;
format day date9.;
run;
Thanks for your help!!
First, lets get some data:
data have;
attrib customer length=$10 informat=$10.
id informat=best.
first_date informat=ddmmyy10. format=ddmmyy10.
last_date informat=ddmmyy10. format=ddmmyy10.
;
input customer $
id
first_date
last_date
;
datalines;
xy 135 01.01.2000 25.03.2005
xy 247 19.03.2003 25.03.2005
ab 387 01.06.2010 30.12.2012
ab 128 01.05.2010 28.02.2011
;
run;
The intnx() function will come to the rescue here. We are going to create a new variable called date, and then use the intnx function to return the end of the month for that date. As long as that date is less than the end date, we will continue to output it to a dataset and then increment to the end of the following month.
data want;
format date ddmmyy10.;
set have;
date = intnx('month',first_date,0,'end');
do while (date le last_date);
output;
date = intnx('month',date,1,'end');
end;
drop first_date last_date;
run;
While I think Rob's answer is the right way to do this, it's probably helpful to see how to do it the way you were trying to.
Starting with this:
data want;
set have;
by customer id;
do day = first_date to last_date;
output;
end;
format day date9.;
run;
This gives you too many rows, right? So what you need to do is identify where in the month you are. There are a bunch of ways to do this. Several date functions (like INTNX and INTCK) could be used to tell you where you are; but the easiest is just to compare month(date) with month(date+1). When they're different, you're on the last day of a month!
data want;
set have;
by customer id notsorted;
do day = first_date to last_date;
if month(day) ne month(day+1) then output;
end;
format day date9.;
run;
(I added notsorted since Rob's example data was not sorted, and I'm lazy. Probably not needed in your real case.)
I would note that this probably isn't your ideal solution - Rob's is probably that, in terms of data steps - in terms of speed. This of course will iterate through every day rather than just once per month.
Another option if you have the dataset you created above - with one row per day - is to use PROC EXPAND, if you have the ETS module. It's very handy for things like this.
data intermediate;
set have;
by customer id notsorted;
do day = first_date to last_date;
output;
end;
format day date9.;
run;;;
Here's your day-level data. Then below is the PROC EXPAND statement, asking for monthly data, aligned at the end. id day; identifies the time series variable, and by customer id notsorted; is the normal by statement (what variables identify the observations), with notsorted so they don't have to be in order relative to each other.
proc expand data=intermediate out=want from=day to=month align=end;
id day;
by customer id notsorted;
run;
This gives a slightly different solution than Rob's and my other solution, because it does give you the final row for each if it's not at the end of a month (and does set that final row to the end of the month). If that's desired, great, and our solutions can easily be adapted to give that; if it's not desired, you'll have to remove it afterwards.
You can do this with a simple iterative DO loop by using the date interval functions. Subtract one from the number of intervals to make it end at the last day of the previous month.
data want ;
set have ;
do offset=0 to intck('month',first_date,last_date)-1;
date=intnx('month',first_date,offset,'e');
output;
end;
format date yymmdd10.;
run;

SAS Array Calculations Row Operations

I have a dataset that has a list of contributions of members of a sales organization by day. What I want to ultimately end up with is the following information:
For each day:
How much the entire team sold. ($200 for day one, $350 for day two..)
How much a designated subset ("Joe"...for example) of that team sold (Joe sold $100 day one, $200 day two...)
the difference in the above two calculations ($200-$100 for day one, $350-$200 for day two....)
how many total people contributed that day (2 in day 1, 3 in day two, 5 in day 3)
how many of my designated subset contributed that day (1 every day in this case, since Joe was there every day)
In the example below, Joe is my designated subset. The problem I am having is directing SAS to only sum up Joe's contributions. The method I have below works, but only if Joe is the only contributor AND if he contributes every day. I basically force him to be the first entry, then point to him. This fails if he is not there one day, or if my subset has multiple people.
Below is my attempt I've been working on, but I think I'm going down the wrong path, since this will not be dynamic enough when I add more people. For example, if the subset now becomes Joe and Sue....the calculation will still just point to Joe. If I point it two first two obs, it may select hal accidentally from day one. Is there a way to specify by rom "Only add the Amount column if the name next to it is either Joe or Sue? Help!
*declare team;
/*%let team=('joe','sue');*/
%let team=('joe');
*input data;
data have;
input day name $ amount;
cards;
1 hal 100
1 joe 100
2 joe 80
2 sue 70
2 jim 200
3 joe 50
3 sue 100
3 ted 200
3 tim 100
3 wen 5000
;
run;
*getting my team to float to top of order list;
data have;
set have;
if name in &team. then order=1;
else order=2;
run;
*order;
proc sort data=have;
by day order name;
run;
*add running count by day;
data have;
set have;
by day;
x+1;
if first.day then x=1;
run;
*get number of people on team;
proc sql noprint;
select count(distinct name) into :count
from have
where name in &team.;
quit;
*get max of people per day;
proc sql noprint;
select max(x) into :max_freq from have;
quit;
*pre transpose...set labels;
data have;
set have;
varname=cats('Name_',x);
value=name;
output;
varname=cats('Amount_',x);
value=amount;
output;
keep day value varname;
run;
*transpose;
proc transpose data=have out=have_transp(drop=_NAME_);
by day;
id varname;
var value;
run;
data want;
set have_transp;
array Amount {*} Amount:;
TOT_Amount=0;
NUM_TOTAL_PEOPLE=0;
do i=1 to dim(Amount);
if Amount[i]>0
then
do;
TOT_Amount+Amount[i];
NUM_TOTAL_PEOPLE+1;
end;
end;
TEAM_CONTRIB=Amount_1;
NON_TEAM_CONTRIB=TOT_Amount-TEAM_CONTRIB;
run;
A few other things:
Every member of the team will not always be present every day
There are very many possibilities for how many people might be on the total team and/or subset
Here's a way using proc means that doesn't use arrays. Proc means will calculate data at different levels by default when using the CLASS and TYPES statements. The data can then be merged into the appropriate level. In this solution it doesn't matter how many people are in the group/subset or that everyone is present for every day.
/*Subset group*/
data subteam;
input name $;
cards;
joe
sue
;
run;
/*Sample data*/
data have;
input day name $ amount;
cards;
1 hal 100
1 joe 100
2 joe 80
2 sue 70
2 jim 200
3 joe 50
3 sue 100
3 ted 200
3 tim 100
3 wen 5000
;
run;
*Set group variable for subset team;
data have;
set have;
group=0;
run;
*Set group variable=1 to subset;
proc sql;
update have
set group=1
where name in (select name from subteam);
quit;
*Calculate sums;
proc means data=have;
class day group;
types day day*group;
var amount;
output out=want1 sum=total n=count;
run;
*Reformat into desired format;
data want2;
merge want1 (where=(group=.) rename=(total=total_overall count=count_overall))
want1 (where=(group=1) rename=(total=total_group count=count_group));
by day;
run;

Get rid of kth smallest and largest values of a dataset in SAS

I have a datset sort of like this
obs| foo | bar | more
1 | 111 | 11 | 9
2 | 9 | 2 | 2
........
I need to throw out the 4 largest and 4 smallest of foo (later then I would do a similar thing with bar) basically to proceed but I'm unsure the most effective way to do this. I know there are functions smallest and largest but I don't understand how I can use them to get the smallest 4 or largest 4 from an already made dataset. I guess alternatively I could just remove the min and max 4 times but that sounds needlessly tedious/time consuming. Is there a better way?
PROC RANK will do this for you pretty easily. If you know the total count of observations, it's trivial - it's slightly harder if you don't.
proc rank data=sashelp.class out=class_ranks(where=(height_r>4 and weight_r>4));
ranks height_r weight_r;
var height weight;
run;
That removes any observation that is in the 4 smallest heights or weights, for example. The largest 4 would require knowing the maximum rank, or doing a second processing step.
data class_final;
set class_ranks nobs=nobs;
if height_r lt (nobs-3) and weight_r lt (nobs-3);
run;
Of course if you're just removing the values then do it all in the data step and call missing the variable if the condition is met rather than deleting the observation.
You are going to need to make at least 2 passes through your dataset however you do this - one to find out what the top and bottom 4 values are, and one to exclude those observations.
You can use proc univariate to get the top and bottom 5 values, and then use the output from that to create a where filter for a subsequent data step. Here's an example:
ods _all_ close;
ods output extremeobs = extremeobs;
proc univariate data = sashelp.cars;
var MSRP INVOICE;
run;
ods listing;
data _null_;
do _N_ = 1 by 1 until (last.varname);
set extremeobs;
by varname notsorted;
if _n_ = 2 then call symput(cats(varname,'_top4'),high);
if _n_ = 4 then call symput(cats(varname,'_bottom4'),low);
end;
run;
data cars_filtered;
set sashelp.cars(where = ( &MSRP_BOTTOM4 < MSRP < &MSRP_TOP4
and &INVOICE_BOTTOM4 < INVOICE < &INVOICE_TOP4
)
);
run;
If there are multiple observations that tie for 4th largest / smallest this will filter out all of them.
You can use proc sql to place the number of distinct values of foo into a macro var (includes null values as distinct).
In you data step you can use first.foo and the macro var to selectively output only those that are no the smallest or largest 4 values.
proc sql noprint;
select count(distinct foo) + count(distinct case when foo is null then 1 end)
into :distinct_obs from have;
quit;
proc sort data = have; by foo; run;
data want;
set have;
by foo;
if first.foo then count+1;
if 4 < count < (&distinct_obs. - 3) then output;
drop count;
run;
I also found a way to do it that seems to work with IML (I'm practicing by trying to redo things different ways). I knew my maximum number of observations and had already sorted it by order of the variable of interest.
PROC IML;
EDIT data_set;
DELETE point {{1, 2, 3, 4,51, 52, 53, 54};
PURGE;
close data_set;
run;
I've not used IML very much but I stumbled upon this while reading documentation. Thank you to everyone who answered my question!

How can I "define" SAS data sets using macro variable and write to them using an array

My source data contains 200,000+ observations, one of the many variables in the data set is "county." My goal is to write a macro that will take this one data set as an input, and split them into 58 different temporary data sets for each of the California counties.
First question is if it is possible to specify the 58 counties on the data statement using something like a global reference array defined beforehand.
Second question is, assuming the output data sets have been properly specified on the data statement, is it possible to use a do loop to choose the right data set to write to?
I can get the comparison to work properly, but cannot seem to use a array reference to specify a output data set. This is most likely because I need more experience with the macro environment!
Please see below for the simplistic skeleton framework I have written so far. c_long array contains the names of each of the counties, c_short array contains a 3 letter abbreviation for each of the counties. Thanks in advance!
data splitraw;
length county_name $15;
infile "&path/random.csv" dsd firstobs=2;
input county_name $ number;
run;
%macro _58countysplit(dxtosplit,countycol);
data <need to specify 58 data sets here named something like &dxtosplit_ALA, &dxtosplit_ALP, etc..>;
set &dxtosplit;
do i=1 to 58;
if c_long{i}=&countycol then output &dxtosplit._&c_short{i};
end;
run;
%mend _58countysplit;
%_58countysplit(splitraw,county_name);
The code you provided will need to run through the large dataset 58 times, each time writing a small one. I have done it a bit different.
First I create a sample dataset with a variable "county" this will contain ten different values:
data large;
attrib county length=$12;
do i=1 to 10000;
county=put(mod(i,10)+1,ROMAN.);
output;
end;
run;
First, I start with finding all the unique values and constructing the names of all the different tables I would like to create:
proc sql noprint;
select distinct compbl("large_"!!county) into :counties separated by " "
from large;
quit;
Now I have a macrovariable "counties" that containes all the different datasets I want to create.
Here I am writing the IF-statements to a file:
filename x temp;
data _null_;
attrib county length=$12 ds length=$18;
file x;
i=1;
do while(scan("&counties",i," ") ne "");
ds=scan("&counties",i," ");
county=scan(ds,-1,"_");
put "if county=""" county +(-1) """ then output " ds ";";
i+1;
end;
run;
Now I have what I need to create the small datasets:
data &counties;
set large;
%inc x;
run;
I agree with user667489, there is almost always a better way then splitting one large data set into many small data sets. However, if you want to proceed along these lines there is a table in sashelp called vcolumn which lists all your libraries, their tables, and each column (in each table) that should help you. Also if you want
if c_long{i}=&countycol then output &dxtosplit._&c_short{i};
to resolve you might mean:
if c_long{i}=&countycol then output &&dxtosplit._&c_short{i};
It's likely, depending upon what you're actually trying to do, that BY processing is all you need. Nevertheless, here is a simple solution:
%macro split_by(data=, splitvar=);
%local dslist iflist;
proc sql noprint;
select distinct cats("&splitvar._", &splitvar)
into :dslist separated by ' '
from &data;
select distinct
catt("if &splitvar='", &splitvar, "' then output &splitvar._", &splitvar, ";", '0A'x)
into :iflist separated by "else "
from &data;
quit;
data &dslist;
set &data;
&iflist
run;
%mend split_by;
Here is some test data to illustrate:
options mprint;
data test;
length county $1 val $1;
input county val;
infile cards;
datalines;
A 2
B 3
A 5
C 8
C 9
D 10
run;
%split_by(data=test, splitvar=county)
And you can view the log to see how the macro generates the DATA step you want:
MPRINT(SPLIT_BY): proc sql noprint;
MPRINT(SPLIT_BY): select distinct cats("county_", county) into :dslist separated by ' ' from test;
MPRINT(SPLIT_BY): select distinct catt("if county='", county, "' then output county_", county, ";", '0A'x) into :iflist separated
by "else " from test;
MPRINT(SPLIT_BY): quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
MPRINT(SPLIT_BY): data county_A county_B county_C county_D;
MPRINT(SPLIT_BY): set test;
MPRINT(SPLIT_BY): if county='A' then output county_A;
MPRINT(SPLIT_BY): else if county='B' then output county_B;
MPRINT(SPLIT_BY): else if county='C' then output county_C;
MPRINT(SPLIT_BY): else if county='D' then output county_D;
MPRINT(SPLIT_BY): run;
NOTE: There were 6 observations read from the data set WORK.TEST.
NOTE: The data set WORK.COUNTY_A has 2 observations and 2 variables.
NOTE: The data set WORK.COUNTY_B has 1 observations and 2 variables.
NOTE: The data set WORK.COUNTY_C has 2 observations and 2 variables.
NOTE: The data set WORK.COUNTY_D has 1 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.05 seconds

SAS Looping: Summing observations vertically based on conditionally

I have a dataset that looks like:
Zip Codes Total Cars
11111 3
11111 4
23232 1
44331 0
44331 10
18860 6
18860 6
18860 6
18860 8
Ther are 3 million+ rows just like this, with varying zips. I need to sum total cars for each zip code, such that the resulting table looks like
Zip Codes Total Cars
11111 7
23232 1
44331 10
18860 26
.
.
.
Manually inputting zips into the code is not an option considering the size of the dataset. Thoughts?
Both answers so far are OK, but here is a more detailed explanation of both possible methods:
PROC SQL METHOD
PROC SQL;
CREATE TABLE output_table AS
SELECT ZipCodes,
SUM(Total_Cars) as Total_Cars
FROM input_table
GROUP BY ZipCodes;
QUIT;
The GROUP BY clause can also be written GROUP BY 1, omitting ZipCodes, as this refers to the 1st column in the SELECT clause.
PROC SUMMARY METHOD
PROC SUMMARY DATA=input_table NWAY;
CLASS ZipCodes;
VAR Total_Cars;
OUTPUT OUT=output_table (DROP=_TYPE_ _FREQ_) SUM()=;
RUN;
The method is similar to another answer to this question, but I've added:
NWAY - gives only the maximum level of summarisation, here it's not as important because you have only one CLASS variable, meaning there is only one level of summarisation. However, without NWAY you get an extra row showing the total value of Total_Cars across the whole dataset, which is not something you asked for in your question.
DROP=_TYPE_ _FREQ_ - This removes the automatic variables:
_TYPE_ - which shows the level of summarisation (see comment above), which would just be a column containing the value 1.
_FREQ_ - gives a frequency count of the ZipCodes, which although useful, isn't something you wanted in your question.
DATA STEP METHOD
PROC SORT DATA=input_table (RENAME=(Total_Cars = tc)) OUT=_temp;
BY ZipCodes;
RUN;
DATA output_table (DROP=TC);
SET _temp;
BY ZipCodes;
IF first.ZipCodes THEN Total_Cars = 0;
Total_Cars+tc;
IF last.ZipCodes THEN OUTPUT;
RUN;
This is just included for completeness, it's not as efficient as it requires pre-sorting.
To supplement #mjsqu's answer, for (more) completeness:
data testin;
input Zip Cars;
datalines;
11111 3
11111 4
23232 1
44331 0
44331 10
18860 6
18860 6
18860 6
18860 8
;
PROC TABULATE METHOD
proc tabulate data=testin out=testout
/*drop extra created vars and rename as needed*/
(drop=_type_ _page_ _table_ rename=(Zip='Zip Codes'n Cars_Sum='Total Cars'n));
/*grouping variable, also used to sort output in ascending order*/
class Zip;
/* variable to be analyzed*/
var Cars;
/*sum cars by zip code*/
table Zip, Cars*(sum);
run;
If using Enterprise Guide, this produces a dataset and a results table. To suppress the results and only output a dataset, include this line before "proc tabulate":
ods select none; /*suppress ods output*/
and this after "run":
ods select all; /*restore ods output*/
The variable upon which you want to sum is "ZipCodes" so that will go into "Class" section.
You want to sum Total_cars , so that will go into "var" section.
Input_table and Output_table is self explanatory.
/*Code/
proc summary data=Input_table;
class ZipCodes;
var Total_cars;
output out=Output_table
sum()=;
run;
You can use proc sql. this is a very simple step
proc sql;
create table new as
select Zipcodes, sum(Total Cars) as total_cars from table_have group by Zipcodes
;
quit;

Resources