Creating a summary column in SAS from multiple columns - arrays

I want to create a single column in SAS that summarizes several columns for each individual in a data set. The data looks like the following:
Subject VisitNumber Exam Result Comments
001 1 Blood Negative Will return for more testing
001 1 BP 100 Score is in normal range
001 1 Vision 20/20 No issues with eyesight
002 5 BMI 19 Within healthy range
002 5 Hearing Good Patient hears well
002 5 Drug Negative Subject passed drug test
The information for each subject and their subsequent visit number should be summarized like this:
Subject VisitNumber Summary
001 1 Exam: Blood, Result: Negative, Comments: Will return for more testing; Exam: BP, Result: 100, Comments: Score is normal range; Exam: Vision, Result: 20/20, Comments: No issues with eyesight
002 5 Exam: BMI, Result: 19, Comments: Within healthy range; Exam: Hearing, Result: Good, Comments: Patient hears well; Exam: Drug, Result: Negative, Comments: Subject passed drug test
Can do this in R the following way:
for (i in 1:length(data$Subject))
{
data$Summary[i] = data$Comments[i] = 'Exam: ' + Exam[i] + ', Result: ' + Result[i] + ', Comments: ' + Comments[i] + '; '
}
Data could then be compressed row wise by Comments column. Any insight as to how this can be done via the DATA or PROC SQL step in SAS would be much appreciated.

Use SAS concatenate functions.
data want;
set have;
by subject notsorted;
length summary $500.;
retain summary;
summary=catx(';',summary, catx(',', cats('Exam:',Exam),cats('Result:',Result),cats('Comments:',Comments)));
if last.subject then output;
keep Subject VisitNumber summary;
run;

For reporting purposes Proc PRINT has a special output layout when the BY and ID statements list the same variables names. The groups will be separated and the group's values will not repeat when the group has more than one row.
data have;
input
Subject&$ VisitNumber& Exam&$ Result&$ Comments&$200.; datalines;
001 1 Blood Negative Will return for more testing
001 1 BP 100 Score is in normal range
001 1 Vision 20/20 No issues with eyesight
002 5 BMI 19 Within healthy range
002 5 Hearing Good Patient hears well
002 5 Drug Negative Subject passed drug test
run;
ods html style=Journal;
title "Subject visit examinations";
proc print data=have;
by subject visitnumber;
id subject visitnumber;
run;

Related

How to combine group by, join, COUNT, SUM and subquery clauses in sql

I am not sure how to write the SQL query for the following problem:
There are two tables, Worker and Product (one worker can make many products) which I describe in this link:https://docs.google.com/spreadsheets/d/1Yk2vKKmUEyuN-QfgTEbmF4suHFtuDkkrsUf-wqvOoKQ/edit?fbclid=IwAR3ipjwNrfhGXg3fCyAri4tD1Q4WqWuKVAqagvbsZg9Sn1myDwkWbWcl_6E#gid=0
The calculation of the total salary of a worker at month x is as follows
totalSalary = salaryPerMonth + SUM(salaryPerProduct * COUNT(pid))
I want to use join statement (regardless of INNER JOIN, LEFT, OR RIGHT JOIN) combined with group by clause to solve this problem but my statements are wrong.
Expect a specific SQL statement in this case.
I hope to be able to express my ideas in this photo
UPDATE: my picture quality is not good so i will repost my picture on this linkenter image description here
#phi nguyễn quốc - Welcome to StackOverflow. What you posted has the makings of a good question. It contains:
Brief summary of the issue
Table structure, sample data
Explanation of expected results
Code you've tried
It just needs a few modifications to conform to the guidelines and avoid being closed. A few tips on posting:
Help others to help you by including a Minimal, Reproducible Example. (With SQL questions include table definitions and sample data). That way folks who want to help can spend their time answering your question, instead of on writing set-up code to replicate your tables, environment, etc..
Make it easy for others to be able to test your code. Always post code as text, not as an image.
Use collaborative tools like db<>fiddle for sharing
One example of how you might improve the question and avoid it being closed:
Issue:
I am trying to write a SQL query to calculate the total salary for workers for a given month X. There are two tables: [Worker] and [Product]. One worker can make many products.
wid
wname
salaryPerMonth
salaryPerProduct
phoneNumber
1
Mr A
500
5
2
Mr B
100
30
3
Mr C
200
20
pid
pname
manufacturedDate
wid
1
Product A
2013-12-01
1
2
Product B
2013-12-09
1
3
Product C
2013-09-08
1
4
Product D
2013-01-30
2
5
Product E
2013-09-20
2
6
Product F
2013-12-23
3
The "Total Salary" of a worker for month X is calculated as follows:
SalaryPerMonth +
( SalaryPerProduct *
Number of Products for Month
)
Expected Results: (December 2013)
wid
wname
salaryPerMonth
salaryPerProduct
totalSalary
** Formula
1
Mr A
500
5
510
= 500 + (5*2)
2
Mr B
100
30
100
= 100 + (30*0)
3
Mr C
200
20
220
= 200 + (20*1)
Actual Results
I've tried this query
SELECT W.wid, W.wname, W.phoneNumber, W.salaryPerMonth, W.salaryPerProduct, (W.salaryPerMonth - SUM(W.salaryPerMonth*COUNT(p.pid))) AS Total
FROM Worker W INNER JOIN Product P ON p.Wid = W.wid
WHERE MONTH(P.manufacturedDate) = 12
GROUP BY W.wid, W.wname, W.phoneNumber, W.salaryPerMonth, W.salaryPerProduct
.. but am getting the error below:
Msg 130 Level 15 State 1 Line 1
Cannot perform an aggregate function on an expression containing an aggregate or a subquery.
Here is my db<>fiddle
CREATE TABLE Product (
pid int
, pname varchar(40)
, manufacturedDate date
, wid int
);
CREATE TABLE Worker (
wid int
, wname varchar(40)
, salaryPerMonth int
, salaryPerProduct int
, phoneNumber varchar(20)
)
INSERT INTO Product(pid, pname, manufacturedDate, wid)
VALUES
(1,'Product A','2013-12-01',1)
,(2,'Product B','2013-12-09',1)
,(3,'Product C','2013-09-08',1)
,(4,'Product D','2013-01-30',2)
,(5,'Product E','2013-09-20',2)
,(6,'Product F','2013-12-23',3)
;
INSERT INTO Worker (wid, wname, salaryPerMonth,salaryPerProduct)
VALUES
(1,'Mr A', 500, 5)
,(2, 'Mr B', 100, 30)
,(3,'Mr C', 200, 20)
;

Advice on how best to manage this dataset?

New to SAS and would appreciate advice and help on how best to handle this data mangement situation.
I have a dataset in which each observation represents a client. Each client has a "description" variable which could include either a comprehensive assessment, treatment or discharge. I have created 3 new variables to flag each observation if they contain one of these.
So for example:
treat_yes = 1 if description contains "tx", "treatment"
dc_yes = 1 if description contains "dc", "d/c" or "discharge"
ca_yes = 1 if desciption contains "comprehensive assessment" or "ca" or "comprehensive ax"
My end goal is to have a new dataset of clients that have gone through a Comprehensive Assessment, Treatment and Discharge.
I'm a little stumped as to what my next move should be here. I have all my variables flagged for clients. But there could be duplicate observations just because a client could have come in many times. So for example:
Client_id treatment_yes ca_yes dc_yes
1234 0 1 1
1234 1 0 0
1234 1 0 1
All I really care about is if for a particular client the variables treatment_yes, ca_yes and dc_yes DO NOT equal 0 (i.e., they each have at least one "1". They could have more than one "1" but as long as they are flagged at least once).
I was thinking my next step might be to collapse the data (how do you do this?) for each unique client ID and sum treatment_yes, dc_yes and ca_yes for each client.
Does that work?
If so, how the heck do I accomplish this? Where do I start?
thanks everyone!
I think the easiest thing to do at this point is to use a proc sql step to find the max value of each of your three variables, aggregated by client_id:
data temp;
input Client_id $ treatment_yes ca_yes dc_yes;
datalines;
1234 0 1 1
1234 1 0 0
1234 1 0 1
;
run;
proc sql;
create table temp_collapse as select distinct
client_id, max(treatment_yes) as treatment_yes,
max(ca_yes) as ca_yes, max(dc_yes) as dc_yes
from temp
group by client_id;
quit;
A better overall approach would be to use the dataset you used to create the _yes variables and do something like max(case when desc = "tx" then 1 else 0 end) as treatment_yes etc., but since you're still new to SAS and understand what you've done so far, I think the above approach is totally sufficient.
The following code allows you to preserve other variables from your original dataset. I have added two variables (var1 and var2) for illustrative purposes:
data temp;
input Client_id $ treatment_yes ca_yes dc_yes var1 var2 $;
datalines;
1234 0 1 1 10 A
1234 1 0 0 11 B
1234 1 0 1 12 C
;
run;
Join the dataset with itself so that each row of a client_id in the original dataset is merged with its corresponding row in an aggregated dataset constructed in a subquery.
proc sql;
create table want as
select *
from temp as a
left join (select client_id,
max(treatment_yes) as max_treat,
max(ca_yes) as max_ca,
max(dc_yes) as max_dc
from temp
group by client_id) as b
on a.client_id=b.client_id;
quit;

SAS Array Calculations Row Operations

I have a dataset that has a list of contributions of members of a sales organization by day. What I want to ultimately end up with is the following information:
For each day:
How much the entire team sold. ($200 for day one, $350 for day two..)
How much a designated subset ("Joe"...for example) of that team sold (Joe sold $100 day one, $200 day two...)
the difference in the above two calculations ($200-$100 for day one, $350-$200 for day two....)
how many total people contributed that day (2 in day 1, 3 in day two, 5 in day 3)
how many of my designated subset contributed that day (1 every day in this case, since Joe was there every day)
In the example below, Joe is my designated subset. The problem I am having is directing SAS to only sum up Joe's contributions. The method I have below works, but only if Joe is the only contributor AND if he contributes every day. I basically force him to be the first entry, then point to him. This fails if he is not there one day, or if my subset has multiple people.
Below is my attempt I've been working on, but I think I'm going down the wrong path, since this will not be dynamic enough when I add more people. For example, if the subset now becomes Joe and Sue....the calculation will still just point to Joe. If I point it two first two obs, it may select hal accidentally from day one. Is there a way to specify by rom "Only add the Amount column if the name next to it is either Joe or Sue? Help!
*declare team;
/*%let team=('joe','sue');*/
%let team=('joe');
*input data;
data have;
input day name $ amount;
cards;
1 hal 100
1 joe 100
2 joe 80
2 sue 70
2 jim 200
3 joe 50
3 sue 100
3 ted 200
3 tim 100
3 wen 5000
;
run;
*getting my team to float to top of order list;
data have;
set have;
if name in &team. then order=1;
else order=2;
run;
*order;
proc sort data=have;
by day order name;
run;
*add running count by day;
data have;
set have;
by day;
x+1;
if first.day then x=1;
run;
*get number of people on team;
proc sql noprint;
select count(distinct name) into :count
from have
where name in &team.;
quit;
*get max of people per day;
proc sql noprint;
select max(x) into :max_freq from have;
quit;
*pre transpose...set labels;
data have;
set have;
varname=cats('Name_',x);
value=name;
output;
varname=cats('Amount_',x);
value=amount;
output;
keep day value varname;
run;
*transpose;
proc transpose data=have out=have_transp(drop=_NAME_);
by day;
id varname;
var value;
run;
data want;
set have_transp;
array Amount {*} Amount:;
TOT_Amount=0;
NUM_TOTAL_PEOPLE=0;
do i=1 to dim(Amount);
if Amount[i]>0
then
do;
TOT_Amount+Amount[i];
NUM_TOTAL_PEOPLE+1;
end;
end;
TEAM_CONTRIB=Amount_1;
NON_TEAM_CONTRIB=TOT_Amount-TEAM_CONTRIB;
run;
A few other things:
Every member of the team will not always be present every day
There are very many possibilities for how many people might be on the total team and/or subset
Here's a way using proc means that doesn't use arrays. Proc means will calculate data at different levels by default when using the CLASS and TYPES statements. The data can then be merged into the appropriate level. In this solution it doesn't matter how many people are in the group/subset or that everyone is present for every day.
/*Subset group*/
data subteam;
input name $;
cards;
joe
sue
;
run;
/*Sample data*/
data have;
input day name $ amount;
cards;
1 hal 100
1 joe 100
2 joe 80
2 sue 70
2 jim 200
3 joe 50
3 sue 100
3 ted 200
3 tim 100
3 wen 5000
;
run;
*Set group variable for subset team;
data have;
set have;
group=0;
run;
*Set group variable=1 to subset;
proc sql;
update have
set group=1
where name in (select name from subteam);
quit;
*Calculate sums;
proc means data=have;
class day group;
types day day*group;
var amount;
output out=want1 sum=total n=count;
run;
*Reformat into desired format;
data want2;
merge want1 (where=(group=.) rename=(total=total_overall count=count_overall))
want1 (where=(group=1) rename=(total=total_group count=count_group));
by day;
run;

SAS: How can I filter for (multiple) entries which are closest to the last day of month (for each month)

I have a large Dataset and want to filter it for all rows with date entry closest to the last day of the month, for each month. So there could be multiple entries for the day closest to the last day of month.
So for instance:
original Dataset
date price name
05-01-1995 1,2 abc
06-01-1995 1,5 def
07-01-1995 1,8 ghi
07-01-1995 1,7 mmm
04-02-1995 1,9 jkl
27-02-1995 2,1 mno
goal:
date price name
07-01-1995 1,8 ghi
07-01-1995 1,7 mmm
27-02-1995 2,1 mno
I had 2 ideas, but I am failing with implementing it within a loop (for traversing the months) in SAS.
1.idea: create new column wich indicates last day of the current month (intnx() function); then filter for all entries that are closest to the last day of its month:
date price name last_day_of_month
05-01-1995 1,2 abc 31-01-1995
06-01-1995 1,5 def 31-01-1995
07-01-1995 1,8 ghi 31-01-1995
04-02-1995 1,9 jkl 28-02-1995
27-02-1995 2,1 mno 28-02-1995
2.idea: simply filter for each month the entries with highest date (using maybe max function?!)
I would be very glad if you were able to help me, as I am used to ordinary programming languages and just started with SAS for research purposes.
proc sql is one way to solve this kind of situation. I'll break down your original requirements with explanations in how to interpret them in sql.
Since you want to group your observations on date, you can use the having clause to filter on the max date per month.
data work.have;
input date DDMMYY10. price name $;
format date date9.;
datalines;
05-01-1995 1.2 abc
07-01-1995 1.8 ghi
06-01-1995 1.5 def
07-01-1995 1.7 mmm
04-02-1995 1.9 jkl
27-02-1995 2.1 mno
;
data work.want;
input date DDMMYY10. price name $;
format date date9.;
datalines;
07-01-1995 1.8 ghi
07-01-1995 1.7 mmm
27-02-1995 2.1 mno
;
proc sql ;
create table work.want as
select *
/*, max(date) as max_date format=date9.*/
/*, intnx('month',date,0,'end') as monthend format=date9.*/
from work.have
group by intnx('month',date,0,'end')
having max(date) = date
order by date, name
;
If you uncomment the comments, the actual filters used are shown in the output table.
Comparing the the requirements against the solution:
proc compare base=work.want compare=work.solution;
results in
NOTE: No unequal values were found. All values compared are exactly equal.
1) create a new variable periode = put(date,yymmn6.) /* gives you yyyymm*/
2) sort the table on periode and date
3) now a periode.last logic will select the record you need per periode.
Something like...
data tab2;
set your_table;
periode = put(date,yymmn6.);
run;
proc sort data= tab2;
by periode date;
run;
data tab3;
set tab2;
by periode;
if last.periode then output;
run;
You can use two SAS functions called intnx and intck to do this with proc sql:
proc sql ;
create table want as
select *, put(date,yymmn6.) as month, intck('days',date,intnx('month',date,0,'end')) as DaysToEnd
from have
group by month
having (DaysToEnd=min(DaysToEnd))
;quit ;
Intnx() adjusts dates by intervals. In the above case, the four parameters used are:
What size 'step' you want to add/subrate the intervals in.
The date that is being referenced
How many interval steps to make
How to 'round' the step (eg round it to the start/end/middle of the resultant day/week/year)
Intck() simply counts interval steps between two dates
This will give you all records which fall on the day closest to the end of the month
Another approach is by using proc rank;
data mid;
retain yrmth date;
set have;
format date yymmddn8.;
yrmth = put(date,yymmn6.);
run;
proc sort data = mid;
by yrmth descending date;
run;
proc rank data = mid out = want descending ties=low;
by yrmth;
var date;
ranks rankdt;
run;
data want1;
set want;
where rankdt = 1;
run;
HTH

SAS: sum all values except one

I'm working in SAS and I'm trying to sum all observations, leaving out one each time.
For example, if I have:
Count Name Grade
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
I want to output a value for Sam that is the sum of all grades but his own, and a value for Adam that is a sum of all grades but his own - etc.
Any ideas? Thanks!
You can do it in a single proc sql instead, using key word calculated:
data have;
input Count Name $ Grade;
datalines;
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
;;;;
run;
proc sql;
create table want as
select *, sum(grade) as all_grades, calculated all_grades-grade as minus_grade
from have;
quit;
Here's a nearly one pass solution (it will be about the same speed as a one pass solution if the dataset fits in the read buffer). I actually calculate the mean here instead of just the sum, as I feel that's a more interesting result (and the sum is of course the mean without the division).
data have;
input Count Name $ Grade;
datalines;
1 Sam 90
2 Adam 100
3 John 80
4 Max 60
5 Andrea 70
;;;;
run;
data want;
retain grademean;
if _n_=1 then do;
do _n_ = 1 to nobs_have;
set have(keep=grade) point=_n_ nobs=nobs_have;
gradesum+grade;
end;
grademean=gradesum/nobs_have;
end;
set have;
grade_noti = ((grademean*nobs_have)-grade)/(nobs_have-1);
run;
Calculate the mean, then for each record subtract the portion that record contributed to the mean. This is a super useful technique for stat testing when you want to compare a record to the rest of the population, and you have a complicated class combination where you'd rather do the mean first. In those cases you use PROC MEANS first and then merge it on, then do this subtraction.
proc sql;
create table temp as select
sum(grade) as all_grades
from orig_data;
quit;
proc sql;
create table temp2 as select
a.count,
a.name,
a.grade,
(b.all_grades-a.grade) as sum_other_grades
from orig_data a
left join temp b;
quit;
Haven't tested it but the above should work. It creates a new dataset temp that has the sum of all grades and merges that back to create a new table with the sum of all grades less the current students grade as sum_other_grades.
This solution performs takes each observation of your starting dataset, and then loops through the same dataset summing up grade values for any records with different names, so beginning with 'Sam', we only add the oth_g variable when we find names that are NOT 'Sam':
data want;
set have;
oth_g=0;
do i=1 to n;
set have
(keep=name grade rename=(name=name_loop grade=grade_loop))
nobs=n point=i;
if name^=name_loop then oth_g+grade_loop;
end;
drop grade_loop name_loop i n;
run;
This is a slight modification to the answer #Reese provided above.
proc sql;
create table want as
select *,
(select sum(grade) from have) as all_grades,
calculated all_grades - grade as minus_grade
from have;
quit;
I've rearranged it this way to avoid the below message being printed to the log:
NOTE: The query requires remerging summary statistics back with the original data.
If you see the above message, it almost always means that you have made a mistake. If you actually did mean to remerge summary stats back with the original data, you should do so explicitly (like I have done above by refactoring #reese 's query.
Personally I think the refactored version is also easier to understand.

Resources