Advice on how best to manage this dataset?

Advice on how best to manage this dataset? - database

New to SAS and would appreciate advice and help on how best to handle this data mangement situation.
I have a dataset in which each observation represents a client. Each client has a "description" variable which could include either a comprehensive assessment, treatment or discharge. I have created 3 new variables to flag each observation if they contain one of these.
So for example:
treat_yes = 1 if description contains "tx", "treatment"
dc_yes = 1 if description contains "dc", "d/c" or "discharge"
ca_yes = 1 if desciption contains "comprehensive assessment" or "ca" or "comprehensive ax"
My end goal is to have a new dataset of clients that have gone through a Comprehensive Assessment, Treatment and Discharge.
I'm a little stumped as to what my next move should be here. I have all my variables flagged for clients. But there could be duplicate observations just because a client could have come in many times. So for example:
Client_id treatment_yes ca_yes dc_yes
1234 0 1 1
1234 1 0 0
1234 1 0 1
All I really care about is if for a particular client the variables treatment_yes, ca_yes and dc_yes DO NOT equal 0 (i.e., they each have at least one "1". They could have more than one "1" but as long as they are flagged at least once).
I was thinking my next step might be to collapse the data (how do you do this?) for each unique client ID and sum treatment_yes, dc_yes and ca_yes for each client.
Does that work?
If so, how the heck do I accomplish this? Where do I start?
thanks everyone!

I think the easiest thing to do at this point is to use a proc sql step to find the max value of each of your three variables, aggregated by client_id:
data temp;
input Client_id $ treatment_yes ca_yes dc_yes;
datalines;
1234 0 1 1
1234 1 0 0
1234 1 0 1
;
run;
proc sql;
create table temp_collapse as select distinct
client_id, max(treatment_yes) as treatment_yes,
max(ca_yes) as ca_yes, max(dc_yes) as dc_yes
from temp
group by client_id;
quit;
A better overall approach would be to use the dataset you used to create the _yes variables and do something like max(case when desc = "tx" then 1 else 0 end) as treatment_yes etc., but since you're still new to SAS and understand what you've done so far, I think the above approach is totally sufficient.

The following code allows you to preserve other variables from your original dataset. I have added two variables (var1 and var2) for illustrative purposes:
data temp;
input Client_id $ treatment_yes ca_yes dc_yes var1 var2 $;
datalines;
1234 0 1 1 10 A
1234 1 0 0 11 B
1234 1 0 1 12 C
;
run;
Join the dataset with itself so that each row of a client_id in the original dataset is merged with its corresponding row in an aggregated dataset constructed in a subquery.
proc sql;
create table want as
select *
from temp as a
left join (select client_id,
max(treatment_yes) as max_treat,
max(ca_yes) as max_ca,
max(dc_yes) as max_dc
from temp
group by client_id) as b
on a.client_id=b.client_id;
quit;

Related

How can I identify three highest values in a column by ID and take their squares and then add them in SAS?

I am working on injury severity scores (ISS) and my dataset has these four columns: ID, High_AIS, Dxcode (diagnosis code), ISS_bodyregion. Each ID/case has several values for "dxcode" and respective High_AIS and ISS_bodyregion - which means each ID/case has multiple injuries in different body regions. The rule to calculate ISS specifies that we have to select AIS values of three different ISS body regions
For some IDs, we have only one value (of course when a person only has single injury and one associated dxcode and AIS). My goal is to calculate ISS (ranges from 0-75) and in order to do this, I want to tell SAS the following things:
Select three largest AIS values by ID (of course when ID has more than 3 values for AIS), take their squares and add them to get ISS.
If ID has only one injury and that has the AIS = 6, the ISS will automatically be equal to 75 (regardless of the injuries elsewhere).
If ID has less than 3 AIS values (for example, 5th ID has only two AIS values: 0 and 1), then consider only two, square them and add them, as we do not have third severely ISS body region for this ID.
If ID has only 3 AIS (for example, 1,0,0) then consider only three, square them and add them even if it is ISS=1.
If ID has all the injuries and AIS values equal to 0 (for example: 0,0) then ISS will equal to 0.
If ID has multiple injuries, and AIS values are: 2,2,1,1,1 and ISS_bodyregion = 5,5,6,6,6. Then we see that ISS_bodyregion repeats itself, the instructions suggest that we only select highest AIS value of ISS body region only once, because it has to be from DIFFERENT ISS body regions. So, in such situation, I want to tell SAS that if ISS_bodyregion repeats itself, only select the one with highest AIS value and leave the rest.
I am so confused as I am telling SAS to keep account of all these aforementioned considerations and I cannot seem to put them all in a single code. Thank you so much in advance. I have already sorted my data by ID descending high_AIS.

So if you are trying to implement this algorithm https://aci.health.nsw.gov.au/networks/institute-of-trauma-and-injury-management/data/injury-scoring/injury_severity_score then you need data like this:
data have;
input id region :$20. ais ;
cards;
1 HEAD/NECK 4
1 HEAD/NECK 3
1 FACE 1
1 CHEST 2
1 ABDOMEN 2
1 EXTREMITIES 3
1 EXTERNAL 1
2 ABDOMEN 3
3 FACE 1
3 CHEST 2
4 HEAD/NECK 6
;
So first find the max per id per region. For example by using PROC SUMMARY.
proc summary data=have nway;
class id region;
var ais;
output out=bodysys max=ais;
run;
Now order by ID and AIS
proc sort data=bodysys ;
by id ais ;
run;
Now you can process by ID and accumulate the AIS scores into an array. You can use MOD() function to cycle through the array so that the last three observations per ID will be the values left in the array (skips the need to first subset to three observations per ID).
data want;
do count=0 by 1 until(last.id);
set bodysys;
by id;
array x[3] ais1-ais3 ;
x[1+mod(count,3)] = ais;
end;
iss=0;
if ais>5 then iss=75;
else do count=1 to 3 ;
iss + x[count]**2;
end;
keep id ais1-ais3 iss ;
run;
Result:
Obs id ais1 ais2 ais3 iss
1 1 2 3 4 29
2 2 3 . . 9
3 3 1 2 . 5
4 4 6 . . 75

Can I set rules for string comparison in SQL? (or do I need to hardcode using CASE WHEN)

I need to make a comparison for ratings in two points in time and indicate if the change was upwards,downwards or stayed the same.
For example:
This would be a table with four columns:
ID T0 T0+1 Status
1 AAA AA Lower
2 BB A Higher
3 C C Same
However, this does not work when applying regular string comparison, because in SQL
A<B
B<BBB
I need
A>B
B<BBB
So my order(highest to lowest): AAA,AA,A,BBB,BB,B
SQL order(highest to lowest): BBB,BB,B,AAA,AA,A
Now I have 2 options in mind, but I wonder if someone know a better one:
1) Use CASE WHEN statements for all the possibilities of ratings going up and down ( I have more values than indictaed above)
CASE WHEN T0=T0+1 then 'Same'
WHEN T0='AAA' and To+1<>'AAA' then 'Lower'
....adress all other options for rating going down
ELSE 'Higher'
However, this generates a very large number of CASE WHEN statements.
2) My other option requires generating 2 tables. In table 1 I use case when statements to assign values/rank to the ratings.
For example:
CASE WHEN T0='AAA' then 6
CASE WHEN T0='AA' then 5
CASE WHEN T0='A' then 4
CASE WHEN T0='BBB' then 3
CASE WHEN T0='BB' then 2
CASE WHEN T0='B' then 1
The same for T0+1.
Then in table 2 I use a regular compariosn between column T0 and Column T0+1 on the numeric values.
However, I am looking for a solution where I can do it in one table (with as little lines as possible), and optimally never really show the ranking column.
I think a nested statement would be the best option, but it did now work for me.
Anybody has suggestions?
I use SQL Server 2008.

If you are using Credit Rating, this is very likely that this is not just about AAA > AA or BBB > BB.
Whether you are using one agency or another, it could also be AA+ or Aa1 for long term, F1+ for short term or something else in different contexts or with other agencies.
It is also often requiered to convert data from one agency to other agencies Rating.
Therefore it is better to use a mapping table such as:
Id | Rating
0 | AAA
1 | AA+
2 | AA
3 | AA-
4 | A+
5 | A
6 | A-
7 | BBB+
Using this table, you only have to join the rating in your data table with the rating in the mapping table:
SELECT d.Rating_T0, d.Rating_T1
CASE WHEN d.Rating_T0 = d.Rating_T1 THEN '='
WHEN m0.id < m1.id THEN '<'
WHEN m0.id > m1.id THEN '>'
END
FROM yourData d
INNER JOIN RatingMapping m0
ON m0.Rating= d.Rating_T0
INNER JOIN RatingMapping m1
ON m1.Rating= d.Rating_T1
If you only store the Rating id in you data table, you will not only save space (1 byte for tinyint versus up to 4 chars) but will also be able to compare without the JOIN to the mapping table.
SELECT d.Rating_Id0, d.Rating_Id1
CASE WHEN d.Rating_Id0 = d.Rating_Id1 THEN '='
WHEN d.Rating_Id0 < d.Rating_Id1 THEN '<'
WHEN d.Rating_Id0 > d.Rating_Id1 THEN '>'
END
FROM yourData d
The JOIN would only be requiered when you want to display the actual Rating value such as AAA for Rating_ID = 0.
You could also add an agency_Id to the Mapping table. This way, you can easily choose which Notation agency you want to display and easily convert between Agency 1 and Agency 2 or Agency 3 (ie. Id 1 => S&P and Id 2 => Fitch, Id 3 => ...)

Using multiple aggregate functions - sum and count

I've tried several of the solutions to my question on the site but could not find one that worked. Please help!
Other than taking some liberties with the report_names, the data is realistic of what I am trying to accomplish and is just a small portion of what I am up against, roughly 97K rows of data with the same type of repetition of branch, file_count, report_name...the file numbers are unique and are insignificant. It is for informational purposes of my question and explains why the amounts are unique - they are tied to the file_name
I am looking for one report_name with the sum of the two amounts.
Here are the current results to my query:
branch file_count file_volume net_profit report_name file_number
Northeast 1 $200,000.00 $200,000.00 bogart.hump.new 12345
Northeast 1 $195,000.00 $197,837.00 bogart.hump.new 23456
Northeast 1 $111,500.00 $113,172.00 bogart.hump.new 34567
Northwest 1 $66,000.00 -$1,500.18 jolie.angela.new 45678
Northwest 1 $159,856.00 -$2,745.58 jolie.angela.new 56789
Northwest 1 $140,998.00 -$2,421.69 jolie.angela.new 67890
Southwest 1 $74,000.00 $73,904.00 Man.bat.net 78901
Southwest 1 $186,245.00 -$4,231.25 Man.bat.net 89012
Southwest 1 $72,375.00 $73,641.00 Man.bat.net 90123
Southeast 1 $79,575.00 -$1,821.76 zep.led.new 1234A
Southeast 1 $268,600.00 $268,600.00 zep.led.new 2345A
Southeast 1 $77,103.00 -$1,751.68 zep.led.new 3456A
This is what I am looking for:
branch file_count file_volume net_profit report_name file_number
Northeast 3 $506,500.00 $511,009.00 bogart.hump.new
Northwest 3 $366,854.00 -$6,667.45 jolie.angela.new
Southwest 3 $332,620.00 $143,313.75 Man.bat.net
Southeast 3 $425,278.00 $265,026.56 zep.led.new
My query:
SELECT
branch,
count(filenumber) AS file_count,
sum(fileAmount) AS file_amount,
sum(netprofit*-1) AS net_profit,
concat(d2.lastname,'.',d2.firstname,'.','new') AS report_name,
FROM user.summary u
inner join user.db1 d1 ON d1.loaname = u.loaname
inner join user.db2 d2 ON d2.cn = u.loaname
WHERE d2.filedate = '2015-09-01'
AND filenumber is not null
GROUP BY branch,concat(d2.lastname,'.',d2.firstname,'.','new')

The only issue i see with your current query is that you have a comma at the end of this line that would give you a syntax error:
concat(d2.lastname,'.',d2.firstname,'.','new') AS report_name,
If you want the blank field file_number as shown in your desired result set though, you could leave the comma and follow it with the blank field by adding to it:
concat(d2.lastname,'.',d2.firstname,'.','new') AS report_name,
'' file_number

I figured it out but could not have done it without airing it out in this forum. In my actual query, I included the "file_name" column, so I had both the "count(file_name)" and "file_name" columns...but in my query example, I only had the "count(file_name)" column. When I removed the "file_column" column from my actual query, it worked. Side note...it was obvious that I excluded a key component in my query. On any future query questions, I will include the complete query but substitute actual column names with col1, col2, db1, db2, etc... thanks very much for responding to my question.

Get rid of kth smallest and largest values of a dataset in SAS

I have a datset sort of like this
obs| foo | bar | more
1 | 111 | 11 | 9
2 | 9 | 2 | 2
........
I need to throw out the 4 largest and 4 smallest of foo (later then I would do a similar thing with bar) basically to proceed but I'm unsure the most effective way to do this. I know there are functions smallest and largest but I don't understand how I can use them to get the smallest 4 or largest 4 from an already made dataset. I guess alternatively I could just remove the min and max 4 times but that sounds needlessly tedious/time consuming. Is there a better way?

PROC RANK will do this for you pretty easily. If you know the total count of observations, it's trivial - it's slightly harder if you don't.
proc rank data=sashelp.class out=class_ranks(where=(height_r>4 and weight_r>4));
ranks height_r weight_r;
var height weight;
run;
That removes any observation that is in the 4 smallest heights or weights, for example. The largest 4 would require knowing the maximum rank, or doing a second processing step.
data class_final;
set class_ranks nobs=nobs;
if height_r lt (nobs-3) and weight_r lt (nobs-3);
run;
Of course if you're just removing the values then do it all in the data step and call missing the variable if the condition is met rather than deleting the observation.

You are going to need to make at least 2 passes through your dataset however you do this - one to find out what the top and bottom 4 values are, and one to exclude those observations.
You can use proc univariate to get the top and bottom 5 values, and then use the output from that to create a where filter for a subsequent data step. Here's an example:
ods _all_ close;
ods output extremeobs = extremeobs;
proc univariate data = sashelp.cars;
var MSRP INVOICE;
run;
ods listing;
data _null_;
do _N_ = 1 by 1 until (last.varname);
set extremeobs;
by varname notsorted;
if _n_ = 2 then call symput(cats(varname,'_top4'),high);
if _n_ = 4 then call symput(cats(varname,'_bottom4'),low);
end;
run;
data cars_filtered;
set sashelp.cars(where = ( &MSRP_BOTTOM4 < MSRP < &MSRP_TOP4
and &INVOICE_BOTTOM4 < INVOICE < &INVOICE_TOP4
)
);
run;
If there are multiple observations that tie for 4th largest / smallest this will filter out all of them.

You can use proc sql to place the number of distinct values of foo into a macro var (includes null values as distinct).
In you data step you can use first.foo and the macro var to selectively output only those that are no the smallest or largest 4 values.
proc sql noprint;
select count(distinct foo) + count(distinct case when foo is null then 1 end)
into :distinct_obs from have;
quit;
proc sort data = have; by foo; run;
data want;
set have;
by foo;
if first.foo then count+1;
if 4 < count < (&distinct_obs. - 3) then output;
drop count;
run;

I also found a way to do it that seems to work with IML (I'm practicing by trying to redo things different ways). I knew my maximum number of observations and had already sorted it by order of the variable of interest.
PROC IML;
EDIT data_set;
DELETE point {{1, 2, 3, 4,51, 52, 53, 54};
PURGE;
close data_set;
run;
I've not used IML very much but I stumbled upon this while reading documentation. Thank you to everyone who answered my question!

Ensure data integrity in SQL Server

I have to make some changes in a small system that stores data in one table as following:
TransId TermId StartDate EndDate IsActiveTerm
------- ------ ---------- ---------- ------------
1 1 2007-01-01 2007-12-31 0
1 2 2008-01-01 2008-12-31 0
1 3 2009-01-01 2009-12-31 1
1 4 2010-01-01 2010-12-31 0
2 1 2008-08-05 2009-08-04 0
2 2 2009-08-05 2010-08-04 1
3 1 2009-07-31 2010-07-30 1
3 2 2010-07-31 2011-07-30 0
where the rules are:
StartDate must be the previous
term EndDate + 1 day (terms cannot overlapping)
there are many terms per each transaction
term length is from 1 to n days (I
made 1 year to make it simpler in this example)
NOTE: IsActiveTerm is a computed column which depends on CurentDate so is not deterministic
I need to ensure terms not overlapping. In other words I want to enforce this condition even when inserting/updating a multiple rows.
What I am thinking of is to add an "INSTEAD OF" triggers (for both Insert and Update) but this requires to use cursors as I need to cope with multiple rows.
Does anyone have a better idea?

You can find pretty much everything about temporal databases in: Richard T. Snodgrass, "Developing Time-Oriented Database Applications in SQL", Morgan-Kaufman (2000), which i believe is out of print but can be downloaded via the link on his publication list

I've got working solution:
CREATE TRIGGER TransTerms_EnsureCon ON TransTerms
FOR INSERT, UPDATE, DELETE AS
BEGIN
IF (EXISTS (SELECT *
FROM TransTerms pT
INNER JOIN TransTerms nT
ON pT.TransId= nT.OfferLettingId
AND nT.TransTermId = pT.TransTermId + 1
WHERE nT.StartDate != DATEADD(d, 1, pT.EndDate)
AND pT.EndDate > pT.StartDate
AND nT.EndDate > nT.StartDate
)
)
RAISERROR('Transaction violates sequenced CONSTRAINT', 1, 2)
ROLLBACK TRANSACTION
END
P.S. Many thanks wallenborn!