Using group by in Proc SQL for SAS - sql-server

I am trying to summarize my data set using the proc sql, but I have repeated values in the output, a simple version of my code is:
PROC SQL;
CREATE TABLE perm.rx_4 AS
SELECT patid,ndc,fill_mon,
COUNT(dea) AS n_dea,
sum(DEDUCT) AS tot_DEDUCT
FROM perm.rx
GROUP BY patid,ndc,fill_mon;
QUIT;
Some sample output is:
Obs Patid Ndc FILL_mon n_dea DEDUCT
3815 33003605204 00054465029 2000-05 2 0
3816 33003605204 00054465029 2000-05 2 0
12257 33004361450 00406035701 2000-06 2 0
16564 33004744098 00603128458 2000-05 2 0
16565 33004744098 00603128458 2000-05 2 0
16566 33004744098 00603128458 2000-06 2 0
16567 33004744098 00603128458 2000-06 2 0
46380 33008165116 00406035705 2000-06 2 0
85179 33013674758 00406035801 2000-05 2 0
89248 33014228307 00054465029 2000-05 2 0
107514 33016949900 00406035805 2000-06 2 0
135047 33056226897 63481062370 2000-05 2 0
213691 33065594501 00472141916 2000-05 2 0
215192 33065657835 63481062370 2000-06 2 0
242848 33066899581 60432024516 2000-06 2 0
As you can see there are repeated out put, for example obs 3815,3816. I have saw some people had similar problem, but the answers didn't work for me.
The content of the dataset is this:
The SAS System 5
17:01 Thursday, December 3, 2015
The CONTENTS Procedure
Engine/Host Dependent Information
Data Set Page Size 65536
Number of Data Set Pages 210
First Data Page 1
Max Obs per Page 1360
Obs in First Data Page 1310
Number of Data Set Repairs 0
Filename /home/zahram/optum/rx_4.sas7bdat
Release Created 9.0401M2
Host Created Linux
Inode Number 424673574
Access Permission rw-r-----
Owner Name zahram
File Size (bytes) 13828096
The SAS System 6
17:01 Thursday, December 3, 2015
The CONTENTS Procedure
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat Label
3 FILL_mon Num 8 YYMMD. Fill month
2 Ndc Char 11 $11. $20. Ndc
1 Patid Num 8 19. Patid
4 n_dea Num 8
5 tot_DEDUCT Num 8
Sort Information
Sortedby Patid Ndc FILL_mon
Validated YES
Character Set ASCII
The SAS System 7
17:01 Thursday, December 3, 2015
The CONTENTS Procedure
Sort Information
Sort Option NODUPKEY
NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.08 seconds
cpu time 0.01 seconds

I'll guess that you have a format on a variable, most likely the date. Proc SQL does not aggregate over formatted values but will use the underlying values but still shows them as formatted, so they appear as duplicates. Your proc contents confirms this. You can get around this by converting this the variable to a character variable.
PROC SQL;
CREATE TABLE perm.rx_4 AS
SELECT patid,ndc, put(fill_mon, yymmd.) as fill_month,
COUNT(dea) AS n_dea,
sum(DEDUCT) AS tot_DEDUCT
FROM perm.rx
GROUP BY patid,ndc, calculated fill_month;
QUIT;

Related

sum values across any 365 day period

I've got a dataset that has id, start date and a claim value (in dollars) in each row - most ids have more than one row - some span over 50 rows. The earliest date for each ID/claim varies, and the claim values are mostly different.
I'd like to do a rolling sum of the value of IDs that have claims within 365 days of each other, to report each ID that has claims that have exceeded a limiting value across each period. So for an ID that had a claim date on 1 January, I'd sum all claims to 31 December (inclusive). Most IDs have several years of data so for the example above, I'd also need to check that if they had a claim on 1 May that they hadn't exceeded the limit by 30 April the following year and so on. I normally see this referred to as a 'rolling sum'. My site has many SAS products including base, stat, ets, and others.
I'm currently testing code on a small mock dataet and so far I've converted a thin file to a fat file with one column for each claim value and each date of the claim. The mock dataset is similar to the client dataset that I'll be using. Here's what I've done so far (noting that the mock data uses days rather than dates - I'm not at the stage where I want to test on real data yet).
data original_data;
input ppt $1. day claim;
datalines;
a 1 7
a 2 12
a 4 12
a 6 18
a 7 11
a 8 10
a 9 14
a 10 17
b 1 27
b 2 12
b 3 14
b 4 12
b 6 18
b 7 11
b 8 10
b 9 14
b 10 17
c 4 2
c 6 4
c 8 8
;
run;
proc sql;
create table ppt_counts as
select ppt, count(*) as ppts
from work.original_data
group by ppt;
select cats('value_', max(ppts) ) into :cats
from work.ppt_counts;
select cats('dates_',max(ppts)) into :cnts
from work.ppt_counts;
quit;
%put &cats;
%put &cnts;
data flipped;
set original_data;
by ppt;
array vars(*) value_1 -&cats.;
array dates(*) dates_1 - &cnts.;
array m_vars value_1 - &cats.;
array m_dates dates_1 - &cnts.;
if first.ppt then do;
i=1;
do over m_vars;
m_vars="";
end;
do over m_dates;
m_dates="";
end;
end;
if first.ppt then do:
i=1;
vars(i) = claim;
dates(i)=day;
if last.ppt then output;
i+1;
retain value_1 - &cats dates_1 - &cnts. 0.;
run;
data output;
set work.flipped;
max_date =max(of dates_1 - &cnts.);
max_value =max(of value_1 - &cats.);
run;
This doesn't give me even close to what I need - not sure how to structure code to make this correct.
What I need to end up with is one row per time that an ID exceeds the yearly limit of claim value (say in the mock data if a claim exceeds 75 across a seven day period), and to include the sum of the claims. So it's likely that there may be multiple lines per ID and the claims from one row may also be included in the claims for the same ID on another row.
type of output:
ID sum of claims
a $85
a $90
b $80
On separate rows.
Any help appreciated.
Thanks
If you need to perform a rolling sum, you can do this with proc expand. The code below will perform a rolling sum of 5 days for each group. First, expand your data to fill in any missing gaps:
proc expand data = original_data
out = original_data_expanded
from = day;
by ppt;
id day;
convert claim / method=none;
run;
Any days with gaps will have missing value of claim. Now we can calculate a moving sum and ignore those missing days when performing the moving sum:
proc expand data = original_data
out = want(where=(NOT missing(claim)));
by ppt;
id day;
convert claim = rolling_sum / transform=(movsum 5) method=none;
run;
Output:
ppt day rolling_sum claim
a 1 7 7
a 2 19 12
a 4 31 12
a 6 42 18
a 7 41 11
...
b 9 53 14
b 10 70 17
c 4 2 2
c 6 6 4
c 8 14 8
The reason we use two proc expand statements is because the rolling sum is calculated before the days are expanded. We need the rolling sum to occur after the expansion. You can test this by running the above code all in a single statement:
/* Performs moving sum, then expands */
proc expand data = original_data
out = test
from = day;
by ppt;
id day;
convert claim = rolling_sum / transform=(movsum 5) method=none;
run;
Use a SQL self join with the dates being within 365 days of itself. This is time/resource intensive if you have a very large data set.
Assuming you have a date variable, the intnx is probably the better way to calculate the date interval than 365 depending on how you want to account for leap years.
If you have a claim id to group on, that would also be better than using the group by clause in this example.
data have;
input ppt $1. day claim;
datalines;
a 1 7
a 2 12
a 4 12
a 6 18
a 7 11
a 8 10
a 9 14
a 10 17
b 1 27
b 2 12
b 3 14
b 4 12
b 6 18
b 7 11
b 8 10
b 9 14
b 10 17
c 4 2
c 6 4
c 8 8
;
run;
proc sql;
create table want as
select a.*, sum(b.claim) as total_claim
from have as a
left join have as b
on a.ppt=b.ppt and
b.day between a.day and a.day+365
group by 1, 2, 3;
/*b.day between a.day and intnx('year', a.day, 1, 's')*/;
quit;
Assuming that you have only one claim per day you could just use a circular array to keep track of the pervious N days of claims to generate the rolling sum. By circular array I mean one where the indexes wrap around back to the beginning when you increment past the end. You can use the MOD() function to convert any integer into an index into the array.
Then to get the running sum just add all of the elements in the array.
Add an extra DO loop to zero out the days skipped when there are days with no claims.
%let N=5;
data want;
set original_data;
by ppt ;
array claims[0:%eval(&n-1)] _temporary_;
lagday=lag(day);
if first.ppt then call missing(of lagday claims[*]);
do index=max(sum(lagday,1),day-&n+1) to day-1;
claims[mod(index,&n)]=0;
end;
claims[mod(day,&n)]=claim;
running_sum=sum(of claims[*]);
drop index lagday ;
run;
Results:
running_
OBS ppt day claim sum
1 a 1 7 7
2 a 2 12 19
3 a 4 12 31
4 a 6 18 42
5 a 7 11 41
6 a 8 10 51
7 a 9 14 53
8 a 10 17 70
9 b 1 27 27
10 b 2 12 39
11 b 3 14 53
12 b 4 12 65
13 b 6 18 56
14 b 7 11 55
15 b 8 10 51
16 b 9 14 53
17 b 10 17 70
18 c 4 2 2
19 c 6 4 6
20 c 8 8 14
Working in a known domain of date integers, you can use a single large array to store the claims at each date and slice out the 365 days to be summed. The bookkeeping needed for the modular approach is not needed.
Example:
data have;
call streaminit(20230202);
do id = 1 to 10;
do date = '01jan2012'd to '02feb2023'd;
date + rand('integer', 25);
claim = rand('integer', 5, 100);
output;
end;
end;
format date yymmdd10.;
run;
options fullstimer;
data want;
set have;
by id;
array claims(100000) _temporary_;
array slice (365) _temporary_;
if first.id then call missing(of claims(*));
claims(date) = claim;
call pokelong(
peekclong(
addrlong (claims(date-365))
, 8*365)
,
addrlong(slice(1))
);
rolling_sum_365 = sum(of slice(*));
if dif1(claim) < 365 then
claims_out_365 = lag(claim) - dif1(rolling_sum_365);
if first.id then claims_out_365 = .;
run;
Note: SAS Date 100,000 is 16OCT2233

Transpose rows 1 and 0 's into different rows in Snowflake

I'm trying to load a file and transpose the row into different rows.
Days Column have 11010011 and need to transpose into vertical format.
Below is the sample input
I'm trying to get the expected output like below
Can you please help me on this in Snowflake? Appreciate your help
Replace '1' with '1,' and '0' with '0,'. Trim the trailing comma. You can then use split to table to turn that into rows:
with SOURCE_DATA as
(
select COLUMN1::int as FACTORY
,COLUMN2::int as YEAR
,COLUMN3::string as DAYS
from (values
(01,2021,'01010100100101010001'),
(99,2021,'00100111010101011010')
)
)
select FACTORY, YEAR, SEQ as SOURCE_ROW, INDEX as POSITION_IN_STRING, VALUE as WORKING_DAY
from SOURCE_DATA, table(split_to_table(trim(replace(replace(DAYS,'1','1,'),'0','0,'),','),',')) D
;
Abbreviated output:
FACTORY
YEAR
SOURCE_ROW
POSITION_IN_STRING
WORKING_DAY
1
2021
1
1
0
1
2021
1
2
1
1
2021
1
3
0
1
2021
1
4
1
1
2021
1
5
0
The split() table function gives you some metadata columns with information on the split. You can change the sample to select * to see them and maybe they're useful in some way for your requirements.

expanding a dataset with blanks

I have a dataset as follows:
data have;
input;
ID Base Adverse Fixed$ Date RepricingFrequency
1 38 50 FIXED 2016 2
2 40 60 FLOATING 2017 3
3 20 20 FIXED 2016 2
4 ...
5
6
I am looking to build an array such that each ID has four vintage years 2017-2020, where the subsequent years are to be filled out with a piece of array code I have that works
like such
ID Vintage Base Adverse Fixed$ Date RepricingFrequency
1 2017 38 50 FIXED 2016 2
1 2018
1 2019
1 2020
In the beginning I just need to duplicate the dataset with the blanks,
The code I've tried so far is
data want;
set have;
do I=1 to 4;
output;
drop I;
run;
but of course that keeps the repeats of all the observations. So I tried an array.
data want;
set have;
array Base(2017:2020) Base2017-Base2020
array Vintage(2017:2020) Vintage2017-Vintage2020
But I'm not sure where to go from here on either accord.
The question is how do I extrapolate my data set for ID1-8 to a dataset where I have ID 1111-8888 where each ID is repeated 4 times with blanks.
Make a dummy dataset with all of the observations
data frame ;
set have(keep=id);
by id ;
if first.id then do date=2017 to 2020 ;
output;
end;
run;
and merge it back with the original.
data want ;
merge have frame ;
by id date ;
run;

How to make a table and access table parameters using array of structures in C?

Rows are Temperature and columns are Pressure.
Temp 750 755 760 765(pressure)
0 1.1 2 1 4
1 3 4 2 1 (factors)
2 4 5 5 9
I need a help in making this table in code with that i would like to access factor values for respective temp and pressure .
For example if temp 0 and pressure 750 the factor value is 1.1 ,if temp 1 and pressure 750 factor value is 3.
My Sample Output image

Matlab Dataset Array Calculating delta t

I have a very large dataset array with over a million values that looks like this:
Month Day Year Hour Min Second Line1 Line2 Power Dt
7 8 2013 0 1 54 1.91 4.98 826.8 0
7 8 2013 0 0 9 1.93 3.71 676.8 0
7 8 2013 0 1 15 1.92 5.02 832.8 0
7 8 2013 0 1 21 1.91 5.01 830.4 0
and so on.
When the measurement of seconds got to 60 it would start over again at 0 hence why the first number is bigger. I need to fill the delta t column (Dt) by taking the current rows seconds column and subtracting the previous rows seconds column and correcting for negatyive values. This opperation cannot preform this operation in a loop as the it would take ages to complete and needs to be completed in a simple, one-shot, vector subtraction operation.
You can try diff command to generate such results. Its very fast and should work wihout any for loop.
HTH
Dt=diff(datenum(A(:,1:6)))*60*60*24;
This gives the delta in seconds, but I'm not sure what you want you correction for negative differences to be. Could you give an example of the expected output?
Note that Dt will be one entry shorter than A, so you may have to pad it.
You can remove the negative values (I think) with the command
Dt(Dt<0)=Dt(Dt<0)+60;
If you need to pad the Dt vector so that it is the same length as the data set, try
Dt=[Dt;0];

Resources