Having some problems finding an answer to what I think is a simple query but I'm very green with SQL:
YR MO ID FLAG RETURN
2001 01 1 1 3.00
2001 02 1 2 4.00
2001 03 1 3 -1.00
2001 04 1 4 1.00
2001 05 1 5 1.00
2001 06 1 6 1.00
2001 07 1 7 1.00
2001 08 1 8 1.00
2001 09 1 9 1.00
2001 10 1 10 1.00
2001 11 1 11 2.00
2001 12 1 12 1.00
2002 12 2 3 1.00
2002 04 2 0 0.05
I'd like a new column next to sum the previous 12 RETURN values WHERE FLAG = 12. Any help is greatly appreciated!
The data will be sorted by ID, then Year, and Month so it should be order sequentially.
The output would be (3+4+-1+1+1+1+1+1+1+1+2+1) = 16
I'd like the output (16) in the FLAG=12 row
Maybe a Windowed Function would fit the bill here:
SELECT *, CASE WHEN FLAG = 12 THEN SUM([RETURN]) OVER (PARTITION BY ID ORDER BY YR, MO ROWS BETWEEN 12 PRECEDING AND CURRENT ROW) ELSE NULL END
FROM SomeTable
ORDER BY ID, YR, MO
So, there are a couple issues with what you are attempting. First, you will need to either programmatically or administratively (through the UI) create the new column; the select call will not do this for you. Next, you need to be sure you want that data in your schema as it will be very 'odd' to have a column that sums flagged values. It seems as if you want to know that result but don't necessarily need to store it. If that is true (or can be made true), then I would suggest creating a select call that uses the 'sum', 'order by ... desc' (this means you need to know the ordering) and 'limit 12' functions. Given any row where the Flag is 12, you should be able get the result you want with a single call.
Just another note, since you've mentioned two different DBMSs, make sure you validate the SQL against both; I'm fairly certain you can find a generic request that will work in both systems. Good luck.
Related
I've got a dataset that has id, start date and a claim value (in dollars) in each row - most ids have more than one row - some span over 50 rows. The earliest date for each ID/claim varies, and the claim values are mostly different.
I'd like to do a rolling sum of the value of IDs that have claims within 365 days of each other, to report each ID that has claims that have exceeded a limiting value across each period. So for an ID that had a claim date on 1 January, I'd sum all claims to 31 December (inclusive). Most IDs have several years of data so for the example above, I'd also need to check that if they had a claim on 1 May that they hadn't exceeded the limit by 30 April the following year and so on. I normally see this referred to as a 'rolling sum'. My site has many SAS products including base, stat, ets, and others.
I'm currently testing code on a small mock dataet and so far I've converted a thin file to a fat file with one column for each claim value and each date of the claim. The mock dataset is similar to the client dataset that I'll be using. Here's what I've done so far (noting that the mock data uses days rather than dates - I'm not at the stage where I want to test on real data yet).
data original_data;
input ppt $1. day claim;
datalines;
a 1 7
a 2 12
a 4 12
a 6 18
a 7 11
a 8 10
a 9 14
a 10 17
b 1 27
b 2 12
b 3 14
b 4 12
b 6 18
b 7 11
b 8 10
b 9 14
b 10 17
c 4 2
c 6 4
c 8 8
;
run;
proc sql;
create table ppt_counts as
select ppt, count(*) as ppts
from work.original_data
group by ppt;
select cats('value_', max(ppts) ) into :cats
from work.ppt_counts;
select cats('dates_',max(ppts)) into :cnts
from work.ppt_counts;
quit;
%put &cats;
%put &cnts;
data flipped;
set original_data;
by ppt;
array vars(*) value_1 -&cats.;
array dates(*) dates_1 - &cnts.;
array m_vars value_1 - &cats.;
array m_dates dates_1 - &cnts.;
if first.ppt then do;
i=1;
do over m_vars;
m_vars="";
end;
do over m_dates;
m_dates="";
end;
end;
if first.ppt then do:
i=1;
vars(i) = claim;
dates(i)=day;
if last.ppt then output;
i+1;
retain value_1 - &cats dates_1 - &cnts. 0.;
run;
data output;
set work.flipped;
max_date =max(of dates_1 - &cnts.);
max_value =max(of value_1 - &cats.);
run;
This doesn't give me even close to what I need - not sure how to structure code to make this correct.
What I need to end up with is one row per time that an ID exceeds the yearly limit of claim value (say in the mock data if a claim exceeds 75 across a seven day period), and to include the sum of the claims. So it's likely that there may be multiple lines per ID and the claims from one row may also be included in the claims for the same ID on another row.
type of output:
ID sum of claims
a $85
a $90
b $80
On separate rows.
Any help appreciated.
Thanks
If you need to perform a rolling sum, you can do this with proc expand. The code below will perform a rolling sum of 5 days for each group. First, expand your data to fill in any missing gaps:
proc expand data = original_data
out = original_data_expanded
from = day;
by ppt;
id day;
convert claim / method=none;
run;
Any days with gaps will have missing value of claim. Now we can calculate a moving sum and ignore those missing days when performing the moving sum:
proc expand data = original_data
out = want(where=(NOT missing(claim)));
by ppt;
id day;
convert claim = rolling_sum / transform=(movsum 5) method=none;
run;
Output:
ppt day rolling_sum claim
a 1 7 7
a 2 19 12
a 4 31 12
a 6 42 18
a 7 41 11
...
b 9 53 14
b 10 70 17
c 4 2 2
c 6 6 4
c 8 14 8
The reason we use two proc expand statements is because the rolling sum is calculated before the days are expanded. We need the rolling sum to occur after the expansion. You can test this by running the above code all in a single statement:
/* Performs moving sum, then expands */
proc expand data = original_data
out = test
from = day;
by ppt;
id day;
convert claim = rolling_sum / transform=(movsum 5) method=none;
run;
Use a SQL self join with the dates being within 365 days of itself. This is time/resource intensive if you have a very large data set.
Assuming you have a date variable, the intnx is probably the better way to calculate the date interval than 365 depending on how you want to account for leap years.
If you have a claim id to group on, that would also be better than using the group by clause in this example.
data have;
input ppt $1. day claim;
datalines;
a 1 7
a 2 12
a 4 12
a 6 18
a 7 11
a 8 10
a 9 14
a 10 17
b 1 27
b 2 12
b 3 14
b 4 12
b 6 18
b 7 11
b 8 10
b 9 14
b 10 17
c 4 2
c 6 4
c 8 8
;
run;
proc sql;
create table want as
select a.*, sum(b.claim) as total_claim
from have as a
left join have as b
on a.ppt=b.ppt and
b.day between a.day and a.day+365
group by 1, 2, 3;
/*b.day between a.day and intnx('year', a.day, 1, 's')*/;
quit;
Assuming that you have only one claim per day you could just use a circular array to keep track of the pervious N days of claims to generate the rolling sum. By circular array I mean one where the indexes wrap around back to the beginning when you increment past the end. You can use the MOD() function to convert any integer into an index into the array.
Then to get the running sum just add all of the elements in the array.
Add an extra DO loop to zero out the days skipped when there are days with no claims.
%let N=5;
data want;
set original_data;
by ppt ;
array claims[0:%eval(&n-1)] _temporary_;
lagday=lag(day);
if first.ppt then call missing(of lagday claims[*]);
do index=max(sum(lagday,1),day-&n+1) to day-1;
claims[mod(index,&n)]=0;
end;
claims[mod(day,&n)]=claim;
running_sum=sum(of claims[*]);
drop index lagday ;
run;
Results:
running_
OBS ppt day claim sum
1 a 1 7 7
2 a 2 12 19
3 a 4 12 31
4 a 6 18 42
5 a 7 11 41
6 a 8 10 51
7 a 9 14 53
8 a 10 17 70
9 b 1 27 27
10 b 2 12 39
11 b 3 14 53
12 b 4 12 65
13 b 6 18 56
14 b 7 11 55
15 b 8 10 51
16 b 9 14 53
17 b 10 17 70
18 c 4 2 2
19 c 6 4 6
20 c 8 8 14
Working in a known domain of date integers, you can use a single large array to store the claims at each date and slice out the 365 days to be summed. The bookkeeping needed for the modular approach is not needed.
Example:
data have;
call streaminit(20230202);
do id = 1 to 10;
do date = '01jan2012'd to '02feb2023'd;
date + rand('integer', 25);
claim = rand('integer', 5, 100);
output;
end;
end;
format date yymmdd10.;
run;
options fullstimer;
data want;
set have;
by id;
array claims(100000) _temporary_;
array slice (365) _temporary_;
if first.id then call missing(of claims(*));
claims(date) = claim;
call pokelong(
peekclong(
addrlong (claims(date-365))
, 8*365)
,
addrlong(slice(1))
);
rolling_sum_365 = sum(of slice(*));
if dif1(claim) < 365 then
claims_out_365 = lag(claim) - dif1(rolling_sum_365);
if first.id then claims_out_365 = .;
run;
Note: SAS Date 100,000 is 16OCT2233
I have a dataset at the firm-product-year level. I want to identify which firms having gaps in reporting years between 1994-2004. Consider an example below:
clear
input id year sales product
14 1994 28.9 2
14 1994 67.9 3
14 1994 12.5 9
14 1994 451.8 34
14 1994 27.5 44
14 1994 647.6 45
14 1995 9.7 2
14 1995 33.5 3
14 1995 112.4 9
14 1995 712.2 15
14 1995 902.3 41
14 1995 67.3 45
14 1995 15.1 50
14 1996 6.5 2
14 1996 24.6 3
14 1996 1009.4 5
14 1996 77.1 9
14 1996 76.9 17
14 1996 12.4 45
14 1996 946.3 88
14 1996 15.4 92
14 1997 .7 2
14 1997 63.2 2
14 1997 91.7 3
14 1997 860.8 9
14 1997 12.4 21
14 1997 800.8 32
14 1997 33.7 45
14 1997 41 95
15 1999 .1 44
15 2000 .1 58
15 2001 .4 27
15 2001 .1 95
15 2002 .5 5
15 2002 .1 58
15 2003 .1 17
15 2004 3.5 28
15 2004 .1 39
16 2000 .8 2
16 2001 .6 2
16 2003 .2 2
16 2004 .1 2
16 2004 .1 8
16 2004 2.5 8
end
Firm 14 produced 6 products in 1994. It produced every year consecutively until 1997. Because there are no missing years in between, I keep this firm. But firm 16 reports in 2000, 2001 and then in 2003. I assume that the firm still operated in 2002 but doesn't report in the data. How to create a dummy variable for this firm?
tsfill doesn't help because I have repeated values within id-year.
In the first step, you delete the companies that do not produce any products in a year by creating a dummy variable "firm_any_production" that indicates whether a company has produced at least one product in a given year. Then the maximum of this dummy variable is calculated for each firm and the firms for which the maximum is 0 are deleted.
gen firm_any_production = sum(sales) > 0
bysort id (year): egen firm_missing_year = max(firm_any_production)
drop if firm_missing_year == 0
In step 2 you calculate whether the newly added products of a company have higher sales than the core product. This is calculated by creating a dummy variable "is_new_product", which indicates whether a product is a new product. Then the sales of these new products are calculated and compared to the sales of the core product. If the sum of the turnover of the new products is greater than the turnover of the core product, another dummy variable "greater_than_core" is created and set to 1.
bysort id year: egen core_product_sales = max(sales)
gen is_new_product = sales != core_product_sales
gen new_product_sales = sales * is_new_product
gen greater_than_core = sum(new_product_sales) > core_product_sales
Translated with www.DeepL.com/Translator (free version)
Added:
The code is creating a firm_missing_year variable that takes the value of 1 if a firm doesn't report any product in the current year. The is_core_product variable indicates which product has the highest sales in a given year for each firm. The is_new_product variable takes the value of 1 if the product wasn't produced in the previous year. Finally, the higher_new_sales variable takes the value of 1 if the sum of sales of new products is greater than the sales of the core product.
use "your_data_file.dta", clear
gen firm_missing_year = 0
bysort id (year): egen last_year = max(year), unique(id)
replace firm_missing_year = 1 if year > last_year[1]
gen is_core_product = 0
bysort id year: egen max_sales = max(sales), unique(id year)
replace is_core_product = 1 if sales == max_sales
gen is_new_product = 0
bysort id year: gen lagged_product = product[_n-1]
replace is_new_product = 1 if product != lagged_product & sales != max_sales
bysort id year: egen sum_new_sales = sum(sales * is_new_product), unique(id year)
gen higher_new_sales = 0
replace higher_new_sales = 1 if sum_new_sales > max_sales
1) Data I have a following dataset in google sheets link. In the sheets sample I have only 4 months of data but normally there would be many, many more to come in the future.
MONTH
DATE
KATEGORIE
DOWNTIME
TIME (min)
9
01/09/2021
01 DURCHLAUF
0
50
9
02/09/2021
01 DURCHLAUF
0
65
9
03/09/2021
01 DURCHLAUF
0
91
9
04/09/2021
01 DURCHLAUF
0
52
9
05/09/2021
01 DURCHLAUF
0
72
9
06/09/2021
01 DURCHLAUF
0
44
9
07/09/2021
01 DURCHLAUF
0
55
9
08/09/2021
01 DURCHLAUF
0
30
9
09/09/2021
01 DURCHLAUF
0
42
2) Expected output table and desired output
I want to create a scorecard for 02 Downtime to show total time for a given month.
If I filter for November, I would like the scorecard to compare vs October. (abs=1180, %=45)
Similarly, if I select December, I want to see the amount vs November (abs=940, %=25)
As a safety measure, if someone selects 2 months simultaneously, then perhaps it should not show any comparison. (unless it's possible to even do 2 vs 2 months, but it's not a necessity.)
3) Chart: Configuration + Setup
I have created a simple scorecard and a pivot table. I filtered out only Downtime.
4) Issue: Attempt at solving + Output and 5) Report: Publicly editable Looker Studio with 1-4.
In my file link you see the mentioned scorecard but I fail to include any comparison that is kind of "dynamic" that changes the month in question.
Added solution to your dashboard page 2
Cross join the datasource with itself
use the following calculated fields for expected output.
Daet01:
CAST(CONCAT(DATE(EXTRACT(YEAR FROM DATE123),EXTRACT(MONTH FROM DATE123),01)) AS DATE)
Previous Month:
TIME (min) (Table 1)-CASE WHEN Daet01 (Table 2) = DATETIME_SUB(Daet01 (Table 1), INTERVAL 1 MONTH) THEN TIME (min) (Table 2) ELSE 0 END
% Difference:
(TIME (min) (Table 1)-CASE WHEN Daet01 (Table 2) = DATETIME_SUB(Daet01 (Table 1), INTERVAL 1 MONTH) THEN TIME (min) (Table 2) ELSE 0 END)/TIME (min) (Table 2)
-
I am trying to group by dataset in three month groups, or quarters, but as I'm starting from an arbitrary date, I cannot use the quarter function in sas.
Example data below of what I have and quarter is the column I need to create in SAS.
The start date is always the same, so my initial quarter will be 3rd Sep 2018 - 3rd Dec 2018 and any active date falling in that quarter will be 1, then quarter 2 will be 3rd Dec 2018 - 3rd Mar 2019 and so on. This cannot be coded manually as the start date will change depending on the data, and the number of quarters could be up to 20+.
The code I have attempted so far is below
data test_Data_op;
set test_data end=eof;
%let j = 0;
%let start_date = start_Date;
if &start_Date. <= effective_dt < (&start_date. + 90) then quarter = &j.+1;
run;
This works and gives the first quarter correctly, but I can't figure out how to loop this for every following quarter? Any help will be greatly appreciated!
No need for a DO loop if you already have the start_date and actual event dates. Just count the number of months and divide by three. Use the continuous method of the INTCK() function to handle start dates that are not the first day of a month.
month_number=intck('month',&start_date,mydate,'cont')+1;
qtr_number=floor((month_number-1)/3)+1;
Based on the comment by #Lee. Edited to match the data from the screenshot.
The example shows that May 11 would be in the 3rd quarter since the seed date is September 3.
data have;
input mydate :yymmdd10.;
format mydate yymmddd10.;
datalines;
2018-09-13
2018-12-12
2019-05-11
;
run;
%let start_date='03sep2018'd;
data want;
set have;
quarter=floor(mod((yrdif(&start_date,mydate)*4),4))+1;
run;
If you want the number of quarters to extend beyond 4 (e.g. September 4, 2019 would be in quarter 5 rather than cycle back to 1), then remove the "mod" from the function:
quarter=floor(yrdif(&start_date,mydate)*4)+1;
The traditional use of quarter means a 3 month time period relative to Jan 1. Make sure your audience understands the phrase quarter in your data presentation actually means 3 months relative to some arbitrary starting point.
The funky quarter can be functionally computed from a months apart derived using a mix of INTCK for the baseline months computation and a logical expression for adjusting with relation to the day of the month of the start date. No loops required.
For example:
data have;
do startDate = '11feb2019'd ;
do effectiveDate = startDate to startDate + 21*90;
output;
end;
end;
format startDate effectiveDate yymmdd10.;
run;
data want;
set have;
qtr = 1
+ floor(
( intck ('month', startDate, effectiveDate)
-
(day(effectiveDate) < day(startDate))
)
/ 3
);
format qtr 4.;
run;
Extra
Comparing my method (qtr) to #Tom (qtr_number) for a range of startDates:
data have;
retain seq 0;
do startDate = '01jan1999'd to '15jan2001'd;
seq + 1;
do effectiveDate = startDate to startDate + 21*90;
output;
end;
end;
format startDate effectiveDate yymmdd10.;
run;
data want;
set have;
qtr = 1
+ floor( ( intck ('month', startDate, effectiveDate)
- (day(effectiveDate) < day(startDate))
) / 3 );
month_number=intck('month',startDate,effectiveDate,'cont')+1;
qtr_number=floor((month_number-1)/3)+1;
format qtr: month: 4.;
run;
options nocenter nodate nonumber;title;
ods listing;
proc print data=want;
where qtr ne qtr_number;
run;
dm 'output';
-------- OUTPUT ---------
effective month_ qtr_
Obs seq startDate Date qtr number number
56820 31 1999-01-31 1999-04-30 1 4 2
57186 31 1999-01-31 2000-04-30 5 16 6
57551 31 1999-01-31 2001-04-30 9 28 10
57916 31 1999-01-31 2002-04-30 13 40 14
58281 31 1999-01-31 2003-04-30 17 52 18
168391 90 1999-03-31 1999-06-30 1 4 2
168483 90 1999-03-31 1999-09-30 2 7 3
168757 90 1999-03-31 2000-06-30 5 16 6
168849 90 1999-03-31 2000-09-30 6 19 7
169122 90 1999-03-31 2001-06-30 9 28 10
169214 90 1999-03-31 2001-09-30 10 31 11
169487 90 1999-03-31 2002-06-30 13 40 14
169579 90 1999-03-31 2002-09-30 14 43 15
169852 90 1999-03-31 2003-06-30 17 52 18
169944 90 1999-03-31 2003-09-30 18 55 19
280510 149 1999-05-29 2001-02-28 7 22 8
280875 149 1999-05-29 2002-02-28 11 34 12
281240 149 1999-05-29 2003-02-28 15 46 16
282035 150 1999-05-30 2000-02-29 3 10 4
282400 150 1999-05-30 2001-02-28 7 22 8
282765 150 1999-05-30 2002-02-28 11 34 12
Please advise SQL command.
I have a table with 3 columns: Data, Quantity, Price
But number of rows about thousand.
I have exact number of rows (for example only 5), which I want to pickup from this table (see below).
So I want to collect data after "06.02.2013" (if this date not in table, possible to take next nearest date after this date, it will be 11.02.2013),
and collect 5 rows after this date (result see below)
table_Prices:
Date Qty Price
-----------------------
01.02.2013 24 1025
06.02.2013 26 1150
11.02.2013 47 2014
16.02.2013 5 1025
21.02.2013 7 1023
26.02.2013 8 1025
03.03.2013 95 1203
08.03.2013 63 1203
13.03.2013 25 2012
18.03.2013 48 1032
23.03.2013 105 1253
28.03.2013 48 1452
Desired result:
06.02.2013 26 1150
11.02.2013 47 2014
16.02.2013 5 1025
21.02.2013 7 1023
26.02.2013 8 1025
select top 5 *
from table_Prices
where Date > cast('06-02-2013' as datetime)
order by Date asc