I am trying to find a way to create observations based on multiple other ones using SAS.
For example, I have the following table:
+------+--------------------+-------------------+
| ID | START_DATE | END_DATE |
+------+--------------------+-------------------+
| ABC1 | 01FEB201500:00:00 | 30NOV201600:00:00 |
| ABC1 | 01JAN201700:00:00 | 30NOV201800:00:00 |
+------+--------------------+-------------------+
And I would like to create a table where all the timestamps for the period 01JAN2014 to 31DEC2020 are covered. In other words, it would consist of creating 2 more observations to the dataset to look like this;
+------+--------------------+-------------------+
| ID | START_DATE | END_DATE |
+------+--------------------+-------------------+
| ABC1 | 01FEB201400:00:00 | 31JAN201500:00:00 |
| ABC1 | 01FEB201500:00:00 | 30NOV201600:00:00 |
| ABC1 | 01DEC201600:00:00 | 30NOV201800:00:00 |
| ABC1 | 01DEC201800:00:00 | 31DEC202000:00:00 |
+------+--------------------+-------------------+
The SAS code to re-create this example is:
DATA test;
INPUT ID :$4. START_DATE :datetime18. END_DATE :datetime18.;
FORMAT START_DATE datetime20. END_DATE datetime20.;
CARDS;
ABC1 01FEB201400:00:00 31JAN201500:00:00
ABC1 01JAN201700:00:00 30NOV201800:00:00
;
RUN;
I don't see a way to do this in SAS
You can fill in (or compute) intra range gaps using basic comparisons, some holding variables and a retained variable.
Example:
Presume no ranges overlap and are order low start first.
data have;
input id x1 x2; datalines;
1 3 7
1 11 14
2 4 9
2 15 18
3 1 11
4 11 20
5 1 2
5 3 4
5 5 9
5 10 20
;
data want;
set have;
by id;
length type $6;
* fill in ranges for every integer 1 through 20;
if first.id then do;
bot = 1;
retain bot;
end;
if bot < x1 then do;
hold1 = x1;
hold2 = x2;
x1 = bot;
x2 = hold1 - 1;
type = 'gap -';
output;
x1 = hold1;
x2 = hold2;
type = 'have';
bot = x2 + 1;
output;
end;
else if x1 <= bot <= x2 then do;
bot = x2 + 1;
type = 'have';
output;
end;
if last.id and 20 >= bot > x2 then do;
type = 'gap +';
x1 = bot;
x2 = 20;
output;
end;
keep type id x1 x2 bot;
run;
Related
I am wondering if it is possible to reshape the following have table in SAS not using SAS/IML to produce the want table.
Have:
+--------+------+------+------+
| NAME | var1 | var2 | var3 |
+--------+------+------+------+
| Q1_ID1 | 1 | 2 | 3 |
| Q1_ID2 | 4 | 5 | 6 |
| Q2_ID1 | 7 | 8 | 9 |
| Q2_ID2 | 10 | 11 | 12 |
| Q3_ID1 | 13 | 14 | 15 |
| Q3_ID2 | 16 | 17 | 18 |
+--------+------+------+------+
Want:
+----------+----+----+----+
| NAME | Q1 | Q2 | Q3 |
+----------+----+----+----+
| var1_ID1 | 1 | 7 | 13 |
| var1_ID2 | 4 | 10 | 16 |
| var2_ID1 | 2 | 8 | 14 |
| var2_ID2 | 5 | 11 | 17 |
| var3_ID1 | 3 | 3 | 15 |
| var3_ID2 | 6 | 6 | 18 |
+----------+----+----+----+
The code to reproduce the have table is the following:
data have;
infile datalines delimiter=",";
input NAME :$8. var1 :8. var2 :8. var3 :8.;
datalines;
Q1_ID1,1,2,3
Q1_ID2,4,5,6
Q2_ID1,7,8,9
Q2_ID2,10,11,12
Q3_ID1,13,14,15
Q3_ID2,16,17,18
;
run;
Two transposes are needed, with some tearing apart and putting together in between.
data have;
infile datalines delimiter=",";
input NAME :$8. var1 :8. var2 :8. var3 :8.;
datalines;
Q1_ID1,1,2,3
Q1_ID2,4,5,6
Q2_ID1,7,8,9
Q2_ID2,10,11,12
Q3_ID1,13,14,15
Q3_ID2,16,17,18
;
run;
proc transpose data=have out=stage;
by name;
var var:;
run;
data stage2(keep=name col1 qtr);
set stage;
qtr = scan(name,1,'_'); * tear apart;
id = scan(name,2,'_');
name = catx('_', _name_, id); * put together;
run;
proc sort data=stage2;
by name qtr;
run;
proc transpose data=stage2 out=want;
by name;
id qtr;
run;
Say that I gain +5 coins from every room I complete. What I'm trying to do is to make a formula in Excel that gets the total coins I've gotten from the first room to the 100th room.
With C++, I guess it would be something like:
while (lastRoom > 0)
{
totalCoins = lastRoom*5;
lastRoom--;
}
totalCoins, being an array so that you can just output the sum of the array.
So if ever, how do you put this code in excel and get it to work? Or is there any other way to get the total coins?
The are infinite solutions.
One is to build a table like this:
+---+----------+---------------+
| | A | B |
+---+----------+---------------+
| 1 | UserID | RoomCompleted |
| 2 | User 001 | Room 1 |
| 3 | User 002 | Room 1 |
| 4 | User 002 | Room 2 |
| 5 | User 002 | Room 3 |
+---+----------+---------------+
them pivot the spreadsheet to get the following:
+---+----------+-----------------------+
| | A | B |
+---+----------+-----------------------+
| 1 | User | Total Rooms completed |
| 2 | User 001 | 1 |
| 3 | User 002 | 3 |
+---+----------+-----------------------+
where you have the number of completed rooms for each users. You can now multiplicate the number per 5 as a simple formula or (better) as a calculated filed of the pivot.
If I understand you correctly you shouldn't need any special code, just a formula:
=(C2-A2+1)*B2
Where C2 = Nth room, A2 = Previous Room, and B2 = coin reward. You can change A2, B2, or C2 and the formula in D2 will output the result.
You can use the formula for sum of integers less than n: (n - 1)*(n / 2), then multiply it by coin count so you will get something like: 5 * (n - 1)*(n / 2). Then you just hook it up to your table.
Hope it helps
So here is my question. Brace yourself as it takes some thinking just to wrap your head around what I am trying to do.
I'm working with Quarterly census employment and wage data. QCEW data has something called suppression codes. If a data denomination (comes in overall, location quotient, and over the year each year each quarter) is suppressed, then all the data for that denomination is zero. I have my table set up in the following way (only showing you columns that are relevant for the question):
A County_Id column,
Industry_ID column,
Year column,
Qtr column,
Suppressed column (0 for not suppressed and 1 for suppressed),
Data_Category column (1 for overall, 2 for lq, and 3 for over the year),
Data_Denomination column (goes 1-8 for what specific data is being looked at in that category ex: monthly employment,Taxable wage, etc. typical data),
and a value column (which will be zero if the Data_Category is suppressed - since all the data denomination values will be zero).
Now, if Overall data (cat 1) for, say, 1991 quarter 1 is suppressed, but the next year quarter 1 has both overall and over the year (cats 1 and 3) NOT suppressed, then we can infer what the value would be for that first year's suppressed data, since OTY1991q1 = (Overall1991q1 - Overall1990q1). So to find that suppressed data we would just subtract our cat 1 (denom 1-8) values from our cat 3 (denom 1-8) values to replace the zeroes that are in our suppressed values from the year before. It's fairly easy to grasp mathematically, the difficulty is that there are millions of columns with which to check for these criteria. I'm trying to write some kind of SQL query that would do this for me, check to make sure Overall-n qtr-n is suppressed, then look to see if the next year isn't for both overall and oty, (in maybe some sort of complicated case statement? Then if those criteria are met, perform the arithmetic for the two Data_Cat-Data_Denom categories and replace the zero in the respective Cat-Denom values.
Below is a simple sample (non-relevant data_cats removed) that I hope will help get what I'm trying to do across.
|CountyID IndustryID Year Qtr Suppressed Data_Cat Data_Denom Value
| 5 10 1990 1 1 1 1 0
| 5 10 1990 1 1 1 2 0
| 5 10 1990 1 1 1 3 0
| 5 10 1991 1 0 1 1 5
| 5 10 1991 1 0 1 2 15
| 5 10 1991 1 0 1 3 25
| 5 10 1991 1 0 3 1 20
| 5 10 1991 1 0 3 2 20
| 5 10 1991 1 0 3 3 35
So basically what we're trying to do here is take the overall data from each data category (I removed lq ~ data_cat 2) because it isn't relevant and data_denom (which I've narrowed down from 8 to 3 for simplicity) in 1991, subtract it from the overall 1991 value and that will give you the applicable
| value for the previous year's 1990 cat_1. So here data_cat 1 Data_denom 1 would be 15 (20-5), denom 2 would be 5(20-15), and denom 3 would be 10(35-25). (Oty 1991q1 - overall 1991q1) = 1990q1. I hope this helps. Like I said the problem isn't the math it's formulating a query that will check this criteria millions and millions of times.
If you want to find supressed data that has 2 rows of unsupressed data for the next year and quarter, we could use cross apply() to do something like this:
test setup: http://rextester.com/ORNCFR23551
using cross apply() to return rows with a valid derived value:
select t.*
, NewValue = cat3.value - cat1.value
from t
cross apply (
select i.value
from t as i
where i.CountyID = t.CountyID
and i.IndustryID = t.IndustryID
and i.Data_Denom = t.Data_Denom
and i.Year = t.Year +1
and i.Qtr = t.Qtr
and i.Suppressed = 0
and i.Data_Cat = 1
) cat1
cross apply (
select i.value
from t as i
where i.CountyID = t.CountyID
and i.IndustryID = t.IndustryID
and i.Data_Denom = t.Data_Denom
and i.Year = t.Year +1
and i.Qtr = t.Qtr
and i.Suppressed = 0
and i.Data_Cat = 3
) cat3
where t.Suppressed = 1
and t.Data_Cat = 1
returns:
+----------+------------+------+-----+------------+----------+------------+-------+----------+
| CountyID | IndustryID | Year | Qtr | Suppressed | Data_Cat | Data_Denom | Value | NewValue |
+----------+------------+------+-----+------------+----------+------------+-------+----------+
| 5 | 10 | 1990 | 1 | 1 | 1 | 1 | 0 | 15 |
| 5 | 10 | 1990 | 1 | 1 | 1 | 2 | 0 | 5 |
| 5 | 10 | 1990 | 1 | 1 | 1 | 3 | 0 | 10 |
+----------+------------+------+-----+------------+----------+------------+-------+----------+
Using outer apply() to return all rows
select t.*
, NewValue = coalesce(nullif(t.value,0),cat3.value - cat1.value,0)
from t
outer apply (
select i.value
from t as i
where i.CountyID = t.CountyID
and i.IndustryID = t.IndustryID
and i.Data_Denom = t.Data_Denom
and i.Year = t.Year +1
and i.Qtr = t.Qtr
and i.Suppressed = 0
and i.Data_Cat = 1
) cat1
outer apply (
select i.value
from t as i
where i.CountyID = t.CountyID
and i.IndustryID = t.IndustryID
and i.Data_Denom = t.Data_Denom
and i.Year = t.Year +1
and i.Qtr = t.Qtr
and i.Suppressed = 0
and i.Data_Cat = 3
) cat3
returns:
+----------+------------+------+-----+------------+----------+------------+-------+----------+
| CountyID | IndustryID | Year | Qtr | Suppressed | Data_Cat | Data_Denom | Value | NewValue |
+----------+------------+------+-----+------------+----------+------------+-------+----------+
| 5 | 10 | 1990 | 1 | 1 | 1 | 1 | 0 | 15 |
| 5 | 10 | 1990 | 1 | 1 | 1 | 2 | 0 | 5 |
| 5 | 10 | 1990 | 1 | 1 | 1 | 3 | 0 | 10 |
| 5 | 10 | 1991 | 1 | 0 | 1 | 1 | 5 | 5 |
| 5 | 10 | 1991 | 1 | 0 | 1 | 2 | 15 | 15 |
| 5 | 10 | 1991 | 1 | 0 | 1 | 3 | 25 | 25 |
| 5 | 10 | 1991 | 1 | 0 | 3 | 1 | 20 | 20 |
| 5 | 10 | 1991 | 1 | 0 | 3 | 2 | 20 | 20 |
| 5 | 10 | 1991 | 1 | 0 | 3 | 3 | 35 | 35 |
+----------+------------+------+-----+------------+----------+------------+-------+----------+
UPDATE 1 - fixed some column names
UPDATE 2 - improved aliases in 2nd query
Ok, I think I get it.
If you're just wanting to make that one inference, then the following may help. (If this is just the first of many inferences you want to make in filling data gaps, you may find that a different method leads to a more efficient solution for doing both/all of them, but I guess cross that bridge when you get there...)
While much of the basic logic stays the same, how you'd tweak it depends on whether you want a query just to provide the values you would infer (e.g. to drive an UPDATE statement), or whether you want to use this logic inline in a bigger query. For performance reasons, I suspect the former makes more sense (especially if you can do the update once and then read the resulting dataset many times), so I'll start by framing things that way and come back to the other in a moment...
It sounds like you have a single table (I'll call it QCEW) with all these columns. In that case, use joins to associate each suppressed overall datapoint (c_oa in the following code) with the corresponding overall and oty datapoints from a year later:
SELECT c_oa.*, n_oa.value - n_oty.value inferred_value
FROM QCEW c_oa --current yr/qtr overall
inner join QCEW n_oa --next yr (same qtr) overall
on c_oa.countyId = n_oa.countyId
and c_oa.industryId = n_oa.industryId
and c_oa.year = n_oa.year - 1
and c_oa.qtr = n_oa.qtr
and c_oa.data_denom = n_oa.data_denom
inner join QCEW n_oty --next yr (same qtr) over-the-year
on c_oa.countyId = n_oty.countyId
and c_oa.industryId = n_oty.industryId
and c_oa.year = n_oty.year - 1
and c_oa.qtr = n_oty.qtr
and c_oa.data_denom = n_oty.data_denom
WHERE c_oa.SUPPRESSED = 1
AND c_oa.DATA_CAT = 1
AND n_oa.SUPPRESSED = 0
AND n_oa.DATA_CAT = 1
AND n_oty.SUPPRESSED = 0
AND n_oty.DATA_CAT = 3
Now it sounds like the table is big, and we've just joined 3 instances of it; so for this to work you'll need good physical design (appropriate indexes/stats for join columns, etc.). And that's why I'd suggest doing an update based on the above query once; sure, it may run long, but then you can read the inferred values in no time.
But if you really want to merge this directly into a query of the data you could modify it some to show all values, with inferred values mixed in. We need to switch to outer joins to do this, and I'm going to do some slightly weird things with join conditions to make it fit together:
SELECT src.COUNTYID
, src.INDUSTRYID
, src.YEAR
, src.QTR
, case when (n_oa.value - n_oty.value) is null
then src.suppressed
else 2
end as SUPPRESSED_CODE -- 0=NOT SUPPRESSED, 1=SUPPRESSED, 2=INFERRED
, src.DATA_CAT
, src.DATA_DENOM
, coalesce(n_oa.value - n_oty.value, src.value) as VALUE
FROM QCEW src --a source row from which we'll generate a record
left join QCEW n_oa --next yr (same qtr) overall (if src is suppressed/overall)
on src.countyId = n_oa.countyId
and src.industryId = n_oa.industryId
and src.year = n_oa.year - 1
and src.qtr = n_oa.qtr
and src.data_denom = n_oa.data_denom
and src.SUPPRESSED = 1 and n_oa.SUPPRESSED = 0
and src.DATA_CAT = 1 and n_oa.DATA_CAT = 1
left join QCEW n_oty --next yr (same qtr) over-the-year (if src is suppressed/overall)
on src.countyId = n_oty.countyId
and src.industryId = n_oty.industryId
and src.year = n_oty.year - 1
and src.qtr = n_oty.qtr
and src.data_denom = n_oty.data_denom
and src.SUPPRESSED = 1 and n_oty.SUPPRESSED = 0
and src.DATA_CAT = 1 and n_oty.DATA_CAT = 3
I'm designing a database to store information about events that are dynamic in nature. What I mean by this is that, each type of event will have some variables attached to them that changes on each occurrence based on some rules defined by the user.
Let's say we have Event Type A with variable X and Y. In this event type, the user can define some rules that determines the value of X and Y on each occurrence of the event.
An example of a set of rules a user might define:
On first occurrence, X = 0; Y = 0;
On each occurrence, X = X + 1;
On each occurrence, if X == 100 then { X = 0; Y = Y + 1 }
By defining these rules, the value of X and Y changes dynamically on all occurrences of the event as follow:
1st occurrence: X = 1, Y = 0
2nd occurrence: X = 2, Y = 0
...
100th occurrence: X = 0, Y = 1
Now, I'm not sure how to store the "user-defined rules" in a database and later query them in my code. Can anyone point me in the right direction? Here's a start:
EVENTS
id;
name;
description;
event_type;
EVENT_TYPE_A_OCCURRENCES
id;
event_id;
X;
Y;
EVENT_RULES
id;
event_id;
frequency; // the frequency in which this rule applies
at_occurrence; // apply this rule at a specific occurrence
condition; // stores the code for the condition
statements; // stores the code for the statements
I'm no expert, please help me solve this problem. Thank you.
Assume following user defined rules stored in table:
-----------------------------------------------------------------
|eventid|occurance|keep-old-x|keep-old-y|x-frequency|y-frequency|
-----------------------------------------------------------------
| A | 1 | T | F | 1 | 100 |
-----------------------------------------------------------------
| B | 2 | F | T | -2 | 0 |
-----------------------------------------------------------------
| C | 5 | T | T | 100 | -3 |
-----------------------------------------------------------------
Lets say before event X = 10, Y = 12.
Event = A, ocuuurance = 1, keep-old-x = T, keep-old-y = F, x-frequency = 1, y-frequency = 100
if keep-old-x is T then
X = X + x-frequency
else
X = x-frequency
endif
if keep-old-y is T then
Y = Y + y-frequency
else
Y = y-frequency
endif
Now, X = 11, Y = 100
You may need to add two more columns to change value of X variable on specific value; as:
--------------------------
|if-x-value| x-new-value |
--------------------------
| 100 | 0 |
--------------------------
| 125 | 5 |
--------------------------
| 150 | 10 |
--------------------------
I hope this helps.
I want to calculate the maximum of a list of SAS variables, where the list is determined by another variable present in the dataset. That is,
| var_1 | var_2 | var_3 | var_4 | maximum till | formula used | var_output |
|-------|-------|-------|-------|--------------|----------------------|------------|
| 3 | 6 | 9 | 12 | 4 | =max(of var_1-var_4) | 12 |
| 1 | 10 | 100 | 1000 | 2 | =max(of var_1-var_2) | 10 |
| 5 | 15 | 25 | 35 | 3 | =max(of var_1-var_3) | 25 |
Appreciate any help. Thanks :)
Use a do loop and a rolling maximum:
data want;
set have;
array vars{4} var1-var4;
do i = 1 to max_till;
var_out = max(vars{i},var_out);
end;
run;
The FCMP solution. This is similar to user667489's solution, but implemented as a function. This will only work in 9.4, and possibly only 9.4 TS1M0+.
data have; *some data;
input var1-var4 varmax;
datalines;
3 6 9 12 4
1 10 100 1000 2
5 15 25 35 3
;;;;
run;
proc fcmp outlib=work.funcs.func; *store functions here;
function maxof(mxarr[*],maxlim); *returns numeric;
do _i = 1 to maxlim;
_max = max(mxarr[_i],_max);
end;
return(_max);
endsub;
run;
options cmplib=work.funcs; *define where functions come from;
data want;
set have;
array vars var1-var4;
varout = maxof(vars,varmax); *use function (pass array by reference);
run;