Do Loops (with multiple rows of id's) with conditional statements? - arrays

Please see my data below;
data finance;
input id loan1 loan2 loan3 assets home$ type;
datalines;
1 93000 98000 45666 new 1
1 98000 45678 98765 67 old 2
1 55000 56764 435371 54 new 1
2 7000 6000 7547 57 new 1
4 67333 87444 98666 34 old 1
4 98000 68777 986465 23 new 1
5 4555 334 652 12 new 1
5 78999 98999 80000 34 new 1
5 889 989 676 3 new 1
;
data finance1;
set finance;
if loan1<80000 then conc'level1';
if loan2 <80000 and home='new' then borrowcap = 'high';
run;
I would like the following dataset, as you can see although there are multiple rows for each ID initially, if there was a level1 or high in any of those rows, I would like to capture that in the same row.
data finance;
input id conc$ borrowcap$;
datalines;
1 level1 high
2 level1 high
4 level1
5 level1 high
;
Any help is appreciated!

Use retain statement, you can keep value from any row for each ID. Use by statement + if last.var statement, you can keep only one row for each ID.
data finance;
input id loan1 loan2 loan3 assets home$ type;
datalines;
1 93000 98000 45666 . new 1
1 98000 45678 98765 67 old 2
1 55000 56764 435371 54 new 1
2 7000 6000 7547 57 new 1
4 67333 87444 98666 34 old 1
4 98000 68777 986465 23 new 1
5 4555 334 652 12 new 1
5 78999 98999 80000 34 new 1
5 889 989 676 3 new 1
;
data finance1;
set finance;
by id;
retain conc borrowcap;
length conc borrowcap $8.;
if first.id then call missing(conc,borrowcap);
if loan1<80000 then conc='level1';
if loan2<80000 and home='new' then borrowcap = 'high';
if last.id;
run;

Related

First-In-First-Out Stock trading - calculate cumulative P/L

I want a sql-server query to calculate cumulative P/L on stock trading (FIFO based calculation).
Input table :
EXECTIME
share_name
Quantity
Price
Buy/Sell
2013-01-01 12:25
abc
100
100
B
2013-01-01 12:26
abc
10
102
S
2013-01-01 12:27
abc
10
102
S
2013-01-01 12:28
abc
10
95
S
2013-01-01 12:29
abc
10
99
S
2013-01-01 12:30
abc
10
105
S
2013-01-01 12:31
abc
100
102
B
2013-01-01 12:32
abc
150
101
S
OUTPUT :
EXECTIME
Cumualative P/L
Winning Streak
Lossing Streak
2013-01-01 12:26
20
1
0
2013-01-01 12:27
40
1
0
2013-01-01 12:28
-10
0
1
2013-01-01 12:29
-20
0
2
2013-01-01 12:30
30
1
0
2013-01-01 12:32
-20
0
1
Explanation :
1st row - 10 shares sold at 102 which were purchased at 100. So profit = (102-100) * 10 = 20
6th row - 150 shares sold at 101,
50 were purchased at 100 - 1st row( 50 already sold above, 50 left)
100 were purchaed at 102 - 7th row
150 * 101 - [(50 * 100)+(100 * 102)] = -50
cumaltive p/l = 30 + (-50) = -20
Winning streak - 1 for positive
Lossing streak - 1,2,... for continuous loss. reset again after profit

Aggregate function with window function filtered by time

I have a table with data about buses while making their routes. There are columns for:
bus trip id (different each time a bus starts the route from the first stop)
bus stop id
datetime column that indicates the moment that the bus leaves each bus stop
integer that indicates how many passengers entered the bus in that stop
There is no information about how many passengers get off the bus on each stop, so I have to make an estimation supposing that once they get on the bus, they stay on it for 30 minutes. The trip lasts about 70 minutes from the first to the last stop.
I am trying to aggregate results on each stop using
SUM(iPassengersIn) OVER (
PARTITION BY tripDate, tripId
ORDER BY busStopOrder
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) total_passengers
The problem is that I can add passengers since the beginning of the trip, but not since "30 minutes ago" on each stop. How could I limit the aggregation to "the last 30 minutes" on each row in order to estimate the occupation between stops?
This is a subset of my data:
trip_date trip_id bus_stop_order minutes_since_trip_start passengers_in trip_total_passengers
2020-06-08 374910 0 0 0 0
2020-06-08 374910 1 3 0 0
2020-06-08 374910 2 5 1 1
2020-06-08 374910 3 8 0 1
2020-06-08 374910 4 9 0 1
2020-06-08 374910 5 12 0 1
2020-06-08 374910 6 13 0 1
2020-06-08 374910 7 13 0 1
2020-06-08 374910 8 15 0 1
2020-06-08 374910 9 16 0 1
2020-06-08 374910 10 16 0 1
2020-06-08 374910 11 17 0 1
2020-06-08 374910 12 18 2 3
2020-06-08 374910 13 20 0 3
2020-06-08 374910 14 22 0 3
2020-06-08 374910 15 24 0 3
2020-06-08 374910 16 25 0 3
2020-06-08 374910 17 28 2 5
2020-06-08 374910 18 30 1 6
2020-06-08 374910 19 31 0 6
2020-06-08 374910 20 33 0 6
2020-06-08 374910 21 41 3 9
2020-06-08 374910 22 44 3 12
2020-06-08 374910 23 45 4 16
2020-06-08 374910 24 48 2 18
2020-06-08 374910 25 48 2 20
2020-06-08 374910 26 50 0 20
2020-06-08 374910 27 51 0 20
2020-06-08 374910 28 51 0 20
2020-06-08 374910 29 53 0 20
2020-06-08 374910 30 55 0 20
2020-06-08 374910 31 58 0 20
For the row with bus_stop_order 21 (41 minutes into the bus trip), where 3 passengers enter the bus, I have to sum only the passengers that entered the bus between minute 11 and 41. Thus, the passenger that entered the bus in the 2nd bus stop (5 minutes into the trip) should be excluded.
That should be applied for every row.
The only thing I can think of is:
select
trip_date,
trip_id,
minutes_since_trip_start,
v.total_passengers
from
#t t1
outer apply (
select sum(passengers_in)
from #t t2
where
t1.trip_date = t2.trip_date
and t1.trip_id = t2.trip_id
and t2.bus_stop_order <= t1.bus_stop_order
and t2.minutes_since_trip_start >= t1.minutes_since_trip_start - 30
) v(total_passengers)
order by
trip_date,
trip_id,
minutes_since_trip_start
;

Issues Regarding SAS

I was working on a homework problem regarding using arrays and looping to create a new variable to identify the date of when the maximum blood lead value was obtained but got stuck. For context, here is the homework problem:
In 1990 a study was done on the blood lead levels of children in Boston. The following variables for twenty-five children from the study have been entered on multiple lines per subject in the file lead_sum2018.txt in a list format:
Line 1
ID Number (numeric, values 1-25)
Date of Birth (mmddyy8. format)
Day of Blood Sample 1 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 1 (numeric, initial possible range: -9 to 12)
Line 2
ID Number (numeric, values 1-25)
Day of Blood Sample 2 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 2 (numeric, initial possible range: -9 to 12)
Line 3
ID Number (numeric, values 1-25)
Day of Blood Sample 3 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 3 (numeric, initial possible range: -9 to 12)
Line 4
ID Number (numeric, values 1-25)
Blood Lead Level Sample 1 (numeric, possible range: 0.01 – 20.00)
Blood Lead Level Sample 2 (numeric, possible range: 0.01 – 20.00)
Blood Lead Level Sample 3 (numeric, possible range: 0.01 – 20.00)
Sex (character, ‘M’ or ‘F’)
All blood samples were drawn in 1990. However, during data entry the order of blood samples was scrambled so that the first blood sample in the data file (blood sample 1) may not correspond to the first blood sample taken on a subject, it could be the first, second or third. In addition, some of the months and days and days of blood sampling were not written on the forms. At data entry, missing month and missing day values were each coded as -9.
The team of investigators for this project has made the following decisions regarding the missing values. Any missing days are to set equal to 15, any missing months are to be set equal to 6. Any analyses that are done on this data set need to follow those decisions. Be sure to implement the SAS syntax as indicated for each question. For example, use SAS arrays and loops if the item states that these must be used.
Here is the data that the HW references (it is in list format and was contained in a separate file called lead_sum2018.txt):
1 04/30/78 6 10
1 -9 7
1 14 1
1 1.62 1.35 1.47 F
2 05/19/79 27 11
2 20 -9
2 5 6
2 1.71 1.31 1.76 F
3 01/03/80 11 7
3 6 6
3 27 2
3 3.24 3.4 3.83 M
4 08/01/80 5 12
4 28 -9
4 3 4
4 3.1 3.69 3.27 M
5 12/26/80 21 5
5 3 7
5 -9 12
5 4.35 4.79 5.14 M
6 06/20/81 7 10
6 11 3
6 22 1
6 1.24 1.16 0.71 F
7 06/22/81 19 6
7 3 12
7 29 8
7 3.1 3.21 3.58 F
8 05/24/82 26 7
8 31 1
8 9 10
8 2.99 2.37 2.4 M
9 10/11/82 2 7
9 25 5
9 28 3
9 2.4 1.96 2.71 F
10 . 10 8
10 30 12
10 28 2
10 2.72 2.87 1.97 F
11 11/16/83 19 4
11 15 11
11 7 -9
11 4.8 4.5 4.96 M
12 03/02/84 17 6
12 11 2
12 17 11
12 2.38 2.6 2.88 F
13 04/19/84 2 12
13 -9 6
13 1 7
13 1.99 1.20 1.21 M
14 02/07/85 4 5
14 17 5
14 21 11
14 1.61 1.93 2.32 F
15 07/06/85 5 2
15 16 1
15 14 6
15 3.93 4 4.08 M
16 09/10/85 12 10
16 11 -9
16 23 6
16 3.29 2.88 2.97 M
17 11/05/85 12 7
17 18 1
17 11 11
17 1.31 0.98 1.04 F
18 12/07/85 16 2
18 18 4
18 -9 6
18 2.56 2.78 2.88 M
19 03/02/86 19 4
19 11 3
19 19 2
19 0.79 0.68 0.72 M
20 08/19/86 21 5
20 15 12
20 -9 4
20 0.66 1.15 1.42 F
21 02/22/87 16 12
21 17 9
21 13 4
21 2.92 3.27 3.23 M
22 10/11/87 7 6
22 1 12
22 -9 3
22 1.43 1.42 1.78 F
23 05/12/88 12 2
23 21 4
23 17 12
23 0.55 0.89 1.38 M
24 08/07/88 17 6
24 27 11
24 6 2
24 0.31 0.42 0.15 F
25 01/12/89 4 7
25 15 -9
25 23 1
25 1.69 1.58 1.53 M
A) Input the data and in the data step:
1) make sure that Date of Birth variable is recorded as a SAS date;
2) use SAS arrays and looping to create a SAS date variable for each of the three blood samples and to address the missing data in accordance to the decisions of the investigators. Hint: use a single array and do loop to recode the missing values for day and month, separately, and an array/do loop for creating the SAS date variable;
3) use a SAS function to create a variable for the highest, i.e., maximum, blood lead value for each child;
4) use SAS arrays and looping to identify the date on which this largest value was obtained and create a new variable for the date of the largest blood lead value;
5) determine the age of the child in years when the largest blood lead value was obtained (rounded to two decimal places);
6) create a new variable based on the age of the child in years when the largest lead value was obtained (call it, “agecat”) that takes on three levels: for children less than 4 years old, agecat should equal 1; for children at least 4 years old, but less than 8, agecat should equal 2; and for children at least 8 years of age, agecat should be 3.;
7) print out the variables for the date of birth, date of the largest lead level, age at blood sample for the largest blood lead level, agecat, sex, and the largest blood lead level (Only print out these requested variables). All dates should be formatted to use the mmddyy10. format on the output.
The code I used in response to this was:
libname HW3 'C:\Users\johns\Desktop\SAS';
filename HW3new 'C:\Users\johns\Desktop\SAS\lead_sum2018.txt';
data one;
infile HW3new;
informat dob mmddyy8.;
input #1 id dob dbs1 mbs1
#2 dbs2 mbs2
#3 dbs3 mbs3
#4 bls1 bls2 bls3 sex;
array dbs{3} dbs1 dbs2 dbs3;
array mbs{3} mbs1 mbs2 mbs3;
do i=1 to 3;
if dbs{i}=-9 then dbs{i}=15;
end;
do i=4 to 6;
if mbs{i}=-9 then mbs{i}=6;
end;
array date{3} mdy1 mdy2 mdy3;
do i=1 to 3;
date{i}=mdy(mbs{i}, dbs{i}, 1990);
end;
maxbls=max(of bls1-bls3);
array bls{3} bls1 bls2 bls3;
array maxdte{3} maxdte1 maxdte2 maxdte3;
do i=1 to i=3;
if bls{i}=maxbls then maxdte=i;
end;
agemax=maxdte-dob;
ageest=round(agemax/365.25,2);
if agemax=. then agecat=.;
else if agemax < 4 then agecat=1;
else if 4 <= agemax < 8 then agecat=2;
else if agemax ge 8 then agecat=3;
run;
I received this error:
22 maxbls=max(of bls1-bls3);
23 array bls{3} bls1 bls2 bls3;
24 array maxdte{3} maxdte1 maxdte2 maxdte3;
25 do i=1 to i=3;
26 if bls{i}=maxbls then maxdte=i;
ERROR: Illegal reference to the array maxdte.
27 end;
Does anyone have any tip is regards to this issue? What did I do wrong? Was I supposed to create an additional array for the date of when the maximum blood lead sample value was collected? Thanks!
**I'm stuck on #4 of Part A, but I included the other parts for context. Thanks!
**Edits: I included the data that I had to read into SAS and the file name of the file it came from
Just from looking at the code immediately prior to the error, you have a problem on this line:
26 if bls{i}=maxbls then maxdte=i;
You are getting the error because you are attempting to assign a value to the array maxdte. Arrays cannot be assigned values like that (unless you are using the deprecated do over syntax...) Instead, choose an element of the array and assign the value to the element. E.g. you could do:
26 if bls{i}=maxbls then maxdte{1}=i;
Or instead of a literal 1, you could use a variable containing the relevant array index.
You are not properly handling ID field from lines #2-4
input #1 id dob dbs1 mbs1
#2 dbs2 mbs2
#3 dbs3 mbs3
#4 bls1 bls2 bls3 sex;
For example you need to skip field 1 on line 2-3 or read the ids into array perhaps to check they are all the same.
input #1 id dob dbs1 mbs1
#2 id2 dbs2 mbs2
#3 id3 dbs3 mbs3
#4 id4 bls1 bls2 bls3 sex;
This example show how to check that you have 4 lines with the same ID and if you do read the rest of the variables or execute LOSTCARD. ID 3 has a missing record;
353 data ex;
354 infile cards n=4 stopover;
355 input #1 id #2 id2 #3 id3 #4 id4 #;
356 if id eq id2 eq id3 eq id4
357 then input #1 id dob:mmddyy. dbs1 mbs1
358 #2 id2 dbs2 mbs2
359 #3 id3 dbs3 mbs3
360 #4 id4 bls1 bls2 bls3 sex :$1.;
361 else lostcard;
362 format dob mmddyy.;
363 cards;
NOTE: LOST CARD.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
372 3 01/03/80 11 7
373 3 27 2
374 3 3.24 3.4 3.83 M
375 4 08/01/80 5 12
NOTE: LOST CARD.
376 4 28 -9
NOTE: LOST CARD.
377 4 3 4
NOTE: The data set WORK.EX has 3 observations and 15 variables.
data ex;
infile cards n=4 stopover;
input #1 id #2 id2 #3 id3 #4 id4 #;
if id eq id2 eq id3 eq id4
then input #1 id dob:mmddyy. dbs1 mbs1
#2 id2 dbs2 mbs2
#3 id3 dbs3 mbs3
#4 id4 bls1 bls2 bls3 sex :$1.;
else lostcard;
format dob mmddyy.;
cards;
1 04/30/78 6 10
1 -9 7
1 14 1
1 1.62 1.35 1.47 F
2 05/19/79 27 11
2 20 -9
2 5 6
2 1.71 1.31 1.76 F
3 01/03/80 11 7
3 27 2
3 3.24 3.4 3.83 M
4 08/01/80 5 12
4 28 -9
4 3 4
4 3.1 3.69 3.27 M
;;;;
run;
proc print;
run;

Subsetting Last N Values From a Data Frame, R

I have a data frame of all the results of a football season, in a data frame called new. I want to extract the last 5 games of all teams home and away. The home variable is column 1 and away variable is column 2.
Say there are 20 teams in a character vector called teams, each with a unique name. If it was just a single team it would be easy to subset - say if team1 was "Arsenal", using something like
Arsenal <- "Arsenal"
head(new[new[,1] == Arsenal | new[,2] == Arsenal,], 5)
But I want to loop through the character vector teams to obtain the last 5 results of all teams, 20 in total. Can somebody help me please?
Edit: Here is some sample data. As an example, I would like to obtain the last two games of all teams- it would be easy to subset a single team but I'm not sure how to subset multiple teams.
V1 V2 V3 V4 V5
1 Chelsea Everton 2 1 19/05/2013
2 Liverpool QPR 1 0 19/05/2013
3 Man City Norwich 2 3 19/05/2013
4 Newcastle Arsenal 0 1 19/05/2013
5 Southampton Stoke 1 1 19/05/2013
6 Swansea Fulham 0 3 19/05/2013
7 Tottenham Sunderland 1 0 19/05/2013
8 West Brom Man United 5 5 19/05/2013
9 West Ham Reading 4 2 19/05/2013
10 Wigan Aston Villa 2 2 19/05/2013
11 Arsenal Wigan 4 1 14/05/2013
12 Reading Man City 0 2 14/05/2013
13 Everton West Ham 2 0 12/05/2013
14 Fulham Liverpool 1 3 12/05/2013
15 Man United Swansea 2 1 12/05/2013
16 Norwich West Brom 4 0 12/05/2013
17 QPR Newcastle 1 2 12/05/2013
18 Stoke Tottenham 1 2 12/05/2013
19 Sunderland Southampton 1 1 12/05/2013
20 Aston Villa Chelsea 1 2 11/05/2013
21 Chelsea Tottenham 2 2 08/05/2013
22 Man City West Brom 1 0 07/05/2013
23 Wigan Swansea 2 3 07/05/2013
24 Sunderland Stoke 1 1 06/05/2013
25 Liverpool Everton 0 0 05/05/2013
26 Man United Chelsea 0 1 05/05/2013
27 Fulham Reading 2 4 04/05/2013
28 Norwich Aston Villa 1 2 04/05/2013
29 QPR Arsenal 0 1 04/05/2013
30 Swansea Man City 0 0 04/05/2013
31 Tottenham Southampton 1 0 04/05/2013
32 West Brom Wigan 2 3 04/05/2013
33 West Ham Newcastle 0 0 04/05/2013
34 Aston Villa Sunderland 6 1 29/04/2013
35 Arsenal Man United 1 1 28/04/2013
36 Chelsea Swansea 2 0 28/04/2013
37 Reading QPR 0 0 28/04/2013
38 Everton Fulham 1 0 27/04/2013
39 Man City West Ham 2 1 27/04/2013
40 Newcastle Liverpool 0 6 27/04/2013
41 Southampton West Brom 0 3 27/04/2013
42 Stoke Norwich 1 0 27/04/2013
43 Wigan Tottenham 2 2 27/04/2013
Where df is your data.frame, this will create a list of 20 data.frames with each element being the dataset for one team. This also assumes that the dataset is already ordered, since you mentioned it.
setnames(df,c('hometeam','awayteam','homegoals','awaygoals','fixturedate'))
allteams <- sort(unique(df$hometeam))
eachteamlastfive <- vector(mode = "list", length = length(allteams))
for ( i in seq(length(allteams)))
{
eachteamlastfive[[i]] <- head(df[df$hometeam==allteams[i] | df$awayteam == allteams[i], ],5)
}
take a look at sapply
sapply(unique(new[,1]), function(team) head(new[new[,1] == team | new[,2] == team,], 5))

Average of Counts

I Have a table called totals and the data looks like:
ACC_ID Data_ID Mon Weeks Total_AR_Count Total_FR_Count Total_OP_Count
23 9 01/2011 4 172 251 194
42 9 01/2011 4 2 16 28
75 9 01/2011 4 33 316 346
75 9 07/2011 5 1 12 20
42 9 09/2011 5 25 758 25
I want the output to be as Average of all the counts grouped by ACC_ID and Data_ID:
ACC_ID Data_ID Avg_AR_Count Avg_FR_Count Avg_OP_Count
23 9 172 251 194
42 9 13.5 387 26.5
75 9 17 164 183
How can do this?
Your description of what you want just about writes the SQL:
SELECT ACC_ID, Data ID, AVG(Total_AR_Count) AS Avg_AR_Count, AVG(Total_FR_Count) AS Avg_FR_Count...
FROM table
GROUP BY ACC_ID, Data_ID

Resources