Subsetting Last N Values From a Data Frame, R - arrays

I have a data frame of all the results of a football season, in a data frame called new. I want to extract the last 5 games of all teams home and away. The home variable is column 1 and away variable is column 2.
Say there are 20 teams in a character vector called teams, each with a unique name. If it was just a single team it would be easy to subset - say if team1 was "Arsenal", using something like
Arsenal <- "Arsenal"
head(new[new[,1] == Arsenal | new[,2] == Arsenal,], 5)
But I want to loop through the character vector teams to obtain the last 5 results of all teams, 20 in total. Can somebody help me please?
Edit: Here is some sample data. As an example, I would like to obtain the last two games of all teams- it would be easy to subset a single team but I'm not sure how to subset multiple teams.
V1 V2 V3 V4 V5
1 Chelsea Everton 2 1 19/05/2013
2 Liverpool QPR 1 0 19/05/2013
3 Man City Norwich 2 3 19/05/2013
4 Newcastle Arsenal 0 1 19/05/2013
5 Southampton Stoke 1 1 19/05/2013
6 Swansea Fulham 0 3 19/05/2013
7 Tottenham Sunderland 1 0 19/05/2013
8 West Brom Man United 5 5 19/05/2013
9 West Ham Reading 4 2 19/05/2013
10 Wigan Aston Villa 2 2 19/05/2013
11 Arsenal Wigan 4 1 14/05/2013
12 Reading Man City 0 2 14/05/2013
13 Everton West Ham 2 0 12/05/2013
14 Fulham Liverpool 1 3 12/05/2013
15 Man United Swansea 2 1 12/05/2013
16 Norwich West Brom 4 0 12/05/2013
17 QPR Newcastle 1 2 12/05/2013
18 Stoke Tottenham 1 2 12/05/2013
19 Sunderland Southampton 1 1 12/05/2013
20 Aston Villa Chelsea 1 2 11/05/2013
21 Chelsea Tottenham 2 2 08/05/2013
22 Man City West Brom 1 0 07/05/2013
23 Wigan Swansea 2 3 07/05/2013
24 Sunderland Stoke 1 1 06/05/2013
25 Liverpool Everton 0 0 05/05/2013
26 Man United Chelsea 0 1 05/05/2013
27 Fulham Reading 2 4 04/05/2013
28 Norwich Aston Villa 1 2 04/05/2013
29 QPR Arsenal 0 1 04/05/2013
30 Swansea Man City 0 0 04/05/2013
31 Tottenham Southampton 1 0 04/05/2013
32 West Brom Wigan 2 3 04/05/2013
33 West Ham Newcastle 0 0 04/05/2013
34 Aston Villa Sunderland 6 1 29/04/2013
35 Arsenal Man United 1 1 28/04/2013
36 Chelsea Swansea 2 0 28/04/2013
37 Reading QPR 0 0 28/04/2013
38 Everton Fulham 1 0 27/04/2013
39 Man City West Ham 2 1 27/04/2013
40 Newcastle Liverpool 0 6 27/04/2013
41 Southampton West Brom 0 3 27/04/2013
42 Stoke Norwich 1 0 27/04/2013
43 Wigan Tottenham 2 2 27/04/2013

Where df is your data.frame, this will create a list of 20 data.frames with each element being the dataset for one team. This also assumes that the dataset is already ordered, since you mentioned it.
setnames(df,c('hometeam','awayteam','homegoals','awaygoals','fixturedate'))
allteams <- sort(unique(df$hometeam))
eachteamlastfive <- vector(mode = "list", length = length(allteams))
for ( i in seq(length(allteams)))
{
eachteamlastfive[[i]] <- head(df[df$hometeam==allteams[i] | df$awayteam == allteams[i], ],5)
}

take a look at sapply
sapply(unique(new[,1]), function(team) head(new[new[,1] == team | new[,2] == team,], 5))

Related

Do Loops (with multiple rows of id's) with conditional statements?

Please see my data below;
data finance;
input id loan1 loan2 loan3 assets home$ type;
datalines;
1 93000 98000 45666 new 1
1 98000 45678 98765 67 old 2
1 55000 56764 435371 54 new 1
2 7000 6000 7547 57 new 1
4 67333 87444 98666 34 old 1
4 98000 68777 986465 23 new 1
5 4555 334 652 12 new 1
5 78999 98999 80000 34 new 1
5 889 989 676 3 new 1
;
data finance1;
set finance;
if loan1<80000 then conc'level1';
if loan2 <80000 and home='new' then borrowcap = 'high';
run;
I would like the following dataset, as you can see although there are multiple rows for each ID initially, if there was a level1 or high in any of those rows, I would like to capture that in the same row.
data finance;
input id conc$ borrowcap$;
datalines;
1 level1 high
2 level1 high
4 level1
5 level1 high
;
Any help is appreciated!
Use retain statement, you can keep value from any row for each ID. Use by statement + if last.var statement, you can keep only one row for each ID.
data finance;
input id loan1 loan2 loan3 assets home$ type;
datalines;
1 93000 98000 45666 . new 1
1 98000 45678 98765 67 old 2
1 55000 56764 435371 54 new 1
2 7000 6000 7547 57 new 1
4 67333 87444 98666 34 old 1
4 98000 68777 986465 23 new 1
5 4555 334 652 12 new 1
5 78999 98999 80000 34 new 1
5 889 989 676 3 new 1
;
data finance1;
set finance;
by id;
retain conc borrowcap;
length conc borrowcap $8.;
if first.id then call missing(conc,borrowcap);
if loan1<80000 then conc='level1';
if loan2<80000 and home='new' then borrowcap = 'high';
if last.id;
run;

Aggregate function with window function filtered by time

I have a table with data about buses while making their routes. There are columns for:
bus trip id (different each time a bus starts the route from the first stop)
bus stop id
datetime column that indicates the moment that the bus leaves each bus stop
integer that indicates how many passengers entered the bus in that stop
There is no information about how many passengers get off the bus on each stop, so I have to make an estimation supposing that once they get on the bus, they stay on it for 30 minutes. The trip lasts about 70 minutes from the first to the last stop.
I am trying to aggregate results on each stop using
SUM(iPassengersIn) OVER (
PARTITION BY tripDate, tripId
ORDER BY busStopOrder
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) total_passengers
The problem is that I can add passengers since the beginning of the trip, but not since "30 minutes ago" on each stop. How could I limit the aggregation to "the last 30 minutes" on each row in order to estimate the occupation between stops?
This is a subset of my data:
trip_date trip_id bus_stop_order minutes_since_trip_start passengers_in trip_total_passengers
2020-06-08 374910 0 0 0 0
2020-06-08 374910 1 3 0 0
2020-06-08 374910 2 5 1 1
2020-06-08 374910 3 8 0 1
2020-06-08 374910 4 9 0 1
2020-06-08 374910 5 12 0 1
2020-06-08 374910 6 13 0 1
2020-06-08 374910 7 13 0 1
2020-06-08 374910 8 15 0 1
2020-06-08 374910 9 16 0 1
2020-06-08 374910 10 16 0 1
2020-06-08 374910 11 17 0 1
2020-06-08 374910 12 18 2 3
2020-06-08 374910 13 20 0 3
2020-06-08 374910 14 22 0 3
2020-06-08 374910 15 24 0 3
2020-06-08 374910 16 25 0 3
2020-06-08 374910 17 28 2 5
2020-06-08 374910 18 30 1 6
2020-06-08 374910 19 31 0 6
2020-06-08 374910 20 33 0 6
2020-06-08 374910 21 41 3 9
2020-06-08 374910 22 44 3 12
2020-06-08 374910 23 45 4 16
2020-06-08 374910 24 48 2 18
2020-06-08 374910 25 48 2 20
2020-06-08 374910 26 50 0 20
2020-06-08 374910 27 51 0 20
2020-06-08 374910 28 51 0 20
2020-06-08 374910 29 53 0 20
2020-06-08 374910 30 55 0 20
2020-06-08 374910 31 58 0 20
For the row with bus_stop_order 21 (41 minutes into the bus trip), where 3 passengers enter the bus, I have to sum only the passengers that entered the bus between minute 11 and 41. Thus, the passenger that entered the bus in the 2nd bus stop (5 minutes into the trip) should be excluded.
That should be applied for every row.
The only thing I can think of is:
select
trip_date,
trip_id,
minutes_since_trip_start,
v.total_passengers
from
#t t1
outer apply (
select sum(passengers_in)
from #t t2
where
t1.trip_date = t2.trip_date
and t1.trip_id = t2.trip_id
and t2.bus_stop_order <= t1.bus_stop_order
and t2.minutes_since_trip_start >= t1.minutes_since_trip_start - 30
) v(total_passengers)
order by
trip_date,
trip_id,
minutes_since_trip_start
;

Create a Column that shows the day of the month based on a date column

I am attempting to return day of the week (i.e. Monday = 1, Tuesday = 2, etc) based on a date column ("Posting_date"). I tried a for loop but got it wrong:
#First date of table was a Sunday (1 March 2019) => so counter starts at 6
posting_df3['Day'] = (posting_df3['Posting_date'] - dt.datetime(2019,3,31)).dt.days.astype('int16')
# Start counter on the right date (31 March 2019 is a Sunday)
count = 7
for x in posting_df3['Day']:
if count != 7:
count = 1
else:
count = count + 1
posting_df3['Day'] = count
Not sure if there are other ways of doing this. Attached is an image of my database structure:
level_0 Posting_date Reservation date Book_window ADR Day
0 9 2019-03-31 2019-04-01 -1 156.00 0
1 25 2019-04-01 2019-04-01 0 152.15 1
2 11 2019-04-01 2019-04-01 0 149.40 1
3 42 2019-04-01 2019-04-01 0 141.33 1
4 45 2019-04-01 2019-04-01 0 159.36 1
... ... ... ... ... ... ...
4278 739 2020-02-21 2019-04-17 310 253.44 327
4279 739 2020-02-22 2019-04-17 310 253.44 328
4280 31 2020-03-11 2019-04-01 345 260.00 346
Final output should be 2019-03-31 Day column should return 7 since it is a Sunday
and 2019-04-01 Day column should return 1 since its Monday etc
You can do it this way
df['weekday']=pd.to_datetime(df['Posting_date']).dt.weekday+1
Input
level_0 Posting_date Reservation_date Book_window ADR Day
0 9 3/31/2019 4/1/2019 -1 156.00 0
1 25 4/1/2019 4/1/2019 0 152.15 1
2 11 4/1/2019 4/1/2019 0 149.40 1
3 42 4/1/2019 4/1/2019 0 141.33 1
4 45 4/1/2019 4/1/2019 0 159.36 1
Output
level_0 Posting_date Reservation_date Book_window ADR Day weekday
0 9 3/31/2019 4/1/2019 -1 156.00 0 7
1 25 4/1/2019 4/1/2019 0 152.15 1 1
2 11 4/1/2019 4/1/2019 0 149.40 1 1
3 42 4/1/2019 4/1/2019 0 141.33 1 1
4 45 4/1/2019 4/1/2019 0 159.36 1 1

Index Rebuild and Reorganize

How we can identify that we have to rebuild and reorganize the indexes in sqlserver.
i mean to say that percentage is acceptable of fragmentation for rebuild the indexes.
for example below status report:
index_id avg_page_space_used_in_percent avg_fragmentation_in_percent index_level record_count page_count fragment_count avg_record_size_in_bytes
1 99.47111441 0 0 300000 2231 2 57.888
1 89.55707932 0 1 2231 4 2 11
1 0.617741537 0 2 4 1 1 11
4 99.72704472 0.113895216 0 300000 878 4 21.629
4 80.40214974 0 1 878 4 2 27.657
4 1.383741043 0 2 4 1 1 26.5
5 99.71136644 0 0 300000 1236 4 31.259
5 85.67899679 0 1 1236 7 2 37.286
5 3.261675315 0 2 7 1 1 36
please let me know and i would like know criteria,when this action required.
act acording to this link it explains how and when

reading and printing a .csv file like a 2D matrix with both integer and float values in c

Reading a file in c with .csv as extension. The file consisting of both integer and float type data values. Is there any way to read the csv file. Any help is appreciated.
The data is as follows:
Application_No. Actual_Effort (in PM) No of Processes No of Tasks No of partnerLinks Task Variables Element Variables Event Variables Script Developer's Skills Developer's Confidence TPSS TS TCC
1 918.28 1 3 5 33 7 2 3 3.5 1 8 135 143
2 8891.513 3 9 3 100 15 6 12 3 1 36 1197 1233
3 22479.261 5 15 23 125 25 10 20 3 1 190 2700 2890
4 2961.131 2 4 9 70 13 4 17 2 0 72 416 488
5 19650.198 7 14 19 130 28 12 5 2.5 0 231 2450 2681
6 377.75 1 2 4 22 8 2 2 3 1 6 68 74
7 2671.93 1 5 12 55 12 6 4 2 0 17 385 402
8 966.15 3 3 6 31 8 5 7 2.5 0 27 153 180
9 3765.81 2 6 17 73 14 2 3 3.5 1 46 552 590
10 7467.11 4 8 21 87 19 13 1 2 0 116 960 1076

Resources