Aggregate function with window function filtered by time - sql-server

I have a table with data about buses while making their routes. There are columns for:
bus trip id (different each time a bus starts the route from the first stop)
bus stop id
datetime column that indicates the moment that the bus leaves each bus stop
integer that indicates how many passengers entered the bus in that stop
There is no information about how many passengers get off the bus on each stop, so I have to make an estimation supposing that once they get on the bus, they stay on it for 30 minutes. The trip lasts about 70 minutes from the first to the last stop.
I am trying to aggregate results on each stop using
SUM(iPassengersIn) OVER (
PARTITION BY tripDate, tripId
ORDER BY busStopOrder
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) total_passengers
The problem is that I can add passengers since the beginning of the trip, but not since "30 minutes ago" on each stop. How could I limit the aggregation to "the last 30 minutes" on each row in order to estimate the occupation between stops?
This is a subset of my data:
trip_date trip_id bus_stop_order minutes_since_trip_start passengers_in trip_total_passengers
2020-06-08 374910 0 0 0 0
2020-06-08 374910 1 3 0 0
2020-06-08 374910 2 5 1 1
2020-06-08 374910 3 8 0 1
2020-06-08 374910 4 9 0 1
2020-06-08 374910 5 12 0 1
2020-06-08 374910 6 13 0 1
2020-06-08 374910 7 13 0 1
2020-06-08 374910 8 15 0 1
2020-06-08 374910 9 16 0 1
2020-06-08 374910 10 16 0 1
2020-06-08 374910 11 17 0 1
2020-06-08 374910 12 18 2 3
2020-06-08 374910 13 20 0 3
2020-06-08 374910 14 22 0 3
2020-06-08 374910 15 24 0 3
2020-06-08 374910 16 25 0 3
2020-06-08 374910 17 28 2 5
2020-06-08 374910 18 30 1 6
2020-06-08 374910 19 31 0 6
2020-06-08 374910 20 33 0 6
2020-06-08 374910 21 41 3 9
2020-06-08 374910 22 44 3 12
2020-06-08 374910 23 45 4 16
2020-06-08 374910 24 48 2 18
2020-06-08 374910 25 48 2 20
2020-06-08 374910 26 50 0 20
2020-06-08 374910 27 51 0 20
2020-06-08 374910 28 51 0 20
2020-06-08 374910 29 53 0 20
2020-06-08 374910 30 55 0 20
2020-06-08 374910 31 58 0 20
For the row with bus_stop_order 21 (41 minutes into the bus trip), where 3 passengers enter the bus, I have to sum only the passengers that entered the bus between minute 11 and 41. Thus, the passenger that entered the bus in the 2nd bus stop (5 minutes into the trip) should be excluded.
That should be applied for every row.

The only thing I can think of is:
select
trip_date,
trip_id,
minutes_since_trip_start,
v.total_passengers
from
#t t1
outer apply (
select sum(passengers_in)
from #t t2
where
t1.trip_date = t2.trip_date
and t1.trip_id = t2.trip_id
and t2.bus_stop_order <= t1.bus_stop_order
and t2.minutes_since_trip_start >= t1.minutes_since_trip_start - 30
) v(total_passengers)
order by
trip_date,
trip_id,
minutes_since_trip_start
;

Related

Update previous rows based on another row

I have a question and i need to update the empty rows based on the rows with value
In this case, i need to update the hours, mins and secs based on every 4th row
For ex: rownum 4 has 8hours, 1mins, 9 sec.
So, my update to previous row should be 8hrs, 1min, 6 sec from rownum 1, then, for rownum 5 it should continue the same procedure
See rownum 8 has 8hours, 1mins, 13 sec.
The previous 3 rows should be 8hrs, 1min, 10 sec from rownum 5
How to have this in a loop or with partition by or any suggestion in SQL server.
You can do this with window functions and converting your hours, minutes and seconds to a time value. Converting to a tim is important to make sure you wrap around the appropriate time boundaries and don't end up with 61 seconds in a minute etc.
Depending on the data and your real world environment you will probably need to add the Flight and maybe some other columns into the partition bys to ensure you are working correctly scoped windows of data.
Query
declare #t table(rn int,timeframe int,h int,m int,s int);
insert into #t values
(1,1,null,null,null)
,(2,1,null,null,null)
,(3,1,null,null,null)
,(4,1,23,59,45)
,(5,2,null,null,null)
,(6,2,null,null,null)
,(7,2,null,null,null)
,(8,2,23,59,49)
,(9,3,null,null,null)
,(10,3,null,null,null)
,(11,3,null,null,null)
,(12,3,23,59,53)
,(13,4,null,null,null)
,(14,4,null,null,null)
,(15,4,null,null,null)
,(16,4,23,59,57)
,(17,5,null,null,null)
,(18,5,null,null,null)
,(19,5,null,null,null)
,(20,5,0,0,1)
,(21,6,null,null,null)
,(22,6,null,null,null)
,(23,6,null,null,null)
,(24,6,0,0,5)
;
with d as
(
select rn
,timeframe
,dateadd(second
,rn - max(rn) over (partition by timeframe)
,max(timefromparts(h,m,s,0,0)) over (partition by timeframe)
) as t
from #t
)
select rn
,timeframe
,datepart(hour,t) as h
,datepart(minute,t) as m
,datepart(second,t) as s
from d
order by rn;
Output
rn
timeframe
h
m
s
1
1
23
59
42
2
1
23
59
43
3
1
23
59
44
4
1
23
59
45
5
2
23
59
46
6
2
23
59
47
7
2
23
59
48
8
2
23
59
49
9
3
23
59
50
10
3
23
59
51
11
3
23
59
52
12
3
23
59
53
13
4
23
59
54
14
4
23
59
55
15
4
23
59
56
16
4
23
59
57
17
5
23
59
58
18
5
23
59
59
19
5
0
0
0
20
5
0
0
1
21
6
0
0
2
22
6
0
0
3
23
6
0
0
4
24
6
0
0
5

Find first and last of a unique element in a column

In a data table such as with the following format:
id Time 1 Time2 V1 V2
1 1 10 30 40
1 2 20 31 41
1 3 30 32 42
1 4 40 33 43
2 1 10 40 50
2 2 20 41 51
2 3 30 42 52
2 4 40 43 53
3 1 10 50 60
3 2 20 51 61
3 3 30 52 62
3 4 40 53 63
I want to select the two smallest and two largest variable readings of time 1 and time 2
I want to do a regression and correlation analysis of v1 and v2 using the first two and last two time readings for each unique ID
Thanks

reading and printing a .csv file like a 2D matrix with both integer and float values in c

Reading a file in c with .csv as extension. The file consisting of both integer and float type data values. Is there any way to read the csv file. Any help is appreciated.
The data is as follows:
Application_No. Actual_Effort (in PM) No of Processes No of Tasks No of partnerLinks Task Variables Element Variables Event Variables Script Developer's Skills Developer's Confidence TPSS TS TCC
1 918.28 1 3 5 33 7 2 3 3.5 1 8 135 143
2 8891.513 3 9 3 100 15 6 12 3 1 36 1197 1233
3 22479.261 5 15 23 125 25 10 20 3 1 190 2700 2890
4 2961.131 2 4 9 70 13 4 17 2 0 72 416 488
5 19650.198 7 14 19 130 28 12 5 2.5 0 231 2450 2681
6 377.75 1 2 4 22 8 2 2 3 1 6 68 74
7 2671.93 1 5 12 55 12 6 4 2 0 17 385 402
8 966.15 3 3 6 31 8 5 7 2.5 0 27 153 180
9 3765.81 2 6 17 73 14 2 3 3.5 1 46 552 590
10 7467.11 4 8 21 87 19 13 1 2 0 116 960 1076

From matrix to array [J]

I'm working on J.
How can I convert this matrix:
(i.10)*/(i.10)
0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9
0 2 4 6 8 10 12 14 16 18
0 3 6 9 12 15 18 21 24 27
0 4 8 12 16 20 24 28 32 36
0 5 10 15 20 25 30 35 40 45
0 6 12 18 24 30 36 42 48 54
0 7 14 21 28 35 42 49 56 63
0 8 16 24 32 40 48 56 64 72
0 9 18 27 36 45 54 63 72 81
in array?
0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 . . .
I tried
(i.10)*/(i.10)"0
and then I've added
~.(i.10)*/(i.10)"0
to eliminate doubles, but it doesn't work.
If you want to turn a 2-dimensional table (matrix) into a 1-dimensional list (vector or "array", though in the J world "array" usually means "rectangle with any number [N] of dimensions"), you can use ravel (,):
matrix =: (i.10)*/(i.10)
list =: , matrix
list
0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 ...
Now using nub (~.) to remove duplicates should work:
~. list
0 1 2 3 4 5 6 7 8 9 10 12 ...
Note that, in J, the shape of an array usually carries important information, so flattening a matrix like this would be fairly unusual. Still, nothing stopping you.
BTW, you can save yourself some keystrokes by using the adverb ~, which will copy the left argument of a dyad to the right side as well, so you could just say:
matrix =: */~ i. 10
and get the same result as (i.10) */ (i.10).

SAS, assigning the same numbers to specific observations

I want to assign the same id number to every four observations. For example, if I have the following data
age marital gender id
45 1 0 1
33 1 1 1
68 0 1 1
27 1 0 1
43 0 0 2
37 0 1 2
19 1 1 2
40 1 1 2
25 1 0 3
38 1 1 3
57 0 0 3
50 1 0 3
51 1 1 4
44 0 1 4
69 1 0 4
39 0 1 4
The last column id is something I want to produce.
Plus, the dataset have 500,000+ observations.
Thanks in advance.
Slightly more compact:
id = ceil(_n_/4);
Use the integer function and the built-in _n_ variable (which increments for each observation):
id = int( (_n_-4)/4 )+1;

Resources