Find first and last of a unique element in a column - arrays

In a data table such as with the following format:
id Time 1 Time2 V1 V2
1 1 10 30 40
1 2 20 31 41
1 3 30 32 42
1 4 40 33 43
2 1 10 40 50
2 2 20 41 51
2 3 30 42 52
2 4 40 43 53
3 1 10 50 60
3 2 20 51 61
3 3 30 52 62
3 4 40 53 63
I want to select the two smallest and two largest variable readings of time 1 and time 2
I want to do a regression and correlation analysis of v1 and v2 using the first two and last two time readings for each unique ID
Thanks

Related

Update previous rows based on another row

I have a question and i need to update the empty rows based on the rows with value
In this case, i need to update the hours, mins and secs based on every 4th row
For ex: rownum 4 has 8hours, 1mins, 9 sec.
So, my update to previous row should be 8hrs, 1min, 6 sec from rownum 1, then, for rownum 5 it should continue the same procedure
See rownum 8 has 8hours, 1mins, 13 sec.
The previous 3 rows should be 8hrs, 1min, 10 sec from rownum 5
How to have this in a loop or with partition by or any suggestion in SQL server.
You can do this with window functions and converting your hours, minutes and seconds to a time value. Converting to a tim is important to make sure you wrap around the appropriate time boundaries and don't end up with 61 seconds in a minute etc.
Depending on the data and your real world environment you will probably need to add the Flight and maybe some other columns into the partition bys to ensure you are working correctly scoped windows of data.
Query
declare #t table(rn int,timeframe int,h int,m int,s int);
insert into #t values
(1,1,null,null,null)
,(2,1,null,null,null)
,(3,1,null,null,null)
,(4,1,23,59,45)
,(5,2,null,null,null)
,(6,2,null,null,null)
,(7,2,null,null,null)
,(8,2,23,59,49)
,(9,3,null,null,null)
,(10,3,null,null,null)
,(11,3,null,null,null)
,(12,3,23,59,53)
,(13,4,null,null,null)
,(14,4,null,null,null)
,(15,4,null,null,null)
,(16,4,23,59,57)
,(17,5,null,null,null)
,(18,5,null,null,null)
,(19,5,null,null,null)
,(20,5,0,0,1)
,(21,6,null,null,null)
,(22,6,null,null,null)
,(23,6,null,null,null)
,(24,6,0,0,5)
;
with d as
(
select rn
,timeframe
,dateadd(second
,rn - max(rn) over (partition by timeframe)
,max(timefromparts(h,m,s,0,0)) over (partition by timeframe)
) as t
from #t
)
select rn
,timeframe
,datepart(hour,t) as h
,datepart(minute,t) as m
,datepart(second,t) as s
from d
order by rn;
Output
rn
timeframe
h
m
s
1
1
23
59
42
2
1
23
59
43
3
1
23
59
44
4
1
23
59
45
5
2
23
59
46
6
2
23
59
47
7
2
23
59
48
8
2
23
59
49
9
3
23
59
50
10
3
23
59
51
11
3
23
59
52
12
3
23
59
53
13
4
23
59
54
14
4
23
59
55
15
4
23
59
56
16
4
23
59
57
17
5
23
59
58
18
5
23
59
59
19
5
0
0
0
20
5
0
0
1
21
6
0
0
2
22
6
0
0
3
23
6
0
0
4
24
6
0
0
5

Aggregate function with window function filtered by time

I have a table with data about buses while making their routes. There are columns for:
bus trip id (different each time a bus starts the route from the first stop)
bus stop id
datetime column that indicates the moment that the bus leaves each bus stop
integer that indicates how many passengers entered the bus in that stop
There is no information about how many passengers get off the bus on each stop, so I have to make an estimation supposing that once they get on the bus, they stay on it for 30 minutes. The trip lasts about 70 minutes from the first to the last stop.
I am trying to aggregate results on each stop using
SUM(iPassengersIn) OVER (
PARTITION BY tripDate, tripId
ORDER BY busStopOrder
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) total_passengers
The problem is that I can add passengers since the beginning of the trip, but not since "30 minutes ago" on each stop. How could I limit the aggregation to "the last 30 minutes" on each row in order to estimate the occupation between stops?
This is a subset of my data:
trip_date trip_id bus_stop_order minutes_since_trip_start passengers_in trip_total_passengers
2020-06-08 374910 0 0 0 0
2020-06-08 374910 1 3 0 0
2020-06-08 374910 2 5 1 1
2020-06-08 374910 3 8 0 1
2020-06-08 374910 4 9 0 1
2020-06-08 374910 5 12 0 1
2020-06-08 374910 6 13 0 1
2020-06-08 374910 7 13 0 1
2020-06-08 374910 8 15 0 1
2020-06-08 374910 9 16 0 1
2020-06-08 374910 10 16 0 1
2020-06-08 374910 11 17 0 1
2020-06-08 374910 12 18 2 3
2020-06-08 374910 13 20 0 3
2020-06-08 374910 14 22 0 3
2020-06-08 374910 15 24 0 3
2020-06-08 374910 16 25 0 3
2020-06-08 374910 17 28 2 5
2020-06-08 374910 18 30 1 6
2020-06-08 374910 19 31 0 6
2020-06-08 374910 20 33 0 6
2020-06-08 374910 21 41 3 9
2020-06-08 374910 22 44 3 12
2020-06-08 374910 23 45 4 16
2020-06-08 374910 24 48 2 18
2020-06-08 374910 25 48 2 20
2020-06-08 374910 26 50 0 20
2020-06-08 374910 27 51 0 20
2020-06-08 374910 28 51 0 20
2020-06-08 374910 29 53 0 20
2020-06-08 374910 30 55 0 20
2020-06-08 374910 31 58 0 20
For the row with bus_stop_order 21 (41 minutes into the bus trip), where 3 passengers enter the bus, I have to sum only the passengers that entered the bus between minute 11 and 41. Thus, the passenger that entered the bus in the 2nd bus stop (5 minutes into the trip) should be excluded.
That should be applied for every row.
The only thing I can think of is:
select
trip_date,
trip_id,
minutes_since_trip_start,
v.total_passengers
from
#t t1
outer apply (
select sum(passengers_in)
from #t t2
where
t1.trip_date = t2.trip_date
and t1.trip_id = t2.trip_id
and t2.bus_stop_order <= t1.bus_stop_order
and t2.minutes_since_trip_start >= t1.minutes_since_trip_start - 30
) v(total_passengers)
order by
trip_date,
trip_id,
minutes_since_trip_start
;

Data field shifting through a vector of data in matlab

I need to create a data field that will go through a vector. Data field is constant length, and it is going through the data vector shifting data field with data field length. I need the mean value of that field (A vector) that corresponds to a mean value of another field (B vector).
Example:
A=[1 5 7 8 9 10 11 13 15 18 19 25 28 30 35 40 45 48 50 51];
B=[2 4 8 9 12 15 16 18 19 20 25 27 30 35 39 40 45 48 50 55];
I want to do next:
A=[{1 5 7 8 9} 10 11 13 15 18 19 25 28 30 35 40 45 48 50 51];
B=[{2 4 8 9 12} 15 16 18 19 20 25 27 30 35 39 40 45 48 50 55];
I want to take data from field of 5 points and get mean value. And then shift whole data field with data field length.
A=[1 5 7 8 9 {10 11 13 15 18} 19 25 28 30 35 40 45 48 50 51];
B=[2 4 8 9 12 {15 16 18 19 20} 25 27 30 35 39 40 45 48 50 55];
I need two vectors, C and D with mean values of this method.
C=[6 13.4 27.4 45.2];
D=[7 17.6 31.2 47.6];
I started something with
n = length(A);
for k = 1:n
....
but nothing I tried worked.
reshape the vector into a 5-row matrix and then compute the mean of each column:
C = mean(reshape(A,5,[]),1);
D = mean(reshape(B,5,[]),1)

Sum of multiple variables by group

I have a dataset with over 900 observations, each observation represents the population of a sub-geographical area for a given year by gender (male, female, all) and 20 different age groups.
I have dropped the variable for the sub-geographical area and I want to collape into the greater geographical area (called Geo).
I am having a difficult time doing a SUM or PROC MEANS because I have so many age groups to sum up and I am trying to avoid writing them all out. I want to collapse across the group year, geo, sex so that I only have 3 observations per Geo (my raw data could have as many as 54 observations).
This is an example of what a tiny section of the raw data looks like:
Year Geo Sex Age0005 Age0610 Age1115 (etc)
2010 1 1 92 73 75
2010 1 2 57 81 69
2010 1 3 159 154 144
2010 1 1 41 38 43
2010 1 2 52 41 39
2010 1 3 93 79 82
2010 2 1 71 66 68
2010 2 2 63 64 70
2010 2 3 134 130 138
2010 2 1 32 35 34
2010 2 2 29 31 36
2010 2 3 61 66 70
This is how I want it to look:
Year Group Sex Age0005 Age0610 Age1115 (etc)
2010 1 1 133 111 118
2010 1 2 109 122 08
2010 1 3 252 233 226
2010 2 1 103 101 102
2010 2 2 92 95 106
2010 2 3 195 196 208
Any ideas? Please help!
You don't have to write out each variable name individually - there are ways of getting around that. E.g. if all of the age group variables that need to be summed up start with age then you can use a : wildcard to match them:
proc summary nway data = have;
var age:;
class year geo sex;
output out = want sum=;
run;
If your variables don't have a common prefix, but are all next to each other in one big horizontal group in your dataset, you can use a double dash list instead:
proc summary nway data = have;
var age005--age1115; /*Includes all variables between these two*/
class year geo sex;
output out = want sum=;
run;
Note also the use of sum= - this means that each summarised variable is reproduced with its original name in the output dataset.
I personally like to use proc sql for this, since it makes it very clear what you're summing and grouping by.
data old ;
input Year Geo Sex Age0005 Age0610 Age1115 ;
datalines;
2010 1 1 92 73 75
2010 1 2 57 81 69
2010 1 3 159 154 144
2010 1 1 41 38 43
2010 1 2 52 41 39
2010 1 3 93 79 82
2010 2 1 71 66 68
2010 2 2 63 64 70
2010 2 3 134 130 138
2010 2 1 32 35 34
2010 2 2 29 31 36
2010 2 3 61 66 70
;
run;
proc sql ;
create table new as select
year
, geo label = 'Group'
, sex
, sum(age0005) as age0005
, sum(age0610) as age0610
, sum(age1115) as age1115
from old
group by geo, year, sex ;
quit;

How to I sum up my data in 4 rows?

Select
AvHours.LineNumber,
(SProd.PoundsMade / (AvHours.AvailableHRS - SUM (ProdDtime.DownTimeHRS))) AS Throughput,
SUM (ProdDtime.DownTimeHRS) AS [Lost Time],
(SUM(cast(ProdDtime.DownTimeHRS AS decimal(10,1))) * 100) / (cast(AvHours.AvailableHRS AS decimal(10,1))) AS [%DownTime],
SUM(SProd.PoundsMade) AS [Pounds Made],
(SProd.PoundsMade / (AvHours.AvailableHRS - SUM (ProdDtime.DownTimeHRS))) * SUM (ProdDtime.DownTimeHRS) AS [Pounds Lost]
FROM rpt_Line_Shift_AvailableHrs AvHours
inner join rpt_Line_Shift_Prod SProd on
AvHours.LineNumber=SProd.LineNumber AND AvHours.Shiftnumber=SProd.Shiftnumber
inner join rpt_Line_Shift_ProdDownTime ProdDtime on
(AvHours.LineNumber=ProdDtime.LineNumber AND AvHours.Shiftnumber=ProdDtime.Shiftnumber)
GROUP BY AvHours.LineNumber,SProd.PoundsMade,AvHours.AvailableHRS
ORDER BY AvHours.LineNumber
The query above gives the following result set:
Line#,Throughput,Lost Time, %downtime,Pounds Made,Pounds Lost
1 53 49 27.222222 97538 2597
1 44 39 20.312500 116229 1716
1 47 40 22.222222 92190 1880
1 55 31 16.145833 133215 1705
1 111 49 27.222222 204442 5439
1 13 31 16.145833 33540 403
1 86 49 27.222222 159432 4214
1 81 31 16.145833 197145 2511
1 74 40 22.222222 146202 2960
1 63 49 27.222222 115920 3087
1 76 39 20.312500 199172 2964
2 64 40 22.222222 126028 2560
2 149 49 27.222222 273966 7301
2 35 39 20.312500 92616 1365
3 49 39 20.312500 129591 1911
3 65 40 22.222222 129248 2600
3 84 39 20.312500 219997 3276
4 95 31 16.145833 229485 2945
4 76 40 22.222222 149996 3040
4 94 31 16.145833 228375 2914
4 99 39 20.312500 259794 3861
What I actually want is just 4 lines (Line# = 1,2,3 or 4) and all the other fields summed.
I'm not sure how to do it. Can anybody help?
Get rid of PoundsMade and AvailableHrs in your group by. It sounds like you only want to group by the Linenumber.
You can use your sql as a nested table and then group by the line number
like the sql below.
Select LineNumber, Sum(Throughput), Sum([Lost Time]), Sum([%DownTime]), Sum([Pounds Made]), Sum([Pounds Lost])
From
(Select
AvHours.LineNumber,
(SProd.PoundsMade / (AvHours.AvailableHRS - SUM (ProdDtime.DownTimeHRS))) AS Throughput,
SUM (ProdDtime.DownTimeHRS) AS [Lost Time],
(SUM(cast(ProdDtime.DownTimeHRS AS decimal(10,1))) * 100) / (cast(AvHours.AvailableHRS AS decimal(10,1))) AS [%DownTime],
SUM(SProd.PoundsMade) AS [Pounds Made],
(SProd.PoundsMade / (AvHours.AvailableHRS - SUM (ProdDtime.DownTimeHRS))) * SUM (ProdDtime.DownTimeHRS) AS [Pounds Lost]
FROM rpt_Line_Shift_AvailableHrs AvHours
inner join rpt_Line_Shift_Prod SProd on
AvHours.LineNumber=SProd.LineNumber AND AvHours.Shiftnumber=SProd.Shiftnumber
inner join rpt_Line_Shift_ProdDownTime ProdDtime on
(AvHours.LineNumber=ProdDtime.LineNumber AND AvHours.Shiftnumber=ProdDtime.Shiftnumber)
GROUP BY AvHours.LineNumber,SProd.PoundsMade,AvHours.AvailableHRS
) A
Group BY LineNumber
ORDER BY LineNumber
I dont have a sql server right now to test this out, But let me know if you encounter any issue
Please mark this as answer if it helped resolving your issue

Resources