Condensing similar rows occuring in groups and keeping order - sql-server

I have an sql table containing the gps coordinates of a device, updated every n minutes (the device is installed in a vehicle). given the nature of GPS, lots of the entries are very similar, but entirely different as far as the server is concerned. I can approximately match things (within ~3.6' or maybe 36') easy enough with CAST(lat as decimal(7,4))
I'd like to be able to take a result set and condense the approximate duplicate entries, but still maintain the time-based order. here's an example:
Row Lat Lng vel Hdg Time
01 31.12345 -88.12345 00 00 12-4-21 01:45:00
02 31.12346 -88.12345 00 00 12-4-21 01:46:00
03 31.12455 -88.12410 10 01 12-4-21 01:47:00
04 31.12495 -88.12480 17 01 12-4-21 01:48:00
05 31.12532 -88.12560 22 01 12-4-21 01:49:00
06 31.12567 -88.12608 25 02 12-4-21 01:50:00
07 31.12638 -88.12672 24 02 12-4-21 01:51:00
08 31.12689 -88.12722 19 02 12-4-21 01:52:00
09 31.12345 -88.12345 00 00 12-4-21 01:53:00
10 31.12346 -88.12346 00 00 12-4-21 01:54:00
11 31.12347 -88.12345 00 00 12-4-21 01:55:00
12 31.12346 -88.12346 00 00 12-4-21 01:56:00
13 31.12689 -88.12788 10 40 12-4-21 01:57:00
14 31.12604 -88.12691 13 39 12-4-21 01:58:00
15 31.12572 -88.12603 15 39 12-4-21 01:59:00
my desired end result would be rows 1 and 2 to be condensed to a single row, and rows 9 through 12 be condensed to a single row, containing AVG(Lat), AVG(Lng), and MIN(Time).
This is the result set i would like to receive, given the above data:
Row Lat Lng vel Hdg Time
01 31.123455 -88.12345 00 00 12-4-21 01:45:00
02 31.12455 -88.12410 10 01 12-4-21 01:47:00
03 31.12495 -88.12480 17 01 12-4-21 01:48:00
04 31.12532 -88.12560 22 01 12-4-21 01:49:00
05 31.12567 -88.12608 25 02 12-4-21 01:50:00
06 31.12638 -88.12672 24 02 12-4-21 01:51:00
07 31.12689 -88.12722 19 02 12-4-21 01:52:00
08 31.12346 -88.123455 00 00 12-4-21 01:53:00
09 31.12689 -88.12788 10 40 12-4-21 01:57:00
10 31.12604 -88.12691 13 39 12-4-21 01:58:00
11 31.12572 -88.12603 15 39 12-4-21 01:59:00
the boundaries between groupings would be movement. velocity being > 0, or gps coordinate changing more than x amount. in this case, x is .0001. the problem, as described below, is that multiple stops (AT DIFFERENT TIMES) at a given coordinate are lumped into a single stop. if i visit coordinate x today at 4 pm, and tomorrow at 8 am, and then again at 6 pm, the only one i see is the tomorrow # 6 pm (in the case of MAX(Time)) or the today # 4 pm (in the case of MIN(Time)).
It's a given that if velocity is 0, heading is also 0. It is, however, important that rows 1 and 2, and 9 through 12 not be grouped TOGETHER if their coordinates are similar enough to be the same (i.e. when rounded to 4 decimal places).
i have a query that does just that:
SELECT Geography::Point(AVG(dbo.GPSEntries.Latitude),
AVG(dbo.GPSEntries.Longitude),
4326 ) as Location,
dbo.GPSEntries.Velocity,
dbo.GPSEntries.Heading,
MAX(dbo.GPSEntries.Time) as maxTime,
MIN(dbo.GPSEntries.Time) as minTime,
AVG(dbo.RFDatas.RSSI) as avgRSSI,
COUNT(1) as samples
FROM dbo.GPSEntries
INNER JOIN
dbo.Reports ON
dbo.GPSEntries.Report_Id = dbo.Reports.Id
INNER JOIN
dbo.RFDatas ON
dbo.GPSEntries.Report_Id = dbo.RFDatas.Report_Id
GROUP BY CAST(Latitude as Decimal(7,4)),
CAST(Longitude as Decimal(7,4)),
Velocity,
Heading
ORDER BY MAX(Time)
in other words, if i travel from point A to point B, stay for 30 minutes (and 30 reports at 1 per minute), then travel to point C, stay for 20 minutes, then travel back to point B and stay for 20 more minutes before heading to point D, i would like to be able to see both separate stops at point B.
Here's some actual data from my db, sanitized to protect the innocent, or to blame someone in north east alabama.
Latitude Longitude Spd Vel MAX(Time) MIN(Time) sig RowCount
34.747420 -86.302580 68 157 2012-06-13 01:31:37.000 2012-06-13 01:31:37.000 -91 1
34.759140 -86.307620 61 134 2012-06-13 01:33:06.000 2012-06-13 01:33:06.000 -91 2
34.763237 -86.307264 0 0 2012-06-13 01:34:36.000 2012-06-12 01:27:21.000 -97 7
34.763288 -86.307280 0 0 2012-06-13 14:30:44.000 2012-06-12 01:30:21.000 -98 527
34.760220 -86.308200 38 110 2012-06-13 14:33:44.000 2012-06-13 14:33:44.000 -98 1
34.750350 -86.305750 5 90 2012-06-13 14:35:13.000 2012-06-13 14:35:13.000 -83 2
34.737160 -86.298040 70 88 2012-06-13 14:36:43.000 2012-06-13 14:36:43.000 -80 1
34.736420 -86.277270 120 33 2012-06-13 14:38:13.000 2012-06-13 14:38:13.000 -87 2
34.747090 -86.248370 120 37 2012-06-13 14:39:43.000 2012-06-13 14:39:43.000 -93 2
34.755620 -86.240640 70 179 2012-06-13 14:41:13.000 2012-06-13 14:41:13.000 -81 1
34.771240 -86.242760 70 0 2012-06-13 14:42:42.000 2012-06-13 14:42:42.000 -88 2
34.785510 -86.245710 70 6 2012-06-13 14:44:12.000 2012-06-13 14:44:12.000 -99 2
34.800220 -86.239400 70 1 2012-06-13 14:45:42.000 2012-06-13 14:45:42.000 -86 1
34.815070 -86.232180 70 16 2012-06-13 14:47:12.000 2012-06-13 14:47:12.000 -98 2
34.824540 -86.226198 0 0 2012-06-13 14:51:41.000 2012-06-13 00:13:48.000 -101 9
34.824579 -86.226171 0 0 2012-06-14 00:26:19.000 2012-06-12 00:46:57.000 -99 168
You'll note the 4th and last row have 527 and 168 entries, respectively, and they span 2 days. those entries are from 1 device only, and are from where the device was stopped for several hours in the same place on multiple occasions.
Here's some zipped csv data: sample
What I Finally Done Did
Some minor modifications to Aaron Bertrand's supplied query shown below:
WITH d AS
(
SELECT Time
,Latitude
,Longitude
,Velocity
,Heading
,TimeRN = ROW_NUMBER() OVER (ORDER BY [Time])
FROM dbo.GPSEntries
GROUP BY Time, Latitude, Longitude, Velocity, Heading
),
y AS (
SELECT BeginTime = MIN(Time)
,EndTime = MAX(Time)
,Latitude = AVG(Latitude)
,Longitude = AVG(Longitude)
-- ,[RowCount] = COUNT(*)
,GroupNumber
FROM (
SELECT Time
,Latitude
,Longitude
,GroupNumber = (
SELECT MIN(d2.TimeRN)
FROM d AS d2
WHERE d2.TimeRN >= d.TimeRN AND
NOT EXISTS (
SELECT 1
FROM d AS d3 -- Between 250 and 337 feet
WHERE ABS(d2.Latitude - d.Latitude) <= .0007 AND
ABS(d2.Longitude - d.Longitude) <= .0007 AND
d2.Velocity = d.Velocity ) )
FROM d ) AS x
GROUP BY GroupNumber
)
SELECT y.Latitude
,y.Longitude
,d.Velocity
,d.Heading
,y.BeginTime
-- ,y.EndTime
-- ,y.[RowCount]
-- ,Duration = CONVERT(time(0),DATEADD(SS,DATEDIFF(SS,y.BeginTime, y.EndTime), '0:00:00'), 108)
FROM y INNER JOIN d ON y.BeginTime = d.[Time]
-- FOR STOPS (5 minute):
-- WHERE DATEDIFF(MI, Y.BeginTime, y.EndTime) + 1 > 5
ORDER BY y.BeginTime;

Here is some sample data in tempdb:
USE tempdb;
GO
CREATE TABLE dbo.GPSEntries
(
Latitude DECIMAL(8,5),
Longitude DECIMAL(8,5),
Velocity TINYINT,
Heading TINYINT,
[Time] SMALLDATETIME
);
INSERT dbo.GPSEntries VALUES
(31.12345,-88.12345,00,00,'2012-04-21 01:45:00'),
(31.12346,-88.12345,00,00,'2012-04-21 01:46:00'),
(31.12455,-88.12410,10,01,'2012-04-21 01:47:00'),
(31.12495,-88.12480,17,01,'2012-04-21 01:48:00'),
(31.12532,-88.12560,22,01,'2012-04-21 01:49:00'),
(31.12567,-88.12608,25,02,'2012-04-21 01:50:00'),
(31.12638,-88.12672,24,02,'2012-04-21 01:51:00'),
(31.12689,-88.12722,19,02,'2012-04-21 01:52:00'),
(31.12345,-88.12345,00,00,'2012-04-21 01:53:00'),
(31.12346,-88.12346,00,00,'2012-04-21 01:54:00'),
(31.12347,-88.12345,00,00,'2012-04-21 01:55:00'),
(31.12346,-88.12346,00,00,'2012-04-21 01:56:00'),
(31.12689,-88.12788,10,40,'2012-04-21 01:57:00'),
(31.12604,-88.12691,13,39,'2012-04-21 01:58:00'),
(31.12572,-88.12603,15,39,'2012-04-21 01:59:00');
And my attempt at satisfying the query:
;WITH d AS
(
SELECT Time, Latitude, Longitude, Velocity, Heading,
NormLat = CONVERT(DECIMAL(7,4), Latitude),
NormLong = CONVERT(DECIMAL(7,4), Longitude),
TimeRN = ROW_NUMBER() OVER (ORDER BY [Time])
FROM dbo.GPSEntries
-- /* you probably want filters:
-- WHERE DeviceID = #SomeDeviceID
-- AND [Time] >= #SomeStartDate
-- AND [Time] < DATEADD(DAY, 1, #SomeEndDate)
-- /* also your sample CSV file had lots of duplicates, so:
GROUP BY Time, Latitude, Longitude, Velocity, Heading
),
y AS (
SELECT MinTime = MIN(Time), MaxTime = MAX(Time), Latitude = AVG(Latitude),
Longitude = AVG(Longitude), [RowCount] = COUNT(*) FROM
(
SELECT Time, Latitude, Longitude, GroupNumber =
(
SELECT MIN(d2.TimeRN)
FROM d AS d2 WHERE d2.TimeRN >= d.TimeRN
AND NOT EXISTS
(
SELECT 1 FROM d AS d3
WHERE d2.NormLat = d.NormLat
AND d2.NormLong = d.NormLong
)
)
FROM d
) AS x GROUP BY GroupNumber
)
SELECT [Row] = ROW_NUMBER() OVER (ORDER BY y.MinTime),
y.Latitude, y.Longitude, d.Velocity, d.Heading,
y.MinTime, y.MaxTime, y.[RowCount]
FROM y INNER JOIN d ON y.MinTime = d.[Time]
ORDER BY y.MinTime;
Results:
Row Latitude Longitude Velocity Heading MinTime MaxTime RowCount
---|---------|----------|--------|-------|----------------|----------------|--------
1 31.123455 -88.123450 0 0 2012-04-21 01:45 2012-04-21 01:46 2
2 31.124550 -88.124100 10 1 2012-04-21 01:47 2012-04-21 01:47 1
3 31.124950 -88.124800 17 1 2012-04-21 01:48 2012-04-21 01:48 1
4 31.125320 -88.125600 22 1 2012-04-21 01:49 2012-04-21 01:49 1
5 31.125670 -88.126080 25 2 2012-04-21 01:50 2012-04-21 01:50 1
6 31.126380 -88.126720 24 2 2012-04-21 01:51 2012-04-21 01:51 1
7 31.126890 -88.127220 19 2 2012-04-21 01:52 2012-04-21 01:52 1
8 31.123460 -88.123455 0 0 2012-04-21 01:53 2012-04-21 01:56 4
9 31.126890 -88.127880 10 40 2012-04-21 01:57 2012-04-21 01:57 1
10 31.126040 -88.126910 13 39 2012-04-21 01:58 2012-04-21 01:58 1
11 31.125720 -88.126030 15 39 2012-04-21 01:59 2012-04-21 01:59 1

Related

Update previous rows based on another row

I have a question and i need to update the empty rows based on the rows with value
In this case, i need to update the hours, mins and secs based on every 4th row
For ex: rownum 4 has 8hours, 1mins, 9 sec.
So, my update to previous row should be 8hrs, 1min, 6 sec from rownum 1, then, for rownum 5 it should continue the same procedure
See rownum 8 has 8hours, 1mins, 13 sec.
The previous 3 rows should be 8hrs, 1min, 10 sec from rownum 5
How to have this in a loop or with partition by or any suggestion in SQL server.
You can do this with window functions and converting your hours, minutes and seconds to a time value. Converting to a tim is important to make sure you wrap around the appropriate time boundaries and don't end up with 61 seconds in a minute etc.
Depending on the data and your real world environment you will probably need to add the Flight and maybe some other columns into the partition bys to ensure you are working correctly scoped windows of data.
Query
declare #t table(rn int,timeframe int,h int,m int,s int);
insert into #t values
(1,1,null,null,null)
,(2,1,null,null,null)
,(3,1,null,null,null)
,(4,1,23,59,45)
,(5,2,null,null,null)
,(6,2,null,null,null)
,(7,2,null,null,null)
,(8,2,23,59,49)
,(9,3,null,null,null)
,(10,3,null,null,null)
,(11,3,null,null,null)
,(12,3,23,59,53)
,(13,4,null,null,null)
,(14,4,null,null,null)
,(15,4,null,null,null)
,(16,4,23,59,57)
,(17,5,null,null,null)
,(18,5,null,null,null)
,(19,5,null,null,null)
,(20,5,0,0,1)
,(21,6,null,null,null)
,(22,6,null,null,null)
,(23,6,null,null,null)
,(24,6,0,0,5)
;
with d as
(
select rn
,timeframe
,dateadd(second
,rn - max(rn) over (partition by timeframe)
,max(timefromparts(h,m,s,0,0)) over (partition by timeframe)
) as t
from #t
)
select rn
,timeframe
,datepart(hour,t) as h
,datepart(minute,t) as m
,datepart(second,t) as s
from d
order by rn;
Output
rn
timeframe
h
m
s
1
1
23
59
42
2
1
23
59
43
3
1
23
59
44
4
1
23
59
45
5
2
23
59
46
6
2
23
59
47
7
2
23
59
48
8
2
23
59
49
9
3
23
59
50
10
3
23
59
51
11
3
23
59
52
12
3
23
59
53
13
4
23
59
54
14
4
23
59
55
15
4
23
59
56
16
4
23
59
57
17
5
23
59
58
18
5
23
59
59
19
5
0
0
0
20
5
0
0
1
21
6
0
0
2
22
6
0
0
3
23
6
0
0
4
24
6
0
0
5

Simple way of converting list from character to numeric in R?

I have looking into other threads on this problem and could not find an easy solution. I have imported data from Excel tables, and joined them in lists which generally look like this:
> Hemo
[[1]]
V1 V2 V3 V4 V5 V6 V7
1 0d 3d 6d 9d 12d 15d 18d
2 10 40 20 60 50 30 40
3 20 30 30 30 30 30 30
4 20 20 30 20 40 20 50
[[2]]
V1 V2 V3 V4 V5 V6 V7
1 0d 3d 6d 9d 12d 15d 18d
2 0 10 10 0 0 0 0
3 0 10 20 20 20 0 0
4 0 0 10 20 20 0 0
However I'd like them to look like this (which is an array):
, , 1
0d 3d 6d 9d 12d 15d 18d
V2 10 40 20 60 50 30 40
V3 20 30 30 30 30 30 30
V4 20 20 30 20 40 20 50
, , 2
0d 3d 6d 9d 12d 15d 18d
V2 0 10 10 0 0 0 0
V3 0 10 20 20 20 0 0
V4 0 0 10 20 20 0 0
In the first case all elements are characters and I am not being able to coerse them to numbers. Ultimately I'd like to convert the first list into the second array where the first imported line figures as the column names. There must be some package enabling this? Please let us find a simple workaround as I am a newbie. Thanks
It appears as though you imported the data from excel, but the columnnames were interpreted as data. You didn't specify which function you used to do the importing, but with most of them you can specify that the first row of data are columnnames.
library(readxl)
data <- read_excel(filename, col_names = TRUE)
When you import your data properly, it won't confuse the actual data, and should automatically read it as numerics. This way you won't have to convert it yourself.

Sum of multiple variables by group

I have a dataset with over 900 observations, each observation represents the population of a sub-geographical area for a given year by gender (male, female, all) and 20 different age groups.
I have dropped the variable for the sub-geographical area and I want to collape into the greater geographical area (called Geo).
I am having a difficult time doing a SUM or PROC MEANS because I have so many age groups to sum up and I am trying to avoid writing them all out. I want to collapse across the group year, geo, sex so that I only have 3 observations per Geo (my raw data could have as many as 54 observations).
This is an example of what a tiny section of the raw data looks like:
Year Geo Sex Age0005 Age0610 Age1115 (etc)
2010 1 1 92 73 75
2010 1 2 57 81 69
2010 1 3 159 154 144
2010 1 1 41 38 43
2010 1 2 52 41 39
2010 1 3 93 79 82
2010 2 1 71 66 68
2010 2 2 63 64 70
2010 2 3 134 130 138
2010 2 1 32 35 34
2010 2 2 29 31 36
2010 2 3 61 66 70
This is how I want it to look:
Year Group Sex Age0005 Age0610 Age1115 (etc)
2010 1 1 133 111 118
2010 1 2 109 122 08
2010 1 3 252 233 226
2010 2 1 103 101 102
2010 2 2 92 95 106
2010 2 3 195 196 208
Any ideas? Please help!
You don't have to write out each variable name individually - there are ways of getting around that. E.g. if all of the age group variables that need to be summed up start with age then you can use a : wildcard to match them:
proc summary nway data = have;
var age:;
class year geo sex;
output out = want sum=;
run;
If your variables don't have a common prefix, but are all next to each other in one big horizontal group in your dataset, you can use a double dash list instead:
proc summary nway data = have;
var age005--age1115; /*Includes all variables between these two*/
class year geo sex;
output out = want sum=;
run;
Note also the use of sum= - this means that each summarised variable is reproduced with its original name in the output dataset.
I personally like to use proc sql for this, since it makes it very clear what you're summing and grouping by.
data old ;
input Year Geo Sex Age0005 Age0610 Age1115 ;
datalines;
2010 1 1 92 73 75
2010 1 2 57 81 69
2010 1 3 159 154 144
2010 1 1 41 38 43
2010 1 2 52 41 39
2010 1 3 93 79 82
2010 2 1 71 66 68
2010 2 2 63 64 70
2010 2 3 134 130 138
2010 2 1 32 35 34
2010 2 2 29 31 36
2010 2 3 61 66 70
;
run;
proc sql ;
create table new as select
year
, geo label = 'Group'
, sex
, sum(age0005) as age0005
, sum(age0610) as age0610
, sum(age1115) as age1115
from old
group by geo, year, sex ;
quit;

How to I sum up my data in 4 rows?

Select
AvHours.LineNumber,
(SProd.PoundsMade / (AvHours.AvailableHRS - SUM (ProdDtime.DownTimeHRS))) AS Throughput,
SUM (ProdDtime.DownTimeHRS) AS [Lost Time],
(SUM(cast(ProdDtime.DownTimeHRS AS decimal(10,1))) * 100) / (cast(AvHours.AvailableHRS AS decimal(10,1))) AS [%DownTime],
SUM(SProd.PoundsMade) AS [Pounds Made],
(SProd.PoundsMade / (AvHours.AvailableHRS - SUM (ProdDtime.DownTimeHRS))) * SUM (ProdDtime.DownTimeHRS) AS [Pounds Lost]
FROM rpt_Line_Shift_AvailableHrs AvHours
inner join rpt_Line_Shift_Prod SProd on
AvHours.LineNumber=SProd.LineNumber AND AvHours.Shiftnumber=SProd.Shiftnumber
inner join rpt_Line_Shift_ProdDownTime ProdDtime on
(AvHours.LineNumber=ProdDtime.LineNumber AND AvHours.Shiftnumber=ProdDtime.Shiftnumber)
GROUP BY AvHours.LineNumber,SProd.PoundsMade,AvHours.AvailableHRS
ORDER BY AvHours.LineNumber
The query above gives the following result set:
Line#,Throughput,Lost Time, %downtime,Pounds Made,Pounds Lost
1 53 49 27.222222 97538 2597
1 44 39 20.312500 116229 1716
1 47 40 22.222222 92190 1880
1 55 31 16.145833 133215 1705
1 111 49 27.222222 204442 5439
1 13 31 16.145833 33540 403
1 86 49 27.222222 159432 4214
1 81 31 16.145833 197145 2511
1 74 40 22.222222 146202 2960
1 63 49 27.222222 115920 3087
1 76 39 20.312500 199172 2964
2 64 40 22.222222 126028 2560
2 149 49 27.222222 273966 7301
2 35 39 20.312500 92616 1365
3 49 39 20.312500 129591 1911
3 65 40 22.222222 129248 2600
3 84 39 20.312500 219997 3276
4 95 31 16.145833 229485 2945
4 76 40 22.222222 149996 3040
4 94 31 16.145833 228375 2914
4 99 39 20.312500 259794 3861
What I actually want is just 4 lines (Line# = 1,2,3 or 4) and all the other fields summed.
I'm not sure how to do it. Can anybody help?
Get rid of PoundsMade and AvailableHrs in your group by. It sounds like you only want to group by the Linenumber.
You can use your sql as a nested table and then group by the line number
like the sql below.
Select LineNumber, Sum(Throughput), Sum([Lost Time]), Sum([%DownTime]), Sum([Pounds Made]), Sum([Pounds Lost])
From
(Select
AvHours.LineNumber,
(SProd.PoundsMade / (AvHours.AvailableHRS - SUM (ProdDtime.DownTimeHRS))) AS Throughput,
SUM (ProdDtime.DownTimeHRS) AS [Lost Time],
(SUM(cast(ProdDtime.DownTimeHRS AS decimal(10,1))) * 100) / (cast(AvHours.AvailableHRS AS decimal(10,1))) AS [%DownTime],
SUM(SProd.PoundsMade) AS [Pounds Made],
(SProd.PoundsMade / (AvHours.AvailableHRS - SUM (ProdDtime.DownTimeHRS))) * SUM (ProdDtime.DownTimeHRS) AS [Pounds Lost]
FROM rpt_Line_Shift_AvailableHrs AvHours
inner join rpt_Line_Shift_Prod SProd on
AvHours.LineNumber=SProd.LineNumber AND AvHours.Shiftnumber=SProd.Shiftnumber
inner join rpt_Line_Shift_ProdDownTime ProdDtime on
(AvHours.LineNumber=ProdDtime.LineNumber AND AvHours.Shiftnumber=ProdDtime.Shiftnumber)
GROUP BY AvHours.LineNumber,SProd.PoundsMade,AvHours.AvailableHRS
) A
Group BY LineNumber
ORDER BY LineNumber
I dont have a sql server right now to test this out, But let me know if you encounter any issue
Please mark this as answer if it helped resolving your issue

SAS_ data repeat read

I have raw data like this
time ID01 ID02 ID03 ~ IDxx
0 10 11 xx
0.5 20 12 xx
1 29 25 xx
1.5 41 30 xx
2 50 40 xx
3 30 50 xx
4 40 42 xx
. . .
. . .
. . .
I want to make it to this form
x time temp.
01 0 10
01 0.5 20
01 1 29
01 1.5 41
01 2 50
01 3 30
01 4 40
02 0 11
02 0.5 12
02 1 25
02 1.5 30
02 2 40
02 3 50
02 4 42
I used array statement and proc transpose
but I can't repeat time variable beside temp.
It works using arrays. Just write an output within the loop and time will be written in your output datatset, and then sort.
data output;
set input;
array ID(*) ID01-ID03;
do i=1 to 3;
X=put(i,z2.);
temp=ID(i);
output;
end;
keep time X temp;
run;
proc sort data=output;
by X time;
run;

Resources