Strava - Group Route Proximity with Latitude, Longitude & Time

Strava - Group Route Proximity with Latitude, Longitude & Time - maps

Question: What is the most computationally efficient way to determine if two bike riders rode together given a stream of data with time, latitude, and longitude?
Background: I'm an avid cyclist and want to reverse engineer how Strava groups bike riders together. Here is their method to determine if cyclists are riding together (they use time and lat/lon of a ride): https://support.strava.com/hc/en-us/articles/216919497-Why-don-t-I-get-grouped-in-Activities-when-I-rode-ran-with-others-
After a bike ride is complete I have a file of latitude and longitude every second.
Rider 1 Route:
Rider 2 Route:
You can see Rider 1 and 2 rode together, but Rider 2 started from a different spot and joined Rider 1 later.
I want to come up with the least computational intensive way of determining these two riders rode together, despite starting from different locations.
I think Strava's approach is good - basically establish a proximity zone (150 meters) around each point on the route and compare routes of the rider to see if the riders spent 70% of their time within 150 meters of each other.
Rider 1 - Locations:
2016-03-27T11:47:45Z 42.113059 -87.736485
2016-03-27T11:47:46Z 42.113081 -87.736511
2016-03-27T11:47:47Z 42.113105 -87.736538
2016-03-27T11:47:48Z 42.113142 -87.736564
2016-03-27T11:47:49Z 42.113175 -87.736587
Rider 2 - Locations:
-2016-03-27T11:47:45Z 42.113049 -87.736394 <= Find same time of Rider 1 and determine if within 150 meters. If < 150 meters assign 1, if > 150 assign 0.
I would iterate over every point of Rider 2 against every point of Rider 1. Then sum up the 1s and 0s. If the (sum of 1s and 0s) / (total points) is greater than 70% riders are grouped together.
I think this method would generally work, but seems very computational intensive, especially if there are thousands of riders to evaluate. Also, the data does not always have latitude and longitude every second. One method would be to average the location every minute and compare the average location by minute. At least it would reduce iterations by 60 times.
I was hoping there was some statistical or GIS method to establish the "signature" of a route and compare signatures rather than a point by point comparison.
Any thoughts on how to compute the route comparison in the most efficient way?
Note: I posted a similar question on the GIS forum, but no one responded yet. Although, I do think the question written here is more clear. https://gis.stackexchange.com/questions/187019/strava-activity-route-grouping

I'm going to assume the following is true:
for each cyclist C, there is a data stream of time T, longitude X and latitude Y (we're using projected X and Y for simplification, not caring about the projection; however, we should)
data stream can be written into database or another kind of persistent data storage
the data stream for C is sampled at rate of 1s, given that there is no guarantee that every sample is taken; we have to assume that sample is taken in more than 50% cases (preferably > 95%; 99,7% would be perfect)
In that case, one table in database contains all of the data needed for analytics. Let's see what does it look like for two cyclist C1 and C2, compared one to another.
╔════╦════╦════╦════╦════╦═══════╗
║ T ║ X1 ║ Y1 ║ X2 ║ Y2 ║ D ║
╠════╬════╬════╬════╬════╬═══════╣
║ 1 ║ 10 ║ 15 ║ - ║ - ║ - ║
║ 2 ║ 11 ║ 16 ║ - ║ - ║ - ║
║ 3 ║ 11 ║ 17 ║ 19 ║ 11 ║ 10,00 ║
║ 4 ║ 12 ║ 18 ║ 18 ║ 11 ║ 9,22 ║
║ 5 ║ 12 ║ 17 ║ 17 ║ 12 ║ 7,07 ║
║ 6 ║ - ║ - ║ 15 ║ 12 ║ - ║
║ 7 ║ 13 ║ 16 ║ 14 ║ 13 ║ 3,16 ║
║ 8 ║ 13 ║ 15 ║ 13 ║ 14 ║ 1,00 ║
║ 9 ║ 14 ║ 14 ║ 13 ║ 14 ║ 1,00 ║
║ 10 ║ 14 ║ 13 ║ 14 ║ 13 ║ 0,00 ║
║ 11 ║ 14 ║ 14 ║ 14 ║ 14 ║ 0,00 ║
║ 12 ║ 14 ║ 15 ║ 14 ║ 14 ║ 1,00 ║
║ 13 ║ 15 ║ 15 ║ 15 ║ 15 ║ 0,00 ║
║ 14 ║ 15 ║ 16 ║ 15 ║ 16 ║ 0,00 ║
║ 15 ║ 16 ║ 16 ║ 16 ║ 17 ║ 1,00 ║
║ 16 ║ 17 ║ 18 ║ 16 ║ 16 ║ 2,24 ║
╚════╩════╩════╩════╩════╩═══════╝
This comparison can easily be done using e.g. SELECT in database, self-joining a table for two cyclists. For a reasonable number of rows (e.g. <10E5, <10E6) and correctly set indexes, this computation is not resource intensive at all. Especially if we take into the consideration that the database query can be written in such a way that value D is not output for every position, but calculated jut in order to aggregate (count) the value. In that case, all you need is a ratio of count of rows where D is less of equal your preferred treshold D0 vs total count of rows. If that ratio is equal or more than your limit (say, 70%), cyclists went on a ride together.
Let's see an example. If there is such table in the database, named CyclistPosition:
CyclistId - identifier of the cyclist
SamplingTime - UTC time of the sample (position) taken
Long - longitude
Lat - latitude
...with the following data:
╔═══════════╦═══════════════════════╦═══════════╦════════════╗
║ CyclistId ║ SamplingTime ║ Long ║ Lat ║
╠═══════════╬═══════════════════════╬═══════════╬════════════╣
║ 1 ║ 2016-03-27T11:47:45Z ║ 42,113059 ║ -87,736485 ║
║ 1 ║ 2016-03-27T11:47:46Z ║ 42,113081 ║ -87,736511 ║
║ 1 ║ 2016-03-27T11:47:47Z ║ 42,113105 ║ -87,736538 ║
║ 1 ║ 2016-03-27T11:47:48Z ║ 42,113142 ║ -87,736564 ║
║ 1 ║ 2016-03-27T11:47:49Z ║ 42,113175 ║ -87,736587 ║
║ 2 ║ 2016-03-27T11:47:45Z ║ 42,113059 ║ -87,736394 ║
║ 2 ║ 2016-03-27T11:47:46Z ║ 42,113085 ║ -87,736481 ║
║ 2 ║ 2016-03-27T11:47:47Z ║ 42,113103 ║ -87,736531 ║
║ 2 ║ 2016-03-27T11:47:48Z ║ 42,113139 ║ -87,736572 ║
║ 2 ║ 2016-03-27T11:47:49Z ║ 42,113147 ║ -87,736595 ║
╚═══════════╩═══════════════════════╩═══════════╩════════════╝
...then we can extract data for the cyclists 1 and 2 using:
SELECT SamplingTime, Long, Lat FROM CyclistPosition WHERE CyclistId = 1
SELECT SamplingTime, Long, Lat FROM CyclistPosition WHERE CyclistId = 2
...and cross-reference that data using this query...
SELECT
cp1.SamplingTime,
Long1 = cp1.Long,
Lat1 = cp1.Lat,
Long2 = cp2.Long,
Lat2 = cp2.Lat
FROM
CyclistPosition cp1
JOIN CyclistPosition cp2
ON cp2.SamplingTime = cp1.SamplingTime
WHERE
cp1.CyclistId = 1
AND cp2.CyclistId = 2
We now have this kind of output, and if we include rougly calculated X and Y (using Mercator), we get:
╔═══════════════════════╦═══════════╦════════════╦═══════════╦════════════╦══════════════╗
║ SamplingTime ║ Long1 ║ Lat1 ║ Long2 ║ Lat2 ║ Dm ║
╠═══════════════════════╬═══════════╬════════════╬═══════════╬════════════╬══════════════╣
║ 2016-03-27T11:47:45Z ║ 42,113059 ║ -87,736485 ║ 42,113059 ║ -87,736394 ║ 10,118517 ║
║ 2016-03-27T11:47:46Z ║ 42,113081 ║ -87,736511 ║ 42,113085 ║ -87,736481 ║ 3,334919 ║
║ 2016-03-27T11:47:47Z ║ 42,113105 ║ -87,736538 ║ 42,113103 ║ -87,736531 ║ 0,777079 ║
║ 2016-03-27T11:47:48Z ║ 42,113142 ║ -87,736564 ║ 42,113139 ║ -87,736572 ║ 0,890572 ║
║ 2016-03-27T11:47:49Z ║ 42,113175 ║ -87,736587 ║ 42,113147 ║ -87,736595 ║ 0,900635 ║
╚═══════════════════════╩═══════════╩════════════╩═══════════╩════════════╩══════════════╝
Note that for a rough calculation of distance in meters you have to find the formula; I used the one here:
http://bluemm.blogspot.hr/2007/01/excel-formula-to-calculate-distance.html
Now we have to aggregate the data and count it. We have to limit the data to start and end time (T1 and T2) and establish the maximum distance (D0) to say cyclists are riding together. The simple way to do that in SQL would be:
DECLARE #togetherPositions int
DECLARE #allPositions int
DECLARE #ratio decimal(18,2)
SELECT #togetherPositions = count(*)
FROM
CyclistPosition cp1
JOIN CyclistPosition cp2
ON cp2.SamplingTime = cp1.SamplingTime
WHERE
cp1.SamplingTime BETWEEN #T1 AND #T2
AND {formula to get distance in meters} <= #D0
SELECT #allPositions = count(*)
FROM
CyclistPosition cp1
JOIN CyclistPosition cp2
ON cp2.SamplingTime = cp1.SamplingTime
WHERE
cp1.SamplingTime BETWEEN #T1 AND #T2
SET #ratio = #togetherPositions / #allPositions * 1.0
Now you just have to decide if the ratio is 0.7, 0.8, 0.85...
HTH

Related

Random dates without duplicates based on the value of other columns

I have a temp table called #RandomDates that looks like this in SQL Server:
╔════╦═════════════╦══════════╦══════════════════╦════════════════════════════════╦═══════════════════════╗
║ ID ║ Description ║ RaceType ║ RaceStartTime ║ AverageCompletionTimeInMinutes ║ PredictCompletionTime ║
╠════╬═════════════╬══════════╬══════════════════╬════════════════════════════════╬═══════════════════════╣
║ 1 ║ Player1 ║ RaceA ║ 2025-05-10 10:00 ║ 120 ║ NULL ║
╠════╬═════════════╬══════════╬══════════════════╬════════════════════════════════╬═══════════════════════╣
║ 2 ║ Player2 ║ RaceA ║ 2025-05-12 17:00 ║ 120 ║ NULL ║
╠════╬═════════════╬══════════╬══════════════════╬════════════════════════════════╬═══════════════════════╣
║ 3 ║ Player3 ║ RaceC ║ 2025-08-12 08:15 ║ 60 ║ NULL ║
╠════╬═════════════╬══════════╬══════════════════╬════════════════════════════════╬═══════════════════════╣
║ 5 ║ Player4 ║ RaceY ║ 2025-08-29 16:00 ║ 10 ║ NULL ║
╠════╬═════════════╬══════════╬══════════════════╬════════════════════════════════╬═══════════════════════╣
║ 6 ║ Player4 ║ RaceY ║ 2025-08-30 21:00 ║ 10 ║ NULL ║
╚════╩═════════════╩══════════╩══════════════════╩════════════════════════════════╩═══════════════════════╝
I want to update the column "PredictCompletionTime" with random dates however I need them to be based on the values
of columns "RaceStartTime" and "AverageCompletionTimeInMinutes".
Example for ID = 1
RaceA takes place on 2025-05-10 10:00
RaceA takes an average of 120 minutes to complete
I want my randomized "PredictCompletionTime" column to be somewhere between:
RaceStartTime + AverageCompletionTimeInMinutes + RANDOMLY add OR deduct a small amount of MINUTES and SECONDS ( lets say between 5 to 10 minutes )
So valid dates for this example could be:
2025-05-10 12:07:20
2025-05-10 11:59:40
I have tried doing this with RAND()* but for some reason my "PredictCompletonTime" column keeps getting updated with duplicated values for each RaceType.
Thanks in advance,

here is an example, so random will create a random number between #MinTime and AverageCompletionTimeInMinutes for each row in second and will add to RaceStartTime :
DECLARE #MinTime int = 300 -- in second
UPDATE #tablename
SET PredictCompletionTime = DATEADD(SECOND ,ROUND(RAND() * (AverageCompletionTimeInMinutes*60 - #MinTime) , 0),RaceStartTime )
FROM #tablename

T-SQL: I'm trying to get the latest row of a column but also the sum of another column

I have the following table, it displays the SalesQty and the StockQty grouped by Article, Supplier, Branch and Month.
╔════════╦════════╦══════════╦═════════╦══════════╦══════════╗
║ Month ║ Branch ║ Supplier ║ Article ║ SalesQty ║ StockQty ║
╠════════╬════════╬══════════╬═════════╬══════════╬══════════╣
║ 201811 ║ 333 ║ 2 ║ 3122 ║ 4 ║ 11 ║
║ 201811 ║ 345 ║ 1 ║ 1234 ║ 2 ║ 10 ║
║ 201811 ║ 345 ║ 1 ║ 4321 ║ 3 ║ 11 ║
║ 201812 ║ 333 ║ 2 ║ 3122 ║ 2 ║ 4 ║
║ 201812 ║ 345 ║ 1 ║ 1234 ║ 3 ║ 12 ║
║ 201812 ║ 345 ║ 1 ║ 4321 ║ 4 ║ 5 ║
║ 201901 ║ 333 ║ 2 ║ 3122 ║ 1 ║ 8 ║
║ 201901 ║ 345 ║ 1 ║ 1234 ║ 6 ║ 9 ║
║ 201901 ║ 345 ║ 1 ║ 4321 ║ 2 ║ 8 ║
║ 201902 ║ 333 ║ 2 ║ 3122 ║ 7 ║ NULL ║
║ 201902 ║ 345 ║ 1 ║ 1234 ║ 4 ║ 13 ║
║ 201902 ║ 345 ║ 1 ║ 4321 ║ 1 ║ 10 ║
╚════════╩════════╩══════════╩═════════╩══════════╩══════════╝
Now I want to sum the SalesQty and get the latest StockQty and group them by Article, Supplier, Branch.
The final result should look like this:
╔════════╦══════════╦═════════╦═════════════╦════════════════╗
║ Branch ║ Supplier ║ Article ║ SumSalesQty ║ LatestStockQty ║
╠════════╬══════════╬═════════╬═════════════╬════════════════╣
║ 333 ║ 2 ║ 3122 ║ 14 ║ NULL ║
║ 345 ║ 1 ║ 1234 ║ 15 ║ 13 ║
║ 345 ║ 1 ║ 4321 ║ 10 ║ 10 ║
╚════════╩══════════╩═════════╩═════════════╩════════════════╝
I already tried this but it gives me an error, and i have no idea what i have to do in this case.
I've made this example so you can try it by yourself. db<>fiddle
SELECT
Branch,
Supplier,
Article,
SumSalesQty = SUM(SalesQty),
-- my attempt
LatestStockQty = (SELECT StockQty FROM TestTable i
WHERE MAX(Month) = Month
AND TT.Branch = i. Branch
AND TT.Supplier = i.Branch
AND TT.Article = i.Branch)
FROM
TestTable TT
GROUP BY
Branch, Supplier, Article
Thank you for your help!

We can try using ROW_NUMBER here, to isolate the latest record for each group:
WITH cte AS (
SELECT t.*, ROW_NUMBER() OVER (PARTITION BY Branch, Supplier, Article
ORDER BY Month DESC) rn,
SUM(SalesQty) OVER (PARTITION BY Branch, Supplier, Article) SumSalesQty
FROM TestTable t
)
SELECT
Month,
Branch,
Supplier,
Article,
SumSalesQty,
StockQty
FROM cte
WHERE rn = 1;
Inside the CTE we compute, for each Branch/Supplier/Article group a row number value, starting with 1 for the most recent month. We also compute the sum of the sales quantity over the same partition. Then, we only need to select all rows from that CTE where the row number is equal to 1.
Demo

A similar approach but without the CTE
SELECT top 1 with ties
Branch
, Supplier
, Article
, SUM(SalesQty) OVER (PARTITION BY Branch, Supplier, Article) SumSalesQty
, tt.StockQty as LatestStockQty
FROM TestTable TT
order by ROW_NUMBER() OVER (PARTITION BY Branch, Supplier, Article ORDER BY Month DESC)

Non-Aggregate Pivot [duplicate]

This question already has an answer here:
SQL transpose full table
(1 answer)
Closed 8 years ago.
I have a table like this:
╔════════╦═══╦═══╦═══╦═══╦═══╗
║ row_id ║ 1 ║ 2 ║ 3 ║ 4 ║ 5 ║
╠════════╬═══╬═══╬═══╬═══╬═══╣
║ 1 ║ T ║ E ║ S ║ N ║ U ║
║ 2 ║ M ║ B ║ R ║ H ║ A ║
║ 3 ║ C ║ D ║ F ║ G ║ I ║
║ 4 ║ J ║ K ║ L ║ O ║ P ║
║ 5 ║ V ║ W ║ X ║ Y ║ Z ║
╚════════╩═══╩═══╩═══╩═══╩═══╝
I want to "pivot" the table to get an outcome where the row_id column is the first row, the 1 column the second etc.
The results should look like this:
╔════════╦═══╦═══╦═══╦═══╦═══╗
║ row_id ║ 1 ║ 2 ║ 3 ║ 4 ║ 5 ║
╠════════╬═══╬═══╬═══╬═══╬═══╣
║ 1 ║ T ║ M ║ C ║ J ║ V ║
║ 2 ║ E ║ B ║ D ║ K ║ W ║
║ 3 ║ S ║ R ║ F ║ L ║ X ║
║ 4 ║ N ║ H ║ G ║ O ║ Y ║
║ 5 ║ U ║ A ║ I ║ P ║ Z ║
╚════════╩═══╩═══╩═══╩═══╩═══╝
I've looked for ideas about Pivoting without aggregates but without much luck, mainly since the data I want to pivot is non numeric.
I've set up the sample data in SQL Fiddle.
Thanks!

What you need is called "matrix transposition". The optimal SQL query will depend very much on the actual way you store the data, so it wouldn't hurt if you will provide more realistic example of your table' structure. Are you sure all matrices you will ever need to work with will be exactly 5*5 ? :)
UPD: Oh, I see you've found it.

I realized my mistake was looking for pivot and not for transpose.
I found an answer here and solved the problem with the following query:
SELECT *
FROM (SELECT row_id,
col,
value
FROM table1
UNPIVOT ( value
FOR col IN ([1],
[2],
[3],
[4],
[5]) ) unpiv) src
PIVOT ( Max(value)
FOR row_id IN ([1],
[2],
[3],
[4],
[5]) ) piv
The results are on SQL Fiddle.

UNPIVOT Data with over Fourty Columns

Test Data
DECLARE #T table
( ClientID INT, Dated DateTime,Value1 varchar(10),Value2 varchar(10),
Value3 varchar(10),Value4 varchar(10),Value5 varchar(10),Value6 varchar(10)
,Value7 varchar(10),Value8 varchar(10),Value9 varchar(10)
)
INSERT INTO #T values
(1,'2014-01-06 16:27:47.440','High','Low','Medium','High','Medium','Low','Medium','High','Low'),
(2,'2014-01-06 16:27:47.440','Medium','High','Low','Medium','High','Low','Medium','Low','Medium'),
(1,'2014-01-01 16:27:47.440','Medium','Low','High','Medium','Low','Medium','High','Low','Medium')
SELECT * FROM #T
╔══════════╦═════════════════════════╦════════╦════════╦════════╦════════╦════════╦════════╦════════╦════════╦════════╗
║ ClientID ║ Dated ║ Value1 ║ Value2 ║ Value3 ║ Value4 ║ Value5 ║ Value6 ║ Value7 ║ Value8 ║ Value9 ║
╠══════════╬═════════════════════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╬════════╣
║ 1 ║ 2014-06-01 16:27:47.440 ║ High ║ Low ║ Medium ║ High ║ Medium ║ Low ║ Medium ║ High ║ Low ║
║ 2 ║ 2014-06-01 16:27:47.440 ║ Medium ║ High ║ Low ║ Medium ║ High ║ Low ║ Medium ║ Low ║ Medium ║
║ 1 ║ 2014-01-01 16:27:47.440 ║ Medium ║ Low ║ High ║ Medium ║ Low ║ Medium ║ High ║ Low ║ Medium ║
╚══════════╩═════════════════════════╩════════╩════════╩════════╩════════╩════════╩════════╩════════╩════════╩════════╝
My Query
SELECT TOP 1
B.Value1 AS Historical_Value1, A.Value1 AS Recent_Value1
, B.Value2 AS Historical_Value2, A.Value2 AS Recent_Value2
, B.Value3 AS Historical_Value3, A.Value3 AS Recent_Value3
, B.Value4 AS Historical_Value4, A.Value4 AS Recent_Value4
, B.Value5 AS Historical_Value5, A.Value5 AS Recent_Value5
, B.Value6 AS Historical_Value6, A.Value6 AS Recent_Value6
, B.Value7 AS Historical_Value7, A.Value7 AS Recent_Value7
, B.Value8 AS Historical_Value8, A.Value8 AS Recent_Value8
, B.Value9 AS Historical_Value9, A.Value9 AS Recent_Value9
FROM #T A INNER JOIN #T B
ON A.ClientID = B.ClientID
WHERE B.Dated < A.Dated
ORDER BY A.Dated DESC, B.Dated DESC
As you can see I am pulling out The lastest Recordings for all the values and the recording recorded prior to that. Recent Values and Historical Values respectively.
Which returns me Data back in the following format.
Current OUTPUT
╔═══════════════════╦═══════════════╦═══════════════════╦═══════════════╦═══════════════════╦═══════════════╦═══════════════════╦═══════════════╦═══════════════════╦═══════════════╦═══════════════════╦═══════════════╦═══════════════════╦═══════════════╦═══════════════════╦═══════════════╦═══════════════════╦═══════════════╗
║ Historical_Value1 ║ Recent_Value1 ║ Historical_Value2 ║ Recent_Value2 ║ Historical_Value3 ║ Recent_Value3 ║ Historical_Value4 ║ Recent_Value4 ║ Historical_Value5 ║ Recent_Value5 ║ Historical_Value6 ║ Recent_Value6 ║ Historical_Value7 ║ Recent_Value7 ║ Historical_Value8 ║ Recent_Value8 ║ Historical_Value9 ║ Recent_Value9 ║
╠═══════════════════╬═══════════════╬═══════════════════╬═══════════════╬═══════════════════╬═══════════════╬═══════════════════╬═══════════════╬═══════════════════╬═══════════════╬═══════════════════╬═══════════════╬═══════════════════╬═══════════════╬═══════════════════╬═══════════════╬═══════════════════╬═══════════════╣
║ Medium ║ High ║ Low ║ Low ║ High ║ Medium ║ Medium ║ High ║ Low ║ Medium ║ Medium ║ Low ║ High ║ Medium ║ Low ║ High ║ Medium ║ Low ║
╚═══════════════════╩═══════════════╩═══════════════════╩═══════════════╩═══════════════════╩═══════════════╩═══════════════════╩═══════════════╩═══════════════════╩═══════════════╩═══════════════════╩═══════════════╩═══════════════════╩═══════════════╩═══════════════════╩═══════════════╩═══════════════════╩═══════════════╝
Desired OUTPUT
But I would like to UNPIVOT the data so it is shown as Follows, I have seen a lot of question on SO but none of them seems to fit my requirement. Any pointer any advice is most welcome thank you.
╔════════╦════════════╦════════╗
║ Values ║ Historical ║ Recent ║
╠════════╬════════════╬════════╣
║ Value1 ║ High ║ Medium ║
║ Value2 ║ Low ║ Low ║
║ Value3 ║ Medium ║ High ║
║ Value4 ║ High ║ Medium ║
║ Value5 ║ High ║ Medium ║
╚════════╩════════════╩════════╝

This would be one way to do it:
;WITH up AS
(
SELECT * FROM #T
UNPIVOT
(
val FOR n IN (Value1,value2,value3,value4,value5,value6,value7,value8,value9)
) as pv
)
SELECT
A.ClientID,
A.Dated,
A.n as Values,
A.val as Recent,
B.val as History
FROM
up as A
JOIN up as B
ON A.ClientID = B.ClientID
AND A.n = B.n
WHERE B.Dated < A.Dated
ORDER BY
A.Dated DESC, B.Dated DESC

Unique ID for each row in recursive CTE hierarchy with multiple nodes

I have a table containing a set of links that form a hierarchy. The big problem is that each link may be used several times (in different positions). I need to be able to distinguish between each "instance" of each node.
For example in the following data, link "D-G" will show up several times:
╔════════════╦════════╗
║ SOURCE ║ TARGET ║
╠════════════╬════════╣
║ A ║ B ║
║ A ║ C ║
║ B ║ D ║
║ B ║ E ║
║ B ║ F ║
║ C ║ D ║
║ C ║ E ║
║ C ║ F ║
║ D ║ G ║
║ E ║ D ║
║ F ║ D ║
╚════════════╩════════╝
I can build the hierarchy using a recursive CTE without any problems, but I want to give each row in the results a unique ID and link it to the parent node's unique ID.
My original idea was to assign a unique ID to each row using Row_Number() + Max(ID) up to this point and have the row inherit it's parents ID, but further reading and trial & error showed that this wont work :-(
Does anybody have an idea how to solve this problem (or at least give me a clue)?
The results should be something like this:
╔═════════════╦═════════════╦═══════════╦═══════════╗
║ SOURCE_DESC ║ TARGET_DESC ║ Source_ID ║ Target_ID ║
╠═════════════╬═════════════╬═══════════╬═══════════╣
║ A ║ B ║ 0 ║ 1 ║
║ A ║ C ║ 0 ║ 2 ║
║ B ║ D ║ 1 ║ 6 ║
║ B ║ E ║ 1 ║ 7 ║
║ B ║ F ║ 1 ║ 8 ║
║ C ║ D ║ 2 ║ 3 ║
║ C ║ E ║ 2 ║ 4 ║
║ C ║ F ║ 2 ║ 5 ║
║ D ║ G ║ 3 ║ 13 ║
║ E ║ D ║ 4 ║ 11 ║
║ F ║ D ║ 5 ║ 10 ║
║ D ║ G ║ 6 ║ 14 ║
║ E ║ D ║ 7 ║ 12 ║
║ F ║ D ║ 8 ║ 9 ║
║ D ║ G ║ 9 ║ 18 ║
║ D ║ G ║ 10 ║ 17 ║
║ D ║ G ║ 11 ║ 16 ║
║ D ║ G ║ 12 ║ 15 ║
╚═════════════╩═════════════╩═══════════╩═══════════╝
Here the "D-G" link shows up several times, but in each instance it has a different ID and a different parent ID!
I've managed to do it but I'm not happy with the way I did it. It doesn't seems very efficient (not important for this example but very important for much larger sets!)
WITH JUNK_DATA
AS (SELECT *,
ROW_NUMBER()
OVER (
ORDER BY SOURCE) RN
FROM LINKS),
RECUR
AS (SELECT T1.SOURCE,
T1.TARGET,
CAST('ROOT' AS VARCHAR(MAX)) NAME,
1 AS RAMA,
CAST(T1.RN AS VARCHAR(MAX)) + ',' AS FULL_RAMA
FROM JUNK_DATA T1
LEFT JOIN JUNK_DATA T2
ON T1.SOURCE = T2.TARGET
WHERE T2.TARGET IS NULL
UNION ALL
SELECT JUNK_DATA.SOURCE,
JUNK_DATA.TARGET,
CASE
WHEN RAMA = 1 THEN (SELECT [DESC]
FROM NAMES
WHERE ID = JUNK_DATA.SOURCE)
ELSE NAME
END NAME,
RAMA + 1 AS RAMA,
FULL_RAMA
+ CAST(JUNK_DATA.RN AS VARCHAR(MAX)) + ','
FROM (SELECT *
FROM JUNK_DATA)JUNK_DATA
INNER JOIN (SELECT *
FROM RECUR) RECUR
ON JUNK_DATA.SOURCE = RECUR.TARGET),
FINAL_DATA
AS (SELECT T2.[DESC] SOURCE_DESC,
T3.[DESC] TARGET_DESC,
RECUR.*,
ROW_NUMBER()
OVER (
ORDER BY RAMA) ID
FROM RECUR
INNER JOIN NAMES T2
ON RECUR.SOURCE = T2.ID
INNER JOIN NAMES T3
ON RECUR.TARGET = T3.ID)
SELECT T1.SOURCE_DESC,
T1.TARGET_DESC,
ISNULL(T2.ID, 0) AS SOURCE_ID,
T1.ID TARGET_ID
FROM FINAL_DATA T1
LEFT JOIN (SELECT ID,
FULL_RAMA
FROM FINAL_DATA)T2
ON LEFT(T1.FULL_RAMA, LEN(T1.FULL_RAMA) - CHARINDEX(',',
REVERSE(T1.FULL_RAMA), 2))
+ ',' = T2.FULL_RAMA
ORDER BY SOURCE_ID,
TARGET_ID
Check it out on SQL fiddle.