Google Maps SQL Server : calculating outlier geographic data within group

Google Maps SQL Server : calculating outlier geographic data within group - sql-server

There are 100 suppliers, each with between 50 and 1000 items. Each supplier may have items close to their office or spread across an entire country or continent.
As LatLngs are input by a human, some mistakes happen. With lots of data and constant 'churn', mistakes are difficult to identify.
To improve data quality, I want to identify outliers for each supplier so that they can be fixed. If a supplier's items are mostly near New York, one in California would be an outlier.
SUPPLIERS
SupplierID int
Latitude DECIMAL(12,9)
Longitude DECIMAL(12,9)
ITEMS
ItemID int
SupplierID int
LatLng geography
I assume I need to use standard deviation for this, but putting it into T-SQL is giving me a headache.
I'd like to output a list of outliers for each supplier, based on each supplier's specific deviation.
This code outputs Items and the distance between each item and the supplier's office.
WITH cte AS
(
SELECT
ItemID,
SupplierID,
LatLng,
LatLng.STDistance(GEOGRAPHY::Point(a.Latitude, a.Longitude, 4326))/1000 As Distance
FROM
Items v
JOIN
Suppliers a ON v.SupplierID = a.SupplierID
)
SELECT
ItemID, SupplierID, Distance
FROM cte
Here's the SQL functionality for standard deviation (from a blog post):
DECLARE #StdDev DECIMAL(5,2)
DECLARE #Avg DECIMAL(5,2)
SELECT
#StdDev = STDEV(Qty),
#Avg = AVG(Qty)
FROM Sales
SELECT
*
FROM
Sales
WHERE
Qty > #Avg - #StdDev AND
Qty < #Avg + #StdDev
STEPS I NEED TO DO
Calculate STDEV and AVG for distance, GROUP BY SupplierID
Output items where the distance is greater than AVG + STDEV for the item's supplier
This is where I'm scratching my head as this is multiple steps AFTER the multiple steps I've already performed. I guess I could insert what I have into a TEMP table and go from there, but is that really the best way?

You can use window functions for this. Both AVG and STDEV are available as window functions
WITH Distances AS
(
SELECT
i.ItemID,
s.SupplierID,
i.LatLng,
v.SupplierLocation,
i.LatLng.STDistance(v.SupplierLocation)/1000 As Distance
FROM
Items i
JOIN
Suppliers s ON i.SupplierID = s.SupplierID
CROSS APPLY (VALUES (
GEOGRAPHY::Point(s.Latitude, s.Longitude, 4326)
)) v(SupplierLocation)
),
Averages AS (
SELECT
ItemID,
SupplierID,
LatLng,
SupplierLocation
Distance,
AVG(Distance) OVER (PARTITION BY SupplierID) AS Avg,
STDEV(Distance) OVER (PARTITION BY SupplierID) AS StDev
FROM
Distances
)
SELECT
ItemID,
SupplierID,
Distance,
Avg,
StDev
FROM
Averages
WHERE
Distance > Avg - StdDev AND
Distance < Avg + StdDev;

Related

How to determine data distribution in SQL Server columns using T-SQL

Can someone show me how to compile code in T-SQL that will allow me to view the distribution of data in columns?
For example in the sample table, there is a column called model. In that column, 50% of the values are Fiestas. I would like to a query that will help determine the distribution of in data in columns.
I have included some sample code to help:
CREATE TABLE #tmpTable
(
registration varchar(50),
make varchar(50),
model varchar(50),
engine_size float
)
INSERT INTO #tmpTable VALUES
('JjFw5a0','SKODA','OCTAVIA',1.8),
('VkfCDpZ','FORD','FIESTA',1.7),
('5E93ZEq','SKODA','OCTAVIA',1.3),
('L2PPN0m','FORD','FIESTA',1.1),
('9xKghxp','FORD','FIESTA',1.5),
('WHShdBm','FORD','FIESTA',1.4),
('TNRHyy7','NISSAN','QASHQAI',1.2),
('6RNX0XG','SKODA','OCTAVIA',1.4),
('tJ9bOD8','FORD','FIESTA',1.1),
('ablFUSC','FORD','FIESTA',1),
('4B7RLYL','MERCEDED_BENZ','E CLASS',1.3),
('tlJiwVY','FORD','FIESTA',1),
('Fb9lcvG','FORD','FIESTA',1.4),
('nW4lqBC','FORD','FIESTA',1.6),
('LggTmL5','HYUNDAI','I20',1),
('2mGgSjS','FORD','FIESTA',1.1),
('IDvOzcM','FORD','FIESTA',1.3),
('JefpXK2','FORD','FIESTA',1.5),
('0h1uWfZ','MERCEDED_BENZ','E CLASS',1.4),
('ylBoGbV','MERCEDED_BENZ','E CLASS',1.7),
('XzoILDK','VAUXHALL','CORSA',1.8),
('Xhocs1Z','FORD','FIESTA',1.5),
('Lh2yWGa','KIA','RIO',1.5),
('hM5GWA0','FORD','FIESTA',1.3),
('PbpxkFt','FORD','FIESTA',1.7),
('SDHWV2r','FORD','FIESTA',1.2),
('n83Je2D','FORD','FIESTA',1.8),
('sDN0gex','FORD','FIESTA',1.2),
('7EICOZY','KIA','RIO',1.5),
('PUuMmIH','FORD','FIESTA',1),
('HiBwSg2','FORD','FIESTA',1.8),
('1yk1vDm','KIA','RIO',1.7),
('cMpH72R','HYUNDAI','I20',1.1),
('ZgQL0gt','MERCEDED_BENZ','E CLASS',1.3),
('jhpamQG','KIA','RIO',1.1),
('pk0lU2F','VAUXHALL','CORSA',1.4),
('fDCUeq1','FORD','FIESTA',1.1),
('ono5QFC','FORD','FIESTA',1.7),
('VohWwGR','FORD','FIESTA',1.5),
('Hih8dKc','SUZUKI','SWIFT',1.2),
('D2RNn3h','SUZUKI','SWIFT',1.2),
('QaYQulE','FORD','FIESTA',1.1),
('xmQPxAG','FORD','FIESTA',1.8),
('vmTqkTO','FORD','FIESTA',1.2),
('lvUtVUA','MERCEDED_BENZ','E CLASS',1),
('SFoj00d','FORD','FIESTA',1),
('9S6wrWV','MERCEDED_BENZ','E CLASS',1),
('0SBnW0z','FORD','FIESTA',1.1),
('HnDHdfj','MERCEDED_BENZ','E CLASS',1),
('RV7q947','FORD','FIESTA',1.4),
('JZqCtTg','FORD','FIESTA',1.7),
('XVgBwgi','FORD','FIESTA',1.8),
('iqJDsIF','FORD','FIESTA',1.6),
('CMbpRFa','FORD','FIESTA',1.6),
('vF7K5Xg','SUZUKI','SWIFT',1.1),
('3j6XGDH','FORD','FIESTA',1.5),
('ommqugM','FORD','FIESTA',1.1),
('LMQkPnw','NISSAN','QASHQAI',1.4),
('1dKgcdd','FORD','FIESTA',1.5),
('hC8BxiP','MERCEDED_BENZ','E CLASS',1.1),
('wLTWol7','FORD','FIESTA',1.6),
('TY8ChYN','FORD','FIESTA',1.6),
('Gw1CpI8','FORD','FIESTA',1.4),
('L4OPAJq','FORD','FIESTA',1.1),
('6TyYpfi','NISSAN','QASHQAI',1.6),
('ozoOcGL','FORD','FIESTA',1.4),
('6IME19U','FORD','FIESTA',1.4),
('BxpmJO5','FORD','FIESTA',1.4),
('0zc2n5A','FORD','FIESTA',1.3),
('FqbBZE2','FIAT','500',1.7),
('2EkTOTz','FORD','FIESTA',1.4),
('fNBvIvg','MERCEDED_BENZ','E CLASS',1.2),
('u5j4R4S','KIA','RIO',1.4),
('zpWaUZo','FORD','FIESTA',1.1),
('FQPVQYc','NISSAN','QASHQAI',1.7),
('8RBQADq','KIA','RIO',1.7),
('TOz2bcT','HYUNDAI','I20',1.7),
('jebhCex','FORD','FIESTA',1.3),
('cdHA1gL','FORD','FIESTA',1.2),
('FoaN4AT','FORD','FIESTA',1.7),
('atGn288','FORD','FIESTA',1.5),
('es8VNdW','FIAT','500',1.3),
('hDWoMXa','KIA','RIO',1.4),
('Q9C6Br1','KIA','RIO',1.5),
('mFSy4aF','FORD','FIESTA',1.6),
('bbbKnrM','SKODA','OCTAVIA',1.5),
('qY7lz6I','FORD','FIESTA',1),
('8Ch2OeU','VAUXHALL','CORSA',1.3),
('dcWsjJv','VAUXHALL','CORSA',1.3),
('bnnoBPg','SKODA','OCTAVIA',1.8),
('mvDyYkK','FORD','FIESTA',1.4),
('KpWDYap','FORD','FIESTA',1.3),
('7EK9K4z','FORD','FIESTA',1.3),
('ZPLHtlP','FORD','FIESTA',1.6),
('4EpYeSB','FORD','FIESTA',1.6),
('O1eZ20M','FORD','FIESTA',1),
('WfVntKk','FORD','FIESTA',1.7),
('6VlkBdi','FORD','FIESTA',1.1),
('hFQfKjk','KIA','RIO',1.4),
('3Y4njNP','KIA','RIO',1),
('3UuNqG0','FORD','FIESTA',1.7),
('qpvMYAu','FORD','FIESTA',1.1),
('NCYJUqx','FORD','FIESTA',1.3),
('M0AvWzg','FORD','FIESTA',1.6),
('XbVmtFf','FORD','FIESTA',1.3),
('l8qZy0H','SKODA','OCTAVIA',1.3),
('EDUbxaU','MERCEDED_BENZ','E CLASS',1.6),
('nWLd82o','FORD','FIESTA',1.7),
('4AkoyWx','FORD','FIESTA',1),
('nOoO25v','FORD','FIESTA',1.3),
('VAm5aV8','NISSAN','QASHQAI',1.4),
('zbd3cie','FORD','FIESTA',1.5),
('hyAN71W','NISSAN','QASHQAI',1),
('FxACHDf','FIAT','500',1.7),
('wOZdaeV','FORD','FIESTA',1.6),
('gfxZl99','VAUXHALL','CORSA',1.1),
('06HhwEJ','SKODA','OCTAVIA',1.7),
('PCTgYiG','KIA','RIO',1.7),
('U54WXZQ','KIA','RIO',1.6),
('FHgrRiF','FORD','FIESTA',1.6),
('R3jP73p','SKODA','OCTAVIA',1.5),
('etVPKX9','SUZUKI','SWIFT',1.1),
('BE3yReB','FORD','FIESTA',1.7),
('zXmX878','FORD','FIESTA',1.6),
('wdM3P2m','FORD','FIESTA',1.7),
('tb727BM','FORD','FIESTA',1.1)
SELECT * FROM #tmpTable

You can apply a Windowed Aggregate to get the overall count:
SELECT make
, model
, count(*) as cnt -- count per Model
, cast(count(*) * 100.0 -- compared to all counts
/ sum(count(*))
over () as dec(5,2)) as distribution
FROM #tmptable
group by make
, model
order by distribution desc;
See fiddle
If you want the percentage of the Model for each Make you need to add PARTITION BY:
SELECT make
, model
, count(*) as cnt -- count per Model
, cast(count(*) * 100.0
/ sum(count(*)) -- compared to all counts per Make
over (partition by Make) as dec(5,2)) as distribution
FROM #tmptable
group by make
, model
order by make, distribution desc;

You can use conditional aggregation to get the ratio of the count of Ford Fiestas and the total count.
SELECT 100.0
* count(CASE
WHEN make = 'FORD'
AND model = 'FIESTA' THEN
1
END)
/ count(*)
FROM #tmptable;
Edit:
If you want the figures for all car models you can simply aggregate and group to get the count for each car model and divide that by the total count which you can get via a subquery.
SELECT make,
model,
100.0
* count(*)
/ (SELECT count(*)
FROM #tmptable)
FROM #tmptable
GROUP BY make,
model;

Set operators problem when trying to find MIN and MAX date for a condition

I want to find the minimum and maximum date on which a given product had a maximum unit cost, but only of the unit cost had a value of more than 10.
The result that I should get should be something like this:
If the date is the minimum date_type is MIN, and vice-versa.
This is the script that I tried, but I only get two records:
select MIN(dt) as min_date, MAX(dt) as max_date
from products
WHERE price = (SELECT max(price) FROM products where unit_id =1)
Products table:
Unit table:

T-SQL - Get last as-at date SUM(Quantity) was not negative

I am trying to find a way to get the last date by location and product a sum was positive. The only way i can think to do it is with a cursor, and if that's the case I may as well just do it in code. Before i go down that route, i was hoping someone may have a better idea?
Table:
Product, Date, Location, Quantity
The scenario is; I find the quantity by location and product at a particular date, if it is negative i need to get the sum and date when the group was last positive.
select
Product,
Location,
SUM(Quantity) Qty,
SUM(Value) Value
from
ProductTransactions PT
where
Date <= #AsAtDate
group by
Product,
Location

i am looking for the last date where the sum of the transactions previous to and including it are positive
Based on your revised question and your comment, here another solution I hope answers your question.
select Product, Location, max(Date) as Date
from (
select a.Product, a.Location, a.Date from ProductTransactions as a
join ProductTransactions as b
on a.Product = b.Product and a.Location = b.Location
where b.Date <= a.Date
group by a.Product, a.Location, a.Date
having sum(b.Value) >= 0
) as T
group by Product, Location
The subquery (table T) produces a list of {product, location, date} rows for which the sum of the values prior (and inclusive) is positive. From that set, we select the last date for each {product, location} pair.

This can be done in a set based way using windowed aggregates in order to construct the running total. Depending on the number of rows in the table this could be a bit slow but you can't really limit the time range going backwards as the last positive date is an unknown quantity.
I've used a CTE for convenience to construct the aggregated data set but converting that to a temp table should be faster. (CTEs get executed each time they are called whereas a temp table will only execute once.)
The basic theory is to construct the running totals for all of the previous days using the OVER clause to partition and order the SUM aggregates. This data set is then used and filtered to the expected date. When a row in that table has a quantity less than zero it is joined back to the aggregate data set for all previous days for that product and location where the quantity was greater than zero.
Since this may return multiple positive date rows the ROW_NUMBER() function is used to order the rows based on the date of the positive quantity day. This is done in descending order so that row number 1 is the most recent positive day. It isn't possible to use a simple MIN() here because the MIN([Date]) may not correspond to the MIN(Quantity).
WITH x AS (
SELECT [Date],
Product,
[Location],
SUM(Quantity) OVER (PARTITION BY Product, [Location] ORDER BY [Date] ASC) AS Quantity,
SUM([Value]) OVER(PARTITION BY Product, [Location] ORDER BY [Date] ASC) AS [Value]
FROM ProductTransactions
WHERE [Date] <= #AsAtDate
)
SELECT [Date], Product, [Location], Quantity, [Value], Positive_date, Positive_date_quantity
FROM (
SELECT x1.[Date], x1.Product, x1.[Location], x1.Quantity, x1.[Value],
x2.[Date] AS Positive_date, x2.[Quantity] AS Positive_date_quantity,
ROW_NUMBER() OVER (PARTITION BY x1.Product, x1.[Location] ORDER BY x2.[Date] DESC) AS Positive_date_row
FROM x AS x1
LEFT JOIN x AS x2 ON x1.Product=x2.Product AND x1.[Location]=x2.[Location]
AND x2.[Date]<x1.[Date] AND x1.Quantity<0 AND x2.Quantity>0
WHERE x1.[Date] = #AsAtDate
) AS y
WHERE Positive_date_row=1

Do you mean that you want to get the last date of positive quantity come to positive in group?
For example, If you are using SQL Server 2012+:
In following scenario, when the date going to 01/03/2017 the summary of quantity come to 1(-10+5+6).
Is it possible the quantity of following date come to negative again?
;WITH tb(Product, Location,[Date],Quantity) AS(
SELECT 'A','B',CONVERT(DATETIME,'01/01/2017'),-10 UNION ALL
SELECT 'A','B','01/02/2017',5 UNION ALL
SELECT 'A','B','01/03/2017',6 UNION ALL
SELECT 'A','B','01/04/2017',2
)
SELECT t.Product,t.Location,SUM(t.Quantity) AS Qty,MIN(CASE WHEN t.CurrentSum>0 THEN t.Date ELSE NULL END ) AS LastPositiveDate
FROM (
SELECT *,SUM(tb.Quantity)OVER(ORDER BY [Date]) AS CurrentSum FROM tb
) AS t GROUP BY t.Product,t.Location
Product Location Qty LastPositiveDate
------- -------- ----------- -----------------------
A B 3 2017-01-03 00:00:00.000

Distance based on Location Datatype

I have 2 tables as below.
Cust_Master:
Cust_ID Location Distance WHID
Cust10001 0xE6100000010CA986FD9E58172A40425A63D009685340 ??? ???
Cust10002 0xE6100000010C7BD976DA1A992F4071766B990C835340 ??? ???
WH_Master:
WH_ID Location
WH1001 0xE6100000010C84F068E388C54240373811FDDA5B5340
WH1002 0xE6100000010C5BB1BFEC9E142A407DEA58A5F4675340
I would like to populate Distance and WHID in Cust_Master table based on location from WH_Master. Can some one throw some light on this.

Create a temporary table with a cross join of distances and warehouse ID for the customers, then based on your requirement ( closest or furthest ) perform a select on this table and use the data to populate the customer table. You will need to define a function to compute distances, and then simply run an update based on customer ID.

Using STDistance() to compute distance and then a ranking function should work:
UPDATE Cust_Master
SET
Distance = OuterQuery.Distance,
WHID = OuterQuery.WH_ID
FROM (
SELECT
Cust_ID,
WH_ID,
Distance,
RANK() OVER (PARTITION BY WH_ID ORDER BY Distance ASC) AS 'RNK'
FROM (
SELECT Cust_ID, WH_ID, Cust_Master.Location.STDistance(WH_Master.Location) AS Distance
FROM Cust_Master, WH_Master
) InnerQuery
GROUP BY Cust_ID, WH_ID, Distance
) OuterQuery
WHERE RNK = 1 AND Cust_Master.Cust_ID = OuterQuery.Cust_ID
This updates the Cust_Mastertable with the WH_IDand Distanceto the closest WH_IDin WH_Master. If you want the n closest you can change RNK = 1to RNK = n

How do I exclude outliers from an aggregate query?

I'm creating a report comparing total time and volume across units. Here a simplification of the query I'm using at the moment:
SELECT m.Unit,
COUNT(*) AS Count,
SUM(m.TimeInMinutes) AS TotalTime
FROM main_table m
WHERE m.unit <> ''
AND m.TimeInMinutes > 0
GROUP BY m.Unit
HAVING COUNT(*) > 15
However, I have been told that I need to exclude cases where the row's time is in the highest or lowest 5% to try and get rid of a few wacky outliers. (As in, remove the rows before the aggregates are applied.)
How do I do that?

You can exclude the top and bottom x percentiles with NTILE
SELECT m.Unit,
COUNT(*) AS Count,
SUM(m.TimeInMinutes) AS TotalTime
FROM
(SELECT
m.Unit,
NTILE(20) OVER (ORDER BY m.TimeInMinutes) AS Buckets
FROM
main_table m
WHERE
m.unit <> '' AND m.TimeInMinutes > 0
) m
WHERE
Buckets BETWEEN 2 AND 19
GROUP BY m.Unit
HAVING COUNT(*) > 15
Edit: this article has several techniques too

One way would be to exclude the outliers with a not in clause:
where m.ID not in
(
select top 5 percent ID
from main_table
order by
TimeInMinutes desc
)
And another not in clause for the bottom five percent.

NTile is quite inexact. If you run NTile against the sample view below, you will see that it catches some indeterminate number of rows instead of 90% from the center. The suggestion to use TOP 95%, then reverse TOP 90% is almost correct except that 90% x 95% gives you only 85.5% of the original dataset. So you would have to do
select top 94.7368 percent *
from (
select top 95 percent *
from
order by .. ASC
) X
order by .. DESC
First create a view to match your table column names
create view main_table
as
select type unit, number as timeinminutes from master..spt_values
Try this instead
select Unit, COUNT(*), SUM(TimeInMinutes)
FROM
(
select *,
ROW_NUMBER() over (order by TimeInMinutes) rn,
COUNT(*) over () countRows
from main_table
) N -- Numbered
where rn between countRows * 0.05 and countRows * 0.95
group by Unit, N.countRows * 0.05, N.countRows * 0.95
having count(*) > 20
The HAVING clause is applied to the remaining set after removing outliers.
For a dataset of 1,1,1,1,1,1,2,5,6,19, the use of ROW_NUMBER allows you to correctly remove just one instance of the 1's.

I think the most robust way is to sort the list into order and then exclude the top and bottom extremes. For a hundred values, you would sort ascending and take the first 95 PERCENT, then sort descending and take the first 90 PERCENT.