Measure all distances return shortest value - sql-server

I have one table with a list of stores, approximately 100 or so with lat/long. The second table I have a list of customers, with lat/long and has more than 500k.
I need to find the closest store to each customer. Currently I am using the geography data type with the STDistance function to calculate the distance between two points. This is functioning fine, but I am getting hung up on the most efficient ways to process this.
Option #1 - Cartesian join Customer_table to Store_table, process the distance calculation, rank the results and filter to #1. Concern with this is that if you have a 1 million row customer list, and 100 stores, you are created a 100 million row table and the rank function then thereafter may be taxing.
Option #2 - With some dynamic sql, create a pivoted table that has each customer in the first column, and each subsequent column has the calculated distance to each branch. From there, I can unpivot and then do the same rank/over function described in the first.
EXAMPLE
CUST_ID LAT LONG STORE1DIST STORE2DIST STORE3DIST
1 20.00 30.00 4.5 5.6 7.8
2 20.00 30.00 7.4 8.1 8.5
I'm not clear which would be the most efficient, and will keep the DBA's from wanting to come find me.
Thanks for the input in advance!

You can unpivot the data into multiple rows for each store distance then use simple pivot (Group by) to get the minimum value of StoreDistance.
select CUST_ID, MIN(STOREDIST) StoreDistance, MIN(STORES) StoreName
from
(select CUST_ID, LAT, LONG, STORE1DIST, STORE2DIST, STORE3DIST from Cus/*Your table*/) p
UNPIVOT
(
STOREDIST FOR STORES IN (STORE1DIST, STORE2DIST, STORE3DIST)
) as unpvt
Group by CUST_ID
This will give you the result as:
CUST_ID StoreDistance StoreName
-----------------------------------
1 4.5 STORE1DIST
2 7.4 STORE1DIST

I have a similar situation on my job. I use a distance function like this (returns kms, use 3960* to return miles):
CREATE Function MySTDistance(#lat1 float, #lon1 float, #lat2 float, #lon2 float)
returns smallmoney
as
return IsNull(6373*acos((sin(radians(#lat1))*sin(radians(#lat2)))
+(cos(radians(#lat1))*cos(radians(#lat2))*cos(radians(#lon1-#lon2)))),0)
then you look for the closest store by doing something like...
select C.Cust_Id
,Store_id=
(select top (1) Store_id
from Store_Table S
order by dbo.MySTDistance(S.lat, S.long, C.lat, C.long)
)
from Customer_Table C
Now you have each customer id with his closest store id. It's quite fast with a huge volume of customers (at least in my case).

Related

Handling very big table in SQL Server Performance

I'm having some troubles to deal with a very big table in my database. Before to talk about the problem, let's talk about what i want to achieve.
I have two source tables :
Source 1: SALES_MAN (ID_SMAN, SM_LATITUDE, SM_LONGITUDE)
Source 2: CLIENT (ID_CLIENT, CLATITUDE, CLONGITUDE)
Target: DISTANCE (ID_SMAN, ID_CLIENT, SM_LATITUDE, SM_LONGITUDE, CLATITUDE, CLONGITUDE, DISTANCE)
The idea is to find the top N nearest SALES_MAN for every client using a ROW_NUMBER in the target table.
What I'm doing currently is calculating the distance between every client and every sales man :
INSERT INTO DISTANCE ([ID_SMAN], [ID_CLIENT], [DISTANCE],
[SM_LATITUDE], [SM_LONGITUDE], [CLATITUDE], [CLONGITUDE])
SELECT
[ID_SMAN], [ID_CLIENT],
geography::STGeomFromText('POINT('+IND_LATITUDE+' '+IND_LONGITUDE+')',4326).STDistance(geography::STGeomFromText('POINT('+DLR.[DLR_N_GPS_LATTITUDE]+' '+DLR.[DLR_N_GPS_LONGITUDE]+')',4326))/1000 as distance,
[SM_LATITUDE], [SM_LONGITUDE], [CLATITUDE], [CLONGITUDE]
FROM
[dbo].[SALES_MAN], [dbo].[CLIENT]
The DISTANCE table contains approximately 1 milliards rows.
The second step to get my 5 nearest sales man per client is to run this query :
SELECT *
FROM
(SELECT
*,
ROW_NUMBER() OVER(PARTITION BY ID_CLIENT ORDER BY DISTANCE) rang
FROM DISTANCE) TAB
WHERE rang < 6
The last query is really a consuming one. So to avoid the SORT operator I tried to create an sorted non clustered index in DISTANCE and ID_CLIENT but it did not work. I also tried to include all the needed columns in the both indexes.
But when I created a clustered index on DISTANCE and keep the nonclustered sorted index in the ID_CLIENT the things went better.
So what a nonclustered sorting index is not working in this case?
But when I use the clustered index, I have other problem in loading data and I'm kind of forced to delete it before starting the loading process.
So what do you think? And how we can deal with this kind of tables to be able to select, insert or update data without having performance issues ?
Many thanks
Too long for a comment, but consider the following points.
Item 1) Consider adding a Geography field to each of your source tables. This will eliminate the redundant GEOGRAPHY::Point() function calls
Update YourTable Set GeoPoint = GEOGRAPHY::Point([Lat], [Lng], 4326)
So then the calculation for distance would simply be
,InMeters = C.GeoPoint.STDistance(S.GeoPoint)
,InMiles = C.GeoPoint.STDistance(S.GeoPoint) / 1609.344
Item 2) Rather than generating EVERY possible combination, consider a adding a condtion to the JOIN. Keep in mind that every "1" of Lat or Lng is approx 69 miles, so you can reduce the search area. For example
From CLIENT C
Join SALES_MAN S
on S.Lat between C.Lat-1 and C.Lat+1
and S.Lng between C.Lng-1 and C.Lng+1
This +/- 1 could be any reasonable value ... (i.e. 0.5 or even 2.0)
ROW_NUMBER is a window function that requires the whole rows related with the ORDER BY 's column so its better to filter your result before ROW_NUMBER,
and you've to change the following code :
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY ID_CLIENT ORDER BY DISTANCE)
rang FROM DISTANCE
) TAB
WHERE rang < 6
into this:
WITH DISTANCE_CLIENT_IDS (CLIENT_ID) AS
(
SELECT DISTINCT CLIENT_ID
FROM DISTANCE
)
SELECT Dx.*
FROM DISTANCE_CLIENT_IDS D1,
(
SElECT * , ROW_NUMBER(ORDER BY DISTANCE) RANGE
FROM (
SELECT TOP(5) *
FROM DISTANCE D2
WHERE D1.CLIENT_ID = D2.CLIENT_ID
) Dt
) Dx
and make sure you'd added indexes on both CLIENT_ID and DISTANCE columns

How to calculate the "Nearest Neighbour" for multiple sources in SQL Server?

The "Nearest Neighbour" problem is very common when working with spatial data.
There's even some nice, simple documentation about how to do it with MS Sql Server in their docs!
I'm usually seeing examples where it's using 1x source Lat/Long and it returns the 'x' number of nearest neighbour Lat/Longs. Fine...
e.g.
USE AdventureWorks2012
GO
DECLARE #g geography = 'POINT(-121.626 47.8315)';
SELECT TOP(7) SpatialLocation.ToString(), City FROM Person.Address
WHERE SpatialLocation.STDistance(#g) IS NOT NULL
ORDER BY SpatialLocation.STDistance(#g);
In my case, I have multiple Lat/Long sources ... and for each source, need to return the 'x' number of nearest neighbours.
Here's my schema
Table: SomeGeogBoundaries
LocationId INTEGER PRIMARY KEY (it's not an identity, but a PK & FK)
CentrePoint GEOGRAPHY
Index:
Spatial Index on CentrePoint column. [Geography || MEDIUM, MEDIUM, HIGH, HIGH]
Sample data:
LocationId | CP Lat/Long
1 | 10,10
2 | 11,11
3 | 20,20
..
So for each location in this table, I need to find the closest.. say 5 other locations.
Update
So far, it looks like using a CURSOR is the only way .. but I'm open to more set based solutions.
You need to find the nearest neighbors within the same set?
SELECT *
FROM SomeGeogBoundaries as b
OUTER APPLY (
SELECT TOP(5) CentrePoint
FROM SomeGeogBoundaries as t
WHERE t.CentrePoint.STInsersects(b.CentrePoint.STBuffer(100))
ORDER by b.CentrePoint.STDistance(t.CentrePoint)
) AS nn
Two notes.
The where clause in the outer apply is to limit the search to (in this case) points that are within 100 meters of eachother (assuming that you're using an SRID whose native unit of measure is meters). That may or may not be appropriate for you. If not, just omit the where clause.
I think this is still a cursor. Don't fool yourself into thinking that just because there is nary a declare cursor statement to be seen that the db engine has much of a choice but to iterate through your table and evaluate the apply for each row.

SQL Get Second Record

I am looking to retrieve only the second (duplicate) record from a data set. For example in the following picture:
Inside the UnitID column there is two separate records for 105. I only want the returned data set to return the second 105 record. Additionally, I want this query to return the second record for all duplicates, not just 105.
I have tried everything I can think of, albeit I am not that experience, and I cannot figure it out. Any help would be greatly appreciated.
You need to use GROUP BY for this.
Here's an example: (I can't read your first column name, so I'm calling it JobUnitK
SELECT MAX(JobUnitK), Unit
FROM JobUnits
WHERE DispatchDate = 'oct 4, 2015'
GROUP BY Unit
HAVING COUNT(*) > 1
I'm assuming JobUnitK is your ordering/id field. If it's not, just replace MAX(JobUnitK) with MAX(FieldIOrderWith).
Use RANK function. Rank the rows OVER PARTITION BY UnitId and pick the rows with rank 2 .
For reference -
https://msdn.microsoft.com/en-IN/library/ms176102.aspx
Assuming SQL Server 2005 and up, you can use the Row_Number windowing function:
WITH DupeCalc AS (
SELECT
DupID = Row_Number() OVER (PARTITION BY UnitID, ORDER BY JobUnitKeyID),
*
FROM JobUnits
WHERE DispatchDate = '20151004'
ORDER BY UnitID Desc
)
SELECT *
FROM DupeCalc
WHERE DupID >= 2
;
This is better than a solution that uses Max(JobUnitKeyID) for multiple reasons:
There could be more than one duplicate, in which case using Min(JobUnitKeyID) in conjunction with UnitID to join back on the UnitID where the JobUnitKeyID <> MinJobUnitKeyID` is required.
Except, using Min or Max requires you to join back to the same data (which will be inherently slower).
If the ordering key you use turns out to be non-unique, you won't be able to pull the right number of rows with either one.
If the ordering key consists of multiple columns, the query using Min or Max explodes in complexity.

PostgreSQL Crosstab - variable number of columns

A common beef I get when trying to evangelize the benefits of learning freehand SQL to MS Access users is the complexity of creating the effects of a crosstab query in the manner Access does it. I realize that strictly speaking, in SQL it doesn't work that way -- the reason it's possible in Access is because it's handling the rendering the of the data.
Specifically, when I have a table with entities, dates and quantities, it's frequent that we want to see a single entity on one line with the dates represented as columns:
This:
entity date qty
------ -------- ---
278700-002 1/1/2016 5
278700-002 2/1/2016 3
278700-002 2/1/2016 8
278700-002 3/1/2016 1
278700-003 2/1/2016 12
Becomes this:
Entity 1/1/16 2/1/16 3/1/16
---------- ------ ------ ------
278700-002 5 11 1
278700-003 12
That said, the common way we've approached this is something similar to this:
with vals as (
select
entity,
case when order_date = '2016-01-01' then qty else 0 end as q16_01,
case when order_date = '2016-02-01' then qty else 0 end as q16_02,
case when order_date = '2016-03-01' then qty else 0 end as q16_02
from mydata
)
select
entity, sum (q16_01) as q16_01, sum (q16_02) as q16_02, sum (q16_03) as q16_03
from vals
group by entity
This is radically oversimplified, but I believe most people will get my meaning.
The main problem with this is not the limit on the number of columns -- the data is typically bounded, and I can make due with a fixed number of date columns -- 36 months, or whatever, depending on the context of the data. My issue is the fact that I have to change the dates every month to make this work.
I had an idea that I could leverage arrays to dynamically assign the quantity to the index of the array, based on the month away from the current date. In this manner, my data would end up looking like this:
Entity Values
---------- ------
278700-002 {5,11,1}
278700-003 {0,12,0}
This would be quite acceptable, as I could manage the rendering of the actual columns within whatever rendering tool I was using (Excel, for example).
The problem is I'm stuck... how do I get from my data to this. If this were Perl, I would loop through the data and do something like this:
foreach my $ref (#data) {
my ($entity, $month_offset, $qty) = #$ref;
$values{$entity}->[$month_offset] += $qty;
}
By this isn't Perl... so far, this is what I have, and now I'm at a mental impasse.
with offset as (
select
entity, order_date, qty,
(extract (year from order_date ) - 2015) * 12 +
extract (month from order_date ) - 9 as month_offset,
array[]::integer[] as values
from mydata
)
select
prod_id, playgrd_dte, -- oh my... how do I load into my array?
from fcst
The "2015" and the "9" are not really hard-coded -- I put them there for simplicity sake for this example.
Also, if my approach or my assumptions are totally off, I trust someone will set me straight.
As with all things imaginable and unimaginable, there is a way to do this with PostgreSQL. It looks like this:
WITH cte AS (
WITH minmax AS (
SELECT min(extract(month from order_date))::int,
max(extract(month from order_date))::int
FROM mytable
)
SELECT entity, mon, 0 AS qty
FROM (SELECT DISTINCT entity FROM mytable) entities,
(SELECT generate_series(min, max) AS mon FROM minmax) allmonths
UNION
SELECT entity, extract(month from order_date)::int, qty FROM mytable
)
SELECT entity, array_agg(sum) AS values
FROM (
SELECT entity, mon, sum(qty) FROM cte
GROUP BY 1, 2) sub
GROUP BY 1
ORDER BY 1;
A few words of explanation:
The standard way to produce an array inside a SQL statement is to use the array_agg() function. Your problem is that you have months without data and then array_agg() happily produces nothing, leaving you with arrays of unequal length and no information on where in the time period the data comes from. You can solve this by adding 0's for every combination of 'entity' and the months in the period of interest. That is what this snippet of code does:
SELECT entity, mon, 0 AS qty
FROM (SELECT DISTINCT entity FROM mytable) entities,
(SELECT generate_series(min, max) AS mon FROM minmax) allmonths
All those 0's are UNIONed to the actual data from 'mytable' and then (in the main query) you can first sum up the quantities by entity and month and subsequently aggregate those sums into an array for each entity. Since it is a double aggregation you need the sub-query. (You could also sum the quantities in the UNION but then you would also need a sub-query because UNIONs don't allow aggregation.)
The minmax CTE can be adjusted to include the year as well (your sample data doesn't need it). Do note that the actual min and max values are immaterial to the index in the array: if min is 743 it will still occupy the first position in the array; those values are only used for GROUPing, not indexing.
SQLFiddle
For ease of use you could wrap this query up in a SQL language function with parameters for the starting and ending month. Adjust the minmax CTE to produce appropriate min and max values for the generate_series() call and in the UNION filter the rows from 'mytable' to be considered.

Random record from a database table (T-SQL)

Is there a succinct way to retrieve a random record from a sql server table?
I would like to randomize my unit test data, so am looking for a simple way to select a random id from a table. In English, the select would be "Select one id from the table where the id is a random number between the lowest id in the table and the highest id in the table."
I can't figure out a way to do it without have to run the query, test for a null value, then re-run if null.
Ideas?
Is there a succinct way to retrieve a random record from a sql server table?
Yes
SELECT TOP 1 * FROM table ORDER BY NEWID()
Explanation
A NEWID() is generated for each row and the table is then sorted by it. The first record is returned (i.e. the record with the "lowest" GUID).
Notes
GUIDs are generated as pseudo-random numbers since version four:
The version 4 UUID is meant for generating UUIDs from truly-random or
pseudo-random numbers.
The algorithm is as follows:
Set the two most significant bits (bits 6 and 7) of the
clock_seq_hi_and_reserved to zero and one, respectively.
Set the four most significant bits (bits 12 through 15) of the
time_hi_and_version field to the 4-bit version number from
Section 4.1.3.
Set all the other bits to randomly (or pseudo-randomly) chosen
values.
—A Universally Unique IDentifier (UUID) URN Namespace - RFC 4122
The alternative SELECT TOP 1 * FROM table ORDER BY RAND() will not work as one would think. RAND() returns one single value per query, thus all rows will share the same value.
While GUID values are pseudo-random, you will need a better PRNG for the more demanding applications.
Typical performance is less than 10 seconds for around 1,000,000 rows — of course depending on the system. Note that it's impossible to hit an index, thus performance will be relatively limited.
On larger tables you can also use TABLESAMPLE for this to avoid scanning the whole table.
SELECT TOP 1 *
FROM YourTable
TABLESAMPLE (1000 ROWS)
ORDER BY NEWID()
The ORDER BY NEWID is still required to avoid just returning rows that appear first on the data page.
The number to use needs to be chosen carefully for the size and definition of table and you might consider retry logic if no row is returned. The maths behind this and why the technique is not suited to small tables is discussed here
Also try your method to get a random Id between MIN(Id) and MAX(Id) and then
SELECT TOP 1 * FROM table WHERE Id >= #yourrandomid
It will always get you one row.
If you want to select large data the best way that I know is:
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM
(keycol1, NEWID())) as int))
% 100) < 10
Source: MSDN
I was looking to improve on the methods I had tried and came across this post. I realize it's old but this method is not listed. I am creating and applying test data; this shows the method for "address" in a SP called with #st (two char state)
Create Table ##TmpAddress (id Int Identity(1,1), street VarChar(50), city VarChar(50), st VarChar(2), zip VarChar(5))
Insert Into ##TmpAddress(street, city, st, zip)
Select street, city, st, zip
From tbl_Address (NOLOCK)
Where st = #st
-- unseeded RAND() will return the same number when called in rapid succession so
-- here, I seed it with a guaranteed different number each time. ##ROWCOUNT is the count from the most recent table operation.
Set #csr = Ceiling(RAND(convert(varbinary, newid())) * ##ROWCOUNT)
Select street, city, st, Right(('00000' + ltrim(zip)),5) As zip
From ##tmpAddress (NOLOCK)
Where id = #csr
If you really want a random sample of individual rows, modify your query to filter out rows randomly, instead of using TABLESAMPLE. For example, the following query uses the NEWID function to return approximately one percent of the rows of the Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
The SalesOrderID column is included in the CHECKSUM expression so that
NEWID() evaluates once per row to achieve sampling on a per-row basis.
The expression CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS
float / CAST (0x7fffffff AS int) evaluates to a random float value
between 0 and 1."
Source: http://technet.microsoft.com/en-us/library/ms189108(v=sql.105).aspx
This is further explained below:
How does this work? Let's split out the WHERE clause and explain it.
The CHECKSUM function is calculating a checksum over the items in the
list. It is arguable over whether SalesOrderID is even required, since
NEWID() is a function that returns a new random GUID, so multiplying a
random figure by a constant should result in a random in any case.
Indeed, excluding SalesOrderID seems to make no difference. If you are
a keen statistician and can justify the inclusion of this, please use
the comments section below and let me know why I'm wrong!
The CHECKSUM function returns a VARBINARY. Performing a bitwise AND
operation with 0x7fffffff, which is the equivalent of (111111111...)
in binary, yields a decimal value that is effectively a representation
of a random string of 0s and 1s. Dividing by the co-efficient
0x7fffffff effectively normalizes this decimal figure to a figure
between 0 and 1. Then to decide whether each row merits inclusion in
the final result set, a threshold of 1/x is used (in this case, 0.01)
where x is the percentage of the data to retrieve as a sample.
Source: https://www.mssqltips.com/sqlservertip/3157/different-ways-to-get-random-data-for-sql-server-data-sampling

Resources