Distance Calculation with huge SQL Server database - sql-server

I have a huge database of businesses (about 500,000) with zipcode, address etc . I need to display them by ascending order from 100 miles are of users zipcode. I have a table for zipcodes with related latitude and longitude. What will be faster/better solution ?
Case 1: to calculate distance and sort by distance. I will have users current zipcode, latitude and longitude in session. I will calculate distance using a SQL Server function.
Case 2: to get all zipcodes in 50 miles area and get businesses with all those zipcodes. Here I will have to write a select in nested query while finding businesses.
I think case 1 will calculate distance for all businesses in database. While 2nd case will just fetch zipcodes and will end up fetching only required businesses. Hence case 2 should be better? I would appreciate any suggestion here.
Here is LINQ query I have for case 1.
var businessListQuery = (from b in _DB.Businesses
let distance = _DB.CalculateDistance(b.Zipcode,userLattitude,userLogntitude)
where b.BusinessCategories.Any(bc => bc.SubCategoryId == subCategoryId)
&& distance < 100
orderby distance
select new BusinessDetails(b, distance.ToString()));
int totalRecords = businessListQuery.Count();
var ret = businessListQuery.ToList().Skip(startRow).Take(pageSize).ToList();
On a side note app is in C# .
Thanks

You could do worse than look at the GEOGRAPHY datatype, for example:
CREATE TABLE Places
(
SeqID INT IDENTITY(1,1),
Place NVARCHAR(20),
Location GEOGRAPHY
)
GO
INSERT INTO Places (Place, Location) VALUES ('Coventry', geography::Point(52.4167, -1.55, 4326))
INSERT INTO Places (Place, Location) VALUES ('Sheffield', geography::Point(53.3667, -1.5, 4326))
INSERT INTO Places (Place, Location) VALUES ('Penzance', geography::Point(50.1214, -5.5347, 4326))
INSERT INTO Places (Place, Location) VALUES ('Brentwood', geography::Point(52.6208, 0.3033, 4326))
INSERT INTO Places (Place, Location) VALUES ('Inverness', geography::Point(57.4760, -4.2254, 4326))
GO
SELECT p1.Place, p2.place, p1.location.STDistance(p2.location) / 1000 AS DistanceInKilometres
FROM Places p1
CROSS JOIN Places p2
GO
SELECT p1.Place, p2.place, p1.location.STDistance(p2.location) / 1000 AS DistanceInKilometres
FROM Places p1
INNER JOIN Places p2 ON p1.SeqID > p2.SeqID
GO
geography::Point takes the latitude and longitude as well as an SRID (Special Reference ID number). In this case, the SRID is 4326 which is standard latitude and longitude. As you already have latitude and longitude, you can just ALTER TABLE to add the geography column then UPDATE to populate it.
I've shown two ways to get the data out of the table, however you can't create an indexed view with this (indexed views can't have self-joins). You could though create a secondary table that is effectively a cache, that's populated based on the above. You then just have to worry about maintaining it (could be done through triggers or some other process).
Note that the cross join will give you 250,000,000,000 rows, but searching is simple as you only need look at one of the places columns (i.e., SELECT * FROM table WHERE Place1 = 'Sheffield' AND distance < 100, the second will give you significantly less rows, but the query then needs to consider both the Place1 and Place2 column).

Related

How to query postgis data by closest point and only return results for that point?

I have a postgis table of points, 460 million records. It has a timestamp & point column.
I'm building graphs based on this data, a list of values for each timestamp that belong to the closest point, leaflet sends the lat/long from the map (where the user clicked) to the script that generates the chart-ready data.
SELECT thevalue
FROM thetable
WHERE ST_DWithin (thepoint, ST_MakePoint($get_lon, $get_lat), 0.04)
ORDER BY thedate
LIMIT 1000
This works great (for some clicks) but there has to be a better/faster way, I'd like the query to know what point to listen to and only return values for that point. Is there a better function for this requirement?
What king of geometry do you have? what projection are you using?
I'm going to assume that your points are in wgs84 (epsg:4326)
If you want distances to be accurate, it's better to use geography in calculations:
alter points_table add column geog geography
update points_table set geog = geom::geography
create an index, and run cluster and analyze to speed up queries
create index my_index_geog on points_table using gist(geog) /* change geog for geom if using geometry */
cluster points_table using my_index_geog
analyze points_table
to get the closest point:
SELECT point_id
FROM points_table
ORDER BY geog <-> ST_SetSrid(ST_MakePoint($get_lon, $get_lat),4326)::geography limit 1;
all together to get the values:
select value
from table
where point_id = (SELECT point_id
FROM points_table
ORDER BY geog <-> ST_SetSrid(ST_MakePoint($get_lon, $get_lat),4326)::geography limit 1)
order by thedate
limit 1000;
additionally I would suggest keeping a table that contains only the points id's and the geometry/geography so the closest-point query runs faster. If you create such table, called only_points, the query becomes:
select value
from table
where point_id = (SELECT point_id
FROM only_points
ORDER BY geog <-> ST_SetSrid(ST_MakePoint($get_lon, $get_lat),4326)::geography limit 1)
order by thedate
limit 1000;
If you need to keep using geometry, then you'll need to create the index on the geometry, cluster based on geom and run the query:
select value
from table
where point_id = (SELECT point_id
FROM points_table
ORDER BY geom::geography <-> ST_SetSrid(ST_MakePoint($get_lon, $get_lat),4326)::geography limit 1)
order by thedate
limit 1000;
It will be slower, however, because you'll be converting to geography on each step
see KNN in Postgis and PostGIS geography type and indexes

Measure all distances return shortest value

I have one table with a list of stores, approximately 100 or so with lat/long. The second table I have a list of customers, with lat/long and has more than 500k.
I need to find the closest store to each customer. Currently I am using the geography data type with the STDistance function to calculate the distance between two points. This is functioning fine, but I am getting hung up on the most efficient ways to process this.
Option #1 - Cartesian join Customer_table to Store_table, process the distance calculation, rank the results and filter to #1. Concern with this is that if you have a 1 million row customer list, and 100 stores, you are created a 100 million row table and the rank function then thereafter may be taxing.
Option #2 - With some dynamic sql, create a pivoted table that has each customer in the first column, and each subsequent column has the calculated distance to each branch. From there, I can unpivot and then do the same rank/over function described in the first.
EXAMPLE
CUST_ID LAT LONG STORE1DIST STORE2DIST STORE3DIST
1 20.00 30.00 4.5 5.6 7.8
2 20.00 30.00 7.4 8.1 8.5
I'm not clear which would be the most efficient, and will keep the DBA's from wanting to come find me.
Thanks for the input in advance!
You can unpivot the data into multiple rows for each store distance then use simple pivot (Group by) to get the minimum value of StoreDistance.
select CUST_ID, MIN(STOREDIST) StoreDistance, MIN(STORES) StoreName
from
(select CUST_ID, LAT, LONG, STORE1DIST, STORE2DIST, STORE3DIST from Cus/*Your table*/) p
UNPIVOT
(
STOREDIST FOR STORES IN (STORE1DIST, STORE2DIST, STORE3DIST)
) as unpvt
Group by CUST_ID
This will give you the result as:
CUST_ID StoreDistance StoreName
-----------------------------------
1 4.5 STORE1DIST
2 7.4 STORE1DIST
I have a similar situation on my job. I use a distance function like this (returns kms, use 3960* to return miles):
CREATE Function MySTDistance(#lat1 float, #lon1 float, #lat2 float, #lon2 float)
returns smallmoney
as
return IsNull(6373*acos((sin(radians(#lat1))*sin(radians(#lat2)))
+(cos(radians(#lat1))*cos(radians(#lat2))*cos(radians(#lon1-#lon2)))),0)
then you look for the closest store by doing something like...
select C.Cust_Id
,Store_id=
(select top (1) Store_id
from Store_Table S
order by dbo.MySTDistance(S.lat, S.long, C.lat, C.long)
)
from Customer_Table C
Now you have each customer id with his closest store id. It's quite fast with a huge volume of customers (at least in my case).

Calculation of Closest Distance of one table's IDs to another reference table without joining in reference table

Really confused on this: so I'm trying to get the rank of the city in terms of population from https://gist.github.com/Miserlou/c5cd8364bf9b2420bb29 (I converted this into a csv and uploaded to our SQL server, let's call it City_table) that each ID in ID_table is closest to.
City_table has the latitude and longitude of each city as well as the rank of each city in terms of population, and the ID_table has the latitude & longitude of each ID. I can't join in City_table, because I'll need to calculate the distance from each ID to each city, and take the minimum of that.
The calculation below gets the distance from one location to another and converts to miles:
(ACOS(COS(RADIANS(90-CAST(ID_table.GEO_LATITUDE AS REAL)))*COS(RADIANS(90-City_table.latitude))+SIN(RADIANS(90-CAST(ID_table.GEO_LATITUDE AS REAL)))*SIN(RADIANS(90-City_table.latitude)) *COS(RADIANS(CAST(ID_table.GEO_LONGITUDE AS REAL)-
(-City_table.longitude))))*6371)*0.621371
To recap, ID_table has an ID, latitude, and longitude. City_table has a latitude, longitude, city, and rank according to highest population. I need to get the rank of the city from City_table that is closest to the ID's location.
I really just don't know how to do this and would really appreciate any help.
I'm trying to accomplish something like the following (acknowledging the syntax isn't right, just the idea of what I think I'm trying to accomplish) #hastrb
SELECT A.ID,City_table.rank,
FOR EACH CITY IN City_table{(ACOS(COS(RADIANS(90-CAST(ID_table.GEO_LATITUDE AS REAL)))*COS(RADIANS(90-City_table.latitude))+
SIN(RADIANS(90-CAST(ID_table.GEO_LATITUDE AS REAL)))*SIN(RADIANS(90-City_table.latitude)) *
COS(RADIANS(CAST(ID_table.GEO_LONGITUDE AS REAL)-(-City_table.longitude))))*6371)*0.621371} AS DISTANCE
FROM ID_table A
WHERE ROW_NUMBER()OVER(PARTITION BY A.ID ORDER BY DISTANCE ASC) = '1'
So I ended up figuring this out, but for future reference if anyone runs into this same problem, I figured out a solution (albeit maybe not the best one) but it works:
WITH X
AS
(
SELECT A.ID, A.CITY, A.[STATE], B.[Rank], B.City AS CITY_TABLE_CITY, B.State AS CITY_TABLE_STATE,
((ACOS(COS(RADIANS(90-CAST(A.GEO_LATITUDE AS REAL)))*COS(RADIANS(90-B.latitude))+
SIN(RADIANS(90-CAST(A.GEO_LATITUDE AS REAL)))*SIN(RADIANS(90-B.latitude))*
COS(RADIANS(CAST(A.GEO_LONGITUDE AS REAL)-B.longitude)))*6371)*0.621371) AS DISTANCE,
ROW_NUMBER()OVER(PARTITION BY A.ID ORDER BY((ACOS(COS(RADIANS(90-CAST(A.GEO_LATITUDE AS REAL)))*COS(RADIANS(90-B.latitude))+
SIN(RADIANS(90-CAST(A.GEO_LATITUDE AS REAL)))*SIN(RADIANS(90-B.latitude))*
COS(RADIANS(CAST(A.GEO_LONGITUDE AS REAL)-B.longitude)))*6371)*0.621371) ASC) AS DISTANCE_NUMBER
FROM ID_TABLE A
FULL OUTER JOIN CITY_TABLE B ON B.latitude<>A.GEO_LATITUDE
)
SELECT *
FROM X
WHERE DISTANCE_NUMBER='1' AND DISTANCE IS NOT NULL
ORDER BY ID
I strongly suggest (as already mentioned by John Cappelletti) to use the geography type. I had to do something similar as you, where my reference table contained the geography of the locations. I joined it to the main query "ON 1=1". That way, for each row of your main query, you will have a location from your locations table (just keep in mind that if the tables you're dealing with are large this will be slow!). In any case, the query looks something like this:
--The next line declares a reference location (some lat/long example)!
DECLARE #reference_point geography=geography::STGeomFromText('POINT(-84.206230 33.897247)', 4326);
SELECT t.LocationName,
t.PointGeom.STDistance(#reference_point) / 1609.34 AS [DistanceInMiles]
FROM
(
SELECT LocationName,
geography::Point(ISNULL(Geom.STY, 0), ISNULL(Geom.STX, 0), 4326) AS [PointGeom],
Geom,
Geom.STX AS [Longitude],
Geom.STY AS [Latitude]
FROM MyLocationsTable
) AS t;

Handling very big table in SQL Server Performance

I'm having some troubles to deal with a very big table in my database. Before to talk about the problem, let's talk about what i want to achieve.
I have two source tables :
Source 1: SALES_MAN (ID_SMAN, SM_LATITUDE, SM_LONGITUDE)
Source 2: CLIENT (ID_CLIENT, CLATITUDE, CLONGITUDE)
Target: DISTANCE (ID_SMAN, ID_CLIENT, SM_LATITUDE, SM_LONGITUDE, CLATITUDE, CLONGITUDE, DISTANCE)
The idea is to find the top N nearest SALES_MAN for every client using a ROW_NUMBER in the target table.
What I'm doing currently is calculating the distance between every client and every sales man :
INSERT INTO DISTANCE ([ID_SMAN], [ID_CLIENT], [DISTANCE],
[SM_LATITUDE], [SM_LONGITUDE], [CLATITUDE], [CLONGITUDE])
SELECT
[ID_SMAN], [ID_CLIENT],
geography::STGeomFromText('POINT('+IND_LATITUDE+' '+IND_LONGITUDE+')',4326).STDistance(geography::STGeomFromText('POINT('+DLR.[DLR_N_GPS_LATTITUDE]+' '+DLR.[DLR_N_GPS_LONGITUDE]+')',4326))/1000 as distance,
[SM_LATITUDE], [SM_LONGITUDE], [CLATITUDE], [CLONGITUDE]
FROM
[dbo].[SALES_MAN], [dbo].[CLIENT]
The DISTANCE table contains approximately 1 milliards rows.
The second step to get my 5 nearest sales man per client is to run this query :
SELECT *
FROM
(SELECT
*,
ROW_NUMBER() OVER(PARTITION BY ID_CLIENT ORDER BY DISTANCE) rang
FROM DISTANCE) TAB
WHERE rang < 6
The last query is really a consuming one. So to avoid the SORT operator I tried to create an sorted non clustered index in DISTANCE and ID_CLIENT but it did not work. I also tried to include all the needed columns in the both indexes.
But when I created a clustered index on DISTANCE and keep the nonclustered sorted index in the ID_CLIENT the things went better.
So what a nonclustered sorting index is not working in this case?
But when I use the clustered index, I have other problem in loading data and I'm kind of forced to delete it before starting the loading process.
So what do you think? And how we can deal with this kind of tables to be able to select, insert or update data without having performance issues ?
Many thanks
Too long for a comment, but consider the following points.
Item 1) Consider adding a Geography field to each of your source tables. This will eliminate the redundant GEOGRAPHY::Point() function calls
Update YourTable Set GeoPoint = GEOGRAPHY::Point([Lat], [Lng], 4326)
So then the calculation for distance would simply be
,InMeters = C.GeoPoint.STDistance(S.GeoPoint)
,InMiles = C.GeoPoint.STDistance(S.GeoPoint) / 1609.344
Item 2) Rather than generating EVERY possible combination, consider a adding a condtion to the JOIN. Keep in mind that every "1" of Lat or Lng is approx 69 miles, so you can reduce the search area. For example
From CLIENT C
Join SALES_MAN S
on S.Lat between C.Lat-1 and C.Lat+1
and S.Lng between C.Lng-1 and C.Lng+1
This +/- 1 could be any reasonable value ... (i.e. 0.5 or even 2.0)
ROW_NUMBER is a window function that requires the whole rows related with the ORDER BY 's column so its better to filter your result before ROW_NUMBER,
and you've to change the following code :
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY ID_CLIENT ORDER BY DISTANCE)
rang FROM DISTANCE
) TAB
WHERE rang < 6
into this:
WITH DISTANCE_CLIENT_IDS (CLIENT_ID) AS
(
SELECT DISTINCT CLIENT_ID
FROM DISTANCE
)
SELECT Dx.*
FROM DISTANCE_CLIENT_IDS D1,
(
SElECT * , ROW_NUMBER(ORDER BY DISTANCE) RANGE
FROM (
SELECT TOP(5) *
FROM DISTANCE D2
WHERE D1.CLIENT_ID = D2.CLIENT_ID
) Dt
) Dx
and make sure you'd added indexes on both CLIENT_ID and DISTANCE columns

SQL Server spatial feature: .STDistance Does it return miles or meters?

I am trying to use the SQL Server 2014 functions to determine distance between two points on a geographical surface. I have three fields in a table (Lat, Long, Coordinates). [Lat] and [Long] are existing values and I store the geographical point coordinates in the [Coordinates] field using the following:
UPDATE dbo.[MyTable]
SET [Coordinates] = geography::STPointFromText('POINT(' + CAST([Lon] AS VARCHAR(20)) + ' ' + CAST([Lat] AS VARCHAR(20)) + ')', 4326) ;
So now I have a table full of records that have the geographical coordinates pre-computed in the [Coordinates] field. Now I want to determine the distance in miles between Point_A and Point_B. I used the following:
-- Compute Point_A:
DECLARE #g geography = (SELECT [Coordinates] FROM [MyTable] WHERE [Id] = 68);
-- Compute Point_B:
DECLARE #h geography = (SELECT [Coordinates] FROM [MyTable] WHERE [Id] = 1439);
-- Compute Distance:
SELECT ROUND(#g.STDistance(#h)) AS [Distance];
The actual distance is about 20 miles but this computation is giving me a number that is thousands of times greater than 20 miles.
Is .STDistance returning meters instead of miles?
On a related note: can anyone point me to an example on the web where one matches thousands of geographical points in a table with the nearest geographical points in another table containing thousand of points? I can see this computation taking a very long time if I can't find a way to shorten the process.
The actual value returned depends on the SRID (Spatial Reference Identifiers) of your geography column - you set this when you created the geography items. If SRID is not specified, the default value of 4326 is assumed, which corresponds to the WGS 84 datum.
As far as performance goes you can index the geography columns, heres a link to get you on the right path.
https://technet.microsoft.com/en-us/library/bb964712(v=sql.105).aspx
if your SRID is 4326 , unit is metre
you can use this sql check another SRID
Select * from sys.spatial_reference_systems
I was using the default SRID value of 4326.
Through trial and error, I discovered that it is returning meters vs. miles.
It would have been nice if Microsoft would have made that more clear in their documentation.

Resources