PostGIS minimum distance between two large sets of points - postgis

I have two tables of points in PostGIS, say A and B, and I want to know, for every point in A, what is the distance to the closest point in B. I am able to solve this for small sets of points with the following query:
SELECT a.id, MIN(ST_Distance_Sphere(a.geom, b.geom))
FROM table_a a, table_b b
GROUP BY a.id;
However, I have a couple million points in each table and this query runs indefinitely. Is there some more efficient way to approach this. I am open to getting an approximate distance rather than an exact one.
Edit: A slight modification to the answer provided by JGH to return distances in meters rather than degrees if points are unprojected.
SELECT
a.id, nn.id AS id_nn,
a.geom, nn.geom_closest,
ST_Distance_Sphere(a.geom, nn.geom_closest) AS min_dist
FROM
table_a AS a
CROSS JOIN LATERAL
(SELECT
b.id,
b.geom AS geom_closest
FROM table_b b
ORDER BY a.geom <-> b.geom
LIMIT 1) AS nn;

Your query is slow because it computes the distance between every points without using any index. You could rewrite it to use the <-> operator that uses the index if used in the order by clause.
select a.id,closest_pt.id, closest_pt.dist
from tablea a
CROSS JOIN LATERAL
(SELECT
id ,
a.geom <-> b.geom as dist
FROM tableb b
ORDER BY a.geom <-> b.geom
LIMIT 1) AS closest_pt;

Related

Nearest Neighbour search in postgis giving wrong result

I try to find the nearest station to a set of polygons. I use the following query but the result is always the same station (actually the one with the lowest id).
SELECT DISTINCT ON (a.id)
a.id AS field_id,
a.name,
a.geom AS field_location,
b.stations_id,
b.stationsname,
st_distance(a.geom, b.geom) AS dist
FROM fields_filtered a,
kl_stationsliste b
WHERE b.bis_datum > '2020-04-01'::date
ORDER BY a.id, b.stations_id, (st_distance(a.geom, b.geom));
What am I doing wrong here?

In sql, how to filter records in two joined tables

I output invoices that are made up of info in two separate data tables linked by a unique ID #. I need to update the service provided in a group of invoices (service info contained in table_B) for only a certain date period (date info contained in table_A).
Here's the two tables I am joining
Table_A
ID------|Name-----------------|Date----------|Total--------|
1-------|--ABC Company--------|--1/1/17------|--$50--------|
2-------|--John Smith---------|--3/1/17------|--$240-------|
3-------|--Mary Jones---------|--2/1/16------|--$320-------|
1-------|--ABC Company--------|--8/1/16------|--$500-------|
Table_B
Table_ID (= ID Table_A)----|-Service-----------|Unit Price--|Qty------|
1--------------------------|--Service A--------|--$50.00----|--10-----|
--
2--------------------------|--Service B--------|--$20.00----|--12-----|
--
3--------------------------|--Service B--------|--$20.00----|--16-----|
--
1--------------------------|--Service A--------|--$50.00----|--10-----|
I am able to join the two tables using:
Select * from Table_B b inner join Table_A a on b.Table_ID = a.ID
which results in following:
Results
Table_ID-|-Service-----|-Unit Price-|-Qty-|-ID--|-Name-----|-Date----|Total--|
1--------|-Service A- |$50.00------|-10--|-1---|-ABC Co.--|-1/1/17--|$500--|
2--------|-Service B- |$20.00------|-12--|-2---|-John S.--|-3/1/17--|$240--|
3--------|-Service B- |$20.00------|-16--|-3---|-Mary J.--|-2/1/16--|$320--|
1--------|-Service A- |&50.00------|-10--|-1---|-ABC Co.--|-8/1/16--|$500--|
Now, I want only rows that are for dates greater 12/31/16. However, when I add a where clause for the date (see below) my results don't change.
Select * from Table_B b inner join Table_A a on b.Table_ID = a.ID where date > 12/31/16
I would expect just two rows for services on 1/1/17 and 3/1/17. How can I filter for just rows with a particular date value in this newly joined table?
Assuming your date is contained in a column intended for storing dates, and not string, try making sure that the date you're passing in really is being interpreted as a date:
SELECT
*
FROM
table_b b
INNER JOIN
table_a a
on b.Table_ID = a.ID
WHERE
a.date > CONVERT(datetime , '20161231' , 112 )
I suspect that SQLSERVER is interpreting your date 12/31/16 as "twelve divided by thirty one divided by sixteen" - a floating point number approximately 0.0241935
The way dates are handled, internally, they are convertable to floating point numbers representing the number (and fraction of) days since a certain point in time, I believe 1 Jan 1900. Hence your 0.024 floating point number will represent a date about 35 minutes past midnight on 01 jan 1900.. and that's why your results aren't filtering, because all the dates satisfy the where clause (theyre all later than 01-01-1900 00:35)!
What result are you getting with your current implementation, because I don't see any issue with your current query
Please test below script, it may give you expected output.
Select *
from Table_B b
inner join Table_A a
on b.Table_ID = a.ID
and date > convert(date , '20161231' , 112 )
Select *
from Table_B b
inner join Table_A a
on b.Table_ID = a.ID
where date > '12/31/16'
Can you try to use the quotes with your date.
Or best way is to use
Select *
from Table_B b
inner join Table_A a
on b.Table_ID = a.ID
where Date BETWEEN '12/31/2016' and '1/1/2018'

MS SQL Server - Table Dependency Hierarchy Group

I have a database with approx 500 tables and there are lot of foreign key relationships among these tables.
I need to form the groups of related tables together i.e one group is not related to any other group all the related tables should come in one group.
For ex:-
There are four tables T1, T2, T3 and T4.
T1 and T2 have a relationship and T3 and T4 have a relationship. So i can insert T1 and T2 in one group and T3 and T4 in another group.
Which SQL server are you using?
select O.name as [Object_Name], C.text as [Object_Definition]
from sys.syscomments C
inner join sys.all_objects O ON C.id = O.object_id
--where C.text like '%table_name%'
Here's a hierarchical query on sys.foreign_keys that should get you pretty close to what you're looking for.
WITH cte AS (
-- find tables that are parents, but are not children themselves
SELECT [fk].[referenced_object_id] AS [child_id],
NULL AS [parent_id],
CAST(CONCAT('/', [fk].[referenced_object_id], '/') AS VARCHAR(MAX)) AS h,
1 AS l
FROM sys.[foreign_keys] AS [fk]
WHERE [fk].[referenced_object_id] NOT IN (
SELECT [parent_object_id]
FROM sys.[foreign_keys]
)
UNION ALL
SELECT child.[parent_object_id],
[child].[referenced_object_id] AS [parent_id],
CAST(CONCAT(parent.[h], child.[parent_object_id], '/') AS VARCHAR(MAX)) AS [h],
parent.l + 1 AS l
FROM cte AS [parent]
JOIN sys.[foreign_keys] AS [child]
ON [parent].[child_id] = child.[referenced_object_id]
),
hier AS (
SELECT DISTINCT
OBJECT_NAME([cte].[child_id]) AS [child],
object_name([cte].[parent_id]) AS [parent],
h,
--CAST([cte].[h] AS HIERARCHYID) AS h
l
FROM cte
)
SELECT [hier].[child] ,
[hier].[parent] ,
[hier].[h]--.ToString()
FROM [hier]
ORDER BY
l, h -- breadth-first search
--h, l -- depth-first search
--h.GetLevel(), h -- breadth-first search; hierarchyid
--h, h.GetLevel() -- depth-first search; hierarchyid
You'll note that I included two order by clauses. Each have their uses. Assume that you have the following disconnected graphs of foreign keys: (a → b → c), (d → e → f). Using the first order by clause will return rows in the following order: a, d, b, e, c, f. That is, all of the top-level elements first, followed by the tier two elements, etc. The second order by clause will return them in the order of a, b, c, d, e, f (or maybe d, e, f, a, b, c; depending on the object ids for a and d). The idea here is that you fully exhaust one disconnected graph before moving onto the next one.
One note is that I'm fairly sure that the above doesn't take self-referential foreign keys into account. If that's important to you, I'd deal with those as a separate action (i.e. fully populate those first, then find the non-self-referential relationships using the above).
I also left a comment or two in there for making a hierarchyid solution work. in the hier cte, use the casting of h to hierarchyid instead of h and then use the order by clauses that take advantage of that. None of that is necessary, but could be a good first exposure to hierarchyid.

OVER (ORDER BY Col) generates 'Sort' operation

I'm working on a query that needs to do filtering, ordering and paging according to the user's input. Now I'm testing a case that's really slow, upon inspection of the Query Plan a 'Sort' is taking 96% of the time.
The datamodel is really not that complicated, the following query should be clear enough to understand what's happening:
WITH OrderedRecords AS (
SELECT
A.Id
, A.col2
, ...
, B.Id
, B.col1
, ROW_NUMBER() OVER (ORDER BY B.col1 ASC) AS RowNumber
FROM A
LEFT JOIN B ON (B.SomeThing IS NULL) AND (A.BId = B.Id)
WHERE (A.col2 IN (...)) AND (B.Id IN (...))
)
SELECT
*
FROM OrderedRecords WHERE RowNumber Between x AND y
A is a table containing about 100k records, but will grow to tens of millions in the field, while B is category type table with 5 items (and this will never grow any bigger then perhaps a few more). There are clustered indexes on A.Id and B.Id.
Performance is really dreadful and I'm wondering if it's possible to remedy this somehow. If, for example, the ordering is on A.Id instead of B.col1 everything is pretty darn fast. Perhaps I can optimize B.col1 is some sort of index.
I already tried putting an index on the field itself, but this didn't help. Probably because the number of distinct items in table B is very small (in itself & compared to A).
Any ideas?
I think this may be part of the problem:
LEFT JOIN B ON (B.SomeThing IS NULL) AND (A.Id = B.Id)
WHERE (A.col2 IN (...)) AND (B.Id IN (...)
Your LEFT JOIN is going to logically act like an INNER JOIN because of the WHERE clause you have in place, since only certain B.ID rows are going to be returned. If that's your intent, then go ahead and use an inner join, which may help the optimizer realize that you are looking for a restricted number of rows.
I suggest you to try following.
For the B table create index:
create index IX_B_1 on B (col1, Id, SomeThing)
For the A table create index:
create index IX_A_1 on A (col2, BId) include (Id, ...)
In the include put all other columns of the table A, that listed in SELECT of OrderedRecords CTE.
However, as you see, index IX_A_1 is space taking, and can take size of about table data itself.
So, as an alternative you may try omit extra columns from include part of the index:
create index IX_A_2 on A (col2, BId) include (Id)
but in this case you will have to slightly modify your query:
;WITH OrderedRecords AS (
SELECT
AId = A.Id
, A.col2
-- remove other A columns from here
, bid = B.Id
, B.col1
, ROW_NUMBER() OVER (ORDER BY B.col1 ASC) AS RowNumber
FROM A
LEFT JOIN B ON (B.SomeThing IS NULL) AND (A.BId = B.Id)
WHERE (A.col2 IN (...)) AND (B.Id IN (...))
)
SELECT
R.*, A.OtherColumns
FROM OrderedRecords R
join A on A.Id = R.AId
WHERE R.RowNumber Between x AND y

Substract 2 columns from postgreSQL LEFT JOIN query with NULL values

I have a postgreSQL query which should be the actual stock of samples on our lab.
The initial samples are taken from a table (tblStudies), but then there are 2 tables to look for to decrease the amount of samples.
So I made a union query for those 2 tables, and then matched the uniun query with the tblStudies to calculate the actual stock.
But the union query only gives values when there is a decrease in samples.
So when the study still has it's initial samples, the value isn't returned.
I figured out I should use a JOIN operation, but then I have NULL values for my study with initial samples.
Here is how far I got, any help please?
SELECT
"tblStudies"."Studie_ID", "SamplesWeggezet", c."Stalen_gebruikt", "SamplesWeggezet" - c."Stalen_gebruikt" as "Stock"
FROM
"Stability"."tblStudies"
LEFT JOIN
(
SELECT b."Studie_ID",sum(b."Stalen_gebruikt") as "Stalen_gebruikt"
FROM (
SELECT "tblAnalyses"."Studie_ID", sum("tblAnalyses"."Aant_stalen_gebruikt") AS "Stalen_gebruikt"
FROM "Stability"."tblAnalyses"
GROUP BY "tblAnalyses"."Studie_ID"
UNION
SELECT "tblStalenUitKamer"."Studie_ID", sum("tblStalenUitKamer".aant_stalen) AS "stalen_gebruikt"
FROM "Stability"."tblStalenUitKamer"
GROUP BY "tblStalenUitKamer"."Studie_ID"
) b
GROUP BY b."Studie_ID"
) c ON "tblStudies"."Studie_ID" = c."Studie_ID"
Because you're doing a LEFT JOIN to the inline query "C" some values of c."stalen_gebruikt" can be null. And any number - null is going to yield null. To address this we can use coalesce
So change
"samplesweggezet" - c."stalen_gebruikt" AS "Stock
to
"samplesweggezet" - COALESCE(c."stalen_gebruikt",0) AS "Stock

Resources