Hierarchical parent query in postgres

Hierarchical parent query in postgres - database

I am moving from Oracle to Postgresql. I am trying to convert some Oracle hierarchical queries to Postgres. For example, in Oracle to return a comma-delimited ordered list of all ids under (i.e., the children) and including the id_to_start_with I would do the following:
SELECT LISTAGG(id_something, ',') WITHIN GROUP (ORDER BY id_something) AS somethings FROM(
SELECT DISTINCT D.id_something
FROM something_table D
START WITH D.id_something = :id_to_start_with
CONNECT BY D.id_something_parent = PRIOR D.id_something
)
The equivalent in Postgres would seem to be:
WITH RECURSIVE the_somethings(id_something) AS (
SELECT id_something
FROM something_table
WHERE id_something = $id_to_start_with
UNION ALL
SELECT D.id_something
FROM the_somethings DR
JOIN something_table D ON DR.id_something = D.id_something_parent
)
SELECT string_agg(temp_somethings.id_something::TEXT, ',') AS somethings
FROM (
SELECT id_something
FROM the_somethings
ORDER BY id_something
) AS temp_somethings
Likewise if I want to return a comma-delimited ordered list of all ids above (i.e., the parents) and including the id_to_start_with I would do the following in Oracle:
SELECT LISTAGG(id_something, ',') WITHIN GROUP (ORDER BY id_something) AS somethings FROM(
SELECT DISTINCT D.id_something
FROM something_table D
START WITH D.id_something = :id_to_start_with
CONNECT BY D.id_something = PRIOR D.id_something_parent
)
The equivalent in Postgres would seem to be:
WITH RECURSIVE the_somethings(id_something, path) AS (
SELECT id_something
, id_something::TEXT as path
FROM something_table
WHERE id_something_parent IS NULL
UNION ALL
SELECT D.id_something
, (DR.path || ',' || D.id_something::TEXT) as path
FROM something_table D
JOIN the_somethings DR ON DR.id_something = D.id_something_parent
)
SELECT path
FROM the_somethings
WHERE id_something = $id_to_start_with
ORDER BY id_something
My question has to do with the last Postgres query. It seems terribly inefficient to me and I wonder if there is a better way to write it. That is, in Oracle the query will look for the parent of the id_to_start_with, then the parent of the parent, and so forth to the root.
The Postgres query, on the other hand, gets every single root to child path combination possible and then throws everything away except for the one root to id_to_start_with that I am looking for. That is potentially a ton of data to create just to throw it all away except for the one row I am looking for.
Is there a way to get a comma-delimited ordered list of all the parents of a particular id_to_start_with that is as performant in Postgres as it is in Oracle?
Edit: Adding explain plans from Oracle and Postgres.
Oracle Explain Plan Output
Postgres Explain Analyyze Output
CTE Scan on the_somethings (cost=62.27..74.66 rows=3 width=76) (actual time=0.361..0.572 rows=1 loops=1)
Filter: (id_something = 1047)
Rows Removed by Filter: 82
CTE the_somethings
-> Recursive Union (cost=0.00..62.27 rows=551 width=76) (actual time=0.026..0.433 rows=83 loops=1)
-> Seq Scan on something_table (cost=0.00..2.83 rows=1 width=8) (actual time=0.023..0.034 rows=1 loops=1)
Filter: (id_something_parent IS NULL)
Rows Removed by Filter: 82
-> Hash Join (cost=0.33..4.84 rows=55 width=76) (actual time=0.028..0.065 rows=16 loops=5)
Hash Cond: (d.id_something_parent = dr.id_something)
-> Seq Scan on something_table d (cost=0.00..2.83 rows=83 width=16) (actual time=0.002..0.012 rows=83 loops=5)
-> Hash (cost=0.20..0.20 rows=10 width=76) (actual time=0.009..0.009 rows=17 loops=5)
Buckets: 1024 Batches: 1 Memory Usage: 10kB
-> WorkTable Scan on the_somethings dr (cost=0.00..0.20 rows=10 width=76) (actual time=0.001..0.004 rows=17 loops=5)
Planning time: 0.407 ms
Execution time: 0.652 ms
This is the final query based on Jakub's answer below.
WITH RECURSIVE the_somethings(id_something, path, level, orig_id, id_something_parent) AS ( SELECT id_something , id_something::TEXT as path , 0 as level , id_something AS orig_id , id_something_parent FROM something_table
 WHERE id_something IN (1047, 448) UNION ALL
 SELECT D.id_something , (D.id_something::TEXT || ',' || DR.path) as path
 , DR.level + 1 as level
 , DR.orig_id as orig_id , D.id_something_parent
 FROM something_table D JOIN the_somethings DR ON D.id_something = DR.id_something_parent )
 SELECT DISTINCT ON(orig_id) orig_id, path FROM the_somethings ORDER BY orig_id, level DESC ;

CTEs in PostgreSQL are fenced meaning they will be materialized and only then will the filter from outer query will be applied. To make the query perform correctly build it the other way around and put the filter inside the CTE.
WITH RECURSIVE the_somethings(id_something, path) AS (
SELECT id_something
, id_something::TEXT as path, 0 as level, id_something AS orig_id
FROM something_table
WHERE id_something IN ($id_to_start_with,$id_to_start_with2)
UNION ALL
SELECT D.id_something
, (D.id_something::TEXT || ',' || DR.path) as path, DR.level + 1, DR.orig_id
FROM something_table D
JOIN the_somethings DR ON DR.id_something_parent = D.id_something
)
SELECT DISTINCT ON(orig_id) orig_id, path
FROM the_somethings
ORDER BY orig_id, DR.level DESC

Related

Custom Sort Order in CTE

I need to get a custom sort order in a CTE but the error shows
"--The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified."
What's a better way to get the custom order in the CTE?
WITH
ctedivisiondesc
as
(
SELECT * FROM (
SELECT --TOP 1 --[APPID]
DH1.[ID_NUM]
--,[SEQ_NUM_2]
--,[CUR_DEGREE]
--,[NON_DEGREE_SEEKING]
,DH1.[DIV_CDE]
,DDF.DEGREE_DESC 'DivisionDesc'
--,[DEGR_CDE]
--,[PRT_DEGR_ON_TRANSC]
--,[ACAD_DEGR_CDE]
,[DTE_DEGR_CONFERRED]
--,MAX([DTE_DEGR_CONFERRED]) AS Date_degree_conferred
,ROW_NUMBER() OVER (
PARTITION BY [ID_NUM]
ORDER BY [DTE_DEGR_CONFERRED] DESC --Getting last degree
) AS [ROW NUMBER]
FROM [TmsePrd].[dbo].[DEGREE_HISTORY] As DH1
inner join [TmsePrd].[dbo].[DEGREE_DEFINITION] AS DDF
on DH1.[DEGR_CDE] = DDF.[DEGREE]
--ORDER BY
--DIV_CDE Level
--CE Continuing Education
--CT Certificate 1
--DC Doctor of Chiropractic 4
--GR Graduate 3
--PD Pending Division
--UG Undegraduate 2
--The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.
ORDER BY CASE
WHEN DDF.DEGREE_DESC = 'Certificate' THEN 1
WHEN DDF.DEGREE_DESC = 'Undegraduate' THEN 2
WHEN DDF.DEGREE_DESC = 'Graduate' THEN 3
WHEN DDF.DEGREE_DESC = 'Doctor of Chiropractic' THEN 4
ELSE 5
END
) AS t
WHERE [ROW NUMBER] <= 1
)
SELECT * FROM ctedivisiondesc

You need to sort the outer query.
Sorting a subquery is not allowed because it is meaningless, consider this simple example:
WITH CTE AS
( SELECT ID
FROM (VALUES (1), (2)) AS t (ID)
ORDER BY ID DESC
)
SELECT *
FROM CTE
ORDER BY ID ASC;
The ordering on the outer query has overridden the ordering on the inner query rendering it a waste of time.
It is not just about explicit sorting of the outer query either, in more complex scenarios SQL Server may sort the subqueries any which way it wishes to enable merge joins or grouping etc. So the only way to guarantee the order or a result is to order the outer query as you wish.
Since you may not have all the data you need in the outer query, you may would probably need to create a further column inside the CTE to use for sorting. e.g.
WITH ctedivisiondesc AS
(
SELECT *
FROM ( SELECT DH1.ID_NUM,
DH1.DIV_CDE,
DDF.DEGREE_DESC AS DivisionDesc,
DTE_DEGR_CONFERRED,
ROW_NUMBER() OVER (PARTITION BY ID_NUM ORDER BY DTE_DEGR_CONFERRED DESC) AS [ROW NUMBER],
CASE
WHEN DDF.DEGREE_DESC = 'Certificate' THEN 1
WHEN DDF.DEGREE_DESC = 'Undegraduate' THEN 2
WHEN DDF.DEGREE_DESC = 'Graduate' THEN 3
WHEN DDF.DEGREE_DESC = 'Doctor of Chiropractic' THEN 4
ELSE 5
END AS SortOrder
FROM TmsePrd.dbo.DEGREE_HISTORY AS DH1
INNER JOIN TmsePrd.dbo.DEGREE_DEFINITION AS DDF
ON DH1.DEGR_CDE = DDF.DEGREE
) AS t
WHERE t.[ROW NUMBER] <= 1
)
SELECT ID_NUM,
DIV_CDE,
DivisionDesc,
DTE_DEGR_CONFERRED
FROM ctedivisiondesc
ORDER BY SortOrder;

Query runs slower when SELECT clause is used

I have a query in SQL Server with 6 JOINs and 1 LEFT JOIN to tables and views. It returns 16k records in about 1 second if the select clause is "SELECT *"
As soon as I specify even one column to display (SELECT ItemID, for example) the query slows down to about 70 seconds.
Query #1 (2s) - SELECT *:
SELECT *
FROM (SELECT LinkedToSet, LinkedToCopy, ',' + STRING_AGG(LocationID,',') + ',' Locs, count(1) OVER (PARTITION BY LinkedToSet) Copies
FROM Inventory.Locations WHERE LinkedToSet is not null AND (State & 4096)>0 GROUP BY LinkedToSet, LinkedToCopy) l
JOIN Bricklink_Set_Query bsq on l.LinkedToSet=bsq.Number
JOIN Bricklink.Set_Parts_Query bsp on l.LinkedToSet=bsp.SetNum AND bsp.Extra=0
JOIN Bricklink.Item_List i on bsp.ItemType=i.ItemType AND bsp.ItemID=i.Number
JOIN Bricklink.Category_List cat on i.Category_ID=cat.CatID
JOIN Bricklink.Color_List col on bsp.ColorID=col.ColorID
LEFT JOIN (SELECT LocationID, ItemType, ItemNum, ColorID, sum(QtyFound) as InvPcs
FROM Inventory.Item_History
GROUP BY LocationID, ItemType, ItemNum, ColorID) as h ON l.Locs like concat('%,',h.locationID,',%') AND h.ItemType=bsp.ItemType AND h.ItemNum=bsp.ItemID AND h.ColorID=bsp.ColorID
Actual Execution Plan: https://www.brentozar.com/pastetheplan/?id=SJD7Qemf_
Query #2 (81s) - SELECT a single column
SELECT bsp.ItemID
FROM (SELECT LinkedToSet, LinkedToCopy, ',' + STRING_AGG(LocationID,',') + ',' Locs, count(1) OVER (PARTITION BY LinkedToSet) Copies
FROM Inventory.Locations WHERE LinkedToSet is not null AND (State & 4096)>0 GROUP BY LinkedToSet, LinkedToCopy) l
JOIN Bricklink_Set_Query bsq on l.LinkedToSet=bsq.Number
JOIN Bricklink.Set_Parts_Query bsp on l.LinkedToSet=bsp.SetNum AND bsp.Extra=0
JOIN Bricklink.Item_List i on bsp.ItemType=i.ItemType AND bsp.ItemID=i.Number
JOIN Bricklink.Category_List cat on i.Category_ID=cat.CatID
JOIN Bricklink.Color_List col on bsp.ColorID=col.ColorID
LEFT JOIN (SELECT LocationID, ItemType, ItemNum, ColorID, sum(QtyFound) as InvPcs
FROM Inventory.Item_History
GROUP BY LocationID, ItemType, ItemNum, ColorID) as h ON l.Locs like concat('%,',h.locationID,',%') AND h.ItemType=bsp.ItemType AND h.ItemNum=bsp.ItemID AND h.ColorID=bsp.ColorID
Actual execution plan: https://www.brentozar.com/pastetheplan/?id=BJTr4x7Gu
The execution plans look totally different from each other and I'm not sure why. I've also tried wrapping the SELECT * and querying that, but some of these tables/views have the exact same field names, especially on the joins, so SQL Server throws an error:
This column 'foo' was specified multiple times.
How do I achieve the performance of SELECT * but limit which columns I display?
P.S. 2 Notes - 1) My desired select statement is obviously more complex than this and 2) Even using the full select statement, if I add a WHERE clause and restrict the query there, it runs in <1 second. If that plan would be useful I can post it as well.

Postgres jsonb array join

I have a jsonb document in a table. This document has an array of cameraIds in the document. I am trying to join this data with the cameras table that is a normal table where cameraId is a column, and return unique rows from the table with the jsonb column (why I am using a group by in my query).
Any advice on how to optimize this query for performance would be greatly appreciated.
JSONB Col Example:
{
"date": {
"end": "2018-11-02T22:00:00.000Z",
"start": "2018-11-02T14:30:00.000Z"
},
"cameraIds": [100, 101],
"networkId": 5,
"filters": [],
"includeUnprocessed": true,
"reason": "some reason",
"vehicleFilter": {
"bodyInfo": "something",
"lpInfo": "something"
}
}
Query:
select ssr.id,
a.name as user_name,
ssr.start_date,
ssr.end_date,
ssr.created_at,
ssr.payload -> 'filters' as pretty_filters,
ssr.payload -> 'reason' as reason,
ssr.payload -> 'includePlates' as include_plates,
ssr.payload -> 'vehicleFilter' -> 'bodyInfo' as vbf,
ssr.payload -> 'vehicleFilter' -> 'lpInfo' as lpInfo,
array_agg(n.name) filter (where n.organization_id = ${orgId}) as network_names,
array_agg(c.name) filter (where n.organization_id = ${orgId}) as camera_names
from
ssr
cross join jsonb_array_elements(ssr.payload -> 'cameraIds') camera_id
inner join cameras as c on c.id = camera_id::int
inner join networks as n on n.id = c.network_id
inner join accounts as a on ssr.account_id = a.id
where n.organization_id = ${someId}
and ssr.created_at between ${startDate} and ${endDat}
group by 1,2,3,4,5,6,7,8,9,10
order BY ssr.created_at desc
OFFSET 0
LIMIT 25;

Your query says:
where n.organization_id = ${someId}
But then the aggregate FILTER says:
where n.organization_id = ${orgId}
... which is a contradiction. The aggregated arrays would always be empty - except where ${orgId} happens to be the same as ${someId}, but then the FILTER clause is useless noise. IOW, the query doesn't seem to make sense as given.
The query might make sense after dropping the aggregate FILTER clauses:
SELECT s.id
, a.name AS user_name
, s.start_date
, s.end_date
, s.created_at
, s.payload ->> 'filters' AS pretty_filters
, s.payload ->> 'reason' AS reason
, s.payload ->> 'includePlates' AS include_plates
, s.payload -> 'vehicleFilter' ->> 'bodyInfo' AS vbf
, s.payload -> 'vehicleFilter' ->> 'lpInfo' AS lpInfo
, cn.camera_names
, cn.network_names
FROM ssr s
JOIN accounts a ON a.id = s.account_id -- assuming referential integrity
CROSS JOIN LATERAL (
SELECT array_agg(c.name) AS camera_names -- sort order?
, array_agg(n.name) AS network_names -- same order? distinct?
FROM jsonb_array_elements_text(ssr.payload -> 'cameraIds') i(camera_id)
JOIN cameras c ON c.id = i.camera_id::int
JOIN networks n ON n.id = c.network_id
WHERE n.organization_id = ${orgId}
) cn
WHERE s.created_at BETWEEN ${startDate} AND ${endDate} -- ?
ORDER BY s.created_at DESC NULLS LAST
LIMIT 25;
Key is the LATERAL subquery, which avoids duplication of rows from ssr, so we can also drop the outer GROUP BY. Should be considerably faster.
Also note ->> instead of -> and jsonb_array_elements_text(). See:
How to turn JSON array into Postgres array?
I left some question marks at more dubious spots in the query. Notably, BETWEEN is almost always the wrong tool for timestamps. See:
Subtract hours from the now() function

How to paginate Stack Exchange Data Explorer (SEDE) results?

Using data explorer to create queries:
SELECT P.id, creationdate,tags,owneruserid,answercount
--SELECT DISTINCT TAGNAME ,TAGID
FROM TAGS AS T
JOIN POSTTAGS AS PT
ON T.ID = PT.TAGID
JOIN POSTS AS P
ON PT.POSTID = P.ID
--WHERE CAST(P.TAGS AS VARCHAR) IN('JAVA')
WHERE PT.TAGID = 3143
How is it possible to add pagination in the query in order to take not only the first 50,000 results, but then run the query again to take the next remaining results?

There are a few ways to "page" through TSQL results; see:
How to return a page of results from SQL?
and
SQL performance: WHERE vs WHERE(ROW_NUMBER)
Here I will use the CTE method as:
It uses convenient row numbers to page through results, rather than trying to track less predictable factors such as creationdate.
It reportedly performs faster than the OFFSET method.
So, that question's query becomes this SEDE query:
-- StartRow: Starting row for paging
-- EndRow: Ending row for paging (Max 50K rows at a time)
WITH allData AS (
SELECT
ROW_NUMBER() OVER (ORDER BY P.creationdate) AS row
, P.id
, P.creationdate
, P.tags
, P.owneruserid
, P.answercount
FROM Posttags AS PT
JOIN Posts AS P ON PT.postid = P.id
WHERE PT.tagid = 3143 -- tag [scala]
)
SELECT *
FROM allData
WHERE row >= ##StartRow:INT?1##
AND row <= ##EndRow:INT?50000##
ORDER BY row

Optimize Entity Framework Generated SQL Server Execution Plan

I have a data structure that is basically a document with a dictionary of tags. I am attempting to bring back all documents of a given formtype that have a tag named 'Last Name' and a tag value of 'Smith'. There may be 0..N 'Last Name' tags associated with the document.
I am using the following linq query to try to match a source document to children with matching tags:
DB.Documents
.Where(doc => doc.FormID == pd.IndexForm.FormID)
.Where(doc => doc.Document_StringIndex_ReadOnly
.Join(Fields,
dsi => new { FieldName = dsi.FieldName, FieldValue = dsi.StringValue },
dsi2 => new { FieldName = dsi2.FieldName, FieldValue = dsi2.StringValue },
(dsi, dsi2) => dsi.Document).Count() > 0);
Which generates the following query when output using .ToTraceString()
SELECT
[Project1].*
FROM ( SELECT
[Extent1].*
(SELECT
COUNT(cast(1 as bit)) AS [A1]
FROM [dbo].[Document_StringIndex_ReadOnly] AS [Extent2]
INNER JOIN (SELECT [Extent3].*
FROM [dbo].[Document] AS [Extent3]
INNER JOIN [dbo].[Document_StringIndex_ReadOnly] AS [Extent4] ON [Extent3].[DocumentID] = [Extent4].[DocumentID] ) AS [Join1] ON (([Extent2].[FieldName] = [Join1].[FieldName]) OR (([Extent2].[FieldName] IS NULL) AND ([Join1].[FieldName] IS NULL))) AND (([Extent2].[StringValue] = [Join1].[StringValue]) OR (([Extent2].[StringValue] IS NULL) AND ([Join1].[StringValue] IS NULL)))
LEFT OUTER JOIN [dbo].[Document] AS [Extent5] ON [Extent2].[DocumentID] = [Extent5].[DocumentID]
WHERE ([Extent1].[DocumentID] = [Extent2].[DocumentID]) AND ([Join1].[DocumentID1] = #p__linq__7) AND ([Join1].[FieldName] = #p__linq__8)) AS [C1]
FROM [dbo].[Document] AS [Extent1]
WHERE [Extent1].[FormID] = #p__linq__5
) AS [Project1]
WHERE [Project1].[C1] > 0
If I do a direct substitution of constants for my parameters (as shown below) the query executes very quickly. However, if I leave the parameters in place the query takes several minutes.
SELECT
[Project1].*
FROM ( SELECT
[Extent1].*
(SELECT
COUNT(cast(1 as bit)) AS [A1]
FROM [dbo].[Document_StringIndex_ReadOnly] AS [Extent2]
INNER JOIN (SELECT [Extent3].*
FROM [dbo].[Document] AS [Extent3]
INNER JOIN [dbo].[Document_StringIndex_ReadOnly] AS [Extent4] ON [Extent3].[DocumentID] = [Extent4].[DocumentID] ) AS [Join1] ON (([Extent2].[FieldName] = [Join1].[FieldName]) OR (([Extent2].[FieldName] IS NULL) AND ([Join1].[FieldName] IS NULL))) AND (([Extent2].[StringValue] = [Join1].[StringValue]) OR (([Extent2].[StringValue] IS NULL) AND ([Join1].[StringValue] IS NULL)))
LEFT OUTER JOIN [dbo].[Document] AS [Extent5] ON [Extent2].[DocumentID] = [Extent5].[DocumentID]
WHERE ([Extent1].[DocumentID] = [Extent2].[DocumentID]) AND ([Join1].[DocumentID1] = 1015) AND ([Join1].[FieldName] = 'DDKey')) AS [C1]
FROM [dbo].[Document] AS [Extent1]
WHERE [Extent1].[FormID] = 22
) AS [Project1]
WHERE [Project1].[C1] > 0
After generating an execution plan, I learned that if I directly substitute the parameter values, SQL Server performs an index seek, and my query is fast. As soon as I leave the parameters in place, SQL Server will perform an index scan, and my query times out. Is there any way to prod SQL server to always seek? Can I force entity framework to not use parameterized queries?

In the generated SQL, this line
[Join1].[FieldName] = #p__linq__8
may be the problem.
If FieldName is varchar(...) and #p__linq__8 is nvarchar(...) then this clause will cause a table scan since the parameter type doesn't match the index type.
When you directly substitute 'DDKey' then the types match so you get an index seek. Try your query with N'DDkey' and see if you get a table scan.
This is an issue with various versions of Linq to Sql and Linq to Entities, but may be fixed in later releases.
One way to work around the problem if you can't update to the latest version would be to change FieldName to be nvarchar(...).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Hierarchical parent query in postgres - database

Related

Custom Sort Order in CTE

Query runs slower when SELECT clause is used

Postgres jsonb array join

How to paginate Stack Exchange Data Explorer (SEDE) results?

Optimize Entity Framework Generated SQL Server Execution Plan

Categories

Resources