PostgreSQL - Create and join array from multiple rows based on ID - arrays

I have two tables:
The table LINKS has:
LINK_ID --- integer, unique ID
FROM_NODE_X -- numbers/floats, indicating a geographical position
FROM_NODE_Y --
FROM_NODE_Z --
TO_NODE_X --
TO_NODE_Y --
TO_NODE_Z --
The table LINK_COORDS has:
LINK_ID --- integer, refers to above UID
ORDER --- integer, indicating order
X ---
Y ---
Z ---
Logically each LINK consists of a number of waypoints. The final order is:
FROM_NODE , 1 , 2 , 3 , ... , TO_NODE
A link has at least two waypoints (FROM_NODE, TO_NODE), but can have a variable number of waypoints in between (0 to 100+).
I now would need a way to aggregate, sort and store the waypoints of each link in an array which later will be used to draw a line.
I'm struggling with the LINK_COORDS being available as individual rows. Having the start and end positions in the other (LINKS) table doesn't help either. If I had a way to at least get all the LINK_COORDS joined/updated to the LINKS table I probably could work out the rest myself again. So if you have an idea on how to get that far, it'd be much appreciated already.
Considering performance would be nice (the tables have somewhere between 500k to 1mio entries now and will have multiples of that later), but is not essential for now.
EDIT:
Thanks for the suggestion, a-horse-with-no-name.
I chose to create the point geometries (PostGIS) for each XYZ before this step, so in the end there's only an array of points to create from the individual points.
The adapted SQL
UPDATE "Link"
SET "POINTS" =
array_append(
(array_prepend(
"FROM_POINT",
(SELECT array_agg(lc."POINT" ORDER BY lc."COUNT")
FROM "LinkCoordinate" lc
WHERE lc."LINK_ID" = "Link"."LINK_ID")))
, "TO_POINT")
however is running extremely slow:
Running it exemplary on 10 links required ~120 seconds. Running it for all the 1,3mio links and plenty more linkcoords would probably take somewhere around half a year. Not really ideal.
How can I figure out where this immense slowness originates from?
If I get the source data in a pre-ordered format (so linkcoordinates of each link_ID), would this allow me to significantly speed up the SQL query?
EDIT: It appears the main slowdown originates from the SELECT subquery used in the array_agg() function. Everything else (incl. ordering) does not really cause any slowdown.
My current guess is that the SELECT query iterates over the entirety of "LinkCoordinate" for each and every link, making it work much harder than it has to, as all LinkCoordinates belonging to a Link are always stored in 'blocks' of rows. A single, sequential processing of the LinkCoordinates would be sufficient, really.

something like this maybe:
select l.link_id,
min(l.from_node_x) as from_node_x,
min(l.from_node_y) as from_node_y,
min(l.from_node_z) as from_node_z,
array_agg(lc.x order by lc."ORDER") as points_x,
array_agg(lc.y order by lc."ORDER") as points_y,
array_agg(lc.z order by lc."ORDER") as points_z,
min(l.to_node_x) as to_node_x,
min(l.to_node_y) as to_node_y,
min(l.to_node_z) as to_node_z
from links l
join link_coords lc on lc.link_id = l.link_id
group by l.link_id;
The min() is necessary due to the group by but won't change the result as all values from the links are the same anyway.
Another possibility is to use a scalar subquery. I'm unsure which of them is faster though - but the join/group by is probably more efficient.
select l.link_id,
l.from_node_x,
l.from_node_y,
l.from_node_z,
(select array_agg(lc.x order by lc."ORDER") from link_coords lc where lc.link_id = l.link_id) as points_x,
(select array_agg(lc.y order by lc."ORDER") from link_coords lc where lc.link_id = l.link_id) as points_y,
(select array_agg(lc.z order by lc."ORDER") from link_coords lc where lc.link_id = l.link_id) as points_z,
l.to_node_x,
l.to_node_y,
l.to_node_z
from links l

Related

Extracting all records using a conditional SQL Server query?

I have a long database of observations for individuals. There are multiple observations for each individual, all assigned different medcodeid's.
I want to extract all records of individuals with certain medcodeid's assigned, but only if they at some point have had a smaller list of specific codes assigned.
This is an example of what I start with:
long dataset, multiple observations
and this is the records I'd like to extract:
multiple observations, but patients 3 and 5 are not extracted, as they never had a medcode 12
Would this be an additional WHERE clause? I am struggling as this will then only extract the second AND medcodeid list. But I want it to extract all, if the individual has had one of these certain fewer codes at some point. I hope that makes some sense. I am unfamiliar with IF command? And cannot see how CASE WHEN would work either.
Thank you very much in advance!
You definitely don't want to filter out all the rows so you're right that an additional condition won't help with that. And where only lets you look at the current row and you're trying to make a decision based all the rows belonging to the patient.
This query just uses a table expression and an analytic count() that tags each row with the number of matches as it lets you look outside the current row just like you need.
-- my additions to your query are in lowercase
with data as (
SELECT obs.patid, yob, obsdate, medcodeid,
count(case when medcodeid IN (<list of mandatory codes>) then 1 end)
over (partition by obs.patid) as medcode_count
-- assuming the relationship looks something like this
from obs inner join medcode on medcode.patid = obs.patid
WHERE medcodeid IN (<list of codes>)
AND obsdate BETWEEN '2004-12-31' AND GETDATE()
AND patienttypeid = 3 AND acceptable = 1 AND gender = 2
AND YEAR(obsdate) - yob > 15 AND YEAR(obsdate) - yob < 45
)
select * from data where medcode_count > 0;
At first I thought you were requiring that at least five of the codes from the full set were found. Now that you've edited the question I believe that you want to require that at least one code from a smaller subset is present. Either way this approach will work.
If I'm understanding what you're asking, I think what you need is an additional WHERE clause with a subquery. This could be done with and EXIST or a join but I find an IN query to be easier to work with.
You left the FROM out of your query so I had to guess at it but try this:
SELECT
obs.patid,
yob,
obsdate,
medcodeid
FROM
obs
WHERE
medcodeid IN (list of 20 codes)
AND (obsdate BETWEEN '2004-12-31' AND GETDATE())
AND patienttypeid = 3
AND acceptable = 1
AND gender = 2
AND ((YEAR(obsdate))-yob) > 15
AND ((YEAR(obsdate)) - yob) < 45
AND obs.patid IN (
SELECT
obs.patid
FROM
obs
WHERE
medcodeid IN (5 of the 20 codes)
);

Is there a way to sum an entire quantity in SQL with unique values

I am trying to get a total summation of both the ItemDetail.Quantity column and ItemDetail.NetPrice column. For sake of example, let's say the quantity that is listed is for each individual item is 5, 2, and 4 respectively. I am wondering if there is a way to display quantity as 11 for one single ItemGroup.ItemGroupName
The query I am using is listed below
select Location.LocationName, ItemDetail.DOB, SUM (ItemDetail.Quantity) as "Quantity",
ItemGroup.ItemGroupName, SUM (ItemDetail.NetPrice)
from ItemDetail
Join ItemGroupMember
on ItemDetail.ItemID = ItemGroupMember.ItemID
Join ItemGroup
on ItemGroupMember.ItemGroupID = ItemGroup.ItemGroupID
Join Location
on ItemDetail.LocationID = Location.LocationID
Inner Join Item
on ItemDetail.ItemID = Item.ItemID
where ItemGroup.ItemGroupID = '78' and DOB = '11/20/2019'
GROUP BY Location.LocationName, ItemDetail.DOB, Item.ItemName,
ItemDetail.NetPrice, ItemGroup.ItemGroupName
If you are using SQL Server 2012 , you can use the summation on partition to display the
details and aggregates in the same query.
SUM(SalesYTD) OVER (ORDER BY DATEPART(yy,ModifiedDate)),1)
Link :
https://learn.microsoft.com/en-us/sql/t-sql/functions/sum-transact-sql?view=sql-server-ver15
We can't be certain without seeing sample data. But I suspect you need to remove some fields from you GROUP BY clause -- probably Item.ItemName and ItemDetail.NetPrice.
Generally, you won't GROUP BY a column that you are applying an aggregate function to in the SELECT -- as in SUM(ItemDetail.NetPrice). And it is not very common, in my experience, to GROUP BY columns that aren't included in the SELECT list - as you are doing with Item.ItemName.
I think you need to go back to basics and read about what GROUP BY does.
First of all welcome to the overflow...
Second: The answer is going to be "It depends"
Any time you aggregate data you will need to Group by the other fields in the query, and you have that in the query. The gotcha is what happens when data is spread across multiple locations.
My suggestion is to rethink your problem and see if you really need these other fields in the query. This will depend on what the person using the data really wants to know.
Do they need to know how many of item X there are, or do they really need to know that item X is spread out over three sites?
You might find you are better off with two smaller queries.

OFFSET and FETCH causing massive performance hit on a query - including when OFFSET = 0

I have a fairly complex SQL query that involves returning about 20 columns from a large number of joins, used to populate a grid of results in a UI. It also uses a couple of CTEs to pre-filter the results. I've included an approximation of the query below (I've commented out the lines that fix the performance)
As the amount of data in the DB increased, the query performance tanked pretty hard, with only about 2500 rows in the main table 'Contract'.
Through experimentation, I found that by just removing the order, offset fetch at the end the performance went from around 30sec to just 1 sec!
order by 1 OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY
This makes no sense to me. The final line should be pretty cheap, free even when the OFFSET is zero, so why is it adding 29secs on to my query time?
In order to maintain the same function for the SQL, I adapted it so that I first select into #TEMP, then perform the above order-offset-fetch on the temp table, then drop the temp table. This completes in about 2-3 seconds.
My 'optimisation' feels pretty wrong, surely there's a more sane way to achieve the same speed?
I haven't extensively tested this for larger datasets, it's essentially a quick fix to get performance back for now. I doubt it will be efficient as the data size grows.
Other than the Clustered Indexes on the primary keys, there are no indexes on the tables. The Query Execution plan didn't appear to show any major bottlenecks, but I'm not an expert on interpreting it.
WITH tableOfAllContractIdsThatMatchRequiredStatus(contractId)
AS (
SELECT DISTINCT c.id
FROM contract c
INNER JOIN site s ON s.ContractId = c.id
INNER JOIN SiteSupply ss ON ss.SiteId = s.id AND ss.status != 'Draft'
WHERE
ISNULL(s.Deleted, '0') = 0
AND ss.status in ('saved')
)
,tableOfAllStatusesForAContract(contractId, status)
AS (
SELECT DISTINCT c.id, ss.status
FROM contract c
INNER JOIN site s ON s.ContractId = c.id
INNER JOIN SiteSupply ss ON ss.SiteId = s.id AND ss.status != 'Draft'
WHERE ss.SupplyType IN ('Electricity') AND ISNULL(s.Deleted, '0') = 0
)
SELECT
[Contract].[Id]
,[Contract].[IsMultiSite]
,statuses.StatusesAsCsv
... lots more columns
,[WaterSupply].[Status] AS ws
--INTO #temp
FROM
(
SELECT
tableOfAllStatusesForAContract.contractId,
string_agg(status, ', ') AS StatusesAsCsv
FROM
tableOfAllStatusesForAContract
GROUP BY
tableOfAllStatusesForAContract.contractId
) statuses
JOIN contract ON Contract.id = statuses.contractId
JOIN tableOfAllContractIdsThatMatchRequiredStatus ON tableOfAllContractIdsThatMatchRequiredStatus.contractId = Contract.id
JOIN Site ON contract.Id = site.contractId and site.isprimarySite = 1 AND ISNULL(Site.Deleted,0) = 0
... several more joins
JOIN [User] ON [Contract].ownerUserId = [User].Id
WHERE isnull(Deleted, 0) = 0
AND
(
[Contract].[Id] = '12659'
OR [Site].[Id] = '12659'
... often more search term type predicates here
)
--select * from #temp
order by 1
OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY
--drop table #temp
I've not had an answer, so I'm going to try and explain it myself, with my admittedly poor understanding of how SQL works and some pointers from Jeroen in comments above. It's probably not right, but from what I've discovered it could be correct, and I do know how to fix my immediate problem so it could help others.
I'll explain it with an analogy, as this is what I believe is probably happening:
Imagine you're a chef in a restaurant, and you have to prepare a large number of meals (rows in results). You know there's going to be a lot as you're front of house has told you this (TOP 10 or FETCH 10).
You spend time setting out the multitude of ingredients required (table joins) and equipment you'll need and as the first order comes in, you make sure you're going to be really efficient. Chopping up more that you need for the first order, putting it in little bowls ready to use on the subsequent orders. The first order takes you quite a while (30 secs) as you're planning ahead and want the subsequent dishes to go out as fast as possible.
However, as you're sat in the kitchen waiting for the next orders.. then don't arrive. That's it, just one order. Well that was a waste of time! If you'd just tried to get one dish out, you could have done it much faster (1sec) but you were planning ahead for something that was never needed.
The next night, you ditch your previous strategy and just do each plate at a time. However this time, there are 100s of customers. You can't deliver them fast enough doing them one at a time. The amount of time to deliver all the orders would have been much faster if you'd planned ahead like the previous night. (I've not tested this hypothesis, but I expect it is what would probably happen).
For my query, I don't know if there's going to be 1 result or 100s although I may be able to do some analysis up front based on the search criteria entered by the user, I may have to adapt my UI to give me more information so I can predict this better, which means I can pick the appropriate strategy for SQL to use upfront. As it is, I'm optimised for a small number of results which works fine for now - but I need to do some more extensive testing to see how performance is affected as the dataset grows.
"If you want a answer to something, post something that's wrong on the internet and someone will be sure to correct you"

Identify the correct linkage between two tables' rows based on two conditions

I hope I am explaining this clearly enough for someone to figure the result out.
I left joined two tables based on one common variable which represents in this case NHS number. Both tables have a unique row identifier but they are independent of each other (ID2, TUMOUR_STAGE_LINENO) (if that makes sense). My problem is to identify the correct linkage between them based on a minimum days difference between the dates of diagnosis from the two tables but also each row identifier should not be linked twice to the other row identifiers. I will show you an extract of the linked data with a few examples.
UPDATE NBOCAP.dbo.temp1
SET LINKAGE = 'TRUE'
FROM NBOCAP.dbo.temp1
JOIN (SELECT ID2, MIN(DAYSDIFF) D
FROM NBOCAP.dbo.temp1
GROUP BY ID2) AS X
ON temp1.ID2 = X.ID2
AND temp1.DAYSDIFF = X.D
AND DATE_OF_DIAGNOSIS < '2013-01-01'
WHERE temp1.TUMOUR_STAGE_LINENO IN (SELECT a.TUMOUR_STAGE_LINENO
FROM temp1 a
INNER JOIN temp1 b
ON b.TUMOUR_STAGE_LINENO <> a.TUMOUR_STAGE_LINENO)
I don't think my WHERE condition does anything in this instance....
The linked table is this:
Link to data extract
Apologies for not uploading the image here but I do not have 10 rep points.
As you can see with nhs numbers 2 & 6 I am getting the same TUMOUR_STAGE_LINENO linked twice to both ID2s because DAYSDIFF are the smallest. My question is how to write in sql that after looking at MIN(DAYSDIFF) to make sure that the next linked TUMOUR_STAGE_LINENO should not be the same as the first.
I appreciate everyone's time taken to look at this.
PS. As you may have noticed there is also the possibility of having same DAYSDIFF thus creating a duplicate linkage. That is probably another issue I need to consider.
It is important to mention that my interest is that the ID2 gets the correct linked row.
Many thanks
Adrian

Using a sort order column in a database table

Let's say I have a Product table in a shopping site's database to keep description, price, etc of store's products. What is the most efficient way to make my client able to re-order these products?
I create an Order column (integer) to use for sorting records but that gives me some headaches regarding performance due to the primitive methods I use to change the order of every record after the one I actually need to change. An example:
Id Order
5 3
8 1
26 2
32 5
120 4
Now what can I do to change the order of the record with ID=26 to 3?
What I did was creating a procedure which checks whether there is a record in the target order (3) and updates the order of the row (ID=26) if not. If there is a record in target order the procedure executes itself sending that row's ID with target order + 1 as parameters.
That causes to update every single record after the one I want to change to make room:
Id Order
5 4
8 1
26 3
32 6
120 5
So what would a smarter person do?
I use SQL Server 2008 R2.
Edit:
I need the order column of an item to be enough for sorting with no secondary keys involved. Order column alone must specify a unique place for its record.
In addition to all, I wonder if I can implement something like of a linked list: A 'Next' column instead of an 'Order' column to keep the next items ID. But I have no idea how to write the query that retrieves the records with correct order. If anyone has an idea about this approach as well, please share.
Update product set order = order+1 where order >= #value changed
Though over time you'll get larger and larger "spaces" in your order but it will still "sort"
This will add 1 to the value being changed and every value after it in one statement, but the above statement is still true. larger and larger "spaces" will form in your order possibly getting to the point of exceeding an INT value.
Alternate solution given desire for no spaces:
Imagine a procedure for: UpdateSortOrder with parameters of #NewOrderVal, #IDToChange,#OriginalOrderVal
Two step process depending if new/old order is moving up or down the sort.
If #NewOrderVal < #OriginalOrderVal --Moving down chain
--Create space for the movement; no point in changing the original
Update product set order = order+1
where order BETWEEN #NewOrderVal and #OriginalOrderVal-1;
end if
If #NewOrderVal > #OriginalOrderVal --Moving up chain
--Create space for the momvement; no point in changing the original
Update product set order = order-1
where order between #OriginalOrderVal+1 and #NewOrderVal
end if
--Finally update the one we moved to correct value
update product set order = #newOrderVal where ID=#IDToChange;
Regarding best practice; most environments I've been in typically want something grouped by category and sorted alphabetically or based on "popularity on sale" thus negating the need to provide a user defined sort.
Use the old trick that BASIC programs (amongst other places) used: jump the numbers in the order column by 10 or some other convenient increment. You can then insert a single row (indeed, up to 9 rows, if you're lucky) between two existing numbers (that are 10 apart). Or you can move row 370 to 565 without having to change any of the rows from 570 upwards.
Here is an alternative approach using a common table expression (CTE).
This approach respects a unique index on the SortOrder column, and will close any gaps in the sort order sequence that may have been left over from earlier DELETE operations.
/* For example, move Product with id = 26 into position 3 */
DECLARE #id int = 26
DECLARE #sortOrder int = 3
;WITH Sorted AS (
SELECT Id,
ROW_NUMBER() OVER (ORDER BY SortOrder) AS RowNumber
FROM Product
WHERE Id <> #id
)
UPDATE p
SET p.SortOrder =
(CASE
WHEN p.Id = #id THEN #sortOrder
WHEN s.RowNumber >= #sortOrder THEN s.RowNumber + 1
ELSE s.RowNumber
END)
FROM Product p
LEFT JOIN Sorted s ON p.Id = s.Id
It is very simple. You need to have "cardinality hole".
Structure: you need to have 2 columns:
pk = 32bit int
order = 64bit bigint (BIGINT, NOT DOUBLE!!!)
Insert/UpdateL
When you insert first new record you must set order = round(max_bigint / 2).
If you insert at the beginning of the table, you must set order = round("order of first record" / 2)
If you insert at the end of the table, you must set order = round("max_bigint - order of last record" / 2)
If you insert in the middle, you must set order = round("order of record before - order of record after" / 2)
This method has a very big cardinality. If you have constraint error or if you think what you have small cardinality you can rebuild order column (normalize).
In maximality situation with normalization (with this structure) you can have "cardinality hole" in 32 bit.
It is very simple and fast!
Remember NO DOUBLE!!! Only INT - order is precision value!
One solution I have used in the past, with some success, is to use a 'weight' instead of 'order'. Weight being the obvious, the heavier an item (ie: the lower the number) sinks to the bottom, the lighter (higher the number) rises to the top.
In the event I have multiple items with the same weight, I assume they are of the same importance and I order them alphabetically.
This means your SQL will look something like this:
ORDER BY 'weight', 'itemName'
hope that helps.
I am currently developing a database with a tree structure that needs to be ordered. I use a link-list kind of method that will be ordered on the client (not the database). Ordering could also be done in the database via a recursive query, but that is not necessary for this project.
I made this document that describes how we are going to implement storage of the sort order, including an example in postgresql. Please feel free to comment!
https://docs.google.com/document/d/14WuVyGk6ffYyrTzuypY38aIXZIs8H-HbA81st-syFFI/edit?usp=sharing

Resources