I have 2 sql queries doing the same thing, first query takes 13 sec to execute while second takes 1 sec to execute. Any reason why ?
Not necessary all the ids in ProcessMessages will have data in ProcessMessageDetails
-- takes 13 sec to execute
Select * from dbo.ProcessMessages t1
join dbo.ProcessMessageDetails t2 on t1.ProcessMessageId = t2.ProcessMessageId
Where Id = 4 and Isdone = 0
--takes under a sec to execute
Select * from dbo.ProcessMessageDetails
where ProcessMessageId in ( Select distinct ProcessMessageId from dbo.ProcessMessages t1
Where Where Id = 4 and Isdone = 0 )
I have clusterd index on t1.processMessageId(Pk) and non clusterd index on t2.processMessageId (FK)
I would need the actual execution plans to tell you exactly what SqlServer is doing behind the scenes. I can tell you these queries aren't doing the exact same thing.
The first query is going through and finding all of the items that meet the conditions for t1 and finding all of the items for t2 and then finding which ones match and joining them together.
The second one is saying first find all of the items that are meet my criteria from t1, and then find the items in t2 that have one of these IDs.
Depending on your statistics, available indexes, hardware, table sizes: Sql Server may decide to do different types of scans or seeks to pick data for each part of the query, and it also may decide to join together data in a certain way.
The answer to your question is really simple the first query which have used will generate more number of rows as compared to the second query so it will take more time to search those many rows that's the reason your first query took 13 seconds and the second one to only one second
So it is generally suggested that you should apply your conditions before making your join or else your number of rows will increase and then you will require more time to search those many rows when joined.
Related
I have a long database of observations for individuals. There are multiple observations for each individual, all assigned different medcodeid's.
I want to extract all records of individuals with certain medcodeid's assigned, but only if they at some point have had a smaller list of specific codes assigned.
This is an example of what I start with:
long dataset, multiple observations
and this is the records I'd like to extract:
multiple observations, but patients 3 and 5 are not extracted, as they never had a medcode 12
Would this be an additional WHERE clause? I am struggling as this will then only extract the second AND medcodeid list. But I want it to extract all, if the individual has had one of these certain fewer codes at some point. I hope that makes some sense. I am unfamiliar with IF command? And cannot see how CASE WHEN would work either.
Thank you very much in advance!
You definitely don't want to filter out all the rows so you're right that an additional condition won't help with that. And where only lets you look at the current row and you're trying to make a decision based all the rows belonging to the patient.
This query just uses a table expression and an analytic count() that tags each row with the number of matches as it lets you look outside the current row just like you need.
-- my additions to your query are in lowercase
with data as (
SELECT obs.patid, yob, obsdate, medcodeid,
count(case when medcodeid IN (<list of mandatory codes>) then 1 end)
over (partition by obs.patid) as medcode_count
-- assuming the relationship looks something like this
from obs inner join medcode on medcode.patid = obs.patid
WHERE medcodeid IN (<list of codes>)
AND obsdate BETWEEN '2004-12-31' AND GETDATE()
AND patienttypeid = 3 AND acceptable = 1 AND gender = 2
AND YEAR(obsdate) - yob > 15 AND YEAR(obsdate) - yob < 45
)
select * from data where medcode_count > 0;
At first I thought you were requiring that at least five of the codes from the full set were found. Now that you've edited the question I believe that you want to require that at least one code from a smaller subset is present. Either way this approach will work.
If I'm understanding what you're asking, I think what you need is an additional WHERE clause with a subquery. This could be done with and EXIST or a join but I find an IN query to be easier to work with.
You left the FROM out of your query so I had to guess at it but try this:
SELECT
obs.patid,
yob,
obsdate,
medcodeid
FROM
obs
WHERE
medcodeid IN (list of 20 codes)
AND (obsdate BETWEEN '2004-12-31' AND GETDATE())
AND patienttypeid = 3
AND acceptable = 1
AND gender = 2
AND ((YEAR(obsdate))-yob) > 15
AND ((YEAR(obsdate)) - yob) < 45
AND obs.patid IN (
SELECT
obs.patid
FROM
obs
WHERE
medcodeid IN (5 of the 20 codes)
);
I have a fairly complex SQL query that involves returning about 20 columns from a large number of joins, used to populate a grid of results in a UI. It also uses a couple of CTEs to pre-filter the results. I've included an approximation of the query below (I've commented out the lines that fix the performance)
As the amount of data in the DB increased, the query performance tanked pretty hard, with only about 2500 rows in the main table 'Contract'.
Through experimentation, I found that by just removing the order, offset fetch at the end the performance went from around 30sec to just 1 sec!
order by 1 OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY
This makes no sense to me. The final line should be pretty cheap, free even when the OFFSET is zero, so why is it adding 29secs on to my query time?
In order to maintain the same function for the SQL, I adapted it so that I first select into #TEMP, then perform the above order-offset-fetch on the temp table, then drop the temp table. This completes in about 2-3 seconds.
My 'optimisation' feels pretty wrong, surely there's a more sane way to achieve the same speed?
I haven't extensively tested this for larger datasets, it's essentially a quick fix to get performance back for now. I doubt it will be efficient as the data size grows.
Other than the Clustered Indexes on the primary keys, there are no indexes on the tables. The Query Execution plan didn't appear to show any major bottlenecks, but I'm not an expert on interpreting it.
WITH tableOfAllContractIdsThatMatchRequiredStatus(contractId)
AS (
SELECT DISTINCT c.id
FROM contract c
INNER JOIN site s ON s.ContractId = c.id
INNER JOIN SiteSupply ss ON ss.SiteId = s.id AND ss.status != 'Draft'
WHERE
ISNULL(s.Deleted, '0') = 0
AND ss.status in ('saved')
)
,tableOfAllStatusesForAContract(contractId, status)
AS (
SELECT DISTINCT c.id, ss.status
FROM contract c
INNER JOIN site s ON s.ContractId = c.id
INNER JOIN SiteSupply ss ON ss.SiteId = s.id AND ss.status != 'Draft'
WHERE ss.SupplyType IN ('Electricity') AND ISNULL(s.Deleted, '0') = 0
)
SELECT
[Contract].[Id]
,[Contract].[IsMultiSite]
,statuses.StatusesAsCsv
... lots more columns
,[WaterSupply].[Status] AS ws
--INTO #temp
FROM
(
SELECT
tableOfAllStatusesForAContract.contractId,
string_agg(status, ', ') AS StatusesAsCsv
FROM
tableOfAllStatusesForAContract
GROUP BY
tableOfAllStatusesForAContract.contractId
) statuses
JOIN contract ON Contract.id = statuses.contractId
JOIN tableOfAllContractIdsThatMatchRequiredStatus ON tableOfAllContractIdsThatMatchRequiredStatus.contractId = Contract.id
JOIN Site ON contract.Id = site.contractId and site.isprimarySite = 1 AND ISNULL(Site.Deleted,0) = 0
... several more joins
JOIN [User] ON [Contract].ownerUserId = [User].Id
WHERE isnull(Deleted, 0) = 0
AND
(
[Contract].[Id] = '12659'
OR [Site].[Id] = '12659'
... often more search term type predicates here
)
--select * from #temp
order by 1
OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY
--drop table #temp
I've not had an answer, so I'm going to try and explain it myself, with my admittedly poor understanding of how SQL works and some pointers from Jeroen in comments above. It's probably not right, but from what I've discovered it could be correct, and I do know how to fix my immediate problem so it could help others.
I'll explain it with an analogy, as this is what I believe is probably happening:
Imagine you're a chef in a restaurant, and you have to prepare a large number of meals (rows in results). You know there's going to be a lot as you're front of house has told you this (TOP 10 or FETCH 10).
You spend time setting out the multitude of ingredients required (table joins) and equipment you'll need and as the first order comes in, you make sure you're going to be really efficient. Chopping up more that you need for the first order, putting it in little bowls ready to use on the subsequent orders. The first order takes you quite a while (30 secs) as you're planning ahead and want the subsequent dishes to go out as fast as possible.
However, as you're sat in the kitchen waiting for the next orders.. then don't arrive. That's it, just one order. Well that was a waste of time! If you'd just tried to get one dish out, you could have done it much faster (1sec) but you were planning ahead for something that was never needed.
The next night, you ditch your previous strategy and just do each plate at a time. However this time, there are 100s of customers. You can't deliver them fast enough doing them one at a time. The amount of time to deliver all the orders would have been much faster if you'd planned ahead like the previous night. (I've not tested this hypothesis, but I expect it is what would probably happen).
For my query, I don't know if there's going to be 1 result or 100s although I may be able to do some analysis up front based on the search criteria entered by the user, I may have to adapt my UI to give me more information so I can predict this better, which means I can pick the appropriate strategy for SQL to use upfront. As it is, I'm optimised for a small number of results which works fine for now - but I need to do some more extensive testing to see how performance is affected as the dataset grows.
"If you want a answer to something, post something that's wrong on the internet and someone will be sure to correct you"
In SQL Server, if I try the following query:
select id from table
order by id
offset 1000000 ROWS
fetch next 1000000 ROWS ONLY;
How will SQL Server work? What strategy does SQL server use?
1. Do a sorting on the whole table first and then select the 1 million rows we need
2. Do a sorting on partial table and then return the 1 million rows we need.
I assume it is 2nd option. If so, how does SQL server decide which range of the table to be sorted?
Edit 1:
I am asking this question to understand what could cause the query slow. I am testing with two queries:
--Query 1:
select id from table
order by id
offset 1 ROWS
fetch next 1 ROWS ONLY;
and
--Query 2:
select id from table
order by id
offset 1000000000 ROWS
fetch next 1 ROWS ONLY;
I found the second query can take me about 30 minutes to finish while the first takes almost 0 second.
So I am curious on what causes this difference? If the two have same time used for order by (or does it even really do a sorting on the whole table? The id is the clustered indexed column of the table. I cannot imagine that it takes 0 second to finish sorting on a terabyte table.)
Then if the sorting takes same time, only difference would be the clustered-index scan. For first query, it only needs to scan first 1 or 10 (a small number) of rows. While for the second query, it needs to scan a much bigger number of rows ( >1000000000 ). But I am not quite sure if this is correct.
Thank you for your help!
Let me take a simple example..
order by id
offset 50 rows fetch 25 rows only
For the above query,the steps would be
1.Table should be sorted by id (if not pay penalty of sort,there is no partial sort,always a full sort)
2.Then scan 50+25 rows(paying cost of 75 rows) and return 25 rows only..
Below is an example of orders table i have(orderid is Pk,so sorted),you can see even though, we are getting only 20 rows ,you are paying cost of 120 rows...
Coming to your question,there is no partial sort (Which implies first option regarding sort only),even you try to return one row like below..
select top 1* from table
order by orderid
I have a query, that I did not write, that takes 2.5 minutes to run. I am trying to optimize it without being able to modify the underlying tables, i.e. no new indexes can be added.
During my optimization troubleshooting I commented out a filter and all of a sudden my query ran in .5 seconds. I have screwed with the formatting and placing of that filter and if it is there the query takes 2.5 minutes, without it .5 seconds. The biggest problem is that the filter is not on the table that is being table-scanned (With over 300k records), it is on a table with 300 records.
The "Actual Execution Plan" of both the 0:0:0.5 vs 0:2:30 are identical down to the exact percentage costs of all steps:
Execution Plan
The only difference is that on the table-scanned table the "Actual Number of Rows" on the 2.5 min query shows 3.7 million rows. The table only has 300k rows. Where the .5 sec query shows Actual Number of Rows as 2,063. The filter is actually being placed on the FS_EDIPartner table that only has 300 rows.
With the filter I get the correct 51 records, but it takes 2.5 minutes to return. Without the filter I get duplication, so I get 2,796 rows, and only take half a second to return.
I cannot figure out why adding the filter to a table with 300 rows and a correct index is causing the Table scan of a different table to have such a significant difference in actual number of rows. I am even doing the "Table scan" table as a sub-query to filter its records down from 300k to 17k prior to doing the join. Here is the actual query in its current state, sorry the tables don't make a lot of sense, I could not reproduce this behavior in test data.
SELECT dbo.FS_ARInvoiceHeader.CustomerID
, dbo.FS_EDIPartner.PartnerID
, dbo.FS_ARInvoiceHeader.InvoiceNumber
, dbo.FS_ARInvoiceHeader.InvoiceDate
, dbo.FS_ARInvoiceHeader.InvoiceType
, dbo.FS_ARInvoiceHeader.CONumber
, dbo.FS_EDIPartner.InternalTransactionSetCode
, docs.DocumentName
, dbo.FS_ARInvoiceHeader.InvoiceStatus
FROM dbo.FS_ARInvoiceHeader
INNER JOIN dbo.FS_EDIPartner ON dbo.FS_ARInvoiceHeader.CustomerID = dbo.FS_EDIPartner.CustomerID
LEFT JOIN (Select DocumentName
FROM GentranDatabase.dbo.ZNW_Documents
WHERE DATEADD(SECOND,TimeCreated,'1970-1-1') > '2016-06-01'
AND TransactionSetID = '810') docs on dbo.FS_ARInvoiceHeader.InvoiceNumber = docs.DocumentName COLLATE Latin1_General_BIN
WHERE docs.DocumentName IS NULL
AND dbo.FS_ARInvoiceHeader.InvoiceType = 'I'
AND dbo.FS_ARInvoiceHeader.InvoiceStatus <> 'Y'
--AND (dbo.FS_EDIPartner.InternalTransactionSetCode = '810')
AND (NOT (dbo.FS_ARInvoiceHeader.CONumber LIKE 'CB%'))
AND (NOT (dbo.FS_ARInvoiceHeader.CONumber LIKE 'DM%'))
AND InvoiceDate > '2016-06-01'
The Commented out line in the Where statement is the culprit, uncommenting it causes the 2.5 minute run.
It could be that the table statistics may have gotten out of whack. These include the number of records tables have which is used to choose the best query plan. Try running this and running the query again:
EXEC sp_updatestats
Using #jeremy's comment as a guideline to point out the Actual Number of Rows was not my problem, but instead the number of executions, I figured out that the Hash Match was .5 seconds, the Nested loop was 2.5 minutes. Trying to force the Hash Match using Left HASH Join was inconsistent depending on what the other filters were set to, changing dates took it from .5 seconds, to 30 secs sometimes. So forcing the Hash (Which is highly discouraged anyway) wasn't a good solution. Finally I resorted to moving the poor performing view to a Stored Procedure and splitting out both of the tables that were related to the poor performance into Table Variables, then joining those table variables. This resulted in the most consistently good performance of getting the results. On average the SP returns in less than 1 second, which is far better than the 2.5 minutes it started at.
#Jeremy gets the credit, but since his wasn't an answer, I thought I would document what was actually done in case someone else stumbles across this later.
I have two tables:
The table LINKS has:
LINK_ID --- integer, unique ID
FROM_NODE_X -- numbers/floats, indicating a geographical position
FROM_NODE_Y --
FROM_NODE_Z --
TO_NODE_X --
TO_NODE_Y --
TO_NODE_Z --
The table LINK_COORDS has:
LINK_ID --- integer, refers to above UID
ORDER --- integer, indicating order
X ---
Y ---
Z ---
Logically each LINK consists of a number of waypoints. The final order is:
FROM_NODE , 1 , 2 , 3 , ... , TO_NODE
A link has at least two waypoints (FROM_NODE, TO_NODE), but can have a variable number of waypoints in between (0 to 100+).
I now would need a way to aggregate, sort and store the waypoints of each link in an array which later will be used to draw a line.
I'm struggling with the LINK_COORDS being available as individual rows. Having the start and end positions in the other (LINKS) table doesn't help either. If I had a way to at least get all the LINK_COORDS joined/updated to the LINKS table I probably could work out the rest myself again. So if you have an idea on how to get that far, it'd be much appreciated already.
Considering performance would be nice (the tables have somewhere between 500k to 1mio entries now and will have multiples of that later), but is not essential for now.
EDIT:
Thanks for the suggestion, a-horse-with-no-name.
I chose to create the point geometries (PostGIS) for each XYZ before this step, so in the end there's only an array of points to create from the individual points.
The adapted SQL
UPDATE "Link"
SET "POINTS" =
array_append(
(array_prepend(
"FROM_POINT",
(SELECT array_agg(lc."POINT" ORDER BY lc."COUNT")
FROM "LinkCoordinate" lc
WHERE lc."LINK_ID" = "Link"."LINK_ID")))
, "TO_POINT")
however is running extremely slow:
Running it exemplary on 10 links required ~120 seconds. Running it for all the 1,3mio links and plenty more linkcoords would probably take somewhere around half a year. Not really ideal.
How can I figure out where this immense slowness originates from?
If I get the source data in a pre-ordered format (so linkcoordinates of each link_ID), would this allow me to significantly speed up the SQL query?
EDIT: It appears the main slowdown originates from the SELECT subquery used in the array_agg() function. Everything else (incl. ordering) does not really cause any slowdown.
My current guess is that the SELECT query iterates over the entirety of "LinkCoordinate" for each and every link, making it work much harder than it has to, as all LinkCoordinates belonging to a Link are always stored in 'blocks' of rows. A single, sequential processing of the LinkCoordinates would be sufficient, really.
something like this maybe:
select l.link_id,
min(l.from_node_x) as from_node_x,
min(l.from_node_y) as from_node_y,
min(l.from_node_z) as from_node_z,
array_agg(lc.x order by lc."ORDER") as points_x,
array_agg(lc.y order by lc."ORDER") as points_y,
array_agg(lc.z order by lc."ORDER") as points_z,
min(l.to_node_x) as to_node_x,
min(l.to_node_y) as to_node_y,
min(l.to_node_z) as to_node_z
from links l
join link_coords lc on lc.link_id = l.link_id
group by l.link_id;
The min() is necessary due to the group by but won't change the result as all values from the links are the same anyway.
Another possibility is to use a scalar subquery. I'm unsure which of them is faster though - but the join/group by is probably more efficient.
select l.link_id,
l.from_node_x,
l.from_node_y,
l.from_node_z,
(select array_agg(lc.x order by lc."ORDER") from link_coords lc where lc.link_id = l.link_id) as points_x,
(select array_agg(lc.y order by lc."ORDER") from link_coords lc where lc.link_id = l.link_id) as points_y,
(select array_agg(lc.z order by lc."ORDER") from link_coords lc where lc.link_id = l.link_id) as points_z,
l.to_node_x,
l.to_node_y,
l.to_node_z
from links l