What technique should I use for Optimizing the SQL Query

What technique should I use for Optimizing the SQL Query - sql-server

Hi I have a stored procedure that is used to fetch records while searching. This procedure returns millions of records. However there was a bug found inside the search procedure which also return duplicate records in some scenario when certain condition are met. I have found the error why it was returning duplicate records: Below is the query that is in question:
With cteAutoApprove (AcctID, AutoApproved,DecisionDate)
AS (
select
A.AcctID,
CAST(autoEnter AS SMALLINT) AS AutoApproved,
DecisionDate
from
(
SELECT
awt.AcctID,
MIN(awt.dtEnter) AS DecisionDate
FROM
dbo.AccountWorkflowTask awt
JOIN dbo.WorkflowTask wt ON awt.WorkflowTaskID = wt.WorkflowTaskID
Join Task T on T.TaskID = wt.TaskID
WHERE
(
(T.TaskStageID = 3 and awt.ReasonIDExit is NULL)
OR (wt.TaskID IN (9,15,201,208,220,308,319,320,408,420,508,608,620,1470,1608,1620))
)
GROUP BY
awt.AcctID
) A
Join AccountWorkflowTask awt1
on awt1.dtEnter=A.DecisionDate and awt1.AcctID=a.AcctID
),
This CTE was returning duplicate record because of the condition on awt1.dtEnter=A.DecisionDate the dtEnter for some account was exactly same. This is the reason it returned duplicate record.
My question is what should I use to prevent this. I cannot use Distinct here as it will definitely slow down the search procedure. Shall I use Rank or Dense Rank so that it is optimized and the query takes less time to execute the result? Or some other technique? Please help as I am actually stuck here

It does seem like a good candidate for row_number (not rank, with the same dates on the same acctid, you'd still have multiple records)
Obviously I can't test the query here, but winging it:
select
A.AcctID,
CAST(autoEnter AS SMALLINT) AS AutoApproved,
DecisionDate
from
(
SELECT
awt.AcctID,
awt.dtEnter AS DecisionDate,
autoEnter,
row_number() over (partition by awt.acctid order by awt.dtEnter) rnr
FROM
dbo.AccountWorkflowTask awt
JOIN dbo.WorkflowTask wt ON awt.WorkflowTaskID = wt.WorkflowTaskID
Join Task T on T.TaskID = wt.TaskID
WHERE
(
(T.TaskStageID = 3 and awt.ReasonIDExit is NULL)
OR (wt.TaskID IN (9,15,201,208,220,308,319,320,408,420,508,608,620,1470,1608,1620))
)
) A
where rnr = 1
This way, the group by is no longer necessary: getting the first date is done by row_number. Neither is the second join, the subquery already contains all the data (and the optimizer is smart enough not to do anything with the rows it doesn't need)
PS. because sql server window functions work incredibly efficient, using row_number instead of the min() - join construction, will most likely gain a performance boost, even if there were no double rows.

Related

SQL Selecting all but newest result per id

I need to set a "waived" flag in my table for all but the newest result per id. I thought I had a query that will work here, but when I run a select on the query, I'm getting incorrect results - I saw one case where it selected both of the only two results for a particular id. I'm also getting multiple results with the same exact data.
What am I doing wrong here?
Here's my select statement:
select t.test_row_id, t.test_result_id, t.waived, t.pass, t.comment
from EV.Test_Result
join EV.Test_Result as t on EV.Test_Result.test_row_id = t.test_row_id and EV.Test_Result.start_time < t.start_time and t.device_id = 1219 and t.waived = 0
order by t.test_row_id
Here's the actual query I want to run:
update EV.Test_Result
set waived = 1
from EV.Test_Result
join EV.Test_Result as t on EV.Test_Result.test_row_id = t.test_row_id and EV.Test_Result.start_time < t.start_time and t.device_id = 1219 and t.waived = 0

If I understand this correctly, you are having problems because the Cardinality of the ON predicate returns all matching rows.
EV.Test_Result.test_row_id = t.test_row_id
and EV.Test_Result.start_time < t.start_time
This ON will compare all of the start_time values that have the same id and return every combination of result sets where start_time is lesser than the t.start_time. Clearly, this is not what you want.
and t.device_id = 1219
and t.waived = 0
This is actually a predicate (ON technically is one), but I would prefer to use this in a subquery/CTE for several reasons: You limit the number of rows SQL has to retrieve and compare.
Something like the following might be what you needed:
SELECT A.test_row_id
, A.test_result_id
, A.waived
, A.pass
, A.comment
FROM EV.Test_Result A
INNER JOIN (SELECT MAX(start_time) AS start_time
, test_row_id
FROM EV.Test_Result
WHERE device_id = 1219
AND waived = 0
GROUP BY test_row_id
) AS T ON A.test_row_id = T.test_row_id
AND A.start_time < T.start_time
ORDER BY A.test_row_id
This query then returns a 1:M relationship between the values in the ON predicate, unlike the M:M query you had run.
UPDATE:
Since I sheepishly screwed up trying to alter my Query on SO, I'll redeem myself by explaining the physical and logical orders of basic SQL Query operators:
As you know, you write a simple SELECT statement like the following:
SELECT <aggregate column>, SUM(<non-aggregate column>) AS Cost
FROM <table_name>
WHERE <column> = 'some_value'
GROUP BY <aggregate column>
HAVING SUM(<non-aggregate column>) > some_value
ORDER BY <column>
Note that if you use a aggregate function, all other columns MUST appear in the GROUP BY or another function.
Now, SQL Server requires them to be written in that order although it actually processes this logically by the following order that is worth memorizing:
FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY
There are more details found on SELECT - MSDN, but this is why any columns in the SELECT operator must be in the group by or in a aggregate function (SUM, MIN, MAX, etc)...and also why my lazy code failed on your first attempt. :/
Note also that the ORDER BY is last (technically TOP operator occurs after this), and that without it the result is not deterministic unless a function such as DENSE_RANK enforces it (thought this occurs in the SELECT statement).
Hope this helps solve the problem and better yet how SQL works. Cheers

Can you try ROW_NUMBER () function order by timestamp descending and filtering out values having ROW_NUMBER 1 ;
Below query should fetch all records per id except the latest one
I tried below query in Oracle with a table having fields : id,user_id, record_order adn timestamp and it worked :
select
<table_name_alias>.*
from
(
select
id,
user_id,
row_number() over (partition by id order by record_order desc) as record_number
from
<your_table_name>
) <table_name_alias>
where
record_number <>1;
If you are using Teradata DB, you can also try QUALIFY statement. I'm not sure if all DBs support this.
Select
table_name.*
from table_name
QUALIFY row_number() over (partition by id order by record_order desc) <>1;

PostgreSQL Inserted rows differ from select

I have a problem with an INSERT in PostgreSQL. I have this query:
INSERT INTO track_segments(tid, gdid1, gdid2, distance, speed)
SELECT * FROM (
SELECT DISTINCT ON (pga.gdid)
pga.tid as ntid,
pga.gdid as gdid1, pgb.gdid as gdid2,
ST_Distance(pga.geopoint, pgb.geopoint) AS segdist,
(ST_Distance(pga.geopoint, pgb.geopoint) / EXTRACT(EPOCH FROM (pgb.timestamp - pga.timestamp + interval '0.1 second'))) as speed
FROM fl_pure_geodata AS pga
LEFT OUTER JOIN fl_pure_geodata AS pgb ON (pga.timestamp < pgb.timestamp AND pga.tid = pgb.tid)
ORDER BY pga.gdid ASC) AS sq
WHERE sq.gdid2 IS NOT NULL;
to fill a table with pairwise connected segements of geopoints. When I run the SELECT alone I get the correct pairs, but when I use it in the statement above, then some are paired the wrong way or not at all. Here's what I mean:
result of SELECT alone:
tid;gdid1;gdid2;distance;speed
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";10;11;34.105058803;31.0045989118182
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";11;12;90.099603143;14.7704267447541
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";12;13;23.331326565;21.2102968772727
result after INSERT with the same SELECT:
tid;gdid1;gdid2;distance;speed
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";10;12;122.574;17.2639603638028
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";11;12;90.0996;14.7704267447541
"0f6fd522-5f1e-49a4-b85e-50f11ef7f908";12;13;23.3313;21.2102968772727
What be the cause of that? It's exactly the same SELECT statement for the INSERT, so why does it give different results?

DISTINCT ON (pga.gdid) can pick any row from a set with equal pga.gdid. You can get different result even by execution the same query for several times. Add additional ordering to get consistent results. something like: pga.gdid ASC, pgb.gdid ASC
BTW You may want to order by pga.gdid ASC, pgb.timestamp - pga.timestamp ASC to get the "next" point.
BTW2 It may be easier to use lead() or lag() window functions to calculate differences between current row and next/previous. This way you wont need a self join and will likely get better performance.

You are ordering your query results only by the column pga.gdid, which is the same in all the rows, so postgres will order the results in a different way each time you do the select query.

Adding Conditional around query increases time by over 2400%

Update: I will get query plan as soon as I can.
We had a poor performing query that took 4 minutes for a particular organization. After the usual recompiling the stored proc and updating statistics didn't help, we re-wrote the if Exists(...) to a select count(*)... and the stored procedure when from 4 minutes to 70 milliseconds. What is the problem with the conditional that makes a 70 ms query take 4 minutes? See the examples
These all take 4+ minutes:
if (
SELECT COUNT(*)
FROM ObservationOrganism omo
JOIN Observation om ON om.ObservationID = omo.ObservationMicID
JOIN Organism o ON o.OrganismID = omo.OrganismID
JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID
JOIN SIRN srn ON srn.SIRNID = omd.SIRNID
JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID
WHERE
om.StatusCode IN ('F', 'C')
AND o.OrganismGroupID <> -1
AND od.OrganismDrugGroupID <> -1
AND (om.LabType <> 'screen' OR om.LabType IS NULL)) > 0
print 'records';
-
IF (EXISTS(
SELECT *
FROM ObservationOrganism omo
JOIN Observation om ON om.ObservationID = omo.ObservationMicID
JOIN Organism o ON o.OrganismID = omo.OrganismID
JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID
JOIN SIRN srn ON srn.SIRNID = omd.SIRNID
JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID
WHERE
om.StatusCode IN ('F', 'C')
AND o.OrganismGroupID <> -1
AND od.OrganismDrugGroupID <> -1
AND (om.LabType <> 'screen' OR om.LabType IS NULL))
print 'records'
This all take 70 milliseconds:
Declare #recordCount INT;
SELECT #recordCount = COUNT(*)
FROM ObservationOrganism omo
JOIN Observation om ON om.ObservationID = omo.ObservationMicID
JOIN Organism o ON o.OrganismID = omo.OrganismID
JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID
JOIN SIRN srn ON srn.SIRNID = omd.SIRNID
JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID
WHERE
om.StatusCode IN ('F', 'C')
AND o.OrganismGroupID <> -1
AND od.OrganismDrugGroupID <> -1
AND (om.LabType <> 'screen' OR om.LabType IS NULL);
IF(#recordCount > 0)
print 'records';
It doesn't make sense to me why moving the exact same Count(*) query into an if statement causes such degradation or why 'Exists' is slower than Count. I even tried the exists() in a select CASE WHEN Exists() and it is still 4+ minutes.

Given that my previous answer was mentioned, I'll try to explain again because these things are pretty tricky. So yes, I think you're seeing the same problem as the other question. Namely a row goal issue.
So to try and explain what's causing this I'll start with the three types of joins that are at the disposal of the engine (and pretty much categorically): Loop Joins, Merge Joins, Hash Joins. Loop joins are what they sound like, a nested loop over both sets of data. Merge Joins take two sorted lists and move through them in lock-step. And Hash joins throw everything in the smaller set into a filing cabinet and then look for items in the larger set once the filing cabinet has been filled.
So performance wise, loop joins require pretty much no set up and if you're only looking for a small amount of data they're really optimal. Merge are the best of the best as far as join performance for any data size, but require data to be already sorted (which is rare). Hash Joins require a fair amount of setup but allow large data sets to be joined quickly.
Now we get to your query and the difference between COUNT(*) and EXISTS/TOP 1. So the behavior you're seeing is that the optimizer thinks that rows of this query are really likely (you can confirm this by planning the query without grouping and seeing how many records it thinks it will get in the last step). In particular it probably thinks that for some table in that query, every record in that table will appear in the output.
"Eureka!" it says, "if every row in this table ends up in the output, to find if one exists I can do the really cheap start-up loop join throughout because even though it's slow for large data sets, I only need one row." But then it doesn't find that row. And doesn't find it again. And now it's iterating through a vast set of data using the least efficient means at its disposal for weeding through large sets of data.
By comparison, if you ask for the full count of data, it has to find every record by definition. It sees a vast set of data and picks the choices that are best for iterating through that entire set of data instead of just a tiny sliver of it.
If, on the other hand, it really was correct and the records were very well correlated it would have found your record with the smallest possible amount of server resources and maximized its overall throughput.

SQL Get Second Record

I am looking to retrieve only the second (duplicate) record from a data set. For example in the following picture:
Inside the UnitID column there is two separate records for 105. I only want the returned data set to return the second 105 record. Additionally, I want this query to return the second record for all duplicates, not just 105.
I have tried everything I can think of, albeit I am not that experience, and I cannot figure it out. Any help would be greatly appreciated.

You need to use GROUP BY for this.
Here's an example: (I can't read your first column name, so I'm calling it JobUnitK
SELECT MAX(JobUnitK), Unit
FROM JobUnits
WHERE DispatchDate = 'oct 4, 2015'
GROUP BY Unit
HAVING COUNT(*) > 1
I'm assuming JobUnitK is your ordering/id field. If it's not, just replace MAX(JobUnitK) with MAX(FieldIOrderWith).

Use RANK function. Rank the rows OVER PARTITION BY UnitId and pick the rows with rank 2 .
For reference -
https://msdn.microsoft.com/en-IN/library/ms176102.aspx

Assuming SQL Server 2005 and up, you can use the Row_Number windowing function:
WITH DupeCalc AS (
SELECT
DupID = Row_Number() OVER (PARTITION BY UnitID, ORDER BY JobUnitKeyID),
*
FROM JobUnits
WHERE DispatchDate = '20151004'
ORDER BY UnitID Desc
)
SELECT *
FROM DupeCalc
WHERE DupID >= 2
;
This is better than a solution that uses Max(JobUnitKeyID) for multiple reasons:
There could be more than one duplicate, in which case using Min(JobUnitKeyID) in conjunction with UnitID to join back on the UnitID where the JobUnitKeyID <> MinJobUnitKeyID` is required.
Except, using Min or Max requires you to join back to the same data (which will be inherently slower).
If the ordering key you use turns out to be non-unique, you won't be able to pull the right number of rows with either one.
If the ordering key consists of multiple columns, the query using Min or Max explodes in complexity.

Slow running BIDS report query...OR clauses in WHERE branch using IN sub queries

I have a report with 3 sub reports and several queries to optimize in each. The first has several OR clauses in the WHERE branch and the OR's are filtering through IN options which are pulling sub-queries.
I say this mostly from reading this SO post. Specifically LBushkin's second point.
I'm not the greatest at TSQL but I know enough to think this is very inefficient. I think I need to do two things.
I know I need to add indexes to the tables involved.
I think the query can be greatly enhanced.
So it seems that my first step would be to improve the query. From there I can look at what columns and tables are involved and thus determine the indexes.
At this point I haven't posted table schemas as I'm looking more for options / considerations such as using a cte to replace all the IN sub-queries.
If needed I will definitely post whatever would be helpful such as physical reads etc.
SELECT DISTINCT
auth.icm_authorizationid,
auth.icm_documentnumber
FROM
Filteredicm_servicecost AS servicecost
INNER JOIN Filteredicm_authorization AS auth ON
auth.icm_authorizationid = servicecost.icm_authorizationid
INNER JOIN Filteredicm_service AS service ON
service.icm_serviceid = servicecost.icm_serviceid
INNER JOIN Filteredicm_case AS cases ON
service.icm_caseid = cases.icm_caseid
WHERE
(cases.icm_caseid IN
(SELECT icm_caseid FROM Filteredicm_case AS CRMAF_Filteredicm_case))
OR (service.icm_serviceid IN
(SELECT icm_serviceid FROM Filteredicm_service AS CRMAF_Filteredicm_service))
OR (servicecost.icm_servicecostid IN
(SELECT icm_servicecostid FROM Filteredicm_servicecost AS CRMAF_Filteredicm_servicecost))
OR (auth.icm_authorizationid IN
(SELECT icm_authorizationid FROM Filteredicm_authorization AS CRMAF_Filteredicm_authorization))

EXISTS is usually much faster than IN as the query engine is able to optimize it better.
Try this:
WHERE EXISTS (SELECT 1 FROM FROM Filteredicm_case WHERE icm_caseid = cases.icm_caseid)
OR EXISTS (SELECT 1 FROM Filteredicm_service WHERE icm_serviceid = service.icm_serviceid)
OR EXISTS (SELECT 1 FROM Filteredicm_servicecost WHERE icm_servicecostid = servicecost.icm_servicecostid)
OR EXISTS (SELECT 1 FROM Filteredicm_authorization WHERE icm_authorizationid = auth.icm_authorizationid)
Furthermore, an index on Filteredicm_case.icm_caseid, an index on Filteredicm_service.icm_serviceid, an index on Filteredicm_servicecost.icm_servicecostid, and an index on Filteredicm_authorization.icm_authorizationid will increase performance of this query. They look like they should be keys already, however, so I suspect that indices already exist.
However, unless I'm misreading, there's no way this WHERE clause will ever evaluate to anything other than true.
The clause you wrote says WHERE cases.icm_caseid IN (SELECT icm_caseid FROM Filteredicm_case AS CRMAF_Filteredicm_case). However, cases is an alias to Filteredicm_case. That's the equivalent of WHERE Filteredicm_case.icm_caseid IN (SELECT icm_caseid FROM Filteredicm_case AS CRMAF_Filteredicm_case). That will be true as long as Filteredicm_case.icm_caseid isn't NULL.
The same error in logic exists for the remaining portions in the WHERE clause:
(service.icm_serviceid IN (SELECT icm_serviceid FROM Filteredicm_service AS CRMAF_Filteredicm_service))
service is an alias for Filteredicm_service. This is always true as long as icm_serviceid is not null
(servicecost.icm_servicecostid IN (SELECT icm_servicecostid FROM Filteredicm_servicecost AS CRMAF_Filteredicm_servicecost))
servicecost is an alias for Filteredicm_servicecost. This is always true as long as icm_servicecostid is not null.
(auth.icm_authorizationid IN (SELECT icm_authorizationid FROM Filteredicm_authorization AS CRMAF_Filteredicm_authorization))
auth is an alias for Filteredicm_authorization. This is always true as long as icm_authorizationid is not null.
I don't understand what you're trying to accomplish.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight