Correlated SQL Server Subquery taking very long - query-optimization

I have this table: G_HIST with 700K rows and about 200 columns. Below is the
correlated query that is taking almost 6 minutes. Is there a better way to write it so
that it can take less than half minute.
If not, what indexes I need to have on this table? Currently it has only PK Unique
index on Primary Keys made up of 10 columns.
Here is the code below to select current version of cycle filtering
participant_identifier:
select distinct Curr.Cycle_Number, Curr.Process_Date,Curr.Group_Policy_Number,
Curr.Record_Type, Curr.Participant_Identifier,Curr.Person_Type,
Curr.Effective_Date
FROM G_HIST as Curr
WHERE Curr.Participant_Identifier not in (
select prev.Participant_Identifier
from G_HIST as Prev
where Prev.Cycle_Number = (
select max(b.Cycle_Number)-1
FROM G_HIST as b
WHERE b.Group_Policy_Number = Curr. Group_Policy_Number
)
)
AND Curr.[Cycle_Number] = (
select max(a.[Cycle_Number])
FROM G_HIST as a
WHERE a.[Group_Policy_Number] = Curr.[Group_Policy_Number]
)

You have aggregating -- MAX() -- correlated (dependent) subqueries. Those can be slow because they need to be re-evaluated for each row in the main query. Let's refactor them to ordinary subqueries. The ordinary subqueries need only be evaluated once.
You need a virtual table containing the largest Cycle_Number for each Group_Policy_Number. You get that with the following subquery.
SELECT MAX(Cycle_Number) Max_Cycle_Number,
Group_Policy_Number
FROM GHIST
GROUP BY Group_Policy_Number
This subquery will benefit, dramatically, from a multicolumn index on (Group_Policy_Number, Max_Cycle_Number).
And you have this pattern:
WHERE someColumn NOT IN (a correlated subquery)
That NOT IN can be refactored to use the LEFT JOIN ... IS NULL pattern (also known as the antijoin pattern) and an ordinary subquery. I guess your business rule says you start by finding the participant numbers in the previous cycle.
This query, using a Common Table Expression, should get you that list of participant numbers from the previous cycle for each Group_Policy_Number. You might want to inspect some results from this to ensure it gives you what you want.
WITH
Maxc AS (
SELECT MAX(Cycle_Number) Max_Cycle_Number,
Group_Policy_Number
FROM GHIST
GROUP BY Group_Policy_Number
),
PrevParticipant AS (
SELECT Participant_Identifier,
Group_Policy_Number
FROM GHIST
JOIN Maxc ON GHIST.Group_Policy_Number = Maxc.Group_Policy_Number
WHERE GHIST.Cycle_Number = Maxc.Cycle_Number - 1
)
SELECT * FROM PrevParticipant;
Then we can use the LEFT JOIN ... IS NULL pattern.
So here is the refactored query, not debugged, use at your own risk.
WITH
Maxc AS (
SELECT MAX(Cycle_Number) Max_Cycle_Number,
Group_Policy_Number
FROM G_HIST
GROUP BY Group_Policy_Number
),
PrevParticipant AS (
SELECT Participant_Identifier,
Group_Policy_Number
FROM G_HIST
JOIN Maxc ON G_HIST.Group_Policy_Number = Maxc.Group_Policy_Number
WHERE GHIST.Cycle_Number = Maxc.Cycle_Number - 1
)
SELECT DISTINCT Curr.whatever
FROM G_HIST Curr
JOIN Maxc
ON Curr.Group_Policy_Number = Maxc.Group_Policy_Number
LEFT JOIN PrevParticipant
ON Curr.Group_Policy_Number = PrevParticipant.Group_Policy_Number
AND Curr.Participant_Number <> PrevParticiant.Participant_Number
WHERE PrevParticipant.Group_Policy_Number IS NULL
AND Curr.Cycle_Number = Maxc.Cycle_Number;
If you have SQL Server 2016 or earlier you won't be able to use Common Table Expressions. Let me know in a comment and I'll show you how to write the query without them.
You can use SSMS's Actual Execution Plan to identify any other indexes you need to speed up the whole query.

Related

SQL - Attain Previous Transaction Informaiton [duplicate]

I need to calculate the difference of a column between two lines of a table. Is there any way I can do this directly in SQL? I'm using Microsoft SQL Server 2008.
I'm looking for something like this:
SELECT value - (previous.value) FROM table
Imagining that the "previous" variable reference the latest selected row. Of course with a select like that I will end up with n-1 rows selected in a table with n rows, that's not a probably, actually is exactly what I need.
Is that possible in some way?
Use the lag function:
SELECT value - lag(value) OVER (ORDER BY Id) FROM table
Sequences used for Ids can skip values, so Id-1 does not always work.
SQL has no built in notion of order, so you need to order by some column for this to be meaningful. Something like this:
select t1.value - t2.value from table t1, table t2
where t1.primaryKey = t2.primaryKey - 1
If you know how to order things but not how to get the previous value given the current one (EG, you want to order alphabetically) then I don't know of a way to do that in standard SQL, but most SQL implementations will have extensions to do it.
Here is a way for SQL server that works if you can order rows such that each one is distinct:
select rank() OVER (ORDER BY id) as 'Rank', value into temp1 from t
select t1.value - t2.value from temp1 t1, temp1 t2
where t1.Rank = t2.Rank - 1
drop table temp1
If you need to break ties, you can add as many columns as necessary to the ORDER BY.
WITH CTE AS (
SELECT
rownum = ROW_NUMBER() OVER (ORDER BY columns_to_order_by),
value
FROM table
)
SELECT
curr.value - prev.value
FROM CTE cur
INNER JOIN CTE prev on prev.rownum = cur.rownum - 1
Oracle, PostgreSQL, SQL Server and many more RDBMS engines have analytic functions called LAG and LEAD that do this very thing.
In SQL Server prior to 2012 you'd need to do the following:
SELECT value - (
SELECT TOP 1 value
FROM mytable m2
WHERE m2.col1 < m1.col1 OR (m2.col1 = m1.col1 AND m2.pk < m1.pk)
ORDER BY
col1, pk
)
FROM mytable m1
ORDER BY
col1, pk
, where COL1 is the column you are ordering by.
Having an index on (COL1, PK) will greatly improve this query.
LEFT JOIN the table to itself, with the join condition worked out so the row matched in the joined version of the table is one row previous, for your particular definition of "previous".
Update: At first I was thinking you would want to keep all rows, with NULLs for the condition where there was no previous row. Reading it again you just want that rows culled, so you should an inner join rather than a left join.
Update:
Newer versions of Sql Server also have the LAG and LEAD Windowing functions that can be used for this, too.
select t2.col from (
select col,MAX(ID) id from
(
select ROW_NUMBER() over(PARTITION by col order by col) id ,col from testtab t1) as t1
group by col) as t2
The selected answer will only work if there are no gaps in the sequence. However if you are using an autogenerated id, there are likely to be gaps in the sequence due to inserts that were rolled back.
This method should work if you have gaps
declare #temp (value int, primaryKey int, tempid int identity)
insert value, primarykey from mytable order by primarykey
select t1.value - t2.value from #temp t1
join #temp t2
on t1.tempid = t2.tempid - 1
Another way to refer to the previous row in an SQL query is to use a recursive common table expression (CTE):
CREATE TABLE t (counter INTEGER);
INSERT INTO t VALUES (1),(2),(3),(4),(5);
WITH cte(counter, previous, difference) AS (
-- Anchor query
SELECT MIN(counter), 0, MIN(counter)
FROM t
UNION ALL
-- Recursive query
SELECT t.counter, cte.counter, t.counter - cte.counter
FROM t JOIN cte ON cte.counter = t.counter - 1
)
SELECT counter, previous, difference
FROM cte
ORDER BY counter;
Result:
counter
previous
difference
1
0
1
2
1
1
3
2
1
4
3
1
5
4
1
The anchor query generates the first row of the common table expression cte where it sets cte.counter to column t.counter in the first row of table t, cte.previous to 0, and cte.difference to the first row of t.counter.
The recursive query joins each row of common table expression cte to the previous row of table t. In the recursive query, cte.counter refers to t.counter in each row of table t, cte.previous refers to cte.counter in the previous row of cte, and t.counter - cte.counter refers to the difference between these two columns.
Note that a recursive CTE is more flexible than the LAG and LEAD functions because a row can refer to any arbitrary result of a previous row. (A recursive function or process is one where the input of the process is the output of the previous iteration of that process, except the first input which is a constant.)
I tested this query at SQLite Online.
You can use the following funtion to get current row value and previous row value:
SELECT value,
min(value) over (order by id rows between 1 preceding and 1
preceding) as value_prev
FROM table
Then you can just select value - value_prev from that select and get your answer

Sort in query plan TSQL

Need to handle query by eliminating and improving performance by deleting sort operators which consumes the greatest amount of resources.
The temp table is around 20,000 rows and the physical table is around 60 million of rows.
I am using LAG function due to that I need to compare values in bigger table, Have You guys any idea to figure it out ?
I am posting query, but if you will need any further info then let me know.
;WITH CTE AS
(
SELECT
a.VIN_NUMBER,
B.CELL_VALUE, B.CELL_VALUE_NEGATIVE_VALUES,
ROW_NUMBER() OVER (PARTITION BY B.VIN_NUMBER, B.LOG_NUM, B.SEQUENCE_NUM_OF_CELL
ORDER BY B.VIN_NUMBER, B.DATE_OF_CELL_READ, B.LOG_NUM, B.SEQUENCE_NUM_OF_CELL) ROW_NUM,
B.CELL_VALUE - LAG(B.CELL_VALUE, 1) OVER (ORDER BY B.VIN_NUMBER, B.DATE_OF_CELL_READ, B.LOG_NUM, B.SEQUENCE_NUM_OF_CELL) CELL_VALUE_NEW
FROM
#TEMP_CHASSI_LAST_LOAD A
JOIN
DBO.LOGS_FROM_CARS B WITH (NOLOCK) ON B.ROW_CREATION_DATE BETWEEN A.MIN_ROW_CREATION_DATE
AND A.MAX_ROW_CREATION_DATE
AND A.VIN_NUMBER = B.VIN_NUMBER
)
SELECT
VIN_NUMBER,
IIF(CELL_VALUE_NEW < 0, 0, CELL_VALUE_NEW) AS CELL_VALUE_NEW,
IIF(CELL_VALUE_NEW < 0, CELL_VALUE_NEW, NULL) AS CELL_VALUE_NEGATIVE_VALUES
FROM
CTE
WHERE
ROW_NUM > 1
AND (CELL_VALUE_NEW <> CELL_VALUE OR CELL_VALUE IS NULL)
It's hard to be sure what you are doing without sample data and full execution plan, but I'd explore a few options.
First, I don't think your LAG() is correct. I think you should add PARTITION BY B.VIN_NUMBER. Pretty sure you do not want to compare values of different VIN's. This will let you get rid of your ROW_NUMBER() as LAG() will now have NULL for the first row. That means your CELL_VALUE_NEW <> CELL_VALUE will filter out, so can remove ROW_NUM > 1
Optimized Query
WITH CTE AS (
SELECT
A.VIN_NUMBER,
B.CELL_VALUE,
B.CELL_VALUE_NEGATIVE_VALUES,
B.CELL_VALUE - LAG(B.CELL_VALUE, 1) OVER (PARTITION BY B.VIN_NUMBER ORDER BY B.DATE_OF_CELL_READ, B.LOG_NUM, B.SEQUENCE_NUM_OF_CELL) CELL_VALUE_NEW
FROM #TEMP_CHASSI_LAST_LOAD AS A
INNER JOIN dbo.LOGS_FROM_CARS B WITH (NOLOCK)
ON B.ROW_CREATION_DATE BETWEEN A.MIN_ROW_CREATION_DATE AND A.MAX_ROW_CREATION_DATE
AND A.VIN_NUMBER = B.VIN_NUMBER
)
SELECT
VIN_NUMBER,
IIF(CELL_VALUE_NEW < 0, 0, CELL_VALUE_NEW) AS CELL_VALUE_NEW,
IIF(CELL_VALUE_NEW < 0, CELL_VALUE_NEW, NULL) AS CELL_VALUE_NEGATIVE_VALUES
FROM CTE
WHERE (CELL_VALUE_NEW <> CELL_VALUE OR CELL_VALUE IS NULL)
Things to Review:
Double check data types for your join conditions. Ex. make sure MIN_ROW_CREATION_DATE and MAX_ROW_CREATION_DATE are the same as ROW_CREATION_DATE. Makes sure it's not text vs date. Ideally VIN_NUMBER is using CHAR(17) (all car VIN's are 17 characters)
Create index on larger table (and maybe try one on the temp table. Query performance improvement might be worth the time to create the index on the temp table)
CREATE INDEX ix_test ON dbo.LOGS_FROM_CARS(VIN_NUMBER,ROW_CREATION_DATE)
INCLUDE (CELL_VALUE,CELL_VALUE_NEGATIVE_VALUES,DATE_OF_CELL_READ, LOG_NUM, SEQUENCE_NUM_OF_CELL)
Try FORCESEEK option on table join to LOGS_FROM_CARS. Be cautious using query hints as can lead to issues down the road, but might be worth it for this query
Are you sure you need CELL_VALUE_NEGATIVE_VALUES from LOGS_FROM_CARS? I don't see it used anywhere. Would remove that from the query if you don't need it

Count all max number value in difference tables sql

I got an error when I tried to solve this problem. First I need to count all values of 2 tables then I need in where condition get all max values.
My code:
Select *
FROM (
select Operator.OperatoriausPavadinimas,
(
select count(*)
from Plan
where Plan.operatoriausID= Operator.operatoriausID
) as NumberOFPlans
from Operator
)a
where a.NumberOFPlans= Max(a.NumberOFPlans)
I get this error
Msg 147, Level 15, State 1, Line 19
An aggregate may not appear in the WHERE clause unless it is in a subquery contained in a HAVING clause or a select list, and the column being aggregated is an outer reference.
I don't know how to solve this.
I need get this http://prntscr.com/p700w9
Update 1
Plan table contains of http://prntscr.com/p7055l values and
Operator table contains of http://prntscr.com/p705k0 values.
Are you looking for... an aggregate query that joins both tables and returns the record that has the maximum count?
I suspect that this might phrase as follows:
SELECT TOP(1) o.OperatoriausPavadinimas, COUNT(*)
FROM Operatorius o
INNER JOIN Planas p ON p.operatoriausID = o.operatoriausID
GROUP BY o.OperatoriausPavadinimas
ORDER BY COUNT(*) DESC
If you want to allow ties, you can use TOP(1) WITH TIES.
You can use top with ties. Your query is a bit hard to follow, but I think you want:
select top (1) with ties o.OperatoriausPavadinimas, count(*)
from plan p join
operator o
on p.operatoriausID = o.operatoriausID
group by o.OperatoriausPavadinimas
order by count(*) desc;

Missing Rows when running SELECT in SQL Server

I have a simple select statement. It's basically 2 CTE's, one includes a ROW_NUMBER() OVER (PARTITION BY, then a join from these into 4 other tables. No functions or anything unusual.
WITH Safety_Check_CTE AS
(
SELECT
Fact_Unit_Safety_Checks_Wkey,
ROW_NUMBER() OVER (PARTITION BY [Dim_Unit_Wkey], [Dim_Safety_Check_Type_Wkey]
ORDER BY [Dim_Safety_Check_Date_Wkey] DESC) AS Check_No
FROM
[Pitches].[Fact_Unit_Safety_Checks]
), Last_Safety_Check_CTE AS
(
SELECT
Fact_Unit_Safety_Checks_Wkey
FROM
Safety_Check_CTE
WHERE
Check_No = 1
)
SELECT
COUNT(*)
FROM
Last_Safety_Check_CTE lc
JOIN
Pitches.Fact_Unit_Safety_Checks f ON lc.Fact_Unit_Safety_Checks_Wkey = f.Fact_Unit_Safety_Checks_Wkey
JOIN
DIM.Dim_Unit u ON f.Dim_Unit_Wkey = u.Dim_Unit_Wkey
JOIN
DIM.Dim_Safety_Check_Type t ON f.Dim_Safety_Check_Type_Wkey = t.Dim_Safety_Check_Type_Wkey
JOIN
DIM.Dim_Date d ON f.Dim_Safety_Check_Date_Wkey = d.Dim_Date_Wkey
WHERE
f.Safety_Check_Certificate_No IN ('GP/KB11007') --option (maxdop 1)
Sometimes it returns 0, 1 or 2 rows. The result should obviously be consistent.
I have ran a profile trace whilst replicating the issue and my session was the only one in the database.
I have compared the Actual execution plans and they are both the same, except the final hash match returns the differing number of rows.
I cannot replicate if I use MAXDOP 0.
In case you use my comment as the answer.
My guess is ORDER BY [Dim_Safety_Check_Date_Wkey] is not deterministic.
In the CTE's you are finding the [Fact_Unit_Safety_Checks_Wkey] that's associated with the most resent row for any given [Dim_Unit_Wkey], [Dim_Safety_Check_Type_Wkey] combination... With no regard for weather or not [Safety_Check_Certificate_No] is equal to 'GP/KB11007'.
Then, in the outer query, you are filtering results based on [Safety_Check_Certificate_No] = 'GP/KB11007'.
So, unless the most recent [Fact_Unit_Safety_Checks_Wkey] happens to have [Safety_Check_Certificate_No] = 'GP/KB11007', the data is going to be filtered out.

Get random data from SQL Server without performance impact

I need to select random rows from my sql table, when search this cases in google, they suggested to ORDER BY NEWID() but it reduces the performance. Since my table has more than 2'000'000 rows of data, this solution does not suit me.
I tried this code to get random data :
SELECT TOP 10 *
FROM Table1
WHERE (ABS(CAST((BINARY_CHECKSUM(*) * RAND()) AS INT)) % 100) < 10
It also drops performance sometimes.
Could you please suggest good solution for getting random data from my table, I need minimum rows from that tables like 30 rows for each request. I tried TableSAMPLE to get the data, but it returns nothing once I added my where condition because it return the data by the basis of page not basis of row.
Try to calc the random ids before to filter your big table.
since your key is not identity, you need to number records and this will affect performances..
Pay attention, I have used distinct clause to be sure to get different numbers
EDIT: I have modified the query to use an arbitrary filter on your big table
declare #n int = 30
;with
t as (
-- EXTRACT DATA AND NUMBER ROWS
select *, ROW_NUMBER() over (order by YourPrimaryKey) n
from YourBigTable t
-- SOME FILTER
WHERE 1=1 /* <-- PUT HERE YOUR COMPLEX FILTER LOGIC */
),
r as (
-- RANDOM NUMBERS BETWEEN 1 AND COUNT(*) OF FILTERED TABLE
select distinct top (#n) abs(CHECKSUM(NEWID()) % n)+1 rnd
from sysobjects s
cross join (SELECT MAX(n) n FROM t) t
)
select t.*
from t
join r on r.rnd = t.n
If your uniqueidentifier key is a random GUID (not generated with NEWSEQUENTIALID() or UuidCreateSequential), you can use the method below. This will use the clustered primary key index without sorting all rows.
SELECT t1.*
FROM (VALUES(
NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())
,(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())
,(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())) AS ThirtyKeys(ID)
CROSS APPLY(SELECT TOP (1) * FROM dbo.Table1 WHERE ID >= ThirtyKeys.ID) AS t1;

Resources