Sort in query plan TSQL - sql-server

Need to handle query by eliminating and improving performance by deleting sort operators which consumes the greatest amount of resources.
The temp table is around 20,000 rows and the physical table is around 60 million of rows.
I am using LAG function due to that I need to compare values in bigger table, Have You guys any idea to figure it out ?
I am posting query, but if you will need any further info then let me know.
;WITH CTE AS
(
SELECT
a.VIN_NUMBER,
B.CELL_VALUE, B.CELL_VALUE_NEGATIVE_VALUES,
ROW_NUMBER() OVER (PARTITION BY B.VIN_NUMBER, B.LOG_NUM, B.SEQUENCE_NUM_OF_CELL
ORDER BY B.VIN_NUMBER, B.DATE_OF_CELL_READ, B.LOG_NUM, B.SEQUENCE_NUM_OF_CELL) ROW_NUM,
B.CELL_VALUE - LAG(B.CELL_VALUE, 1) OVER (ORDER BY B.VIN_NUMBER, B.DATE_OF_CELL_READ, B.LOG_NUM, B.SEQUENCE_NUM_OF_CELL) CELL_VALUE_NEW
FROM
#TEMP_CHASSI_LAST_LOAD A
JOIN
DBO.LOGS_FROM_CARS B WITH (NOLOCK) ON B.ROW_CREATION_DATE BETWEEN A.MIN_ROW_CREATION_DATE
AND A.MAX_ROW_CREATION_DATE
AND A.VIN_NUMBER = B.VIN_NUMBER
)
SELECT
VIN_NUMBER,
IIF(CELL_VALUE_NEW < 0, 0, CELL_VALUE_NEW) AS CELL_VALUE_NEW,
IIF(CELL_VALUE_NEW < 0, CELL_VALUE_NEW, NULL) AS CELL_VALUE_NEGATIVE_VALUES
FROM
CTE
WHERE
ROW_NUM > 1
AND (CELL_VALUE_NEW <> CELL_VALUE OR CELL_VALUE IS NULL)

It's hard to be sure what you are doing without sample data and full execution plan, but I'd explore a few options.
First, I don't think your LAG() is correct. I think you should add PARTITION BY B.VIN_NUMBER. Pretty sure you do not want to compare values of different VIN's. This will let you get rid of your ROW_NUMBER() as LAG() will now have NULL for the first row. That means your CELL_VALUE_NEW <> CELL_VALUE will filter out, so can remove ROW_NUM > 1
Optimized Query
WITH CTE AS (
SELECT
A.VIN_NUMBER,
B.CELL_VALUE,
B.CELL_VALUE_NEGATIVE_VALUES,
B.CELL_VALUE - LAG(B.CELL_VALUE, 1) OVER (PARTITION BY B.VIN_NUMBER ORDER BY B.DATE_OF_CELL_READ, B.LOG_NUM, B.SEQUENCE_NUM_OF_CELL) CELL_VALUE_NEW
FROM #TEMP_CHASSI_LAST_LOAD AS A
INNER JOIN dbo.LOGS_FROM_CARS B WITH (NOLOCK)
ON B.ROW_CREATION_DATE BETWEEN A.MIN_ROW_CREATION_DATE AND A.MAX_ROW_CREATION_DATE
AND A.VIN_NUMBER = B.VIN_NUMBER
)
SELECT
VIN_NUMBER,
IIF(CELL_VALUE_NEW < 0, 0, CELL_VALUE_NEW) AS CELL_VALUE_NEW,
IIF(CELL_VALUE_NEW < 0, CELL_VALUE_NEW, NULL) AS CELL_VALUE_NEGATIVE_VALUES
FROM CTE
WHERE (CELL_VALUE_NEW <> CELL_VALUE OR CELL_VALUE IS NULL)
Things to Review:
Double check data types for your join conditions. Ex. make sure MIN_ROW_CREATION_DATE and MAX_ROW_CREATION_DATE are the same as ROW_CREATION_DATE. Makes sure it's not text vs date. Ideally VIN_NUMBER is using CHAR(17) (all car VIN's are 17 characters)
Create index on larger table (and maybe try one on the temp table. Query performance improvement might be worth the time to create the index on the temp table)
CREATE INDEX ix_test ON dbo.LOGS_FROM_CARS(VIN_NUMBER,ROW_CREATION_DATE)
INCLUDE (CELL_VALUE,CELL_VALUE_NEGATIVE_VALUES,DATE_OF_CELL_READ, LOG_NUM, SEQUENCE_NUM_OF_CELL)
Try FORCESEEK option on table join to LOGS_FROM_CARS. Be cautious using query hints as can lead to issues down the road, but might be worth it for this query
Are you sure you need CELL_VALUE_NEGATIVE_VALUES from LOGS_FROM_CARS? I don't see it used anywhere. Would remove that from the query if you don't need it

Related

Correlated SQL Server Subquery taking very long

I have this table: G_HIST with 700K rows and about 200 columns. Below is the
correlated query that is taking almost 6 minutes. Is there a better way to write it so
that it can take less than half minute.
If not, what indexes I need to have on this table? Currently it has only PK Unique
index on Primary Keys made up of 10 columns.
Here is the code below to select current version of cycle filtering
participant_identifier:
select distinct Curr.Cycle_Number, Curr.Process_Date,Curr.Group_Policy_Number,
Curr.Record_Type, Curr.Participant_Identifier,Curr.Person_Type,
Curr.Effective_Date
FROM G_HIST as Curr
WHERE Curr.Participant_Identifier not in (
select prev.Participant_Identifier
from G_HIST as Prev
where Prev.Cycle_Number = (
select max(b.Cycle_Number)-1
FROM G_HIST as b
WHERE b.Group_Policy_Number = Curr. Group_Policy_Number
)
)
AND Curr.[Cycle_Number] = (
select max(a.[Cycle_Number])
FROM G_HIST as a
WHERE a.[Group_Policy_Number] = Curr.[Group_Policy_Number]
)
You have aggregating -- MAX() -- correlated (dependent) subqueries. Those can be slow because they need to be re-evaluated for each row in the main query. Let's refactor them to ordinary subqueries. The ordinary subqueries need only be evaluated once.
You need a virtual table containing the largest Cycle_Number for each Group_Policy_Number. You get that with the following subquery.
SELECT MAX(Cycle_Number) Max_Cycle_Number,
Group_Policy_Number
FROM GHIST
GROUP BY Group_Policy_Number
This subquery will benefit, dramatically, from a multicolumn index on (Group_Policy_Number, Max_Cycle_Number).
And you have this pattern:
WHERE someColumn NOT IN (a correlated subquery)
That NOT IN can be refactored to use the LEFT JOIN ... IS NULL pattern (also known as the antijoin pattern) and an ordinary subquery. I guess your business rule says you start by finding the participant numbers in the previous cycle.
This query, using a Common Table Expression, should get you that list of participant numbers from the previous cycle for each Group_Policy_Number. You might want to inspect some results from this to ensure it gives you what you want.
WITH
Maxc AS (
SELECT MAX(Cycle_Number) Max_Cycle_Number,
Group_Policy_Number
FROM GHIST
GROUP BY Group_Policy_Number
),
PrevParticipant AS (
SELECT Participant_Identifier,
Group_Policy_Number
FROM GHIST
JOIN Maxc ON GHIST.Group_Policy_Number = Maxc.Group_Policy_Number
WHERE GHIST.Cycle_Number = Maxc.Cycle_Number - 1
)
SELECT * FROM PrevParticipant;
Then we can use the LEFT JOIN ... IS NULL pattern.
So here is the refactored query, not debugged, use at your own risk.
WITH
Maxc AS (
SELECT MAX(Cycle_Number) Max_Cycle_Number,
Group_Policy_Number
FROM G_HIST
GROUP BY Group_Policy_Number
),
PrevParticipant AS (
SELECT Participant_Identifier,
Group_Policy_Number
FROM G_HIST
JOIN Maxc ON G_HIST.Group_Policy_Number = Maxc.Group_Policy_Number
WHERE GHIST.Cycle_Number = Maxc.Cycle_Number - 1
)
SELECT DISTINCT Curr.whatever
FROM G_HIST Curr
JOIN Maxc
ON Curr.Group_Policy_Number = Maxc.Group_Policy_Number
LEFT JOIN PrevParticipant
ON Curr.Group_Policy_Number = PrevParticipant.Group_Policy_Number
AND Curr.Participant_Number <> PrevParticiant.Participant_Number
WHERE PrevParticipant.Group_Policy_Number IS NULL
AND Curr.Cycle_Number = Maxc.Cycle_Number;
If you have SQL Server 2016 or earlier you won't be able to use Common Table Expressions. Let me know in a comment and I'll show you how to write the query without them.
You can use SSMS's Actual Execution Plan to identify any other indexes you need to speed up the whole query.

Missing Rows when running SELECT in SQL Server

I have a simple select statement. It's basically 2 CTE's, one includes a ROW_NUMBER() OVER (PARTITION BY, then a join from these into 4 other tables. No functions or anything unusual.
WITH Safety_Check_CTE AS
(
SELECT
Fact_Unit_Safety_Checks_Wkey,
ROW_NUMBER() OVER (PARTITION BY [Dim_Unit_Wkey], [Dim_Safety_Check_Type_Wkey]
ORDER BY [Dim_Safety_Check_Date_Wkey] DESC) AS Check_No
FROM
[Pitches].[Fact_Unit_Safety_Checks]
), Last_Safety_Check_CTE AS
(
SELECT
Fact_Unit_Safety_Checks_Wkey
FROM
Safety_Check_CTE
WHERE
Check_No = 1
)
SELECT
COUNT(*)
FROM
Last_Safety_Check_CTE lc
JOIN
Pitches.Fact_Unit_Safety_Checks f ON lc.Fact_Unit_Safety_Checks_Wkey = f.Fact_Unit_Safety_Checks_Wkey
JOIN
DIM.Dim_Unit u ON f.Dim_Unit_Wkey = u.Dim_Unit_Wkey
JOIN
DIM.Dim_Safety_Check_Type t ON f.Dim_Safety_Check_Type_Wkey = t.Dim_Safety_Check_Type_Wkey
JOIN
DIM.Dim_Date d ON f.Dim_Safety_Check_Date_Wkey = d.Dim_Date_Wkey
WHERE
f.Safety_Check_Certificate_No IN ('GP/KB11007') --option (maxdop 1)
Sometimes it returns 0, 1 or 2 rows. The result should obviously be consistent.
I have ran a profile trace whilst replicating the issue and my session was the only one in the database.
I have compared the Actual execution plans and they are both the same, except the final hash match returns the differing number of rows.
I cannot replicate if I use MAXDOP 0.
In case you use my comment as the answer.
My guess is ORDER BY [Dim_Safety_Check_Date_Wkey] is not deterministic.
In the CTE's you are finding the [Fact_Unit_Safety_Checks_Wkey] that's associated with the most resent row for any given [Dim_Unit_Wkey], [Dim_Safety_Check_Type_Wkey] combination... With no regard for weather or not [Safety_Check_Certificate_No] is equal to 'GP/KB11007'.
Then, in the outer query, you are filtering results based on [Safety_Check_Certificate_No] = 'GP/KB11007'.
So, unless the most recent [Fact_Unit_Safety_Checks_Wkey] happens to have [Safety_Check_Certificate_No] = 'GP/KB11007', the data is going to be filtered out.

How to optimize a join to a moderately large type II table on snowflake?

Background
Suppose I have the following tables:
-- 33M rows
CREATE TABLE lkp.session (
session_id BIGINT,
visitor_id BIGINT,
session_datetime TIMESTAMP
);
-- 17M rows
CREATE TABLE lkp.visitor_customer_hist (
visitor_id BIGINT,
customer_id BIGINT,
from_datetime TIMESTAMP,
to_datetime TIMESTAMP
);
Visitor_customer_hist gives the customer_id that is in effect for each visitor at each point in time.
The goal is to look up the customer id that was in effect for each session, using the visitor_id and session_datetime.
CREATE TABLE lkp.session_effective_customer AS
SELECT
s.session_id,
vch.customer_id AS effective_customer_id
FROM lkp.session s
JOIN lkp.visitor_customer_hist vch ON vch.visitor_id = s.visitor_id
AND s.session_datetime >= vch.from_datetime
AND s.session_datetime < vch.to_datetime;
Problem
Even with a warehouse scaled to large, this query is extremely slow. It took 1h15m to complete, and it was the only query running on the warehouse.
I verified there are no overlapping values in visitor_customer_hist, the presence of which could cause a duplicative join.
Is snowflake just really bad at this kind of join? I am looking for suggestions re how I might optimize the tables for this kind of query, re clustering, or any optimization technique or re-working of the query, e.g. maybe a correlated subquery or something.
Additional info
Profile:
If the lkp.session table contains a narrow time range, and the lkp.visitor_customer_hist table contains a wide time range, you may benefit from rewriting the query to add a redundant condition restricting the range of rows considered in the join:
CREATE TABLE lkp.session_effective_customer AS
SELECT
s.session_id,
vch.customer_id AS effective_customer_id
FROM lkp.session s
JOIN lkp.visitor_customer_hist vch ON vch.visitor_id = s.visitor_id
AND s.session_datetime >= vch.from_datetime
AND s.session_datetime < vch.to_datetime
WHERE vch.to_datetime >= (select min(session_datetime) from lkp.session)
AND vch.from_datetime <= (select max(session_datetime) from lkp.session);
On the other hand, this won't help very much if both tables cover similar wide date range and there are large numbers of customers associated with a given visitor over time.
Following Stuart's answer, we can filter it a bit more by looking at the visitor-wise min and max. Like so:
CREATE TEMPORARY TABLE _vch AS
SELECT
l.visitor_id,
l.customer_id,
l.from_datetime,
l.to_datetime
FROM (
SELECT
l.visitor_id,
min(l.session_datetime) AS mindt,
max(l.session_datetime) AS maxdt
FROM lkp.session l
GROUP BY l.visitor_id
) a
JOIN lkp.visitor_customer_hist l ON a.visitor_id = l.visitor_id
AND l.from_datetime >= a.mindt
AND l.to_datetime <= a.maxdt;
Then with our lighter-weight hist table, maybe we'll have better luck:
CREATE TABLE lkp.session_effective_customer AS
SELECT
s.session_id,
vch.customer_id AS effective_customer_id
FROM lkp.session s
JOIN _vch vch ON vch.visitor_id = s.visitor_id
AND s.session_local_datetime >= vch.from_datetime
AND s.session_local_datetime < vch.to_datetime;
Unfortunately, in my case, though I filtered out a huge percentage of rows, the problem visitors (those with thousands of records in visitor_customer_hist) remained problematic (i.e. they still had thousands of records, resulting in join explosion).
In other circumstances, though, this could work.
In cases where both tables have high record counts per-visitor, this join is problematic, for reasons Marcin described in the comments. Accordingly, with this kind of scenario, it is best to avoid this kind of join altogether if possible.
The way I ultimately solved this issue was to scrap the visitor_customer_hist table and write a custom window function / udtf.
Initially I created lkp.visitor_customer_hist table because it could be created using existing window functions, and on a non-MPP sql database appropriate indexes could be created which would render lookups sufficiently performant. It was created like so:
CREATE TABLE lkp.visitor_customer_hist AS
SELECT
a.visitor_id AS visitor_id,
a.customer_id AS customer_id,
nvl(lag(a.session_datetime) OVER ( PARTITION BY a.visitor_id
ORDER BY a.session_datetime ), '1900-01-01') AS from_datetime,
CASE WHEN lead(a.session_datetime) OVER ( PARTITION BY a.visitor_id
ORDER BY a.session_datetime ) IS NULL THEN '9999-12-31'
ELSE a.session_datetime END AS to_datetime
FROM (
SELECT
s.session_id,
vs.visitor_id,
customer_id,
row_number() OVER ( PARTITION BY vs.visitor_id, s.session_datetime
ORDER BY s.session_id ) AS rn,
lead(s.customer_id) OVER ( PARTITION BY vs.visitor_id
ORDER BY s.session_datetime ) AS next_cust_id,
session_datetime
FROM "session" s
JOIN "visitor_session" vs ON vs.session_id = s.session_id
WHERE s.customer_id <> -2
) a
WHERE (a.next_cust_id <> a.customer_id
OR a.next_cust_id IS NULL) AND a.rn = 1;
So, scrapping this approach, I wrote the following UDTF insstead:
CREATE OR REPLACE FUNCTION udtf_eff_customer(customer_id FLOAT)
RETURNS TABLE(effective_customer_id FLOAT)
LANGUAGE JAVASCRIPT
IMMUTABLE
AS '
{
initialize: function() {
this.customer_id = -1;
},
processRow: function (row, rowWriter, context) {
if (row.CUSTOMER_ID != -1) {
this.customer_id = row.CUSTOMER_ID;
}
rowWriter.writeRow({EFFECTIVE_CUSTOMER_ID: this.customer_id});
},
finalize: function (rowWriter, context) {/*...*/},
}
';
And it can be applied like so:
SELECT
iff(a.customer_id <> -1, a.customer_id, ec.effective_customer_id) AS customer_id,
a.session_id
FROM "session" a
JOIN table(udtf_eff_customer(nvl2(a.visitor_id, a.customer_id, NULL) :: DOUBLE) OVER ( PARTITION BY a.visitor_id
ORDER BY a.session_datetime DESC )) ec
So this accomplishes the desired result: for every session, if the customer_id is not "unknown", then we go ahead and use that; otherwise, we use the next customer_id (if one exists) that can be associated with that visitor (ordering by time of session).
This is a much better solution than creating the lookup table; it essentially takes only one pass at the data, requires a lot less code / complexity, and goes very fast.

Get random data from SQL Server without performance impact

I need to select random rows from my sql table, when search this cases in google, they suggested to ORDER BY NEWID() but it reduces the performance. Since my table has more than 2'000'000 rows of data, this solution does not suit me.
I tried this code to get random data :
SELECT TOP 10 *
FROM Table1
WHERE (ABS(CAST((BINARY_CHECKSUM(*) * RAND()) AS INT)) % 100) < 10
It also drops performance sometimes.
Could you please suggest good solution for getting random data from my table, I need minimum rows from that tables like 30 rows for each request. I tried TableSAMPLE to get the data, but it returns nothing once I added my where condition because it return the data by the basis of page not basis of row.
Try to calc the random ids before to filter your big table.
since your key is not identity, you need to number records and this will affect performances..
Pay attention, I have used distinct clause to be sure to get different numbers
EDIT: I have modified the query to use an arbitrary filter on your big table
declare #n int = 30
;with
t as (
-- EXTRACT DATA AND NUMBER ROWS
select *, ROW_NUMBER() over (order by YourPrimaryKey) n
from YourBigTable t
-- SOME FILTER
WHERE 1=1 /* <-- PUT HERE YOUR COMPLEX FILTER LOGIC */
),
r as (
-- RANDOM NUMBERS BETWEEN 1 AND COUNT(*) OF FILTERED TABLE
select distinct top (#n) abs(CHECKSUM(NEWID()) % n)+1 rnd
from sysobjects s
cross join (SELECT MAX(n) n FROM t) t
)
select t.*
from t
join r on r.rnd = t.n
If your uniqueidentifier key is a random GUID (not generated with NEWSEQUENTIALID() or UuidCreateSequential), you can use the method below. This will use the clustered primary key index without sorting all rows.
SELECT t1.*
FROM (VALUES(
NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())
,(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())
,(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())) AS ThirtyKeys(ID)
CROSS APPLY(SELECT TOP (1) * FROM dbo.Table1 WHERE ID >= ThirtyKeys.ID) AS t1;

RowNumber() and Partition By performance help wanted

I've got a table of stock market moving average values, and I'm trying to compare two values within a day, and then compare that value to the same calculation of the prior day. My sql as it stands is below... when I comment out the last select statement that defines the result set, and run the last cte shown as the result set, I get my data back in about 15 minutes. Long, but manageable since it'll run as an insert sproc overnight. When I run it as shown, I'm at 40 minutes before any results even start to come in. Any ideas? It goes from somewhat slow, to blowing up, probably with the addition of ROW_NUMBER() OVER (PARTITION BY) BTW I'm still working through the logic, which is currently impossible with this performance issue. Thanks in advance..
Edit: I fixed my partition as suggested below.
with initialSmas as
(
select TradeDate, Symbol, Period, Value
from tblDailySMA
),
smaComparisonsByPer as
(
select i.TradeDate, i.Symbol, i.Period FastPer, i.Value FastVal,
i2.Period SlowPer, i2.Value SlowVal, (i.Value-i2.Value) FastMinusSlow
from initialSmas i join initialSmas as i2 on i.Symbol = i2.Symbol
and i.TradeDate = i2.TradeDate and i2.Period > i.Period
),
smaComparisonsByPerPartitioned as
(
select ROW_NUMBER() OVER (PARTITION BY sma.Symbol, sma.FastPer, sma.SlowPer
ORDER BY sma.TradeDate) as RowNum, sma.TradeDate, sma.Symbol, sma.FastPer,
sma.FastVal, sma.SlowPer, sma.SlowVal, sma.FastMinusSlow
from smaComparisonsByPer sma
)
select scp.TradeDate as LatestDate, scp.FastPer, scp.FastVal, scp.SlowPer, scp.SlowVal,
scp.FastMinusSlow, scp2.TradeDate as LatestDate, scp2.FastPer, scp2.FastVal, scp2.SlowPer,
scp2.SlowVal, scp2.FastMinusSlow, (scp.FastMinusSlow * scp2.FastMinusSlow) as Comparison
from smaComparisonsByPerPartitioned scp join smaComparisonsByPerPartitioned scp2
on scp.Symbol = scp2.Symbol and scp.RowNum = (scp2.RowNum - 1)
1) You have some fields both in the Partition By and the Order By clauses. That doesn't make sense since you will have one and only one value for each (sma.FastPer, sma.SlowPer). You can safely remove these fields from the Order By part of the window function.
2) Assuming that you already have indexes for adequate performance in "initialSmas i join initialSmas" and that you already have and index for (initialSmas.Symbol, initialSmas.Period, initialSmas.TradeDate) the best you can do is to copy smaComparisonsByPer into a temporary table where you can create an index on (sma.Symbol, sma.FastPer, sma.SlowPer, sma.TradeDate)

Resources