MariaDB simple join with order by without temp table - query-optimization

I have a job queue that is FIFO and can grow to the range of 0 to 10MM records. Each record has some value associated with a user. I have a second table that CAN contain USERS that have priority. This gets queried a lot by worker threads. This causes slow queries in the 1MM record range when ordering by this priority e.g.
select *
from calcqueue
LEFT JOIN calc_priority USING(userId)
where calcqueue.isProcessing IS NULL
order by ISNULL(calc_priority.priority), calc_priority.priority
running explain on this gets me "Using index condition; Using temporary; Using filesort". I attempted to switch this over to a derived table which scales at larger number of rows, however I cant get the order to stay preserved which defeats the true intentions (but at least keeps my servers speedy)
SELECT *
FROM
( SELECT priority,p,userId FROM
( SELECT calc_priority.priority,
qt_uncalc.userId,
ISNULL(calc_priority.priority) p
from
( SELECT userId
from calcqueue
WHERE isProcessing IS NULL
) qt_uncalc
LEFT JOIN calc_priority USING(userId) sortedQ
ORDER BY p,sortedQ.priority ASC
) orderedT
Is there anyway to achieve this only using derived tables? calc_priority can (and does) change a lot. So adding the priority in at calcqueue insert time isn't an option

Plan A
Munch on this:
( SELECT *, 999999 AS priority
from calcqueue
LEFT JOIN calc_priority USING(userId)
where calcqueue.isProcessing IS NULL
AND calc_priority.priority IS NULL
LIMIT 10
)
UNION ALL
( SELECT *, calc_priority.priority
from calcqueue
JOIN calc_priority USING(userId)
where calcqueue.isProcessing IS NULL
ORDER BY calc_priority.priority
LIMIT 10
)
ORDER BY priority
and include
LIMIT 10; INDEX(isProcessing, userId)
I'm attempting to avoid the hassle with NULL.
Plan B
You could change the app to always set priority to a suitable value, thereby avoid having to do the UNION.

Related

Correlated SQL Server Subquery taking very long

I have this table: G_HIST with 700K rows and about 200 columns. Below is the
correlated query that is taking almost 6 minutes. Is there a better way to write it so
that it can take less than half minute.
If not, what indexes I need to have on this table? Currently it has only PK Unique
index on Primary Keys made up of 10 columns.
Here is the code below to select current version of cycle filtering
participant_identifier:
select distinct Curr.Cycle_Number, Curr.Process_Date,Curr.Group_Policy_Number,
Curr.Record_Type, Curr.Participant_Identifier,Curr.Person_Type,
Curr.Effective_Date
FROM G_HIST as Curr
WHERE Curr.Participant_Identifier not in (
select prev.Participant_Identifier
from G_HIST as Prev
where Prev.Cycle_Number = (
select max(b.Cycle_Number)-1
FROM G_HIST as b
WHERE b.Group_Policy_Number = Curr. Group_Policy_Number
)
)
AND Curr.[Cycle_Number] = (
select max(a.[Cycle_Number])
FROM G_HIST as a
WHERE a.[Group_Policy_Number] = Curr.[Group_Policy_Number]
)
You have aggregating -- MAX() -- correlated (dependent) subqueries. Those can be slow because they need to be re-evaluated for each row in the main query. Let's refactor them to ordinary subqueries. The ordinary subqueries need only be evaluated once.
You need a virtual table containing the largest Cycle_Number for each Group_Policy_Number. You get that with the following subquery.
SELECT MAX(Cycle_Number) Max_Cycle_Number,
Group_Policy_Number
FROM GHIST
GROUP BY Group_Policy_Number
This subquery will benefit, dramatically, from a multicolumn index on (Group_Policy_Number, Max_Cycle_Number).
And you have this pattern:
WHERE someColumn NOT IN (a correlated subquery)
That NOT IN can be refactored to use the LEFT JOIN ... IS NULL pattern (also known as the antijoin pattern) and an ordinary subquery. I guess your business rule says you start by finding the participant numbers in the previous cycle.
This query, using a Common Table Expression, should get you that list of participant numbers from the previous cycle for each Group_Policy_Number. You might want to inspect some results from this to ensure it gives you what you want.
WITH
Maxc AS (
SELECT MAX(Cycle_Number) Max_Cycle_Number,
Group_Policy_Number
FROM GHIST
GROUP BY Group_Policy_Number
),
PrevParticipant AS (
SELECT Participant_Identifier,
Group_Policy_Number
FROM GHIST
JOIN Maxc ON GHIST.Group_Policy_Number = Maxc.Group_Policy_Number
WHERE GHIST.Cycle_Number = Maxc.Cycle_Number - 1
)
SELECT * FROM PrevParticipant;
Then we can use the LEFT JOIN ... IS NULL pattern.
So here is the refactored query, not debugged, use at your own risk.
WITH
Maxc AS (
SELECT MAX(Cycle_Number) Max_Cycle_Number,
Group_Policy_Number
FROM G_HIST
GROUP BY Group_Policy_Number
),
PrevParticipant AS (
SELECT Participant_Identifier,
Group_Policy_Number
FROM G_HIST
JOIN Maxc ON G_HIST.Group_Policy_Number = Maxc.Group_Policy_Number
WHERE GHIST.Cycle_Number = Maxc.Cycle_Number - 1
)
SELECT DISTINCT Curr.whatever
FROM G_HIST Curr
JOIN Maxc
ON Curr.Group_Policy_Number = Maxc.Group_Policy_Number
LEFT JOIN PrevParticipant
ON Curr.Group_Policy_Number = PrevParticipant.Group_Policy_Number
AND Curr.Participant_Number <> PrevParticiant.Participant_Number
WHERE PrevParticipant.Group_Policy_Number IS NULL
AND Curr.Cycle_Number = Maxc.Cycle_Number;
If you have SQL Server 2016 or earlier you won't be able to use Common Table Expressions. Let me know in a comment and I'll show you how to write the query without them.
You can use SSMS's Actual Execution Plan to identify any other indexes you need to speed up the whole query.

Missing Rows when running SELECT in SQL Server

I have a simple select statement. It's basically 2 CTE's, one includes a ROW_NUMBER() OVER (PARTITION BY, then a join from these into 4 other tables. No functions or anything unusual.
WITH Safety_Check_CTE AS
(
SELECT
Fact_Unit_Safety_Checks_Wkey,
ROW_NUMBER() OVER (PARTITION BY [Dim_Unit_Wkey], [Dim_Safety_Check_Type_Wkey]
ORDER BY [Dim_Safety_Check_Date_Wkey] DESC) AS Check_No
FROM
[Pitches].[Fact_Unit_Safety_Checks]
), Last_Safety_Check_CTE AS
(
SELECT
Fact_Unit_Safety_Checks_Wkey
FROM
Safety_Check_CTE
WHERE
Check_No = 1
)
SELECT
COUNT(*)
FROM
Last_Safety_Check_CTE lc
JOIN
Pitches.Fact_Unit_Safety_Checks f ON lc.Fact_Unit_Safety_Checks_Wkey = f.Fact_Unit_Safety_Checks_Wkey
JOIN
DIM.Dim_Unit u ON f.Dim_Unit_Wkey = u.Dim_Unit_Wkey
JOIN
DIM.Dim_Safety_Check_Type t ON f.Dim_Safety_Check_Type_Wkey = t.Dim_Safety_Check_Type_Wkey
JOIN
DIM.Dim_Date d ON f.Dim_Safety_Check_Date_Wkey = d.Dim_Date_Wkey
WHERE
f.Safety_Check_Certificate_No IN ('GP/KB11007') --option (maxdop 1)
Sometimes it returns 0, 1 or 2 rows. The result should obviously be consistent.
I have ran a profile trace whilst replicating the issue and my session was the only one in the database.
I have compared the Actual execution plans and they are both the same, except the final hash match returns the differing number of rows.
I cannot replicate if I use MAXDOP 0.
In case you use my comment as the answer.
My guess is ORDER BY [Dim_Safety_Check_Date_Wkey] is not deterministic.
In the CTE's you are finding the [Fact_Unit_Safety_Checks_Wkey] that's associated with the most resent row for any given [Dim_Unit_Wkey], [Dim_Safety_Check_Type_Wkey] combination... With no regard for weather or not [Safety_Check_Certificate_No] is equal to 'GP/KB11007'.
Then, in the outer query, you are filtering results based on [Safety_Check_Certificate_No] = 'GP/KB11007'.
So, unless the most recent [Fact_Unit_Safety_Checks_Wkey] happens to have [Safety_Check_Certificate_No] = 'GP/KB11007', the data is going to be filtered out.

How to optimize a join to a moderately large type II table on snowflake?

Background
Suppose I have the following tables:
-- 33M rows
CREATE TABLE lkp.session (
session_id BIGINT,
visitor_id BIGINT,
session_datetime TIMESTAMP
);
-- 17M rows
CREATE TABLE lkp.visitor_customer_hist (
visitor_id BIGINT,
customer_id BIGINT,
from_datetime TIMESTAMP,
to_datetime TIMESTAMP
);
Visitor_customer_hist gives the customer_id that is in effect for each visitor at each point in time.
The goal is to look up the customer id that was in effect for each session, using the visitor_id and session_datetime.
CREATE TABLE lkp.session_effective_customer AS
SELECT
s.session_id,
vch.customer_id AS effective_customer_id
FROM lkp.session s
JOIN lkp.visitor_customer_hist vch ON vch.visitor_id = s.visitor_id
AND s.session_datetime >= vch.from_datetime
AND s.session_datetime < vch.to_datetime;
Problem
Even with a warehouse scaled to large, this query is extremely slow. It took 1h15m to complete, and it was the only query running on the warehouse.
I verified there are no overlapping values in visitor_customer_hist, the presence of which could cause a duplicative join.
Is snowflake just really bad at this kind of join? I am looking for suggestions re how I might optimize the tables for this kind of query, re clustering, or any optimization technique or re-working of the query, e.g. maybe a correlated subquery or something.
Additional info
Profile:
If the lkp.session table contains a narrow time range, and the lkp.visitor_customer_hist table contains a wide time range, you may benefit from rewriting the query to add a redundant condition restricting the range of rows considered in the join:
CREATE TABLE lkp.session_effective_customer AS
SELECT
s.session_id,
vch.customer_id AS effective_customer_id
FROM lkp.session s
JOIN lkp.visitor_customer_hist vch ON vch.visitor_id = s.visitor_id
AND s.session_datetime >= vch.from_datetime
AND s.session_datetime < vch.to_datetime
WHERE vch.to_datetime >= (select min(session_datetime) from lkp.session)
AND vch.from_datetime <= (select max(session_datetime) from lkp.session);
On the other hand, this won't help very much if both tables cover similar wide date range and there are large numbers of customers associated with a given visitor over time.
Following Stuart's answer, we can filter it a bit more by looking at the visitor-wise min and max. Like so:
CREATE TEMPORARY TABLE _vch AS
SELECT
l.visitor_id,
l.customer_id,
l.from_datetime,
l.to_datetime
FROM (
SELECT
l.visitor_id,
min(l.session_datetime) AS mindt,
max(l.session_datetime) AS maxdt
FROM lkp.session l
GROUP BY l.visitor_id
) a
JOIN lkp.visitor_customer_hist l ON a.visitor_id = l.visitor_id
AND l.from_datetime >= a.mindt
AND l.to_datetime <= a.maxdt;
Then with our lighter-weight hist table, maybe we'll have better luck:
CREATE TABLE lkp.session_effective_customer AS
SELECT
s.session_id,
vch.customer_id AS effective_customer_id
FROM lkp.session s
JOIN _vch vch ON vch.visitor_id = s.visitor_id
AND s.session_local_datetime >= vch.from_datetime
AND s.session_local_datetime < vch.to_datetime;
Unfortunately, in my case, though I filtered out a huge percentage of rows, the problem visitors (those with thousands of records in visitor_customer_hist) remained problematic (i.e. they still had thousands of records, resulting in join explosion).
In other circumstances, though, this could work.
In cases where both tables have high record counts per-visitor, this join is problematic, for reasons Marcin described in the comments. Accordingly, with this kind of scenario, it is best to avoid this kind of join altogether if possible.
The way I ultimately solved this issue was to scrap the visitor_customer_hist table and write a custom window function / udtf.
Initially I created lkp.visitor_customer_hist table because it could be created using existing window functions, and on a non-MPP sql database appropriate indexes could be created which would render lookups sufficiently performant. It was created like so:
CREATE TABLE lkp.visitor_customer_hist AS
SELECT
a.visitor_id AS visitor_id,
a.customer_id AS customer_id,
nvl(lag(a.session_datetime) OVER ( PARTITION BY a.visitor_id
ORDER BY a.session_datetime ), '1900-01-01') AS from_datetime,
CASE WHEN lead(a.session_datetime) OVER ( PARTITION BY a.visitor_id
ORDER BY a.session_datetime ) IS NULL THEN '9999-12-31'
ELSE a.session_datetime END AS to_datetime
FROM (
SELECT
s.session_id,
vs.visitor_id,
customer_id,
row_number() OVER ( PARTITION BY vs.visitor_id, s.session_datetime
ORDER BY s.session_id ) AS rn,
lead(s.customer_id) OVER ( PARTITION BY vs.visitor_id
ORDER BY s.session_datetime ) AS next_cust_id,
session_datetime
FROM "session" s
JOIN "visitor_session" vs ON vs.session_id = s.session_id
WHERE s.customer_id <> -2
) a
WHERE (a.next_cust_id <> a.customer_id
OR a.next_cust_id IS NULL) AND a.rn = 1;
So, scrapping this approach, I wrote the following UDTF insstead:
CREATE OR REPLACE FUNCTION udtf_eff_customer(customer_id FLOAT)
RETURNS TABLE(effective_customer_id FLOAT)
LANGUAGE JAVASCRIPT
IMMUTABLE
AS '
{
initialize: function() {
this.customer_id = -1;
},
processRow: function (row, rowWriter, context) {
if (row.CUSTOMER_ID != -1) {
this.customer_id = row.CUSTOMER_ID;
}
rowWriter.writeRow({EFFECTIVE_CUSTOMER_ID: this.customer_id});
},
finalize: function (rowWriter, context) {/*...*/},
}
';
And it can be applied like so:
SELECT
iff(a.customer_id <> -1, a.customer_id, ec.effective_customer_id) AS customer_id,
a.session_id
FROM "session" a
JOIN table(udtf_eff_customer(nvl2(a.visitor_id, a.customer_id, NULL) :: DOUBLE) OVER ( PARTITION BY a.visitor_id
ORDER BY a.session_datetime DESC )) ec
So this accomplishes the desired result: for every session, if the customer_id is not "unknown", then we go ahead and use that; otherwise, we use the next customer_id (if one exists) that can be associated with that visitor (ordering by time of session).
This is a much better solution than creating the lookup table; it essentially takes only one pass at the data, requires a lot less code / complexity, and goes very fast.

Get random data from SQL Server without performance impact

I need to select random rows from my sql table, when search this cases in google, they suggested to ORDER BY NEWID() but it reduces the performance. Since my table has more than 2'000'000 rows of data, this solution does not suit me.
I tried this code to get random data :
SELECT TOP 10 *
FROM Table1
WHERE (ABS(CAST((BINARY_CHECKSUM(*) * RAND()) AS INT)) % 100) < 10
It also drops performance sometimes.
Could you please suggest good solution for getting random data from my table, I need minimum rows from that tables like 30 rows for each request. I tried TableSAMPLE to get the data, but it returns nothing once I added my where condition because it return the data by the basis of page not basis of row.
Try to calc the random ids before to filter your big table.
since your key is not identity, you need to number records and this will affect performances..
Pay attention, I have used distinct clause to be sure to get different numbers
EDIT: I have modified the query to use an arbitrary filter on your big table
declare #n int = 30
;with
t as (
-- EXTRACT DATA AND NUMBER ROWS
select *, ROW_NUMBER() over (order by YourPrimaryKey) n
from YourBigTable t
-- SOME FILTER
WHERE 1=1 /* <-- PUT HERE YOUR COMPLEX FILTER LOGIC */
),
r as (
-- RANDOM NUMBERS BETWEEN 1 AND COUNT(*) OF FILTERED TABLE
select distinct top (#n) abs(CHECKSUM(NEWID()) % n)+1 rnd
from sysobjects s
cross join (SELECT MAX(n) n FROM t) t
)
select t.*
from t
join r on r.rnd = t.n
If your uniqueidentifier key is a random GUID (not generated with NEWSEQUENTIALID() or UuidCreateSequential), you can use the method below. This will use the clustered primary key index without sorting all rows.
SELECT t1.*
FROM (VALUES(
NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())
,(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())
,(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID()),(NEWID())) AS ThirtyKeys(ID)
CROSS APPLY(SELECT TOP (1) * FROM dbo.Table1 WHERE ID >= ThirtyKeys.ID) AS t1;

Efficient way to get max date before a given date

Suppose I have a table called Transaction and another table called Price. Price holds the prices for given funds at different dates. Each fund will have prices added at various dates, but they won't have prices at all possible dates. So for fund XYZ I may have prices for the 1 May, 7 May and 13 May and fund ABC may have prices at 3 May, 9 May and 11 May.
So now I'm looking at the price that was prevailing for a fund at the date of a transaction. The transaction was for fund XYZ on 10 May. What I want, is the latest known price on that day, which will be the price for 7 May.
Here's the code:
select d.TransactionID, d.FundCode, d.TransactionDate, v.OfferPrice
from Transaction d
inner join Price v
on v.FundCode = d.FundCode
and v.PriceDate = (
select max(PriceDate)
from Price
where FundCode = v.FundCode
/* */ and PriceDate < d.TransactionDate
)
It works, but it is very slow (several minutes in real world use). If I remove the line with the leading comment, the query is very quick (2 seconds or so) but it then uses the latest price per fund, which is wrong.
The bad part is that the price table is minuscule compared to some of the other tables we use, and it isn't clear to me why it is so slow. I suspect the offending line forces SQL Server to process a Cartesian product, but I don't know how to avoid it.
I keep hoping to find a more efficient way to do this, but it has so far escaped me. Any ideas?
You don't specify the version of SQL Server you're using, but if you are using a version with support for ranking functions and CTE queries I think you'll find this quite a bit more performant than using a correlated subquery within your join statement.
It should be very similar in performance to Andriy's queries. Depending on the exact index topography of your tables, one approach might be slightly faster than another.
I tend to like CTE-based approaches because the resulting code is quite a bit more readable (in my opinion). Hope this helps!
;WITH set_gen (TransactionID, OfferPrice, Match_val)
AS
(
SELECT d.TransactionID, v.OfferPrice, ROW_NUMBER() OVER(PARTITION BY d.TransactionID ORDER BY v.PriceDate ASC) AS Match_val
FROM Transaction d
INNER JOIN Price v
ON v.FundCode = d.FundCode
WHERE v.PriceDate <= d.TransactionDate
)
SELECT sg.TransactionID, d.FundCode, d.TransactionDate, sg.OfferPrice
FROM Transaction d
INNER JOIN set_gen sg ON d.TransactionID = sg.TransactionID
WHERE sg.Match_val = 1
There's a method for finding rows with maximum or minimum values, which involves LEFT JOIN to self, rather than more intuitive, but probably more costly as well, INNER JOIN to a self-derived aggregated list.
Basically, the method uses this pattern:
SELECT t.*
FROM t
LEFT JOIN t AS t2 ON t.key = t2.key
AND t2.Value > t.Value /* ">" is when getting maximums; "<" is for minimums */
WHERE t2.key IS NULL
or its NOT EXISTS counterpart:
SELECT *
FROM t
WHERE NOT EXISTS (
SELECT *
FROM t AS t2
WHERE t.key = t2.key
AND t2.Value > t.Value /* same as above applies to ">" here as well */
)
So, the result is all the rows for which there doesn't exist a row with the same key and the value greater than the given.
When there's just one table, application of the above method is pretty straightforward. However, it may not be that obvious how to apply it when there's another table, especially when, like in your case, the other table makes the actual query more complex not merely by its being there, but also by providing us with an additional filtering for the values we are looking for, namely with the upper limits for the dates.
So, here's what the resulting query might look like when applying the LEFT JOIN version of the method:
SELECT
d.TransactionID,
d.FundCode,
d.TransactionDate,
v.OfferPrice
FROM Transaction d
INNER JOIN Price v ON v.FundCode = d.FundCode
LEFT JOIN Price v2 ON v2.FundCode = v.FundCode /* this and */
AND v2.PriceDate > v.PriceDate /* this are where we are applying
the above method; */
AND v2.PriceDate < d.TransactionDate /* and this is where we are limiting
the maximum value */
WHERE v2.FundCode IS NULL
And here's a similar solution with NOT EXISTS:
SELECT
d.TransactionID,
d.FundCode,
d.TransactionDate,
v.OfferPrice
FROM Transaction d
INNER JOIN Price v ON v.FundCode = d.FundCode
WHERE NOT EXISTS (
SELECT *
FROM Price v2
WHERE v2.FundCode = v.FundCode /* this and */
AND v2.PriceDate > v.PriceDate /* this are where we are applying
the above method; */
AND v2.PriceDate < d.TransactionDate /* and this is where we are limiting
the maximum value */
)
Are both pricedate and transactiondate indexed? If not you are doing table scans which is likely the cause of the performance bottleneck.

Resources