Why are common table expressions faster even when limiting the result? - database

I can't get my head around why the following query:
SELECT ref,
article_number,
count(article_number)
FROM invoice
INNER JOIN goods_list USING (invoice_number)
WHERE invoice_owner = 'someone'
GROUP BY invoice_number, article_number
LIMIT 1
is way slower than this one:
WITH base_data AS (
SELECT invoice_number
FROM invoice
WHERE invoice_owner = 'someone'
LIMIT 1
)
SELECT invoice_number,
article_number,
count(article_number)
FROM base_data
INNER JOIN goods_list USING (invoice_number)
GROUP BY invoice_number, article_number
Is the limit applied after the whole result set is returned?

The first query processes all the data for the invoice owner. It does the group by and finally returns one row.
The second query gets one row in the CTE for the invoice owner up front. It joins that single row into another table and then does the aggregation on way fewer rows.
Hence, it is not surprising that the second query is much faster, because it is processing many fewer rows for the aggregation.
Note: when using limit you should also use order by. Otherwise, you can get any matching row when the code runs -- and you might even get different rows on different runs.

Related

Snowflake - Dynamic Filter Condition inside CTE (Common Table Expression)

I am creating view in Snowflake that has CTE on base table without any filters. I have other CTEs that depend on Parent CTE to fetch further information.
Everything is working fine when I query all records from base table that has 45K rows. But when I query view for one particular ID, explain plan shows Base CTE is picking up 45K rows, joining rest of CTE on 45K rows then finally applying my unique ID filter and returning one row.
I am not getting any difference in performance pulling data for all records or one record. Snowflake is not optimizing base CTE to apply the filter criteria I am looking for.
Any suggestions how can I resolve this issue? I used local variables in filter criteria of base CTE but it is not viable solution.
CREATE OR REPLACE VIEW test_v AS
WITH parent_cte as
(select document_id, time, ...
from audit_table
),
emp_cte as
(select employee_details, ...
from employee_tab,
parent_cte
where parent_cte.document_id = employee_tab.document_id),
dep_cte as
(select dep_details, ....
from dependent_tab,
emp_cte
where ..........)
select *
from dep_cte, emp_cte, base_cte;
Now when I query the view for one document_id, plan is fetching all data and joining then applying filter which is not efficient.
select * from test_v where document_id = '1001';
I can't use these tables in one select with join condition as I am using "LATERAL FLATTEN" which is cross multiplying each base table record so I am going with CTE approach.
Appreciate your ideas.

Subquery is returning more than one value? Not getting a result set?

Write a SELECT statement that returns two columns: VendorName and LargestInv
(LargestInv is the correlation name of the subquery)
Subquery portion:
SELECT statement that returns the largest InvoiceTotal from the Invoices table (you will need to perform the JOIN within the subquery in one of the clauses).
Sort the results by LargestInv from largest to smallest.(Subquery Must be in the Select statement)
I have tried this but My subqueries returning more than one value
USE AP
SELECT VendorName, (SELECT MAX(InvoiceTotal) FROM Invoices JOIN Vendors
ON Invoices.VendorID = Vendors.VendorID
GROUP BY Invoices.VendorID) AS LargestInv
FROM Vendors
Your issue is scope.
The sub-query shouldn't be joining to the Vendor table if the goal is a correlated sub-query. The "correlated" part comes from joining the results of the inner query (the sub-query) to the outer query.
As written, you're finding VendorID inside the sub-query and the results aren't correlated to the outer query at all. Hence your error message.
SELECT
VendorName,
(SELECT MAX(InvoiceTotal)
FROM Invoices
WHERE Invoices.VendorID = Vendors.VendorID
) AS LargestInv
FROM Vendors
ORDER BY LargestInv DESC;
Edit (extended explanation):
A correlated sub-query isn't designed to pull up a full result set, like the sub-query you were writing at first. It's designed to go over to another table and use a value (or values) from the outer query to bring back a single result, one row at a time.
In this case, using the VendorID from the Vendors table, go over to Invoices, calculate a MAX value "WHERE" the VendorID in Invoices matches the VendorID ON THIS ROW, bring that single value back, then, next row, go back and do that again. And again and again.
It's one way to get the data, but it's not usually efficient. Later, though, you'll learn to use correlated sub-queries in (NOT) EXISTS clauses, and in that context they tend to be extremely efficient. Story for another day, but it's one reason the construct is important to know.
So, your way was good, because it was set based and would tend to be more efficient as a sub-query in the FROM clause, but this way, row by row, is important to understand conceptually.
This is how I would do it.
SELECT VendorName, LargestInv.MaxI
FROM Vendors
FROM (
SELECT VendorName, MAX(InvoiceTotal) as MaxI
FROM Invoices
JOIN Vendors ON Invoices.VendorID = Vendors.VendorID
GROUP BY VendorName
) AS LargestInv ON LargestInv.VendorName = Vendors.VendorName
Now having more than one in the sub-query won't give you an error and you can look at the results.

How does DISTINCT work in SQL Server 2008 R2? Are there other options? [duplicate]

I need to retrieve all rows from a table where 2 columns combined are all different. So I want all the sales that do not have any other sales that happened on the same day for the same price. The sales that are unique based on day and price will get updated to an active status.
So I'm thinking:
UPDATE sales
SET status = 'ACTIVE'
WHERE id IN (SELECT DISTINCT (saleprice, saledate), id, count(id)
FROM sales
HAVING count = 1)
But my brain hurts going any farther than that.
SELECT DISTINCT a,b,c FROM t
is roughly equivalent to:
SELECT a,b,c FROM t GROUP BY a,b,c
It's a good idea to get used to the GROUP BY syntax, as it's more powerful.
For your query, I'd do it like this:
UPDATE sales
SET status='ACTIVE'
WHERE id IN
(
SELECT id
FROM sales S
INNER JOIN
(
SELECT saleprice, saledate
FROM sales
GROUP BY saleprice, saledate
HAVING COUNT(*) = 1
) T
ON S.saleprice=T.saleprice AND s.saledate=T.saledate
)
If you put together the answers so far, clean up and improve, you would arrive at this superior query:
UPDATE sales
SET status = 'ACTIVE'
WHERE (saleprice, saledate) IN (
SELECT saleprice, saledate
FROM sales
GROUP BY saleprice, saledate
HAVING count(*) = 1
);
Which is much faster than either of them. Nukes the performance of the currently accepted answer by factor 10 - 15 (in my tests on PostgreSQL 8.4 and 9.1).
But this is still far from optimal. Use a NOT EXISTS (anti-)semi-join for even better performance. EXISTS is standard SQL, has been around forever (at least since PostgreSQL 7.2, long before this question was asked) and fits the presented requirements perfectly:
UPDATE sales s
SET status = 'ACTIVE'
WHERE NOT EXISTS (
SELECT FROM sales s1 -- SELECT list can be empty for EXISTS
WHERE s.saleprice = s1.saleprice
AND s.saledate = s1.saledate
AND s.id <> s1.id -- except for row itself
)
AND s.status IS DISTINCT FROM 'ACTIVE'; -- avoid empty updates. see below
db<>fiddle here
Old sqlfiddle
Unique key to identify row
If you don't have a primary or unique key for the table (id in the example), you can substitute with the system column ctid for the purpose of this query (but not for some other purposes):
AND s1.ctid <> s.ctid
Every table should have a primary key. Add one if you didn't have one, yet. I suggest a serial or an IDENTITY column in Postgres 10+.
Related:
In-order sequence generation
Auto increment table column
How is this faster?
The subquery in the EXISTS anti-semi-join can stop evaluating as soon as the first dupe is found (no point in looking further). For a base table with few duplicates this is only mildly more efficient. With lots of duplicates this becomes way more efficient.
Exclude empty updates
For rows that already have status = 'ACTIVE' this update would not change anything, but still insert a new row version at full cost (minor exceptions apply). Normally, you do not want this. Add another WHERE condition like demonstrated above to avoid this and make it even faster:
If status is defined NOT NULL, you can simplify to:
AND status <> 'ACTIVE';
The data type of the column must support the <> operator. Some types like json don't. See:
How to query a json column for empty objects?
Subtle difference in NULL handling
This query (unlike the currently accepted answer by Joel) does not treat NULL values as equal. The following two rows for (saleprice, saledate) would qualify as "distinct" (though looking identical to the human eye):
(123, NULL)
(123, NULL)
Also passes in a unique index and almost anywhere else, since NULL values do not compare equal according to the SQL standard. See:
Create unique constraint with null columns
OTOH, GROUP BY, DISTINCT or DISTINCT ON () treat NULL values as equal. Use an appropriate query style depending on what you want to achieve. You can still use this faster query with IS NOT DISTINCT FROM instead of = for any or all comparisons to make NULL compare equal. More:
How to delete duplicate rows without unique identifier
If all columns being compared are defined NOT NULL, there is no room for disagreement.
The problem with your query is that when using a GROUP BY clause (which you essentially do by using distinct) you can only use columns that you group by or aggregate functions. You cannot use the column id because there are potentially different values. In your case there is always only one value because of the HAVING clause, but most RDBMS are not smart enough to recognize that.
This should work however (and doesn't need a join):
UPDATE sales
SET status='ACTIVE'
WHERE id IN (
SELECT MIN(id) FROM sales
GROUP BY saleprice, saledate
HAVING COUNT(id) = 1
)
You could also use MAX or AVG instead of MIN, it is only important to use a function that returns the value of the column if there is only one matching row.
If your DBMS doesn't support distinct with multiple columns like this:
select distinct(col1, col2) from table
Multi select in general can be executed safely as follows:
select distinct * from (select col1, col2 from table ) as x
As this can work on most of the DBMS and this is expected to be faster than group by solution as you are avoiding the grouping functionality.
I want to select the distinct values from one column 'GrondOfLucht' but they should be sorted in the order as given in the column 'sortering'. I cannot get the distinct values of just one column using
Select distinct GrondOfLucht,sortering
from CorWijzeVanAanleg
order by sortering
It will also give the column 'sortering' and because 'GrondOfLucht' AND 'sortering' is not unique, the result will be ALL rows.
use the GROUP to select the records of 'GrondOfLucht' in the order given by 'sortering
SELECT GrondOfLucht
FROM dbo.CorWijzeVanAanleg
GROUP BY GrondOfLucht, sortering
ORDER BY MIN(sortering)

SQL Server: order of returned rows when using IN clause

Running the following query returns 4 rows. As I can see in SSMS the order of returned rows is the same as I specified in the IN clause.
SELECT * FROM Table WHERE ID IN (4,3,2,1)
Can I say that the order of returned rows are ALWAYS the same as they appear in the IN clause?
If yes then is it true, that the following two queries return the rows in the same order? (as I've tested the orders are the same, but I don't know if I can trust this behavior)
SELECT TOP 10 * FROM Table ORDER BY LastModification DESC
SELECT * FROM Table WHERE ID IN (SELECT TOP 10 ID FROM Table ORDER BY LastModification DESC)
I ask this question because I have a quite complex select query. Using this trick over it brings me ca. 30% performance gain, in my case.
You cannot guarantee the records to be in any particular order unless you use ORDER BY clause. You may use some tricks that may work some of the time but they won't give you guarantee of the order.

SQL Server Query - ORDER BY killing query performance on small result set

I have a query in SQL Server 2008 R2 in the following form:
SELECT TOP (2147483647) *
FROM (
SELECT *
FROM sub_query_a
) hierarchy
LEFT JOIN (
SELECT *
FROM sub_query_b
) expenditure
ON hierarchy.x = expenditure.x AND hierarchy.y = expenditure.y
ORDER BY hierarchy.c, hierarchy.d, hierarchy.e
The hierarchy subquery contains UNIONS and INNER JOINS. The expenditure subquery is based on several levels of sub-subqueries, and contains UNIONS, INNER and LEFT JOINS, and ultimately, a PIVOT aggregate.
The hierarchy subquery by itself runs in 2 seconds and returns 467 rows. The expenditure subquery by itself runs in 7 seconds and returns 458 rows. Together, without the ORDER BY clause, the query runs in 11 seconds. However, with the ORDER BY clause, the query runs in 11 minutes.
The Actual Execution Plan reveals what's different. Without the ORDER BY clause, both the hierarchy and expenditure subqueries are running once each, with the results being Merge Join (Right Outer Join) joined together. When the ORDER BY clause is included, the hierarchy query is still run once, but the expenditure portion is run once per row from the hierarchy query, and the results are Nested Loops (Left Outer Join) joined together. Its as if the ORDER BY clause is causing the expenditure subquery to become a correlated subquery (which it is not).
To verify that SQL Server was actually capable of doing the query and producing a sorted result set in 11 seconds, as a test, I created a temp table and inserted the results of the query without the ORDER BY clause into it. Then I did a SELECT * FROM #temp_table ORDER BY c, d, e. The entire script took the expected 11 seconds, and returned the desired results.
I want to make the query work efficiently with the ORDER BY clause as one query--I don't want to have to create a stored procedure just to enable the #temp_table hacky solution.
Any ideas on the cause of this issue, or a fix?
To avoid nested loop joins, you can give an option to the compiler:
SELECT TOP (2147483647) *
FROM (
SELECT *
FROM sub_query_a
) hierarchy
LEFT JOIN (
SELECT *
FROM sub_query_b
) expenditure
ON hierarchy.x = expenditure.x AND hierarchy.y = expenditure.y
ORDER BY hierarchy.c, hierarchy.d, hierarchy.e
option (merge join, hash join)
I generally much prefer to have the optimizer figure out the right query plan. On rare occasions, however, I run into a problem similar to yours and need to make a suggestion to push it in the right direction
Thanks to #MartinSmith's comment, I got looking at what could cause the major discrepancies between the estimated and actual rows delivered by the expenditure subquery in the non-ORDER BY version, even though I eventually wanted to ORDER it. I thought that perhaps if I can optimize that version a bit, perhaps that would also benefit the ORDER BY version as well.
As I mentioned in the OP, the expenditure subquery contains a PIVOT aggregation across yet another subquery (let's call it unaggregated_expenditure). I added a layer between the PIVOT and the unaggregated_expenditure subquery which aggregated the required column before PIVOTing the same column across the required few pivot columns. This added a bit of conceptual complexity, yet was able to reduce the estimated number of rows coming from the PIVOT from 106,245,000 to 10,307. This change, when applied to the ORDER BY version of the whole query, resulted in a different Actual Execution Plan that was able to process and deliver the query within the desired 11 seconds.

Resources