How to efficiently compare two subsets of rows of a large table? - database

I use PostgreSQL 9.6 and my table schema is as follows: department, key, value1, value2, value3, ... Each department has hundreds of millions of unique keys, but the set of keys is more or less the same for all departments. It's possible that some keys don't exist for some departments, but such situations are rare.
I would like to prepare a report that for two departments points out differences in values for each key (comparison involves some logic based only on values for the key).
My first approach was to write an external tool in python that:
creates a server-side cursor for query: SELECT * FROM my_table WHERE department = 'ABC' ORDER BY key;
creates another server-side cursor for query: SELECT * FROM my_table WHERE department = 'XYZ' ORDER BY key;
iterates over both cursors, and compares the values.
It worked fine, but I thought it will be more efficient to perform the comparison inside a stored procedure in PostgreSQL. I wrote a stored procedure that takes two cursors as arguments, iterates over them and compares the values. Any differences are written into a temporary table. At the end, the external tool iterates just over the temporary table - there shouldn't be many rows there.
I thought that the latter approach would be more efficient, because it doesn't require transferring lots of data outside the database.
To my surprise, it turned out to be slower by almost 40%.
To isolate the problem I compared the performance of iterating a cursor inside a stored procedure, and in python:
FETCH cur_1 INTO row_1;
WHILE (row_1 IS NOT NULL) LOOP
rows = rows + 1;
FETCH FROM cur_1 INTO row_1;
END LOOP;
vs.
conn = psycopg2.connect(PG_URI)
cur = conn.cursor('test')
cur.execute(query)
cnt = 0
for row in cur:
cnt += 1
Query was the same in both cases. Again, the external tool was faster.
My hypothesis is that this is because the stored procedure fetches rows one-by-one (FETCH FROM curs_1 INTO row_1) while the application fetches rows in batches of 2000. But I couldn't find a way to FETCH a batch of rows from cursor inside a PgSQL procedure. Thus, I can't test the hypothesis.
So my question is it possible to speed up my stored procedure?
What is the best approach for problems like this?

Why can you not do a self-join rather than using cursors? Something like:
SELECT t1.key, t1.value1 - t2.value1 as diff1,
t1.value2 - t2.value2 as diff2, ...
FROM my_table t1 inner join my_table t2 on t1.key = t2.key
WHERE t1.department = 'XYZ' and t2.department = 'ABC'
UNION
SELECT t1.key, t1.value1 as diff1,
t1.value2 as diff2, ...
FROM my_table t1 WHERE NOT EXISTS (SELECT 1 FROM my_table t2 WHERE
t1.key = t2.key AND t2.dept = 'ABC') AND t1.dept = 'XYZ'
UNION
SELECT t1.key, t1.value1 as diff1,
t1.value2 as diff2, ...
FROM my_table t1 WHERE NOT EXISTS (SELECT 1 FROM my_table t2 WHERE
t1.key = t2.key AND t2.dept = 'XYZ') AND t1.dept = 'ABC';
The first part deals with all the common cases and the two unions pick up the missing values. I would have thought this would be much faster than a cursor approach.

This might be faster as it will only return those rows that are different in at least one of the values:
select *
from (
SELECT key, value1, value2, value3
FROM my_table
WHERE department = 'ABC'
) d1
full join (
SELECT key, value1, value2, value3
FROM my_table
WHERE department = 'XYZ'
) d2 using (key)
where d1 is distinct from d2;

Related

Optimizing SQL Function

I'm trying to optimize or completely rewrite this query. It takes about ~1500ms to run currently. I know the distinct's are fairly inefficient as well as the Union. But I'm struggling to figure out exactly where to go from here.
I am thinking that the first select statement might not be needed to return the output of;
[Key | User_ID,(User_ID)]
Note; Program and Program Scenario are both using Clustered Indexes. I can provide a screenshot of the Execution Plan if needed.
ALTER FUNCTION [dbo].[Fn_Get_Del_User_ID] (#_CompKey INT)
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE #UseID AS VARCHAR(8000);
SET #UseID = '';
SELECT #UseID = #UseID + ', ' + x.User_ID
FROM
(SELECT DISTINCT (UPPER(p.User_ID)) as User_ID FROM [dbo].[Program] AS p WITH (NOLOCK)
WHERE p.CompKey = #_CompKey
UNION
SELECT DISTINCT (UPPER(ps.User_ID)) as User_ID FROM [dbo].[Program] AS p WITH (NOLOCK)
LEFT OUTER JOIN [dbo].[Program_Scenario] AS ps WITH (NOLOCK) ON p.ProgKey = ps.ProgKey
WHERE p.CompKey = #_CompKey
AND ps.User_ID IS NOT NULL) x
RETURN Substring(#UserIDs, 3, 8000);
END
There are two things happening in this query
1. Locating rows in the [Program] table matching the specified CompKey (#_CompKey)
2. Locating rows in the [Program_Scenario] table that have the same ProgKey as the rows located in (1) above.
Finally, non-null UserIDs from both these sets of rows are concatenated into a scalar.
For step 1 to be efficient, you'd need an index on the CompKey column (clustered or non-clustered)
For step 2 to be efficient, you'd need an index on the join key which is ProgKey on the Program_Scenario table (this likely is a non-clustered index as I can't imagine ProgKey to be PK). Likely, SQL would resort to a loop join strategy - i.e., for each row found in [Program] matching the CompKey criteria, it would need to lookup corresponding rows in [Program_Scenario] with same ProgKey. This is a guess though, as there is not sufficient information on the cardinality and distribution of data.
Ensure the above two indexes are present.
Also, as others have noted the second left outer join is a bit confusing as an inner join is the right way to deal with it.
Per my interpretation the inner part of the query can be rewritten this way. Also, this is the query you'd ideally run and optimize before tacking the string concatenation part. The DISTINCT is dropped as it is automatic with a UNION. Try this version of the query along with the indexes above and if it provides the necessary boost, then include the string concatenation or the xml STUFF approaches to return a scalar.
SELECT UPPER(p.User_ID) as User_ID
FROM
[dbo].[Program] AS p WITH (NOLOCK)
WHERE
p.CompKey = #_CompKey
UNION
SELECT UPPER(ps.User_ID) as User_ID
FROM
[dbo].[Program] AS p WITH (NOLOCK)
INNER JOIN [dbo].[Program_Scenario] AS ps WITH (NOLOCK) ON p.ProgKey = ps.ProgKey
WHERE
p.CompKey = #_CompKey
AND ps.User_ID IS NOT NULL
I am taking a shot in the dark here. I am guessing that the last code you posted is still a scalar function. It also did not have all the logic of your original query. Again, this is a shot in the dark since there is no table definitions or sample data posted.
This might be how this would look as an inline table valued function.
ALTER FUNCTION [dbo].[Fn_Get_Del_User_ID]
(
#_CompKey INT
) RETURNS TABLE AS RETURN
select MyResult = STUFF(
(
SELECT distinct UPPER(p.User_ID) as User_ID
FROM dbo.Program AS p
WHERE p.CompKey = #_CompKey
group by p.User_ID
UNION
SELECT distinct UPPER(ps.User_ID) as User_ID
FROM dbo.Program AS p
LEFT OUTER JOIN dbo.Program_Scenario AS ps ON p.ProgKey = ps.ProgKey
WHERE p.CompKey = #_CompKey
AND ps.User_ID IS NOT NULL
for xml path ('')
), 1, 1, '')
from dbo.Program

SQL Server (2014) Query Optimization when using filter variable passed to procedure

I am looking for help with optimization techniques or hint for me to move ahead with the problem I have. Using a temp table for in clause makes my query run for more than 5 seconds, changing it a static value returns the data under a second. I am trying to understand the way to optimize this.
-- details about the number of rows in table
dept_activity table
- total rows - 17,319,666
- rows for (dept_id = 10) - 36054
-- temp table
CREATE TABLE #tbl_depts (
Id INT Identity(1, 1) PRIMARY KEY
,dept_id integer
);
-- for example I inserted one row but based on conditions multiple department numbers are inserted in this temp table
insert into #tbl_depts(dept_id) values(10);
-- this query takes more than 5 seconds
SELECT activity_type,count(1) rc
FROM dept_activity da
WHERE (
#filter_by_dept IS NULL
OR da.depart_id IN (
SELECT td.dept_id
FROM #tbl_depts td
)
)
group by activity_type;
-- this query takes less than 500 milli seconds
SELECT activity_type,count(1) rc
FROM dept_activity da
WHERE (
#filter_by_dept IS NULL
OR da.depart_id IN (
10 -- changed to static value
)
)
group by activity_type;
What ways I can optimize to return data for first query under a second.
You're testing this with just one value, but isn't your real case different?
The problem that optimizer has here is that it can't know how many rows the temp. table in -clause will actually find, so it'll have to make a guess, and probably that why the result is different. Looking at estimated row counts (+vs actual) might give some insight on this.
If your clause only contains this one criteria:
#filter_by_dept IS NULL OR da.depart_id IN
It might be good to test what happens if you separate your logic with if blocks, into the one that fetches all, and the other that filters the data.
If that's not the real case, you might want to test both option (recompile), which could result into a better plan, but will use (little bit) more CPU since the plan is re-generated every time. Or by constructing the clause with dynamic SQL (either just with the temp table but optimizing away the or statements, or doing a full in clause if there isn't a ridiculous amount of values), but that might get really ugly.
There are different ways of writing same thing. Use as per your requirements -
Separate Block
IF #filter_by_dept IS NULL
BEGIN
SELECT da.activity_type, count(1) rc
FROM dept_activity da
GROUP BY da.activity_ty
END
ELSE
BEGIN
SELECT da.activity_type,COUNT(1) rc
FROM dept_activity da
INNER JOIN #tbl_depts td ON td.dept_id = da.depart_id
GROUP BY da.activity_ty
END
Dynamic Query
DECLARE #sql_stmt VARCHAR(5000)
SET #sql_stmt = '
SELECT activity_type, COUNT(1) rc
FROM dept_activity da
'
IF #filter_by_dept IS NOT NULL
SET #sql_stmt = #sql_stmt + ' INNER JOIN #tbl_depts td ON td.dept_id = da.depart_id'
SET #sql_stmt = #sql_stmt + ' GROUP BY da.activity_type '
EXEC(#sql_stmt)
Simple Left Join
Comparatively, it can be slower that above two options.
SELECT da.activity_type, count(1) rc
FROM dept_activity da
LEFT JOIN #tbl_depts td ON td.dept_id = da.depart_id
WHERE #filter_by_dept IS NULL OR td.id IS NOT NULL
GROUP BY da.activity_type
The biggest issue is most likely the use of an "optional parameter". The query optimizer has no idea weather or not #filter_by_dept is going to have a value the next time it's executed to it chooses to play it safe opts for an index scan, rather than an index seek. The is where OPTION(RECOMPILE) can be your friend. Especially on simple, easy to compile queries like this one.
Also, there are potential gains from using a WHERE EXISTS in place of the IN.
Try the following...
DECLARE #filter_by_dept INT = 10;
SELECT
da.activity_type,
rc = COUNT(1)
FROM
dbo.dept_activity da
WHERE
#filter_by_dept IS NULL
OR
EXISTS (SELECT 1 FROM #tbl_depts td WHERE da.depart_id = td.dept_id)
GROUP BY
da.activity_type
OPTION (RECOMPILE);
HTH, Jason

SQL Server copy rows to second table

I have a table for bookings (table_b) that has around 1.3M rows. A second table (table_s) is used to note when these rows are needed to be accessed by a separate application.
Currently there are triggers to make a record in table_s but this doesn't help with all existing data.
I believe I need to have a query that selects the rows that exists in table_b but not table_s and then insert a row for each line.
Here is my current syntax but don't think it has been formed correctly
DECLARE #b_id [INT] = 0;
WHILE(1 = 1)
BEGIN
SELECT TOP 10
#b_id = MIN([b].[b_id])
FROM
[table_b] AS [b]
LEFT JOIN
[table_s] AS [s] ON [b].[b_id] = [s].[b_id]
WHERE
[s].[b_id] IS NULL;
IF #b_id IS NULL
BREAK;
INSERT INTO [table_s] ([b_id], [processed])
VALUES (#b_id, 0);
END;
Syntactically everything is fine. But there are some misconceptions present in your query
select top 10 #b_id = MIN(b.b_id)
a variable can hold just one value, even though you select top 10 it will assign single value to variable. Your current approach will loop for each non existing record
I don't think for 1 million records insert we need to split the insert into batches. Try this way
INSERT INTO table_s
(b_id,
processed)
SELECT b_id,
0
FROM table_b AS b
WHERE NOT EXISTS (SELECT 1
FROM table_s AS s
WHERE b.b_id = s.b_id)

Switching on queries to run in a view

I have a huge view with many queries concatenated using UNION ALL with the first column of every query being constant.
e.g.
CREATE VIEW M AS (
SELECT 'A' ID, Value FROM A
UNION ALL
SELECT 'B' ID, Value FROM B
...
)
The queries are more complex in reality but the purpose here is just to switch on what queries to run like this:
SELECT * FROM M WHERE ID = 'A'
The execution plan is showing that the queries that doesn't match on ID never run.
I thought this was a really nice (feature?) that I could use to combine querying different but similar things through the same view.
However, I'm ending up with an even better execution plans if querying against a CTE like this:
WITH M AS (
SELECT 'A' ID, Value FROM A
UNION ALL
SELECT 'B' ID, Value FROM B
...
)
SELECT * FROM M WHERE ID = 'A'
Here's a partial sample of the actual query:
SELECT CONVERT(char(4), 'T ') EntityTypeID, SystemID, TaskId EntityID
FROM [dbo].[Task]
UNION ALL
SELECT CONVERT(char(4), 'T ') EntityTypeID, s.SystemID, [dbo].[Task].TaskId EntityID
FROM [dbo].[Task]
INNER JOIN [dbo].[System] s ON s.MasterSystemID = [dbo].[Task].SystemID
INNER JOIN SystemEntitySettings ON SystemEntitySettings.SystemID = s.SystemID
AND SystemEntitySettings.EntityTypeID = 'T '
AND SystemEntitySettings.IsSystemPrivate = 0
Given the above T-SQL if I ran something like WHERE EntityTypeiD <> 'T' it would ignore the first query entirely but do something with the second (never returning any actual rows).
The issue I'm having, or rather, my question is, why is it that it cannot eliminate the query entirely from the view, when it does so in the CTE case?
EDIT
I've observed some interesting things so far, I'm not ruling out the deal with parameterization but I can also achive the desiered effect by either specifying a query hint (apparently any will do) or rewrite the second join as a IN predicate since it is only a filter.
INNER JOIN SystemEntitySettings ON SystemEntitySettings.SystemID = s.SystemID
AND SystemEntitySettings.EntityTypeID = 'T '
AND SystemEntitySettings.IsSystemPrivate = 0
...becomes...
WHERE s.SystemID IN (
SELECT SystemID
FROM dbo.SystemEntitySettings
WHERE EntityTypeID = 'T ' AND IsSystemPrivate = 0
)
But, the following query has the same issue. It appears as if it's related to the JOIN operations some how. (NOTE the additional JOIN with [Group] taking place in this query)
SELECT CONVERT(char(4), 'CF ') EntityTypeID, s.SystemID, [dbo].[CareerForum].GroupID EntityID
FROM [dbo].[CareerForum]
INNER JOIN [dbo].[Group] ON [dbo].[Group].GroupID = [dbo].[CareerForum].GroupID
INNER JOIN [dbo].[System] s ON s.MasterSystemID = [dbo].[Group].SystemID
WHERE s.SystemID IN (SELECT SystemID FROM dbo.SystemEntitySettings WHERE EntityTypeID = 'CF ' AND IsSystemPrivate = 0)
Reproducible
The following script can be used to reproduce the issue. Notice how the execution plan is completely different wheter the query is run with a query hint or if the view is run using a cte (the desiered result).
CREATE DATABASE test_jan_20
USE test_jan_20
create table source (
x int not null primary key,
)
insert into source values (1)
insert into source values (2)
insert into source values (3)
insert into source values (4)
insert into source values (5)
insert into source values (6)
create table other (
y int not null primary key,
)
insert into other values (1)
insert into other values (2)
insert into other values (3)
insert into other values (4)
insert into other values (5)
insert into other values (6)
create view dummy AS (
SELECT 'A' id, x, NULL y
FROM SOURCE
WHERE x BETWEEN 1 AND 2
UNION ALL
SELECT 'B' id, x, NULL y
FROM SOURCE
WHERE x BETWEEN 3 AND 4
UNION ALL
SELECT 'B' id, source.x, y
FROM SOURCE
INNER JOIN other ON y = source.x
INNER JOIN source s2 ON s2.x = y - 1 --i need this join for the issue to occur in the execution plan
WHERE source.x BETWEEN 5 AND 6
)
GO
--this one fails to remove the JOIN, not OK
SELECT * FROM dummy WHERE id = 'c'
--this is OK
SELECT * FROM dummy WHERE id = 'c' OPTION (HASH JOIN) --NOTE: any query hint seems to do the trick
--this is OK
;
WITH a AS (
SELECT 'A' id, x, NULL y
FROM SOURCE
WHERE x BETWEEN 1 AND 2
UNION ALL
SELECT 'B' id, x, NULL y
FROM SOURCE
WHERE x BETWEEN 3 AND 4
UNION ALL
SELECT 'B' id, source.x, y
FROM SOURCE
INNER JOIN other ON y = source.x
INNER JOIN source s2 ON s2.x = y - 1 --i need this join for the issue to occur in the execution plan
WHERE source.x BETWEEN 5 AND 6
)
SELECT * FROM a WHERE id = 'c'
In your test case this is what is happening.
For the query with the view and the query hint or the CTE the Query Optimiser is using "contradiction detection". You can see in the execution plan properties that the OPTIMIZATION LEVEL is TRIVIAL. The trivial plan churned out is exactly the same as the one shown in point 8 of this article.
For the query with the view without the query hint this gets auto parameterised. This can prevent the contradiction detection from kicking in as covered here.
The execution plan is showing that the
queries that doesn't match on ID never
run.
That is correct since you provided a constant 'A', so the plan is built against the specific string 'A', which cuts off one part.
The issue I'm having, or rather, my
question is, why is it that it cannot
eliminate the query entirely from the
view, when it does so in the CTE case?
I thought you just stated that it did? I guess you are using it in a parameterized way, either in an SP, function or parameterized query. This causes a plan to be created that MUST be able to take various parameters - so cutting one part out is out of the question.
To achieve what you want, you would have to generate dynamic SQL that would present the query with a constant value to the query optimizer. This is true whether you use View or inline Table-valued function.
EDIT: following addition of reproducible
These two forms seem to work as well
select * from (SELECT * FROM dummy) y WHERE id = 'c'
with a as (Select * from dummy) SELECT * FROM a WHERE id = 'c'
With you last update, the query is optimized too.
If you provide c (or any pother missing value) as a filter, you will have this plan:
|--Compute Scalar(DEFINE:([Union1019]=[Expr1018], [Union1020]=[ee].[dbo].[source].[x], [Union1021]=[ee].[dbo].[other].[y]))
|--Compute Scalar(DEFINE:([Expr1018]='B'))
|--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1022]))
|--Constant Scan
|--Clustered Index Seek(OBJECT:([ee].[dbo].[source].[PK__source__3BD019E5171A1207] AS [s2]), SEEK:([s2].[x]=[Expr1022]) ORDERED FORWARD)
, with the constant scan expanding as follows:
<RelOp AvgRowSize="19" EstimateCPU="1.57E-07" EstimateIO="0" EstimateRebinds="0" EstimateRewinds="0" EstimateRows="0" LogicalOp="Constant Scan" NodeId="3" Parallel="false" PhysicalOp="Constant Scan" EstimatedTotalSubtreeCost="1.57E-07">
<OutputList>
<ColumnReference Database="[ee]" Schema="[dbo]" Table="[source]" Column="x" />
<ColumnReference Database="[ee]" Schema="[dbo]" Table="[other]" Column="y" />
<ColumnReference Column="Expr1022" />
</OutputList>
<ConstantScan />
</RelOp>
In other worlds, source s and other o are never touched, this does not produce any real output, and, hence, there is no input for the Nested Loops, so no actual seeks are performed.
If you substitute the parameter with b, you will see a more complex plan with actual JOIN operations against all three tables.

set difference in SQL query

I'm trying to select records with a statement
SELECT *
FROM A
WHERE
LEFT(B, 5) IN
(SELECT * FROM
(SELECT LEFT(A.B,5), COUNT(DISTINCT A.C) c_count
FROM A
GROUP BY LEFT(B,5)
) p1
WHERE p1.c_count = 1
)
AND C IN
(SELECT * FROM
(SELECT A.C , COUNT(DISTINCT LEFT(A.B,5)) b_count
FROM A
GROUP BY C
) p2
WHERE p2.b_count = 1)
which takes a long time to run ~15 sec.
Is there a better way of writing this SQL?
If you would like to represent Set Difference (A-B) in SQL, here is solution for you.
Let's say you have two tables A and B, and you want to retrieve all records that exist only in A but not in B, where A and B have a relationship via an attribute named ID.
An efficient query for this is:
# (A-B)
SELECT DISTINCT A.* FROM (A LEFT OUTER JOIN B on A.ID=B.ID) WHERE B.ID IS NULL
-from Jayaram Timsina's blog.
You don't need to return data from the nested subqueries. I'm not sure this will make a difference withiut indexing but it's easier to read.
And EXISTS/JOIN is probably nicer IMHO then using IN
SELECT *
FROM
A
JOIN
(SELECT LEFT(B,5) AS b1
FROM A
GROUP BY LEFT(B,5)
HAVING COUNT(DISTINCT C) = 1
) t1 On LEFT(A.B, 5) = t1.b1
JOIN
(SELECT C AS C1
FROM A
GROUP BY C
HAVING COUNT(DISTINCT LEFT(B,5)) = 1
) t2 ON A.C = t2.c1
But you'll need a computed column as marc_s said at least
And 2 indexes: one on (computed, C) and another on (C, computed)
Well, not sure what you're really trying to do here - but obviously, that LEFT(B, 5) expression keeps popping up. Since you're using a function, you're giving up any chance to use an index.
What you could do in your SQL Server table is to create a computed, persisted column for that expression, and then put an index on that:
ALTER TABLE A
ADD LeftB5 AS LEFT(B, 5) PERSISTED
CREATE NONCLUSTERED INDEX IX_LeftB5 ON dbo.A(LeftB5)
Now use the new computed column LeftB5 instead of LEFT(B, 5) anywhere in your query - that should help to speed up certain lookups and GROUP BY operations.
Also - you have a GROUP BY C in there - is that column C indexed?
If you are looking for just set difference between table1 and table2,
the below query is simple that gives the rows that are in table1, but not in table2, such that both tables are instances of the same schema with column names as
columnone, columntwo, ...
with
col1 as (
select columnone from table2
),
col2 as (
select columntwo from table2
)
...
select * from table1
where (
columnone not in col1
and columntwo not in col2
...
);

Resources