What's the difference between these T-SQL queries using OR? - sql-server

I use Microsoft SQL Server 2008 (SP1, x64). I have two queries that do the same, or so I think, but they are have completely different query plans and performance.
Query 1:
SELECT c_pk
FROM table_c
WHERE c_b_id IN (SELECT b_id FROM table_b WHERE b_z = 1)
OR c_a_id IN (SELECT a_id FROM table_a WHERE a_z = 1)
Query 2:
SELECT c_pk
FROM table_c
LEFT JOIN (SELECT b_id FROM table_b WHERE b_z = 1) AS b ON c_b_id = b_id
LEFT JOIN (SELECT a_id FROM table_a WHERE a_z = 1) AS a ON c_a_id = a_id
WHERE b_id IS NOT NULL
OR a_id IS NOT NULL
Query 1 is fast as I would expect, whereas query 2 is very slow. The query plans look quite different.
I would like query 2 to be as fast as query 1. I have software that uses query 2, and I cannot change that into query 1. I can change the database.
Some questions:
why are the query plans different?
can I "teach" SQL Server somehow that query 2 is equal to query 1?
All tables have (clustered) primary keys and proper indexes on all columns:
CREATE TABLE table_a (
a_pk int NOT NULL PRIMARY KEY,
a_id int NOT NULL UNIQUE,
a_z int
)
GO
CREATE INDEX IX_table_a_z ON table_a (a_z)
GO
CREATE TABLE table_b (
b_pk int NOT NULL PRIMARY KEY,
b_id int NOT NULL UNIQUE,
b_z int
)
GO
CREATE INDEX IX_table_b_z ON table_b (b_z)
GO
CREATE TABLE table_c (
c_pk int NOT NULL PRIMARY KEY,
c_a_id int,
c_b_id int
)
GO
CREATE INDEX IX_table_c_a_id ON table_c (c_a_id)
GO
CREATE INDEX IX_table_c_b_id ON table_c (c_b_id)
GO
The tables are not modified after filling initially. I'm the only one querying them. They contains millions of records (table_a: 5M, table_b: 4M, table_c: 12M), but using only 1% gives similar results.
Edit: I tried adding FOREIGN KEYs for c_a_id and c_b_id, but that only made query 1 slower...
I hope someone can have a look at the query plans and explain the difference.

Join are slower, let me say by design. First query uses a sub-query (cacheable) to filter records so it'll produce less data (and less accesses to each table).
Did you read these:
http://www.sql-server-performance.com/2006/tuning-joins/
http://blogs.msdn.com/b/craigfr/archive/2006/12/04/semi-join-transformation.aspx
What I mean is that with IN the DB can do better optimizations like removing duplicates, stop at first match and similar (and these are from school memories so I'm sure it'll do much better). So I guess the question isn't why QP is different but how smart how deep optimizations can go.

You are comparing non equivalent queries also you are using left join in quite unusual way.
Generally if yours intention was to select all entries in table_c which has linked records either in table_a or table_b you should use exists statement:
SELECT c_pk
FROM table_c
WHERE Exists(
SELECT 1
FROM table_b
WHERE b_z = 1 and c_b_id = b_id
) OR Exists(
SELECT 1
FROM table_a
WHERE a_z = 1 and c_a_id = a_id
)

Since you can't change the query, at least you can improve the query's environment.
Highlight your query, right-click on it in SSMS and select "Analyze
Query in Database Engine Tuning Advisor."
Run the analysis to find out if you need any additional indexes or
statistics built.
Heed SQL Server's advice.

Related

selecting keys which are not in another table takes forever

I have a query like this:
select key, name from localtab where key not in (select key from remotetab);
The query takes forever, and I don't understand why.
localtab is local table, and remotetab is a remote table in another server. key is an int column which has a unique index in both tables. When I query the both tables separately, it takes just a few seconds.
Linked Severs have terrible performance. Get the data you need to the local server and do the majority of the hard work and processing there instead of a mix of local and remote in a single query.
select remotetab into a temp table
select [key] into #remote_made_local from remotetab
Use the #temp table when doing the where clause filtering and use exists instead of in for better performance
select a.[key], a.name from localtab a where not exists (select 1 from #remote_made_local b where b.[key] = a.[key] )
Vs doing
select [key], name from localtab where key not in (select [key] from #remote_made_local)
There is also a solution without using temporary tables.
By using a left join instead of not in (select ...), you can massively speed up the query. Like this:
select l.key, l.name
from localtab l left join remotetab r on l.key = r.key
where r.key is null ;

Which is the fastest way to run this SQL query?

I have a table (let's call it A) in SQL Server 2016 that I want to query on. I need to select only those rows that have a definitive status, so I need to exclude some rows. There's another table (B), containing the record id from the Table A and two columns, col1 and col2. If these columns are non-empty, the corresponding record can be considered final. There is a one-to-one relationship between tables A and B. Because these tables are rather large, I want to use the most efficient query. Which should I choose?
SELECT *
FROM TableA
WHERE record_id IN
(SELECT record_id FROM TableB WHERE col1 IS NOT NULL AND col2 IS NOT NULL)
SELECT a.*
FROM TableA a
INNER JOIN TableB b ON a.record_id = b.record_id
WHERE b.col1 IS NOT NULL AND b.col2 IS NOT NULL
SELECT a.*
FROM TableA a
INNER JOIN TableB b
ON a.record_id = b.record_id
AND b.col1 IS NOT NULL
AND b.col2 IS NOT NULL
Of course, if there's an even faster way that I hadn't thought of, please share. I'd also be very curious to know why one query is faster than the others.
WITH cte AS
(SELECT b.record_id, b.col1, b.col2
FROM TableB b
WHERE col1 IS NULL
AND col2 IS NULL --if the field isn't NULL, it might be quicker to do <> '')
SELECT a.record_id, a.identifyColumnsNeededExplicitely
FROM cte
JOIN TableA a ON a.record_id = cte.record_id
ORDER BY a.record_id
In practice the execution plan will do whatever it likes depending on your current indexes / clustered index / foreign keys / constraints / table stastics (aka number of rows / general containt of your rows/...). Any analysis should be done case by case and what's true for 2 tables may not be to 2 others table.
Theorically,
Without any index, the first one should be the best since it will make an optimization on operations with 1 table scan on TableB, 2 contants scan on TableB and 1 table scan on Table1.
With a foreign key on TableA.record_id referencing TableB.record_id OR an index in both column, the second should be faster since it will make a scan index and 2 constant scan.
In rare case, it could be the 3rd one depending on TableB stats. But not far from number 2 since number 3 will scan all the TableB.
In even rarer case, neither of the 3.
What I'm tryng to say is : "Since we don't have neither your tables nor rows, open your SQL Management, put the stats ON and try it yourself."

SQL Server differences between same table on different databases

I have the same table in two different databases. It has the same columns, primary keys, etc. but data in this table may differ from one database to another. So I am trying to get the differences. For example:
Database A Database B
Table_A Table_B
Table_A and Table_B have Id1 and Id2 fields as primary key.
Table_A and Table_B is exactly the same but may contain different data. So I would like to obtain the differences, I mean, obtain the data that is in Table_A but not in Table_B, and insert them in Table_B, or if it is not possible to automatically insert them in Table_B to generate a list of inserts.
To obtain the data that is in Table_A and not in Table_B and vice versa, I do the following:
SELECT a.*, b.*
FROM Table_A a
FULL JOIN Table_B b ON (a.Id1=b.Id1 and a.Id2=b.Id2)
WHERE a.Id1 IS NULL OR b.Id1 IS NULL or a.Id2 IS NULL OR b.Id2 IS NULL
Then I use excel to generate my inserts to be inserted on table Table_B.
Is that correct? or is there any better way to do this?
For your scenario I would go with the MERGE statement
MERGE INTO Table_B AS Trg
USING (SELECT ID1, ID2, YourDataColumn FROM Table_A) AS Src
ON Trg.ID1 = Src.ID1 AND Trg.ID2 = Src.ID2
WHEN NOT MATCHED BY TARGET THEN
INSERT (ID1, ID2, YourDataColumn )
VALUES (Src.ID1, Src.ID2, Src.YourDataColumn );
Would be great to know what database you're on.
Oracle SQL Developer got tooling for this. http://www.thatjeffsmith.com/archive/2012/09/sql-developer-database-diff-compare-objects-from-multiple-schemas/

Nonclustered index functionality relative to clustered index seek

the question is quite simple, but we've had so many issues with index/statistics updates not always resulting in the proper new execution plans in low-load environments that I need to check this with you guys here to be sure.
Say that you have the following tables:
/*
TABLES:
TABLE_A (PK_ID INT, [random columns], B_ID INT (INDEXED, and references TABLE_B.PK_ID))
TABLE_B (PK_ID INT, [random columns], C_ID INT (INDEXED, and references TABLE_C.PK_ID))
TABLE_C (PK_ID INT, [random columns])
*/
SELECT *
FROM TABLE_A A
JOIN TABLE_B B ON B.PK_ID = A.B_ID
JOIN TABLE_C C ON C.PK_ID = B.C_ID
WHERE A.randcolumn1 = 'asd' AND B.randcolumn2 <> 5
Now, since B is joined to A with its clustered PK column, shouldn't that mean that the index on B.C_ID will not be used as the information is already returned through the B.PK_ID clustered index? In fact, is it not true that the index on B.C_ID will never be used unless the query specifically targets the ID values on that index?
This may seem like a simple and even stupid question, but I want to make absolutely sure I'm getting this right. I'm thinking of making adjustments on our indexing, since we have a lot of unused indexes which have been inherited from an old datamodel and they're taking up quite a bit of space in a DB this size. And experience has shown that we cannot fully trust the execution plans on any environment apart from the production thanks to its extreme load compared to testing environments, which makes it difficult to test this out reliably.
Thanks!
The query optimizer is free to do as it pleases. It could execute the second join by scanning the C table, and for each row, looking up the matching row in B. The index you describe would help with that lookup.
SQL Server provides statistics to tell you if an index is actually used:
select db_name(ius.database_id) as Db
, object_name(ius.object_id) as [Table]
, max(ius.last_user_lookup) as LastLookup
, max(ius.last_user_scan) as LastScan
, max(ius.last_user_seek) as LastSeek
, max(ius.last_user_update) as LastUpdate
from sys.dm_db_index_usage_stats as ius
where ius.[database_id] = db_id()
and ius.[object_id] = object_id('YourTableName')
group by
ius.database_id
, ius.object_id
If the index isn't used for more than 2 months, it is usually safe to drop it.

How to avoid table scan and index scan for huge tables

I am using MSSQL 2008 R2. I have a table with huge number of rows (test table)
I have the following SQL code, please suggest where I can use index hints, force seek or any other means to improve performance.
Indexes
1. non-clustered - idx_id (id)
2. non-clustered - idx_name (name)
SELECT DISTINCT
p.id,
p.name,
FROM
test p
LEFT OUTER JOIN
(
SELECT
e.id
FROM
test e
INNER JOIN
(
SELECT
c.id
FROM
test c
GROUP BY
c.id
HAVING
COUNT(1) > 1
) f
ON e.id = f.id
WHERE
e.name = 'test_name'
) m
ON p.id = m.id
WHERE
m.id is null
Prerequise: have a primary key
select distinct
p.id
, p.name
from test p
where not exists (
SELECT TOP(1)
1
FROM test e
WHERE e.PrimaryKey <> p.PrimaryKey
AND e.id = p.id
AND 'test_name' IN (e.name, p.name)
)
How many columns your table contains? If there's only these two columns, it makes no sense to add nonclustered index. You should create CLUSTERED index on ID column, and that's it - you'll see performance increase.
If you have many colums, consider two options:
Create clustered index on NAME column and nonclustered index on ID column.
Create nonclustered index on ID column, and INCLUDE NAME column (you'll create covering index that way)
Generally speaking, relational databases (being relational) are written in such a way to optimize join statements. When using a "JOIN" clause with "ON" criteria, the database engine can create an optimized execution plan that takes the table structure, indexes, etc. into account. When joining on a sub-select, sometimes the same optimizing factors are not available, or are not taken into account the same way. It depends on your schema, but it is a good rule of thumb to assume that a standard join with an "on" clause is going to be more efficient than a join on a sub-select.
Your schema is pretty vague, so I am not even sure that you need the joins, but if you do, you should try performing the joins directly with "on" criteria.

Resources