Which is the fastest way to run this SQL query? - sql-server

I have a table (let's call it A) in SQL Server 2016 that I want to query on. I need to select only those rows that have a definitive status, so I need to exclude some rows. There's another table (B), containing the record id from the Table A and two columns, col1 and col2. If these columns are non-empty, the corresponding record can be considered final. There is a one-to-one relationship between tables A and B. Because these tables are rather large, I want to use the most efficient query. Which should I choose?
SELECT *
FROM TableA
WHERE record_id IN
(SELECT record_id FROM TableB WHERE col1 IS NOT NULL AND col2 IS NOT NULL)
SELECT a.*
FROM TableA a
INNER JOIN TableB b ON a.record_id = b.record_id
WHERE b.col1 IS NOT NULL AND b.col2 IS NOT NULL
SELECT a.*
FROM TableA a
INNER JOIN TableB b
ON a.record_id = b.record_id
AND b.col1 IS NOT NULL
AND b.col2 IS NOT NULL
Of course, if there's an even faster way that I hadn't thought of, please share. I'd also be very curious to know why one query is faster than the others.

WITH cte AS
(SELECT b.record_id, b.col1, b.col2
FROM TableB b
WHERE col1 IS NULL
AND col2 IS NULL --if the field isn't NULL, it might be quicker to do <> '')
SELECT a.record_id, a.identifyColumnsNeededExplicitely
FROM cte
JOIN TableA a ON a.record_id = cte.record_id
ORDER BY a.record_id

In practice the execution plan will do whatever it likes depending on your current indexes / clustered index / foreign keys / constraints / table stastics (aka number of rows / general containt of your rows/...). Any analysis should be done case by case and what's true for 2 tables may not be to 2 others table.
Theorically,
Without any index, the first one should be the best since it will make an optimization on operations with 1 table scan on TableB, 2 contants scan on TableB and 1 table scan on Table1.
With a foreign key on TableA.record_id referencing TableB.record_id OR an index in both column, the second should be faster since it will make a scan index and 2 constant scan.
In rare case, it could be the 3rd one depending on TableB stats. But not far from number 2 since number 3 will scan all the TableB.
In even rarer case, neither of the 3.
What I'm tryng to say is : "Since we don't have neither your tables nor rows, open your SQL Management, put the stats ON and try it yourself."

Related

Can joins effectively ignore field indexes if they are a constant?

It is much easier to explain this with an example.
Table A has PK on (store,line).
Table B has PK on (id,store,line).
[id] is int, [store] is nvarchar(100) and [line] is int in both cases.
If I run:
select *
from A inner join B
on A.store=B.store and A.line=B.line
where B.id=0
will the engine be able to make a fast (i'm thinking merge) join? Or will it be helpful to add a dummy column id valued 0 in A?
Your statement will work but if you do it like this the optimizer will be more effective:
select *
from A inner
join B on A.store=B.store and A.line=B.line and B.id=0
Here is is able to exclude items where b.id does not equal zero before it does the merge. Depending on table size topology etc this could be quite significant.
For example consider the case where you have 50 million rows shared across 5 nodes table b and 1 node for table a -- in your code all records would have to be moved to the node with the a table while with the code above only the records that have id = 0 would need to be moved.
This can be very non-intuitive when a is a small table (which are often only on one node.)

how to get multiple sets of distinct values

This is not about distinct combinations of values (Select distinct col1, col2 from table)
I have a table with a newly loaded csv file.
Some columns are linked to foreign key dimensions but the values in a given column may not exist in the reference tables.
My desire is to find all the values in each column that do not exist but in such a way as to minimize the amount of table scans in our source table.
My current approach consumes the output of a bunch of queries like the following:
SELECT DISTINCT col2 FROM table WHERE col2 NOT IN (SELECT val FROM DimCol2)
SELECT DISTINCT col3 FROM table WHERE col3 NOT IN (SELECT val FROM DimCol3)
however, for N columns, this results in N table scans.
Table is up to 10M rows and columns range in cardinality from 5 through to 5M, but almost all values are already present in the dim tables (>99%).
DimColN ranges in size from 5 values to 50M values, and is well indexed.
The csv is loaded into table via SSIS, so splitting pre-processing inside SSIS is possible, but i would have to avoid a sql query for each row.
The ssis server does not have enough spare ram to cache all the dim tables.
What about using a LEFT JOIN and checking where the results of the join are null, meaning they don't exist in DimCol2
SELECT DISTINCT Col2
FROM table a
LEFT JOIN DimCol2 on a.Col2 = b.val
WHERE b.val IS NULL

Searching for the unique key (not in meta)

With which SQL Server standard tool it is possible to search unique key in the table's data (but not in meta declaration)?
P.S. I am thinking to write such script by myself. May be you could point a snippet for
combinatorics in t-sql? e.g. for generation all Combinations from n by 1..n ?
P.P.S About problem complexity for those who do not see it. It is important that we do not need to analyze the whole data to dismiss the hypnotize that those two columns is the 'unique key'. With real world, 'report-like', sorted data even after analysing first two rows, I think, it is possible to remove many of columns combinations. So I feel such algorithm should have 'before full table compare' phase. But there it is a question for what portion of data to choose for this 'before full table compare' phase . The best candidate about which I think is the 'page'... If data unique in the page we could test the uniqueness on whole table, if not unique (on the page), then go to the next column set.
select t1.col, count(*)
from table t1
join table t2
on t1.col = t2.col
group by t1.col
having count(*) > 1
if zero rows are returned then it is unique
more than one column
select t1.cola, t1.colb, count(*)
from table t1
join table t2
on t1.cola = t2.cola
and t1.colb = t2.colb
group by t1.cola, t2.colb
having count(*) > 1

What's the difference between these T-SQL queries using OR?

I use Microsoft SQL Server 2008 (SP1, x64). I have two queries that do the same, or so I think, but they are have completely different query plans and performance.
Query 1:
SELECT c_pk
FROM table_c
WHERE c_b_id IN (SELECT b_id FROM table_b WHERE b_z = 1)
OR c_a_id IN (SELECT a_id FROM table_a WHERE a_z = 1)
Query 2:
SELECT c_pk
FROM table_c
LEFT JOIN (SELECT b_id FROM table_b WHERE b_z = 1) AS b ON c_b_id = b_id
LEFT JOIN (SELECT a_id FROM table_a WHERE a_z = 1) AS a ON c_a_id = a_id
WHERE b_id IS NOT NULL
OR a_id IS NOT NULL
Query 1 is fast as I would expect, whereas query 2 is very slow. The query plans look quite different.
I would like query 2 to be as fast as query 1. I have software that uses query 2, and I cannot change that into query 1. I can change the database.
Some questions:
why are the query plans different?
can I "teach" SQL Server somehow that query 2 is equal to query 1?
All tables have (clustered) primary keys and proper indexes on all columns:
CREATE TABLE table_a (
a_pk int NOT NULL PRIMARY KEY,
a_id int NOT NULL UNIQUE,
a_z int
)
GO
CREATE INDEX IX_table_a_z ON table_a (a_z)
GO
CREATE TABLE table_b (
b_pk int NOT NULL PRIMARY KEY,
b_id int NOT NULL UNIQUE,
b_z int
)
GO
CREATE INDEX IX_table_b_z ON table_b (b_z)
GO
CREATE TABLE table_c (
c_pk int NOT NULL PRIMARY KEY,
c_a_id int,
c_b_id int
)
GO
CREATE INDEX IX_table_c_a_id ON table_c (c_a_id)
GO
CREATE INDEX IX_table_c_b_id ON table_c (c_b_id)
GO
The tables are not modified after filling initially. I'm the only one querying them. They contains millions of records (table_a: 5M, table_b: 4M, table_c: 12M), but using only 1% gives similar results.
Edit: I tried adding FOREIGN KEYs for c_a_id and c_b_id, but that only made query 1 slower...
I hope someone can have a look at the query plans and explain the difference.
Join are slower, let me say by design. First query uses a sub-query (cacheable) to filter records so it'll produce less data (and less accesses to each table).
Did you read these:
http://www.sql-server-performance.com/2006/tuning-joins/
http://blogs.msdn.com/b/craigfr/archive/2006/12/04/semi-join-transformation.aspx
What I mean is that with IN the DB can do better optimizations like removing duplicates, stop at first match and similar (and these are from school memories so I'm sure it'll do much better). So I guess the question isn't why QP is different but how smart how deep optimizations can go.
You are comparing non equivalent queries also you are using left join in quite unusual way.
Generally if yours intention was to select all entries in table_c which has linked records either in table_a or table_b you should use exists statement:
SELECT c_pk
FROM table_c
WHERE Exists(
SELECT 1
FROM table_b
WHERE b_z = 1 and c_b_id = b_id
) OR Exists(
SELECT 1
FROM table_a
WHERE a_z = 1 and c_a_id = a_id
)
Since you can't change the query, at least you can improve the query's environment.
Highlight your query, right-click on it in SSMS and select "Analyze
Query in Database Engine Tuning Advisor."
Run the analysis to find out if you need any additional indexes or
statistics built.
Heed SQL Server's advice.

How can I "subtract" one table from another?

I have a master table A, with ~9 million rows. Another table B (same structure) has ~28K rows from table A. What would be the best way to remove all contents of B from table A?
The combination of all columns (~10) are unique. Nothing more in the form a of a unique key.
If you have sufficient rights you can create a new table and rename that one to A. To create the new table you can use the following script:
CREATE TABLE TEMP_A AS
SELECT *
FROM A
MINUS
SELECT *
FROM B
This should perform pretty good.
DELETE FROM TableA WHERE ID IN(SELECT ID FROM TableB)
Should work. Might take a while though.
one way, just list out all the columns
delete table a
where exists (select 1 from table b where b.Col1= a.Col1
AND b.Col2= a.Col2
AND b.Col3= a.Col3
AND b.Col4= a.Col4)
Delete t2
from t1
inner join t2
on t1.col1 = t2.col1
and t1.col2 = t2.col2
and t1.col3 = t2.col3
and t1.col4 = t2.col4
and t1.col5 = t2.col5
and t1.col6 = t2.col6
and t1.col7 = t2.col7
and t1.col8 = t2.col8
and t1.col9 = t2.col9
and t1.col10 = t2.col0
This is likely to be very slow as you would have to have every col indexed which is highly unlikely in an environment when a table this size has no primary key, so do it during off peak. What possessed you to have a table with 9 million records and no primary key?
If this is something you'll have to do on a regular basis, the first choice should be to try to improve the database design (looking for primary keys, trying to get the "join" condition to be on as few columns as possible).
If that is not possible, the distinct second option is to figure out the "selectivity" of each of the columns (i.e. how many "different" values does each column have, 'name' would be more selective than 'address country' than 'male/female').
The general type of statement I'd suggest would be like this:
Delete from tableA
where exists (select * from tableB
where tableA.colx1 = tableB.colx1
and tableA.colx2 = tableB.colx2
etc. and tableA.colx10 = tableB.colx10).
The idea is to list the columns in order of the selectivity and build an index on colx1, colx2 etc. on tableB. The exact number of columns in tableB would be a result of some trial&measure. (Offset the time for building the index on tableB with the improved time of the delete statement.)
If this is just a one time operation, I'd just pick one of the slow methods outlined above. It's probably not worth the effort to think too much about this when you can just start a statement before going home ...
Is there a key value (or values) that can be used?
something like
DELETE a
FROM tableA a
INNER JOIN tableB b
on b.id = a.id

Resources