It is much easier to explain this with an example.
Table A has PK on (store,line).
Table B has PK on (id,store,line).
[id] is int, [store] is nvarchar(100) and [line] is int in both cases.
If I run:
select *
from A inner join B
on A.store=B.store and A.line=B.line
where B.id=0
will the engine be able to make a fast (i'm thinking merge) join? Or will it be helpful to add a dummy column id valued 0 in A?
Your statement will work but if you do it like this the optimizer will be more effective:
select *
from A inner
join B on A.store=B.store and A.line=B.line and B.id=0
Here is is able to exclude items where b.id does not equal zero before it does the merge. Depending on table size topology etc this could be quite significant.
For example consider the case where you have 50 million rows shared across 5 nodes table b and 1 node for table a -- in your code all records would have to be moved to the node with the a table while with the code above only the records that have id = 0 would need to be moved.
This can be very non-intuitive when a is a small table (which are often only on one node.)
Related
I have a table (let's call it A) in SQL Server 2016 that I want to query on. I need to select only those rows that have a definitive status, so I need to exclude some rows. There's another table (B), containing the record id from the Table A and two columns, col1 and col2. If these columns are non-empty, the corresponding record can be considered final. There is a one-to-one relationship between tables A and B. Because these tables are rather large, I want to use the most efficient query. Which should I choose?
SELECT *
FROM TableA
WHERE record_id IN
(SELECT record_id FROM TableB WHERE col1 IS NOT NULL AND col2 IS NOT NULL)
SELECT a.*
FROM TableA a
INNER JOIN TableB b ON a.record_id = b.record_id
WHERE b.col1 IS NOT NULL AND b.col2 IS NOT NULL
SELECT a.*
FROM TableA a
INNER JOIN TableB b
ON a.record_id = b.record_id
AND b.col1 IS NOT NULL
AND b.col2 IS NOT NULL
Of course, if there's an even faster way that I hadn't thought of, please share. I'd also be very curious to know why one query is faster than the others.
WITH cte AS
(SELECT b.record_id, b.col1, b.col2
FROM TableB b
WHERE col1 IS NULL
AND col2 IS NULL --if the field isn't NULL, it might be quicker to do <> '')
SELECT a.record_id, a.identifyColumnsNeededExplicitely
FROM cte
JOIN TableA a ON a.record_id = cte.record_id
ORDER BY a.record_id
In practice the execution plan will do whatever it likes depending on your current indexes / clustered index / foreign keys / constraints / table stastics (aka number of rows / general containt of your rows/...). Any analysis should be done case by case and what's true for 2 tables may not be to 2 others table.
Theorically,
Without any index, the first one should be the best since it will make an optimization on operations with 1 table scan on TableB, 2 contants scan on TableB and 1 table scan on Table1.
With a foreign key on TableA.record_id referencing TableB.record_id OR an index in both column, the second should be faster since it will make a scan index and 2 constant scan.
In rare case, it could be the 3rd one depending on TableB stats. But not far from number 2 since number 3 will scan all the TableB.
In even rarer case, neither of the 3.
What I'm tryng to say is : "Since we don't have neither your tables nor rows, open your SQL Management, put the stats ON and try it yourself."
With which SQL Server standard tool it is possible to search unique key in the table's data (but not in meta declaration)?
P.S. I am thinking to write such script by myself. May be you could point a snippet for
combinatorics in t-sql? e.g. for generation all Combinations from n by 1..n ?
P.P.S About problem complexity for those who do not see it. It is important that we do not need to analyze the whole data to dismiss the hypnotize that those two columns is the 'unique key'. With real world, 'report-like', sorted data even after analysing first two rows, I think, it is possible to remove many of columns combinations. So I feel such algorithm should have 'before full table compare' phase. But there it is a question for what portion of data to choose for this 'before full table compare' phase . The best candidate about which I think is the 'page'... If data unique in the page we could test the uniqueness on whole table, if not unique (on the page), then go to the next column set.
select t1.col, count(*)
from table t1
join table t2
on t1.col = t2.col
group by t1.col
having count(*) > 1
if zero rows are returned then it is unique
more than one column
select t1.cola, t1.colb, count(*)
from table t1
join table t2
on t1.cola = t2.cola
and t1.colb = t2.colb
group by t1.cola, t2.colb
having count(*) > 1
the question is quite simple, but we've had so many issues with index/statistics updates not always resulting in the proper new execution plans in low-load environments that I need to check this with you guys here to be sure.
Say that you have the following tables:
/*
TABLES:
TABLE_A (PK_ID INT, [random columns], B_ID INT (INDEXED, and references TABLE_B.PK_ID))
TABLE_B (PK_ID INT, [random columns], C_ID INT (INDEXED, and references TABLE_C.PK_ID))
TABLE_C (PK_ID INT, [random columns])
*/
SELECT *
FROM TABLE_A A
JOIN TABLE_B B ON B.PK_ID = A.B_ID
JOIN TABLE_C C ON C.PK_ID = B.C_ID
WHERE A.randcolumn1 = 'asd' AND B.randcolumn2 <> 5
Now, since B is joined to A with its clustered PK column, shouldn't that mean that the index on B.C_ID will not be used as the information is already returned through the B.PK_ID clustered index? In fact, is it not true that the index on B.C_ID will never be used unless the query specifically targets the ID values on that index?
This may seem like a simple and even stupid question, but I want to make absolutely sure I'm getting this right. I'm thinking of making adjustments on our indexing, since we have a lot of unused indexes which have been inherited from an old datamodel and they're taking up quite a bit of space in a DB this size. And experience has shown that we cannot fully trust the execution plans on any environment apart from the production thanks to its extreme load compared to testing environments, which makes it difficult to test this out reliably.
Thanks!
The query optimizer is free to do as it pleases. It could execute the second join by scanning the C table, and for each row, looking up the matching row in B. The index you describe would help with that lookup.
SQL Server provides statistics to tell you if an index is actually used:
select db_name(ius.database_id) as Db
, object_name(ius.object_id) as [Table]
, max(ius.last_user_lookup) as LastLookup
, max(ius.last_user_scan) as LastScan
, max(ius.last_user_seek) as LastSeek
, max(ius.last_user_update) as LastUpdate
from sys.dm_db_index_usage_stats as ius
where ius.[database_id] = db_id()
and ius.[object_id] = object_id('YourTableName')
group by
ius.database_id
, ius.object_id
If the index isn't used for more than 2 months, it is usually safe to drop it.
I have two tables in sql-server.
System{
AUTO_ID -- identity auto increment primary key
SYSTEM_GUID -- index created, unique key
}
File{
AUTO_ID -- identity auto increment primary key
File_path
System_Guid -- foreign key refers system.System_guid, index is created
-- on this column
}
System table has 100,000 rows.
File table has 200,000 rows.
File table has only 10 distinct values for System_guid.
My Query is as follows:
Select * from
File left outer join System
on file.system_guid = system.system_guid
SQL server is using hash match join to give me result which is taking a long time.
I want to optimize this query to make it go faster. The fact that there are only 10 distinct system_guid probably means the hash match wastes energy. How can utilize this knowledge to speed up the query?
When an indexed column has almost non-changing values, the purpose of index fails. If all you want is to extract records from System where system_guid is one of the ones in File then you may be better off (in your case) with a query like:
select * from System
where system_guid in (select distinct system_guid from File).
Is the LEFT join really necessary? How does the query perform as an INNER join? Do you get a different join.
I doubt hash join is much of a problem with this amount of I/O.
You could do a UNION like this... maybe coax a different plan out of it.
Select * from File
WHERE System_Guid NOT IN (SELECT system_guid from system)
union all
Select * from
File inner join System
on file.system_guid = system.system_guid
I have a master table A, with ~9 million rows. Another table B (same structure) has ~28K rows from table A. What would be the best way to remove all contents of B from table A?
The combination of all columns (~10) are unique. Nothing more in the form a of a unique key.
If you have sufficient rights you can create a new table and rename that one to A. To create the new table you can use the following script:
CREATE TABLE TEMP_A AS
SELECT *
FROM A
MINUS
SELECT *
FROM B
This should perform pretty good.
DELETE FROM TableA WHERE ID IN(SELECT ID FROM TableB)
Should work. Might take a while though.
one way, just list out all the columns
delete table a
where exists (select 1 from table b where b.Col1= a.Col1
AND b.Col2= a.Col2
AND b.Col3= a.Col3
AND b.Col4= a.Col4)
Delete t2
from t1
inner join t2
on t1.col1 = t2.col1
and t1.col2 = t2.col2
and t1.col3 = t2.col3
and t1.col4 = t2.col4
and t1.col5 = t2.col5
and t1.col6 = t2.col6
and t1.col7 = t2.col7
and t1.col8 = t2.col8
and t1.col9 = t2.col9
and t1.col10 = t2.col0
This is likely to be very slow as you would have to have every col indexed which is highly unlikely in an environment when a table this size has no primary key, so do it during off peak. What possessed you to have a table with 9 million records and no primary key?
If this is something you'll have to do on a regular basis, the first choice should be to try to improve the database design (looking for primary keys, trying to get the "join" condition to be on as few columns as possible).
If that is not possible, the distinct second option is to figure out the "selectivity" of each of the columns (i.e. how many "different" values does each column have, 'name' would be more selective than 'address country' than 'male/female').
The general type of statement I'd suggest would be like this:
Delete from tableA
where exists (select * from tableB
where tableA.colx1 = tableB.colx1
and tableA.colx2 = tableB.colx2
etc. and tableA.colx10 = tableB.colx10).
The idea is to list the columns in order of the selectivity and build an index on colx1, colx2 etc. on tableB. The exact number of columns in tableB would be a result of some trial&measure. (Offset the time for building the index on tableB with the improved time of the delete statement.)
If this is just a one time operation, I'd just pick one of the slow methods outlined above. It's probably not worth the effort to think too much about this when you can just start a statement before going home ...
Is there a key value (or values) that can be used?
something like
DELETE a
FROM tableA a
INNER JOIN tableB b
on b.id = a.id