How can I "subtract" one table from another? - sql-server

I have a master table A, with ~9 million rows. Another table B (same structure) has ~28K rows from table A. What would be the best way to remove all contents of B from table A?
The combination of all columns (~10) are unique. Nothing more in the form a of a unique key.

If you have sufficient rights you can create a new table and rename that one to A. To create the new table you can use the following script:
CREATE TABLE TEMP_A AS
SELECT *
FROM A
MINUS
SELECT *
FROM B
This should perform pretty good.

DELETE FROM TableA WHERE ID IN(SELECT ID FROM TableB)
Should work. Might take a while though.

one way, just list out all the columns
delete table a
where exists (select 1 from table b where b.Col1= a.Col1
AND b.Col2= a.Col2
AND b.Col3= a.Col3
AND b.Col4= a.Col4)

Delete t2
from t1
inner join t2
on t1.col1 = t2.col1
and t1.col2 = t2.col2
and t1.col3 = t2.col3
and t1.col4 = t2.col4
and t1.col5 = t2.col5
and t1.col6 = t2.col6
and t1.col7 = t2.col7
and t1.col8 = t2.col8
and t1.col9 = t2.col9
and t1.col10 = t2.col0
This is likely to be very slow as you would have to have every col indexed which is highly unlikely in an environment when a table this size has no primary key, so do it during off peak. What possessed you to have a table with 9 million records and no primary key?

If this is something you'll have to do on a regular basis, the first choice should be to try to improve the database design (looking for primary keys, trying to get the "join" condition to be on as few columns as possible).
If that is not possible, the distinct second option is to figure out the "selectivity" of each of the columns (i.e. how many "different" values does each column have, 'name' would be more selective than 'address country' than 'male/female').
The general type of statement I'd suggest would be like this:
Delete from tableA
where exists (select * from tableB
where tableA.colx1 = tableB.colx1
and tableA.colx2 = tableB.colx2
etc. and tableA.colx10 = tableB.colx10).
The idea is to list the columns in order of the selectivity and build an index on colx1, colx2 etc. on tableB. The exact number of columns in tableB would be a result of some trial&measure. (Offset the time for building the index on tableB with the improved time of the delete statement.)
If this is just a one time operation, I'd just pick one of the slow methods outlined above. It's probably not worth the effort to think too much about this when you can just start a statement before going home ...

Is there a key value (or values) that can be used?
something like
DELETE a
FROM tableA a
INNER JOIN tableB b
on b.id = a.id

Related

Is this join overcomplicated?

I have inherited an application made by a previous developer. Some of the database calls are running slow in places where there is a large amount of data. I have found in general the SQL code is well written but there are places that make me think, 'what the..?'
Here is one example:
select a.*
from bs_ResearchEnquiry a
left join bs_StateWorkflowState_Map b
on (
select c.MapId from bs_StateWorkflowState_Map c
where c.StateId = a.StateId AND c.StateWorkflowId = a.StateWorkflowId
)=b.MapId
where
b.IsFinal=1
The MapId field is a unique primary key to the bs_StateWorkflowState_Map table.
StateId and StateWorkflowId together also form a unique key.
There will always be a match on these keys to rows in the foreign table bs_ResearchEnquiry
Therefore, could I rewrite the left join more efficiently, and safely, as:
inner join bs_StateWorkflowState_Map b
on b.StateId = a.StateId AND b.StateWorkflowId = a.StateWorkflowId
Or was the original developer trying to achieve something I've missed ?
Your simplification looks good to me. Note that the presence of:
where b.IsFinal = 1
Means that the outer join is effectively inner join.
With your explanation on keys given, you are right, the query can be simplified. It selects records from bs_ResearchEnquiry where the associated bs_StateWorkflowState_Map record is final. So use EXISTS:
select *
from bs_ResearchEnquiry re
where exists
(
select *
from bs_StateWorkflowState_Map m
where m.StateId = re.StateId
and m.StateWorkflowId = re.StateWorkflowId
and m.IsFinal = 1
);
(From your explanation on uniqueness, I gather that there already exist indexes on (StateId, StateWorkflowId) in both tables. If not, create them.)

Which is the fastest way to run this SQL query?

I have a table (let's call it A) in SQL Server 2016 that I want to query on. I need to select only those rows that have a definitive status, so I need to exclude some rows. There's another table (B), containing the record id from the Table A and two columns, col1 and col2. If these columns are non-empty, the corresponding record can be considered final. There is a one-to-one relationship between tables A and B. Because these tables are rather large, I want to use the most efficient query. Which should I choose?
SELECT *
FROM TableA
WHERE record_id IN
(SELECT record_id FROM TableB WHERE col1 IS NOT NULL AND col2 IS NOT NULL)
SELECT a.*
FROM TableA a
INNER JOIN TableB b ON a.record_id = b.record_id
WHERE b.col1 IS NOT NULL AND b.col2 IS NOT NULL
SELECT a.*
FROM TableA a
INNER JOIN TableB b
ON a.record_id = b.record_id
AND b.col1 IS NOT NULL
AND b.col2 IS NOT NULL
Of course, if there's an even faster way that I hadn't thought of, please share. I'd also be very curious to know why one query is faster than the others.
WITH cte AS
(SELECT b.record_id, b.col1, b.col2
FROM TableB b
WHERE col1 IS NULL
AND col2 IS NULL --if the field isn't NULL, it might be quicker to do <> '')
SELECT a.record_id, a.identifyColumnsNeededExplicitely
FROM cte
JOIN TableA a ON a.record_id = cte.record_id
ORDER BY a.record_id
In practice the execution plan will do whatever it likes depending on your current indexes / clustered index / foreign keys / constraints / table stastics (aka number of rows / general containt of your rows/...). Any analysis should be done case by case and what's true for 2 tables may not be to 2 others table.
Theorically,
Without any index, the first one should be the best since it will make an optimization on operations with 1 table scan on TableB, 2 contants scan on TableB and 1 table scan on Table1.
With a foreign key on TableA.record_id referencing TableB.record_id OR an index in both column, the second should be faster since it will make a scan index and 2 constant scan.
In rare case, it could be the 3rd one depending on TableB stats. But not far from number 2 since number 3 will scan all the TableB.
In even rarer case, neither of the 3.
What I'm tryng to say is : "Since we don't have neither your tables nor rows, open your SQL Management, put the stats ON and try it yourself."

SQL View Optimization

I am trying to build a view that does basically 2 things, whether a record in table 1 is in table 2 and whether a link to another table is still there. it worked on a subset of data, but when i tried to run the full query it timed out in the view designer.
The view worked fine until I added in the check to see whether the link to another table was present.
Initially it joined table A to Table B and filtered out where A.ID wasnt present in the ID column in table B
I was then told that if the link between the person and the address table (stored in table C) was removed then we would have no way of knowing other than to get a full extract of that table again and see which links are no longer present. I am trying to use that check to determine whether to display some data in particular columns
I am using the following structure close to 60 times to choose whether to show information in a column:
Column1 = case when exists (select LinkID from LinkTable C
where cast(C.LinkAddressID as varchar) = A.AddressID
and cast(C.LinkID as varchar) = A.ID)
then Column1
else NULL
end
There is about 1.6m records in Table A just over 4m records in the Link table.
is there a better way to write this query / view that would be more optimized?
Please let me know if more information is needed
Select C.LinkID
From A
Left Join C On C.LinkAddressID = A.AddressID And C.LinkID = A.ID
This will give you C.LinkID if a match exists on the two conditions and NULL if both criteria are not satisfied.
Having indexes / keys such as primary key on A.ID and foreign key relationships based on what is in the join clause will provide very good performance.
As Joe suggested, if for all 60 columns you use the same AddressId and Id fields to match two tables, I believe so you can use something as following query
SELECT
Column1 = CASE WHEN C.LinkID IS NULL THEN NULL ELSE A.Column1 END,
....
FROM A
Left Join LinkTable C
ON C.LinkAddressID = A.AddressID AND C.LinkID = A.ID
Casting data types will definitely disable the advantage from index. So keep away data type cast if possible on joins and in WHERE clauses

Searching for the unique key (not in meta)

With which SQL Server standard tool it is possible to search unique key in the table's data (but not in meta declaration)?
P.S. I am thinking to write such script by myself. May be you could point a snippet for
combinatorics in t-sql? e.g. for generation all Combinations from n by 1..n ?
P.P.S About problem complexity for those who do not see it. It is important that we do not need to analyze the whole data to dismiss the hypnotize that those two columns is the 'unique key'. With real world, 'report-like', sorted data even after analysing first two rows, I think, it is possible to remove many of columns combinations. So I feel such algorithm should have 'before full table compare' phase. But there it is a question for what portion of data to choose for this 'before full table compare' phase . The best candidate about which I think is the 'page'... If data unique in the page we could test the uniqueness on whole table, if not unique (on the page), then go to the next column set.
select t1.col, count(*)
from table t1
join table t2
on t1.col = t2.col
group by t1.col
having count(*) > 1
if zero rows are returned then it is unique
more than one column
select t1.cola, t1.colb, count(*)
from table t1
join table t2
on t1.cola = t2.cola
and t1.colb = t2.colb
group by t1.cola, t2.colb
having count(*) > 1

Pivot Table vs Parent_ID for 1-Many Relation

In a general one-to-many (parent-to-child) relationship, is there a significant efficiency difference between (a) putting parent_id in the child table and (b) using a pivot table of only parent_id, child_id?
NOTE: Assume Oracle if necessary, otherwise answer using a general RDBMS.
If by PIVOT table you mean a many-to-many link table, then no, it will only hamper performance.
You should keep parent_id in the child table.
The many-to-many link table takes an extra JOIN and therefore is less efficient.
Compare the following queries:
SELECT *
FROM child_table c
JOIN child_to_parent cp
ON cp.child = c.id
JOIN parent p
ON p.id = cp.parent
WHERE c.property = 'some_property'
and this one:
SELECT *
FROM child_table c
JOIN parent p
ON p.id = c.parent
WHERE c.property = 'some_property'
The latter one is one JOIN shorter and more efficient.
The only possible exception to that rule is that you run these queries often:
SELECT *
FROM child_table c
JOIN parent_table p
ON p.id = c.parent
WHERE c.id IN (id1, id2, ...)
, i. e. you know the id's of the child rows beforehand.
This may be useful if you use natural keys for your child_table.
In this case yes, the child_to_parent link table will be more efficient, since you can just replace it with the following query:
SELECT *
FROM child_to_parent cp
JOIN parent_table p
ON p.id = cp.parent
WHERE cp.child IN (id1, id2, ...)
and child_to_parent will be always less or equal in size to child_table, and hence more efficient.
However, in Oracle you can achieve just the same result creating a composite index on child_table (id, parent_id).
Since Oracle does not index NULL's, this index will be just like your child_to_parent table, but without the table itself and implied maintenance overhead.
In other systems (which index NULL's), the index may be less efficient than a dedicated table, especially if you have lots of NULL parents.

Resources