Find missing values on the same column of two tables - sql-server

Suppose you have two tables in a SQL Server database with the same schema for both tables. I want to compare a single column on both tables and find the values that are missing in table1 but are in table2. I've been doing this manually in Excel with a macro after I've gotten a distinct list in each query, but it would be less work if I had a query. How can I find the missing records via T-SQL? I'd like to do this for the following data types: datetime, nvarchar & bigint.
SELECT DISTINCT [dbo].[table1].[column1]
FROM [dbo].[table1]
ORDER BY [dbo].[table1].[column1] DESC
SELECT DISTINCT [dbo].[table2].[column1]
FROM [dbo].[table2]
ORDER BY [dbo].[table2].[column1] DESC

There are several ways you can do this...
LEFT JOIN:
SELECT DISTINCT t2.column1
FROM dbo.table2 t2
LEFT JOIN dbo.table1 t1
ON t2.Column1 = t1.Column1
WHERE t1.Column1 IS NULL
NOT EXISTS:
SELECT DISTINCT t2.column1
FROM dbo.table2 t2
WHERE NOT EXISTS (
SELECT 1
FROM dbo.table1 t1
WHERE t1.column1 = t2.column1
)
NOT IN:
SELECT DISTINCT t2.column1
FROM dbo.table2 t2
WHERE t2.column1 NOT IN (
SELECT t1.column1
FROM dbo.table1 t1
)
There are some slight variations in the behavior and efficiency of these approaches... based mostly on the presence of NULL values in columns, so try each approach to find the most efficient one that gives the results you expect.

SELECT DISTINCT [dbo].[table2].[column1]
FROM [dbo].[table2]
except
SELECT DISTINCT [dbo].[table1].[column1]
FROM [dbo].[table1]
All the values of column1 in Table2 that are not present in column1 of Table1

basically, you can use LEFT JOIN.
TableB is set as the main table in this case. By joining it with TableA using LEFT JOIN, the the records that have no match on TableA a will still be in the result list but their values are NULL. So to filter out non matching records, add a filtering condition which only select records with NULL value on tableA.
SELECT b.*
FROM tableB b
LEFT JOIN tableA a
ON a.column1 = b.column1
WHERE a.column1 IS NULL
To further gain more knowledge about joins, kindly visit the link below:
Visual Representation of SQL Joins

SQL Server 2005 onwards you could use Except
SELECT DISTINCT [dbo].[table2].[column1]
FROM [dbo].[table2]
Except
SELECT DISTINCT [dbo].[table1].[column1]
FROM [dbo].[table1]

Related

Efficiently group query by one column, taking the maximum of another column and a third column that comes from the same row as the maximum column

I have a table of 100,000,000+ values, so efficiency is very important to me. I need to take information from table A, join it to an index table B, then join to table C using the index retrieved from table B. The problem is, there are multiple indexes for each value in table A, and I want to retrieve the one with the most recent date.
The query below creates duplicates:
SELECT ID_1, ID_2, Date
INTO #DEST_TABLE FROM Table_1 t1
INNER JOIN Table_2 t2 ON t1.ID_1=t2.ID_1
INNER JOIN Table_3 t3 ON t2.ID_2=t3.ID_2
This one does not, but when running with more than 35,000 vs 40,000 elements, the execution time goes from <5sec to >1min:
SELECT ID_1, ID_2, Date
INTO #DEST_TABLE FROM
(SELECT * FROM Table_1 l CROSS APPLY Table_2 t2 WHERE t1.ID_1=t2.ID_1) t_temp
LEFT JOIN Table_3 t3 ON t_temp.ID_2=t3.ID_2
How can I decrease my execution time as much as possible?
Here is an example table:
For this table, I would be trying to get the most recent location for each person.
None of the columns are indexed and I cannot create indexes on this table.
First of all, when you are working on 100 Million+ records and that
too joining to other tables, first thing I would ask is what is the
rationale behind not creating indexes which can cover your query. If
you are not the admin of that system, I would suggest that you
should bring this up to admin group and try to understand what is
the exact reason (if any) they do not want index on that huge table.
Specially because you mentioned "efficiency is very important to
me".
Remember that 'SQL Tuning' is only one of the steps of 'Database Performance Tuning' and you can tune only as much with writing a good SQL Query. When the data volume gets huge, a good SQL Query is never sufficient without taking other Performance Tuning Measures.
Apart from what Roger has already provided, here are a few solutions that you can try out:
Solution 1
SELECT T1.ID_1, OA.ID_2, OA.Location
FROM Table1 T1
OUTER APPLY (
SELECT TOP 1 T3.ID_2, T3.Location
FROM Table2 T2
INNER JOIN Table3 T3
ON T2.ID_2 = T3.ID_2
WHERE T2.ID_1 = T1.ID_1
ORDER BY T3.Date DESC
) OA;
Solution 2:
SELECT DISTINCT
T1.ID_1
,T2.ID_2
,Location = FIRST_VALUE(T3.Location) OVER (PARTITION BY T1.ID_1 ORDER BY T3.Date DESC)
FROM Table1 T1
INNER JOIN Table2 T2
ON T1.ID_1 = T2.ID_1
INNER JOIN Table3 T3
ON T2.ID_2 = T3.ID_2;
Data Preparation:
DROP TABLE IF EXISTS Table1
DROP TABLE IF EXISTS Table2
DROP TABLE IF EXISTS Table3
SELECT TOP 10000 ID_1 = object_id, name
INTO Table1
FROM sys.all_objects
ORDER BY object_id
SELECT ID_1 = T1.ID_1, ID_2 = IDENTITY(INT, 1, 1)
INTO Table2
FROM Table1 T1
CROSS JOIN Table1 T2
SELECT ID_2, Location = 'City_'+ CAST(ID_2 AS VARCHAR(100)), Date = CAST(DATEADD(DAY, ID_2/10000, GETDATE()) AS DATE)
INTO Table3
FROM Table2
Indexes to cover the Solution 1:
CREATE NONCLUSTERED INDEX IX_TABLE1_ID_1 ON Table1 (ID_1)
CREATE NONCLUSTERED INDEX IX_TABLE2_ID_2 ON Table2 (ID_1, ID_2)
CREATE NONCLUSTERED INDEX IX_TABLE3_ID_2 ON Table3 (ID_2, Date DESC) INCLUDE (Location)
Execution Plan:
You can see that all are 'Index Seek' except for Table1 which is an legitimate 'Index Scan' because you are doing scans for each value of Table1's ID_1 value. If you put a where clause in the outer loop to search for a few specific ID_1 values, then that 'Index Scan' will turn to a 'Index Seek' as well.
I will leave the Index Strategy for the 2nd solution to you (as a homework :) ). Tips: You have to make the Location as a key as well. Or you can go with COLUMNSTORE index approach.
You can use something like this:
select top (1) with ties
a.A_Id, b.B_Id, b.Date
from dbo.TableA a
inner join dbo.TableB b on a.A_Id = it.A_Id
inner join dbo.TableC c on c.B_Id = b.B_Id
order by row_number() over(partition by a.A_Id order by b.Date desc);
Alternatively, you can try an olde fashioneth approache:
select a.A_Id, b.B_Id, b.Date
from dbo.TableA a
inner join dbo.TableB b on a.A_Id = b.A_Id
inner join dbo.TableC c on c.B_Id = b.B_Id
where not exists (
select 0 from dbo.TableB pb where pb.B_Id = b.B_Id and pb.Date > b.Date
);
However, as with all such situations, its performance will heavily depend on indices. SSMS can suggest you some, if you will look at the execution plan; off the top of my head, you will need all Id columns to be indexed, and you will need either a single (Date) or a composite (A_Id, Date, B_Id) on the TableB.
UPD: If you can't create or modify any indices, and performance is paramount, I would suggest copying the data in question into a separate schema or database, where you might have appropriate permissions. Apart from that... it's impossible to get something out of nothing.

Implement Bitwise OR instead of multiple In clause in sql query

I have a query which uses IN clause (can use EXISTS also) for multiple columns which are filtered using OR Clause inside WHERE Clause. Is there any better approach to write this query.
SELECT columndata FROM TABLE1
WHERE column1key in (select columnkey from #temptable1)
OR column2key in (select columnkey from #temptable2)
OR column3key IN (SELECT columnkey FROM #temptable3)
You can go for 'LEFT JOIN' as shown below
SELECT columndata
FROM TABLE1 tab1
LEFT JOIN #temptable1 t1 on tab1.column1key = t1.columnkey
LEFT JOIN #temptable2 t2 on tab1.column2key = t2.columnkey
LEFT JOIN #temptable3 t3 on tab1.column3key = t3.columnkey
You may get better performance by this, which breaks down the SELECT into separate queries with a de-duplication later.
SELECT columndata FROM TABLE1
WHERE column1key in (select columnkey from #temptable1)
UNION
SELECT columndata FROM TABLE1
WHERE column2key in (select columnkey from #temptable2)
UNION
SELECT columndata FROM TABLE1
WHERE column3key IN (SELECT columnkey FROM #temptable3)
But you would really have to try it
With no or bad indexes, you still have to scan then same amount of data. With good indexes, this may work better...
As a side note, EXISTS and IN will give the same plan here

How to use BETWEEN in a reference table

I was just wondering on how create a query that in such a way it will check if the column is in between in a reference table.
such as
SELECT *
FROM Table1 WHERE Column1 BETWEEN ( SELECT Column1 , Column2 FROM TABLE2 )
I just don't know how to implement it in a correct way.
Thank you.
If you can have overlapping ranges in Table2, and all you want are (unique) Table1 records that are in any range in Table2, then this query will do it.
SELECT *
FROM Table1
WHERE EXISTS (
SELECT *
FROM Table2
Where Table1.Column1 BETWEEN Table2.Column1 and Table2.Column2)
You can also solve this using JOINs, if the ranges in Table2 are not overlapping, otherwise you will need to use either DISTINCT or ROW_NUMBER() to pare them down to unique Table1 records.
Try This....
SELECT *
FROM Table1 as t1
INNER JOIN Table2 t2 ON t1.Column1 BETWEEN t2.Column1 AND t2.Column2
this works
SELECT * FROM table1 as t1,table2 as t2
WHERE t1.Column1 BETWEEN t2.Column1 AND t2.Column2.

SQL Server - Invalid object name while joining table to itself

I've tried to find an answer to my problem but couldn't find similar example.
I have results from such a query
SELECT * FROM (
SELECT id FROM table
) AS t1
Now I would like to join t1 to another instance of itself because I need to shift it. For example if I wanted to compare a row with the previous one. I tried:
SELECT * FROM (
SELECT id FROM table
) AS t1
LEFT JOIN t1 AS t2 ON (my conditions)
But I get an error that t1 is invalid object name. When I copy my select statement:
SELECT * FROM (
SELECT id FROM table
) AS t1
LEFT JOIN (
SELECT id FROM table
) AS t2 ON (my conditions)
The above works, but is it not slower than joining to already returned results?
Any help would be appreciated
The first one is in correct:
SELECT * FROM (
SELECT id FROM table
) AS t1
LEFT JOIN t1 AS t2 ON (my conditions)
Because you can't alias an alias. You can do something similar to it using CTE like so:
;WITH cte
AS
(
SELECT * FROM Table
)
SELECT *
FROM Cte t1
INNER JOIN cte t2 ON --
I think your select should be of the form:
SELECT *
FROM [table] t1
LEFT JOIN [table] t2 ON (your conditions)
From a performance perspective, this is identical to your last select and to the CTE solution in Mahmoud's answer (I've reviewed the execution plan for all three in SQL Server).
It might only be a matter of taste, but I find this form to be more readable/maintainable.

Updating column with value from other table, can't use distinct function

My original data is in Table2. I created Table1 from scratch. I populated Column A like this:
INSERT INTO Table1("item")
SELECT DISTINCT(Table2."item")
FROM Table2
I populated Table1.Totals (Column B) like this:
UPDATE Table1
SET totals = t2.q
FROM Table1 INNER JOIN
(
SELECT t2."item"
, SUM(t2.quantity) AS q
FROM t2
GROUP BY t2."item"
) AS t2
ON Table1."item" = t2."item"
How can I populate Table1."date"? My UPDATE above doesn't work here because I can't use an aggregate function on a date. I was able to get the results I wanted using the following code in a separate query:
SELECT DISTINCT Table1."item"
, Table2."date"
FROM Table1 INNER JOIN Table2
ON Table1."item" = Table2."item"
ORDER BY Table1."item"
But how do I use the results of this query to SET the value of the column? I'm using SQL Server 2008.
If you can't do the insert all over again, as #Lamak suggested, then you could perform an UPDATE this way:
UPDATE t1
SET t1.Date = s.Date
FROM Table1 AS t1
INNER JOIN
(
SELECT Item, [Date] = MAX([Date]) -- or MIN()
FROM Table2
GROUP BY Item
) AS s
ON t1.Item = s.Item;
For SQL Server you coul've use a single INSERT statement:
INSERT INTO Table1(Item, Totals, [Date])
SELECT Item, SUM(Quantity), MIN([Date]) -- It could be MAX([Date])
FROM Table2
GROUP BY Item
The easiest way is to use a simple CTAS (create table as select):
select item as item, SUM(quantity) as Q, MIN(date) as d into table2
from table1
group by item
Instead of creating a table, you could create a view, using a select statement like in #Lamak's answer. That way you wouldn't have to update the new row set each time the Table2 updates.

Resources