Implement Bitwise OR instead of multiple In clause in sql query - sql-server

I have a query which uses IN clause (can use EXISTS also) for multiple columns which are filtered using OR Clause inside WHERE Clause. Is there any better approach to write this query.
SELECT columndata FROM TABLE1
WHERE column1key in (select columnkey from #temptable1)
OR column2key in (select columnkey from #temptable2)
OR column3key IN (SELECT columnkey FROM #temptable3)

You can go for 'LEFT JOIN' as shown below
SELECT columndata
FROM TABLE1 tab1
LEFT JOIN #temptable1 t1 on tab1.column1key = t1.columnkey
LEFT JOIN #temptable2 t2 on tab1.column2key = t2.columnkey
LEFT JOIN #temptable3 t3 on tab1.column3key = t3.columnkey

You may get better performance by this, which breaks down the SELECT into separate queries with a de-duplication later.
SELECT columndata FROM TABLE1
WHERE column1key in (select columnkey from #temptable1)
UNION
SELECT columndata FROM TABLE1
WHERE column2key in (select columnkey from #temptable2)
UNION
SELECT columndata FROM TABLE1
WHERE column3key IN (SELECT columnkey FROM #temptable3)
But you would really have to try it
With no or bad indexes, you still have to scan then same amount of data. With good indexes, this may work better...
As a side note, EXISTS and IN will give the same plan here

Related

What happens if the sub-query on an IN operator fails?

I have a T-SQL query where I am using the IN operator to find all records where the GUID is in the result of the subquery. However, I recently made changes to the schema so that Table6 does not have a GUID field and now has an AlternateID field. So the subquery for the IN operator fails if you run it. However, if I execute the query as a whole, it always returns all records in TableGUIDResolving table. It's almost as if the IN operator is returning TRUE for all records because the subquery is failing.
I have tried fixing the subquery, and it executes as expected when I do this.
Can someone help me explain this? Is this behavior intentional?
SELECT ID
FROM TableGUIDResolving
WHERE GUID IN (SELECT AlternateID AS GUID FROM Table1
UNION
SELECT GUID FROM Table2
UNION
SELECT GUID FROM Table3
UNION
SELECT GUID FROM Table4
UNION
SELECT GUID FROM Table5
UNION
SELECT GUID FROM Table6)
Yup. That is what happens when you use subqueries without qualified column names. You think you are saying:
select table6.GUID from table6
but this doesn't exist, so the scoping rules in SQL change it to:
select TableGUIDResolving.GUID from table6
I would recommend that you change your logic to a series of NOT EXISTS:
SELECT ID
FROM TableGUIDResolving tgr WHERE GUID IN (
WHERE EXISTS (SELECT 1 FROM Table1 t1 WHERE t1.AlternateID = tgr.GUID) OR
EXISTS (SELECT 1 FROM Table2 t2 WHERE t2.GUID = tgr.GUID) OR
EXISTS (SELECT 1 FROM Table3 t3 WHERE t3.GUID = tgr.GUID) OR
EXISTS (SELECT 1 FROM Table4 t4 WHERE t4.GUID = tgr.GUID) OR
EXISTS (SELECT 1 FROM Table4 t5 WHERE t5.GUID = tgr.GUID) OR
EXISTS (SELECT 1 FROM Table4 t6 WHERE t6.GUID = tgr.GUID)
If you have an index on GUID/AlternateID in each of the tables, then this should have much better performance.

Efficiently group query by one column, taking the maximum of another column and a third column that comes from the same row as the maximum column

I have a table of 100,000,000+ values, so efficiency is very important to me. I need to take information from table A, join it to an index table B, then join to table C using the index retrieved from table B. The problem is, there are multiple indexes for each value in table A, and I want to retrieve the one with the most recent date.
The query below creates duplicates:
SELECT ID_1, ID_2, Date
INTO #DEST_TABLE FROM Table_1 t1
INNER JOIN Table_2 t2 ON t1.ID_1=t2.ID_1
INNER JOIN Table_3 t3 ON t2.ID_2=t3.ID_2
This one does not, but when running with more than 35,000 vs 40,000 elements, the execution time goes from <5sec to >1min:
SELECT ID_1, ID_2, Date
INTO #DEST_TABLE FROM
(SELECT * FROM Table_1 l CROSS APPLY Table_2 t2 WHERE t1.ID_1=t2.ID_1) t_temp
LEFT JOIN Table_3 t3 ON t_temp.ID_2=t3.ID_2
How can I decrease my execution time as much as possible?
Here is an example table:
For this table, I would be trying to get the most recent location for each person.
None of the columns are indexed and I cannot create indexes on this table.
First of all, when you are working on 100 Million+ records and that
too joining to other tables, first thing I would ask is what is the
rationale behind not creating indexes which can cover your query. If
you are not the admin of that system, I would suggest that you
should bring this up to admin group and try to understand what is
the exact reason (if any) they do not want index on that huge table.
Specially because you mentioned "efficiency is very important to
me".
Remember that 'SQL Tuning' is only one of the steps of 'Database Performance Tuning' and you can tune only as much with writing a good SQL Query. When the data volume gets huge, a good SQL Query is never sufficient without taking other Performance Tuning Measures.
Apart from what Roger has already provided, here are a few solutions that you can try out:
Solution 1
SELECT T1.ID_1, OA.ID_2, OA.Location
FROM Table1 T1
OUTER APPLY (
SELECT TOP 1 T3.ID_2, T3.Location
FROM Table2 T2
INNER JOIN Table3 T3
ON T2.ID_2 = T3.ID_2
WHERE T2.ID_1 = T1.ID_1
ORDER BY T3.Date DESC
) OA;
Solution 2:
SELECT DISTINCT
T1.ID_1
,T2.ID_2
,Location = FIRST_VALUE(T3.Location) OVER (PARTITION BY T1.ID_1 ORDER BY T3.Date DESC)
FROM Table1 T1
INNER JOIN Table2 T2
ON T1.ID_1 = T2.ID_1
INNER JOIN Table3 T3
ON T2.ID_2 = T3.ID_2;
Data Preparation:
DROP TABLE IF EXISTS Table1
DROP TABLE IF EXISTS Table2
DROP TABLE IF EXISTS Table3
SELECT TOP 10000 ID_1 = object_id, name
INTO Table1
FROM sys.all_objects
ORDER BY object_id
SELECT ID_1 = T1.ID_1, ID_2 = IDENTITY(INT, 1, 1)
INTO Table2
FROM Table1 T1
CROSS JOIN Table1 T2
SELECT ID_2, Location = 'City_'+ CAST(ID_2 AS VARCHAR(100)), Date = CAST(DATEADD(DAY, ID_2/10000, GETDATE()) AS DATE)
INTO Table3
FROM Table2
Indexes to cover the Solution 1:
CREATE NONCLUSTERED INDEX IX_TABLE1_ID_1 ON Table1 (ID_1)
CREATE NONCLUSTERED INDEX IX_TABLE2_ID_2 ON Table2 (ID_1, ID_2)
CREATE NONCLUSTERED INDEX IX_TABLE3_ID_2 ON Table3 (ID_2, Date DESC) INCLUDE (Location)
Execution Plan:
You can see that all are 'Index Seek' except for Table1 which is an legitimate 'Index Scan' because you are doing scans for each value of Table1's ID_1 value. If you put a where clause in the outer loop to search for a few specific ID_1 values, then that 'Index Scan' will turn to a 'Index Seek' as well.
I will leave the Index Strategy for the 2nd solution to you (as a homework :) ). Tips: You have to make the Location as a key as well. Or you can go with COLUMNSTORE index approach.
You can use something like this:
select top (1) with ties
a.A_Id, b.B_Id, b.Date
from dbo.TableA a
inner join dbo.TableB b on a.A_Id = it.A_Id
inner join dbo.TableC c on c.B_Id = b.B_Id
order by row_number() over(partition by a.A_Id order by b.Date desc);
Alternatively, you can try an olde fashioneth approache:
select a.A_Id, b.B_Id, b.Date
from dbo.TableA a
inner join dbo.TableB b on a.A_Id = b.A_Id
inner join dbo.TableC c on c.B_Id = b.B_Id
where not exists (
select 0 from dbo.TableB pb where pb.B_Id = b.B_Id and pb.Date > b.Date
);
However, as with all such situations, its performance will heavily depend on indices. SSMS can suggest you some, if you will look at the execution plan; off the top of my head, you will need all Id columns to be indexed, and you will need either a single (Date) or a composite (A_Id, Date, B_Id) on the TableB.
UPD: If you can't create or modify any indices, and performance is paramount, I would suggest copying the data in question into a separate schema or database, where you might have appropriate permissions. Apart from that... it's impossible to get something out of nothing.

SQL get counts using subqueries from multiple linked tables

Suppose I have tables 1-4, all the other tables are linked to table1. For what its worth, table1, table2 and table3 are relatively small but table4 contains a lot of data.
Now I have the following query:
SELECT t1.id
, (SELECT COUNT(*) FROM table2 WHERE table1_id = t1.id) AS t2_count
, (SELECT COUNT(*) FROM table3 WHERE table1_id = t1.id) AS t3_count
, (SELECT COUNT(*) FROM table4 WHERE table1_id = t1.id) AS t4_count
FROM table1 t1
Due to the fact that the subqueries are dependent/correlated I assumed that there must be a better way (performance wise) to get the data.
I tried to do the following but it drastically increased the execution time (from about 2s to 35s). I'm guessing that the multiple left joins creates a very big data set?!
SELECT t1.id
, COUNT(t2.id) AS t2_count
, COUNT(t3.id) AS t3_count
, COUNT(t4.id) AS t4_count
FROM table1 t1
LEFT JOIN table2 t2 ON t2.table1_id = t1.id
LEFT JOIN table3 t3 ON t3.table1_id = t1.id
LEFT JOIN table4 t4 ON t4.table1_id = t1.id
GROUP BY t1.id
Is there better way to get the counts? I don't need the data from the other tables.
UPDATE:
Bart's answer got me thinking that the table1_id columns are nullable. I added a IS NOT NULL check to the WHERE clauses and this brought the time down to 1s.
SELECT t1.id
, (SELECT COUNT(*) FROM table2 WHERE table1_id IS NOT NULL AND table1_id = t1.id) AS t2_count
, (SELECT COUNT(*) FROM table3 WHERE table1_id IS NOT NULL AND table1_id = t1.id) AS t3_count
, (SELECT COUNT(*) FROM table4 WHERE table1_id IS NOT NULL AND table1_id = t1.id) AS t4_count
FROM table1 t1
I guess not. If you execute a SELECT COUNT(*) FROM [table], it should perform a count on the table's PK. That should be pretty fast, even for very large tables.
Is your table4 a real table (and not a view, or a table-valued function, or something else that looks like a table)? And does it have a primary key? If so, I don't think that the performance of a SELECT COUNT(*) FROM [table4] query can be increased significantly.
It may also be the case, that your table4 is heavily targeted (in concurrent transactions over multiple connections), or perhaps your SQL Server is doing some heavy IO or computations. I cannot assume anything about that, however. You may try to check if your query is also slow on a restored database backup on a physically separate test server.

Find missing values on the same column of two tables

Suppose you have two tables in a SQL Server database with the same schema for both tables. I want to compare a single column on both tables and find the values that are missing in table1 but are in table2. I've been doing this manually in Excel with a macro after I've gotten a distinct list in each query, but it would be less work if I had a query. How can I find the missing records via T-SQL? I'd like to do this for the following data types: datetime, nvarchar & bigint.
SELECT DISTINCT [dbo].[table1].[column1]
FROM [dbo].[table1]
ORDER BY [dbo].[table1].[column1] DESC
SELECT DISTINCT [dbo].[table2].[column1]
FROM [dbo].[table2]
ORDER BY [dbo].[table2].[column1] DESC
There are several ways you can do this...
LEFT JOIN:
SELECT DISTINCT t2.column1
FROM dbo.table2 t2
LEFT JOIN dbo.table1 t1
ON t2.Column1 = t1.Column1
WHERE t1.Column1 IS NULL
NOT EXISTS:
SELECT DISTINCT t2.column1
FROM dbo.table2 t2
WHERE NOT EXISTS (
SELECT 1
FROM dbo.table1 t1
WHERE t1.column1 = t2.column1
)
NOT IN:
SELECT DISTINCT t2.column1
FROM dbo.table2 t2
WHERE t2.column1 NOT IN (
SELECT t1.column1
FROM dbo.table1 t1
)
There are some slight variations in the behavior and efficiency of these approaches... based mostly on the presence of NULL values in columns, so try each approach to find the most efficient one that gives the results you expect.
SELECT DISTINCT [dbo].[table2].[column1]
FROM [dbo].[table2]
except
SELECT DISTINCT [dbo].[table1].[column1]
FROM [dbo].[table1]
All the values of column1 in Table2 that are not present in column1 of Table1
basically, you can use LEFT JOIN.
TableB is set as the main table in this case. By joining it with TableA using LEFT JOIN, the the records that have no match on TableA a will still be in the result list but their values are NULL. So to filter out non matching records, add a filtering condition which only select records with NULL value on tableA.
SELECT b.*
FROM tableB b
LEFT JOIN tableA a
ON a.column1 = b.column1
WHERE a.column1 IS NULL
To further gain more knowledge about joins, kindly visit the link below:
Visual Representation of SQL Joins
SQL Server 2005 onwards you could use Except
SELECT DISTINCT [dbo].[table2].[column1]
FROM [dbo].[table2]
Except
SELECT DISTINCT [dbo].[table1].[column1]
FROM [dbo].[table1]

How to use BETWEEN in a reference table

I was just wondering on how create a query that in such a way it will check if the column is in between in a reference table.
such as
SELECT *
FROM Table1 WHERE Column1 BETWEEN ( SELECT Column1 , Column2 FROM TABLE2 )
I just don't know how to implement it in a correct way.
Thank you.
If you can have overlapping ranges in Table2, and all you want are (unique) Table1 records that are in any range in Table2, then this query will do it.
SELECT *
FROM Table1
WHERE EXISTS (
SELECT *
FROM Table2
Where Table1.Column1 BETWEEN Table2.Column1 and Table2.Column2)
You can also solve this using JOINs, if the ranges in Table2 are not overlapping, otherwise you will need to use either DISTINCT or ROW_NUMBER() to pare them down to unique Table1 records.
Try This....
SELECT *
FROM Table1 as t1
INNER JOIN Table2 t2 ON t1.Column1 BETWEEN t2.Column1 AND t2.Column2
this works
SELECT * FROM table1 as t1,table2 as t2
WHERE t1.Column1 BETWEEN t2.Column1 AND t2.Column2.

Resources