TSQL Operators IN vs INNER JOIN - sql-server

Using SQL Server 2014:
Is there any performance difference between the following statements?
DELETE FROM MyTable where PKID IN (SELECT PKID FROM #TmpTableVar)
AND
DELETE FROM MyTable INNER JOIN #TmpTableVar t ON MyTable.PKID = t.PKID

In your given example the execution plans will be the same (most probably).
But having same execution plans doesn't mean that they are the best execution plans you can possibly have for this statement.
The problem I see in both of your queries is the use of the Table Variable.
SQL Server always assumes that there is only 1 row in the table variable. Only in SQL Server 2014 and later version this assumption has been changed to 100 rows.
So no matter how many rows you have this the table variable SQL Server will always assume you have one row in the #TmpTableVar.
You can change your code slightly to give SQL Server a better idea of how many rows there will be in that table by replacing it with a Temporary table and since it is a PK_ID Column in your table variable you can also create an index on that table, to give best chance to sql server to come up with the best possible execution plan for this query.
SELECT PKID INTO #Temp
FROM #TmpTableVar
-- Create some index on the temp table here .....
DELETE FROM MyTable
WHERE EXISTS (SELECT 1
FROM #Temp t
WHERE MyTable.PKID = t.PKID)
Note
In operator will work fine since it is a primary key column in the table variable. but if you ever use IN operator on a nullable column, the results may surprise you, The IN operator goes all pear shape as soon as it finds a NULL values in the column it is checking on.
I personally prefer Exists operator for such queries but inner joins should also work just fine but avoid IN operators if you can.

Related

Index for using IN clause in where condition

My application access data from a table in SQL Server. Consider the table name is PurchaseDetail with some other columns.
The select query has below where clauses.
1. name - name has 10000 values only.
2. createdDateTime
The actual query is
select *
from PurchaseDetail
where name in (~2000 name)
and createdDateTime = 'someDateValue';
The SQL tuning advisor gave some recommendation. I tried with those recommended indexes. The performance increased a bit but not completely.
Is there any wrong in my query? or Is there any possible to change/improve my select query?
Because I didn't use IN in where clause before. My table having more than 100 million records.
Any suggestion please?
In this case using IN for that much data is not good at all.
this best way is to use INNER JOIN instead.
It would be nicer if insert those names into a temp table and INNER JOIN it with your SELECT query.

Performance of User-Defined Table Types in SQL Server

We have been using User-Defined Table Types to pass a list of integers to our stored procedures.
We then use these to join to other tables in our stored proc queries.
For example:
CREATE PROCEDURE [dbo].[sp_Name]
(
#Ids [dbo].[OurTableType] READONLY
)
AS
SET Nocount ON
SELECT
*
FROM
SOMETABLE
INNER JOIN #Ids [OurTableType] ON [OurTableType].Id = SOMETABLE.Id
We have seen very poor performance from this when using larger datasets.
One approach we've used to speed things up, is the dump the contents into a temp table and join off that instead.
For example:
CREATE PROCEDURE [dbo].[sp_Name]
(
#Ids [dbo].[OurTableType] READONLY
)
AS
SET Nocount ON
CREATE TABLE #TempTable(Id INT)
INSERT INTO #TempTable
SELECT Id from #Ids
SELECT
*
FROM
SOMETABLE
INNER JOIN #TempTable ON #TempTable.Id = SOMETABLE.Id
DROP TABLE #TempTable
This does improve performance significantly, but I wanted to get some opinions on this approach and any other consequences we haven't considered. Also an explanation as to why this improves performance may also be useful.
N.B. sometime we may need to pass in more than just an integer, hence why we don't use a comma separated list or something like that.
SQL Server 2019 and SQL Azure
Microsoft has implemented a new feature called Table Variable Deferred Compilation that largely resolves the performance issues with table variables in previous versions of SQL Server:
With table variable deferred compilation, compilation of a statement that references a table variable is deferred until the first actual execution of the statement. This is identical to the behavior of temporary tables, and this change results in the use of actual cardinality instead of the original one-row guess.
This behaviour is available and enabled out-of-the-box and requires no opt-in. Unfortunately it can still suffer from parameter sniffing issues, but overall it's a massive improvement.
SQL Server 2017 and earlier
The primary reason for the poor performance of the JOIN is that the Table-Valued Parameter (TVP) is a Table Variable. Table Variables do not keep statistics and appear to the Query Optimizer to only have 1 row. Hence they are just fine to do something like INSERT INTO Table (column_list) SELECT column_list FROM #TVP; but not a JOIN.
There are a few things to try to get around this:
Dump everything to a local temporary table (you are already doing this). A technical downside here is that you are duplicating the data passed into the TVP in tempdb (where both the TVP and temp table store their data).
Try defining the User-Defined Table Type to have a Clustered Primary Key. You can do this inline on the [Id] field:
[ID] INT NOT NULL PRIMARY KEY
Not sure how much this helps performance, but worth a try.
Add OPTION (RECOMPILE) to the query. This is a way of getting the Query Optimizer to see how many rows are in a Table Variable so that it can have proper estimates.
SELECT column_list
FROM SOMETABLE
INNER JOIN #Ids [OurTableType]
ON [OurTableType].Id = SOMETABLE.Id
OPTION (RECOMPILE);
The downside here is that you have a RECOMPILE which takes additional time each time this proc is called. But that might be an overall net gain.
Starting in SQL Server 2014, you can take advantage of In-Memory OLTP and specify WITH (MEMORY_OPTIMIZED = ON) for the User-Defined Table Type. Please see Scenario: Table variable can be MEMORY_OPTIMIZED=ON for details. I have heard that this definitely helps. Unfortunately, in SQL Server 2014 and SQL Server 2016 RTM this feature is only available in 64-bit Enterprise Edition. But, starting with SQL Server 2016 SP1, this feature was made available to all editions (possible exception being SQL Server Express LocalDB).
PS. Don't do SELECT *. Always specify a column list. Unless doing something like an IF EXIST(SELECT * FROM)....

Does MS SQL Server automatically create temp table if the query contains a lot id's in 'IN CLAUSE'

I have a big query to get multiple rows by id's like
SELECT *
FROM TABLE
WHERE Id in (1001..10000)
This query runs very slow and it ends up with timeout exception.
Temp fix for it is querying with limit, break this query into 10 parts per 1000 id's.
I heard that using temp tables may help in this case but also looks like ms sql server automatically doing it underneath.
What is the best way to handle problems like this?
You could write the query as follows using a temporary table:
CREATE TABLE #ids(Id INT NOT NULL PRIMARY KEY);
INSERT INTO #ids(Id) VALUES (1001),(1002),/*add your individual Ids here*/,(10000);
SELECT
t.*
FROM
[Table] AS t
INNER JOIN #ids AS ids ON
ids.Id=t.Id;
DROP TABLE #ids;
My guess is that it will probably run faster than your original query. Lookup can be done directly using an index (if it exists on the [Table].Id column).
Your original query translates to
SELECT *
FROM [TABLE]
WHERE Id=1000 OR Id=1001 OR /*...*/ OR Id=10000;
This would require evalutation of the expression Id=1000 OR Id=1001 OR /*...*/ OR Id=10000 for every row in [Table] which probably takes longer than with a temporary table. The example with a temporary table takes each Id in #ids and looks for a corresponding Id in [Table] using an index.
This all assumes that there are gaps in the Ids between 1000 and 10000. Otherwise it would be easier to write
SELECT *
FROM [TABLE]
WHERE Id BETWEEN 1001 AND 10000;
This would also require an index on [Table].Id to speed it up.

How to force SQL Server to process CONTAINS clauses before WHERE clauses?

I have a SQL query that uses both standard WHERE clauses and full text index CONTAINS clauses. The query is built dynamically from code and includes a variable number of WHERE and CONTAINS clauses.
In order for the query to be fast, it is very important that the full text index be searched before the rest of the criteria are applied.
However, SQL Server chooses to process the WHERE clauses before the CONTAINS clauses and that causes tables scans and the query is very slow.
I'm able to rewrite this using two queries and a temporary table. When I do so, the query executes 10 times faster. But I don't want to do that in the code that creates the query because it is too complex.
Is there an a way to force SQL Server to process the CONTAINS before anything else? I can't force a plan (USE PLAN) because the query is built dynamically and varies a lot.
Note: I have the same problem on SQL Server 2005 and SQL Server 2008.
You can signal your intent to the optimiser like this
SELECT
*
FROM
(
SELECT *
FROM
WHERE
CONTAINS
) T1
WHERE
(normal conditions)
However, SQL is declarative: you say what you want, not how to do it. So the optimiser may decide to ignore the nesting above.
You can force the derived table with CONTAINS to be materialised before the classic WHERE clause is applied. I won't guarantee performance.
SELECT
*
FROM
(
SELECT TOP 2000000000
*
FROM
....
WHERE
CONTAINS
ORDER BY
SomeID
) T1
WHERE
(normal conditions)
Try doing it with 2 queries without temp tables:
SELECT *
FROM table
WHERE id IN (
SELECT id
FROM table
WHERE contains_criterias
)
AND further_where_classes
As I noted above, this is NOT as clean a way to "materialize" the derived table as the TOP clause that #gbn proposed, but a loop join hint forces an order of evaluation, and has worked for me in the past (admittedly usually with two different tables involved). There are a couple of problems though:
The query is ugly
you still don't get any guarantees that the other WHERE parameters don't get evaluated until after the join (I'll be interested to see what you get)
Here it is though, given that you asked:
SELECT OriginalTable.XXX
FROM (
SELECT XXX
FROM OriginalTable
WHERE
CONTAINS XXX
) AS ContainsCheck
INNER LOOP JOIN OriginalTable
ON ContainsCheck.PrimaryKeyColumns = OriginalTable.PrimaryKeyColumns
AND OriginalTable.OtherWhereConditions = OtherValues

UPSERT in SSIS

I am writing an SSIS package to run on SQL Server 2008. How do you do an UPSERT in SSIS?
IF KEY NOT EXISTS
INSERT
ELSE
IF DATA CHANGED
UPDATE
ENDIF
ENDIF
See SQL Server 2008 - Using Merge From SSIS. I've implemented something like this, and it was very easy. Just using the BOL page Inserting, Updating, and Deleting Data using MERGE was enough to get me going.
Apart from T-SQL based solutions (and this is not even tagged as sql/tsql), you can use an SSIS Data Flow Task with a Merge Join as described here (and elsewhere).
The crucial part is the Full Outer Join in the Merger Join (if you only want to insert/update and not delete a Left Outer Join works as well) of your sorted sources.
followed by a Conditional Split to know what to do next: Insert into the destination (which is also my source here), update it (via SQL Command), or delete from it (again via SQL Command).
INSERT: If the gid is found only on the source (left)
UPDATE If the gid exists on both the source and destination
DELETE: If the gid is not found in the source but exists in the destination (right)
I would suggest you to have a look at Mat Stephen's weblog on SQL Server's upsert.
SQL 2005 - UPSERT: In nature but not by name; but at last!
Another way to create an upsert in sql (if you have pre-stage or stage tables):
--Insert Portion
INSERT INTO FinalTable
( Colums )
SELECT T.TempColumns
FROM TempTable T
WHERE
(
SELECT 'Bam'
FROM FinalTable F
WHERE F.Key(s) = T.Key(s)
) IS NULL
--Update Portion
UPDATE FinalTable
SET NonKeyColumn(s) = T.TempNonKeyColumn(s)
FROM TempTable T
WHERE FinalTable.Key(s) = T.Key(s)
AND CHECKSUM(FinalTable.NonKeyColumn(s)) <> CHECKSUM(T.NonKeyColumn(s))
The basic Data Manipulation Language (DML) commands that have been in use over the years are Update, Insert and Delete. They do exactly what you expect: Insert adds new records, Update modifies existing records and Delete removes records.
UPSERT statement modifies existing records, if a records is not present it INSERTS new records.
The functionality of UPSERT statment can be acheived by two new set of TSQL operators. These are the two new ones
EXCEPT
INTERSECT
Except:-
Returns any distinct values from the query to the left of the EXCEPT operand that are not also returned from the right query
Intersect:-
Returns any distinct values that are returned by both the query on the left and right sides of the INTERSECT operand.
Example:- Lets say we have two tables Table 1 and Table 2
Table_1 column name(Number, datatype int)
----------
1
2
3
4
5
Table_2 column name(Number, datatype int)
----------
1
2
5
SELECT * FROM TABLE_1 EXCEPT SELECT * FROM TABLE_2
will return 3,4 as it is present in Table_1 not in Table_2
SELECT * FROM TABLE_1 INTERSECT SELECT * FROM TABLE_2
will return 1,2,5 as they are present in both tables Table_1 and Table_2.
All the pains of Complex joins are now eliminated :-)
To use this functionality in SSIS, all you need to do add an "Execute SQL" task and put the code in there.
I usually prefer to let SSIS engine to manage delta merge. Only new items are inserted and changed are updated.
If your destination Server does not have enough resources to manage heavy query, this method allow to use resources of your SSIS server.
We can use slowly changing dimension component in SSIS to upsert.
https://learn.microsoft.com/en-us/sql/integration-services/data-flow/transformations/configure-outputs-using-the-slowly-changing-dimension-wizard?view=sql-server-ver15
I would use the 'slow changing dimension' task

Resources