Okay, so I have a query that looks like this:
Declare #Table1 Table (some columns)
Insert into #Table1 [QueryA]
Update #Table1
set Field1 = A.Value1
from ([QueryB]) A
where Field2 = A.Value2
Select * from #Table1
QueryA is a simple query that returns ~150 rows. QueryB is more complex and returns 3 rows. When run on its own, QueryB returns in less than 1 second. When run inside of the update statement, QueryB takes about 1 minute to run.
Now, if the query is reformatted like this, the whole thing takes less than a second:
Declare #Table1 Table (some columns)
Insert into #Table1 [QueryA]
Declare #Table2 Table (some columns)
Insert into #Table2 [QueryB]
Update #Table1
set Field1 = A.Value1
from (select * from #Table2) A
where Field2 = A.Value2
Select * from #Table1
Does anyone know why this is happening? My guess is that something wonky is going on with the optimizer engine, but if I'm missing something, I'd love to hear it.
SQL Server does not create statistics on table variables so the query plans will involve scans, lots and, lots of repeated scans. The other this does not save the query plan so on every run through it recreates the query plan.
So what you are getting is scan for every row * (recreate a query plan +execute a query plan).
Related
I know how to generate scripts to script insert lines allowing me to backup some data. I was wondering though if it was possible to write a query (using WHERE clause as an example) to target a very small subset of data in a very large table?
In the end I want to generate a script that has a bunch of insert lines and will allow for inserting primary key values (where it normally would not let you).
SSMS will not let you to have the INSERT queries for specific rows in a table. You can do this by using GenerateInsert stored procedure. For example :
EXECUTE dbo.GenerateInsert #ObjectName = N'YourTableName'
,#SearchCondition='[ColumnName]=ColumnValue';
will give you similar result for the filtered rows specified in the #SearchCondition
Let's say your table name is Table1 which has columns Salary & Name and you want the insert queries for those who have salary greater than 1000 whose name starts with Mr., then you can use this :
EXECUTE dbo.GenerateInsert #ObjectName = N'Table1'
,#SearchCondition='[Salary]>1000 AND [Name] LIKE ''Mr.%'''
,#PopulateIdentityColumn=1;
If I read your requirement correctly, what you actually want to do is simply make a copy of some data in your table. I typically do this by using a SELECT INTO. This will also generate the target table for you.
CREATE TABLE myTable (Column1 int, column2 NVARCHAR(50))
;
INSERT INTO myTable VALUES (1, 'abc'), (2, 'bcd'), (3, 'cde'), (4, 'def')
;
SELECT * FROM myTable
;
SELECT
*
INTO myTable2
FROM myTable WHERE Column1 > 2
;
SELECT * FROM myTable;
SELECT * FROM myTable2;
DROP TABLE myTable;
DROP TABLE myTable2;
myTable will contain the following:
Column1 column2
1 abc
2 bcd
3 cde
4 def
myTable2 will only have the last 2 rows:
Column1 column2
3 cde
4 def
Edit: Just saw the bit about the Primary Key values. Does this mean you want to insert the data into an existing table, rather than just creating a backup set? If so, you can issue SET IDENTITY_INSERT myTable2 ON to allow for this.
However, be aware that might cause issues in case the id values you are trying to insert already exist.
Consider the following query:
begin
;with
t1 as (
select top(10) x from tableX
),
t2 as (
select * from t1
),
t3 as (
select * from t1
)
-- --------------------------
select *
from t2
join t3 on t3.x=t2.x
end
go
I was wondering if t1 is called twice hence tableX being called twice (which means t1 acts like a table)?
or just once with its rows saved in t1 for the whole query (like a variable in a programming lang)?
Just trying to figure out how tsql engine optimises this. This is important to know because if t1 has millions of rows and is being called many times in the whole query generating the same result then there should be a better way to do it..
Just create the table:
CREATE TABLE tableX
(
x int PRIMARY KEY
);
INSERT INTO tableX
VALUES (1)
,(2)
Turn on the execution plan generation and execute the query. You will get something like this:
So, yes, the table is queried two times. If you are using complex common table expression and you are working with huge amount of data, I will advice to store the result in temporary table.
Sometimes, I am getting very bad execution plans for complex CTEs which were working nicely in the past. Also, you are allowed to define indexes on temporary tables and improve performance further.
To be honest, there is no answer... The only answer is Race your horses (Eric Lippert).
The way you write your query does not tell you, how the engine will put it in execution. This depends on many, many influences...
You tell the engine, what you want to get and the engine decides how to get this.
This may even differ between identical calls depending on statistics, currently running queries, existing cached results etc.
Just as a hint, try this:
USE master;
GO
CREATE DATABASE testDB;
GO
USE testDB;
GO
--I create a physical test table with 1.000.000 rows
CREATE TABLE testTbl(ID INT IDENTITY PRIMARY KEY, SomeValue VARCHAR(100));
WITH MioRows(Nr) AS (SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM master..spt_values v1 CROSS JOIN master..spt_values v2 CROSS JOIN master..spt_values v3)
INSERT INTO testTbl(SomeValue)
SELECT CONCAT('Test',Nr)
FROM MioRows;
--Now we can start to test this
GO
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO
DECLARE #dt DATETIME2 = SYSUTCDATETIME();
--Your approach with CTEs
;with t1 as (select * from testTbl)
,t2 as (select * from t1)
,t3 as (select * from t1)
select t2.ID AS t2_ID,t2.SomeValue AS t2_SomeValue,t3.ID AS t3_ID,t3.SomeValue AS t3_SomeValue INTO target1
from t2
join t3 on t3.ID=t2.ID;
SELECT 'Final CTE',DATEDIFF(MILLISECOND,#dt,SYSUTCDATETIME());
GO
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO
DECLARE #dt DATETIME2 = SYSUTCDATETIME();
--Writing the intermediate result into a physical table
SELECT * INTO test1 FROM testTbl;
SELECT 'Write into test1',DATEDIFF(MILLISECOND,#dt,SYSUTCDATETIME());
select t2.ID AS t2_ID,t2.SomeValue AS t2_SomeValue,t3.ID AS t3_ID,t3.SomeValue AS t3_SomeValue INTO target2
from test1 t2
join test1 t3 on t3.ID=t2.ID
SELECT 'Final physical table',DATEDIFF(MILLISECOND,#dt,SYSUTCDATETIME());
GO
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO
DECLARE #dt DATETIME2 = SYSUTCDATETIME();
--Same as before, but with an primary key on the intermediate table
SELECT * INTO test2 FROM testTbl;
SELECT 'Write into test2',DATEDIFF(MILLISECOND,#dt,SYSUTCDATETIME());
ALTER TABLE test2 ADD PRIMARY KEY (ID);
SELECT 'Add PK',DATEDIFF(MILLISECOND,#dt,SYSUTCDATETIME());
select t2.ID AS t2_ID,t2.SomeValue AS t2_SomeValue,t3.ID AS t3_ID,t3.SomeValue AS t3_SomeValue INTO target3
from test2 t2
join test2 t3 on t3.ID=t2.ID
SELECT 'Final physical tabel with PK',DATEDIFF(MILLISECOND,#dt,SYSUTCDATETIME());
--Clean up (Careful with real data!!!)
GO
USE master;
GO
--DROP DATABASE testDB;
GO
On my system the
first takes 674ms, the
second 1.205ms (297 for writing into test1) and the
third 1.727ms (285 for writing into test2 and ~650ms for creating the index.
Although the query is performed twice, the engine can take advantage of cached results.
Conclusio
The engine is really smart... Don't try to be smarter...
If the table would cover a lot of columns and much more data per row the whole test might return something else...
If your CTEs (sub-queries) involve much more complex data with joins, views, functions and so on, the engine might get into troubles finding the best approach.
If performance matters, you can race your horses to test it out. One hint: I sometimes used a TABLE HINT quite successfully: FORCE ORDER. This will perform joins in the order specified in the query.
Here is a simple example to test the theories:
First, via temporary table which calls the matter only once.
declare #r1 table (id int, v uniqueidentifier);
insert into #r1
SELECT * FROM
(
select id=1, NewId() as 'v' union
select id=2, NewId()
) t
-- -----------
begin
;with
t1 as (
select * from #r1
),
t2 as (
select * from t1
),
t3 as (
select * from t1
)
-- ----------------
select * from t2
union all select * from t3
end
go
On the other hand, if we put the matter inside t1 instead of the temporary table, it gets called twice.
t1 as (
select id=1, NewId() as 'v' union
select id=2, NewId()
)
Hence, my conclusion is to use temporary table and not reply on cached results.
Also, ive implemented this on a large scale query that called the "matter" twice only and after moving it to temporary table the execution time went straight half!!
The problem im trying to solve is about avoiding duplicate data getting into my table. I'm using xml to send bulk data to a stored procedure. The procedure I wrote works with 100, 200 records. But when it comes to 20000 of them there is a time out exception.
This is the stored procedure:
DECLARE #TEMP TABLE (Page_No varchar(MAX))
DECLARE #TEMP2 TABLE (Page_No varchar(MAX))
INSERT INTO #TEMP(Page_No)
SELECT
CAST(CC.query('data(PageId)') AS NVARCHAR(MAX)) AS Page_No
FROM
#XML.nodes('DocumentElement/CusipsFile') AS tt(CC)
INSERT INTO #TEMP2(Page_No)
SELECT Page_No
FROM tbl_Cusips_Pages
INSERT INTO tbl_Cusips_Pages(Page_No, Download_Status)
SELECT Page_No, 'False'
FROM #TEMP
WHERE Page_No NOT IN (SELECT Page_No FROM #TEMP
INTERSECT
SELECT Page_No FROM #TEMP2)
How can I solve this? Is there a better way to write this procedure?
As was already suggested, NVARCHAR(MAX) column/variable is very slow and has limited options. If you can change it, it would help a lot.
MERGE tbl_Cusips_Pages
USING (
SELECT
CAST(CC.query('data(PageId)') AS NVARCHAR(4000))
FROM
#XML.nodes('DocumentElement/CusipsFile') AS tt(CC)
) AS source (Page_No)
ON tbl_Cusips_Pages.Page_No = source.Page_No
WHEN NOT MATCHED BY TARGET
THEN INSERT (Page_No, Download_Status)
VALUES (source.Page_No, 'false')
Anyway, your query is not that bad either, just put the queries directly into the third one (TEMP2 one for sure) instead of inserting the data into the table variables. Table variables are quite slow in comparison.
Replace last INSERT Statement with following Script, I have replace IN Clause With NOT EXISTS that may help you for better performance.
DECLARE #CommanPageNo TABLE (Page_No varchar(MAX))
INSERT INTO #CommanPageNo SELECT Page_No FROM #TEMP
INTERSECT
SELECT Page_No FROM #TEMP2
INSERT INTO tbl_Cusips_Pages(Page_No, Download_Status)
SELECT Page_No, 'False'
FROM #TEMP
WHERE NOT EXISTS (SELECT 1 FROM #CommanPageNo WHERE Page_No=#CommanPageNo.Page_No)
I have below mentioned 2 approaches of accomplishing one task.
1st is selecting from Table directly multiple times and 2nd in selecting desired columns from table into table variable first and then using that table variable multiple times. Which one would perform better and why?
declare
#var1 varchar(10),
#var2 varchar(10)
----------------------------------------------------------------------------
-- 1st approach
----------------------------------------------------------------------------
select *
from tab1
where tab1.col1 in (select tab2.col1 from tab2 where tab2.col2 <> #var1) or
tab1.col2 in (select tab2.col2 from tab2 where tab2.col3 <> #var2)
----------------------------------------------------------------------------
-- 2nd approach
----------------------------------------------------------------------------
declare #tab2 table (col1 varchar(10), col2 varchar(10))
insert into #tab2
select col1,
col2
from tab2
select *
from tab1
where tab1.col1 in (select t.col1 from #tab2 as t where t.col2 <> #var1) or
tab1.col2 in (select t.col2 from #tab2 as t where t.col3 <> #var2)
According to me the first approach will be faster and efficient.
If u see the execution plan, extra cost for table insert gets added into the second approach.
Execution Plan for first approach:
Execution Plan for second approach:
EDIT: I wasn't understanding the question. Forget my answer. Please.
I don't think there is any difference of performance between your 2 methods, because the only difference is a tiny request to retrieve your column, which should be negligible.
The bonus that you get with the second approach is that if you change your column names in the future, you won't need to update your script.
PS:
I think your query to retrieve your columns isn't quite right. Y'oure not retrieving columns names here, but datas. I don't know your DBMS, but if it's Oracle, it should be something like :
SELECT column_name
FROM USER_TAB_COLUMNS
WHERE table_name = 'MYTABLE'
Why in the world would you think two selects are faster than one?
Why would you not just select col1, col2 from tab1 where ... ?
In both cases you have a select where
A select on table is faster than a select on table varible
So all you have done is added the overhead of inserting into a table variable to get a less efficient select
A table variable is stored in tempdb
Microsoft has all sorts of warnings on use of table variables
[Table variable][1]
For one not to use for more then 100 rows
It does not have indexes
Really what if tab1 had a million rows and the where limited it to 10
You really think insert a million rows into #tab2 is going to make it faster?
I have a table which I need to copy records from back into itself. As part of that, I want to capture the new rows using an OUTPUT clause into a table variable so I can perform other opertions on the rows as well in the same process. I want each row to contain its new key and the key it was copied from. Here's a contrived example:
INSERT
MyTable (myText1, myText2) -- myId is an IDENTITY column
OUTPUT
Inserted.myId,
Inserted.myText1,
Inserted.myText2
INTO
-- How do I get previousId into this table variable AND the newly inserted ID?
#MyTable
SELECT
-- MyTable.myId AS previousId,
MyTable.myText1,
MyTable.myText2
FROM
MyTable
WHERE
...
SQL Server barks if the number of columns on the INSERT doesn't match the number of columns from the SELECT statement. Because of that, I can see how this might work if I added a column to MyTable, but that isn't an option. Previously, this was implemented with a cursor which is causing a performance bottleneck -- I'm purposely trying to avoid that.
How do I copy these records while preserving the copied row's key in a way that will achieve the highest possible performance?
I'm a little unclear as to the context - is this in an AFTER INSERT trigger.
Anyway, I can't see any way to do this in a single call. The OUTPUT clause will only allow you to return rows that you have inserted. What I would recommend is as follows:
DECLARE #MyTable (
myID INT,
previousID INT,
myText1 VARCHAR(20),
myText2 VARCHAR(20)
)
INSERT #MyTable (previousID, myText1, myText2)
SELECT myID, myText1, myText2 FROM inserted
INSERT MyTable (myText1, myText2)
SELECT myText1, myText2 FROM inserted
-- ##IDENTITY now points to the last identity value inserted, so...
UPDATE m SET myID = i.newID
FROM #myTable m, (SELECT ##IDENTITY - ROW_NUMBER() OVER(ORDER BY myID DESC) + 1 AS newID, myID FROM inserted) i
WHERE m.previousID = i.myID
...
Of course, you wouldn't put this into an AFTER INSERT trigger, because it will give you a recursive call, but you could do it in an INSTEAD OF INSERT trigger. I may be wrong on the recursive issue; I've always avoid the recursive call, so I've never actually found out. Using ##IDENTITY and ROW_NUMBER(), however, is a trick I've used several times in the past to do something similar.