CTE vs multiple inserts - sql-server

As we all know, CTEs were introduced in SQL Server 2005 and the crowd went wild.
I have a case where I'm inserting a whole lot of static data into a table. What I want to know is which of the following is faster and what other factors I should be aware of.
INSERT INTO MyTable (MyField) VALUES ('Hello')
INSERT INTO MyTable (MyField) VALUES ('World')
or
WITH MyCTE(Field1) AS (SELECT 'Hello' UNION SELECT 'World')
INSERT INTO MyTable (MyField) SELECT Field1 FROM MyCTE
I have an uncomfortable feeling the answer will be dependent on things like what triggers exist on MyTable...
(Also, I know and don't care that BULK INSERTing a CSV and any number of other methods are objectively faster and better ways of inserting static data. I specifically want to know the concerns I should be aware of with a CTE vs multiple inserts.)

Not sure which version of SQL Server you're using (2005 or 2008) - but no matter which version you use, I don't see any big benefit in using a CTE in this case for multiple inserts, quite honestly. CTE's are indeed great for a great many situation - but this is not one of them.
So basically, I would suggest you just use several INSERT statements.
In SQL Server 2008, you could simplify those by just specifying multiple values tuples:
INSERT INTO MyTable (MyField)
VALUES ('Hello'), ('World'), ('and outer space')
As always, your table structure, presence (or absence) of indices and triggers does indeed have a significant impact on your INSERT speed. If you need to load a lot of data, it's sometimes easier to turn these constraints and triggers OFF for the duration of the INSERT and then back on - but again: there's really no way to give you a clear indication whether that's the case in your specific situation or not - just too many variables we don't know about that play a siginificant role. Measure it, compare it, make a decision for yourself!

Related

Improving Merge Performance with Change Data Capture and Hash

Today I'm trying to tune the performance of an audit database. I have a legal reason for tracking changes to rows, and I've implemented a set of tables using the System Versioned tables method in SQL Server 2016.
My overall process lands "RAW" data into an initial table from a source system. From here, I then have a MERGE process that takes data from the RAW table and compares every column in the RAW table to what exists in the audit-able system versioned staging table and decides what has changed. System row versioning then tells me what has changed and what hasn't.
The trouble with this approach is that my tables are very wide. Some of them have 400 columns or more. Even tables that have 450,000 records take SQL server about 17 minutes to perform a MERGE operation. It's really slowing down the performance of our solution and it seems it would help things greatly if we could speed it up. We presently have hundreds of tables we need to do this for.
At the moment both the RAW and STAGE tables are indexed on an ID column.
I've read in several places that we might consider using a CHECKSUM or HASHBYTES function to record a value in the RAW extract. (What would you call this? GUID? UUID? Hash?). We'd then compare the calculated value to what exists in the STAGE table. But here's the rub: There are often quite a few NULL values across many columns. It's been suggested that we cast all the column types to be the same (nvarchar(max))?, and NULL values seem to cause the entire computation of the checksum to fall flat. So I'm also coding lots of ISNULL(,'UNKNOWN') statements into my code too.
So - Are there better methods for improving the performance of the merge here? I thought that I could use a row updated timestamp column as a single value to compare instead of the checksum, but I am not certain that that would pass legal muster/scrutiny. Legal is concerned that rows may be edited outside of an interface and the column wouldn't always be updated. I've seen approaches with developers using a concatenate function (shown below) to combine many column values together. This seems code intensive and expensive to compute / cast columns too.
So my questions are:
Given the situational reality, can I improve MERGE performance in any way here?
Should I use a checksum, or hashbytes, and why?
Which hashbytes method makes the most sense here? (I'm only comparing one RAW row to another STAGE row based on an ID match right)?
Did I miss something with functions that might make this comparison faster or easier in the reading
I have done? It seems odd there aren't better functions besides CONCAT available to do this in SQL Server.
I wrote the below code to show some of the ideas I am considering. Is there something better than what I wrote below?
DROP TABLE IF EXISTS MyTable;
CREATE TABLE MyTable
(C1 VARCHAR(10),
C2 VARCHAR(10),
C3 VARCHAR(10)
);
INSERT INTO MyTable
(C1,C2,C3)
VALUES
(NULL,NULL,NULL),
(NULL,NULL,3),
(NULL,2,3),
(1,2,3);
SELECT
HASHBYTES('SHA2_256',
CONCAT(C1,'-',
C2,'-',
C3)) AS HashbytesValueCastWithNoNullCheck,
HASHBYTES('SHA2_256',
CONCAT(CAST(C1 as varchar(max)),'-',
CAST(C2 as varchar(max)),'-',
CAST(C3 as varchar(max)))) AS HashbytesValueCastWithNoNullCheck,
HASHBYTES('SHA2_256',
CONCAT(ISNULL(CAST(C1 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C2 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C3 as varchar(max)),'UNKNOWN'))) AS HashbytesValueWithCastWithNullCheck,
CONCAT(ISNULL(CAST(C1 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C2 as varchar(max)),'UNKNOWN'),'-',
ISNULL(CAST(C3 as varchar(max)),'UNKNOWN')) AS StringValue,
CONCAT(C1,'-',C2,'-',C3) AS ConcatString,
C1,
C2,
C3
FROM
MyTable;
'''
Given the situational reality, can I improve MERGE performance in any way here?
You should test, but storing a hash for every row, computing the hash for the new rows, and comparing based on the (key,hash) should be cheaper than comparing every column.
Should I use a checksum, or hashbytes, and why?
HASHBYTES has a much lower probability of missing a change. Roughly, with CHECKSUM you'll probably eventually miss a change or two, with HASHBYTES you probably won't ever miss a change. See remarks here: BINARY_CHECKSUM.
Did I miss something with functions that might make this comparison faster or easier in the reading I have done?
No. There's no special way to compare multiple columns.
Is there something better than what I wrote below?
You definitely should replace nulls, else a row (1,null,'A') and (1,'A',null) would get the same hash. And you should replace nulls, and delimit, with something that won't appear as a value in any column. And if you have Unicode text, converting to varchar may erase some changes, so it's safer to use nvarchar. eg:
HASHBYTES('SHA2_256',
CONCAT(ISNULL(CAST(C1 as nvarchar(max)),N'~'),N'|',
ISNULL(CAST(C2 as nvarchar(max)),N'~'),N'|',
ISNULL(CAST(C3 as nvarchar(max)),N'~'))) AS HashbytesValueWithCastWithNullCheck
JSON in SQL Server is very fast. So you might try a pattern like:
select t.Id, z.RowJSON, hashbytes('SHA2_256', RowJSON) RowHash
from SomeTable t
cross apply (select t.* for json path) z(RowJSON)

SQL Server import wizard and temp tables

I have a VERY long query, so to make it readable I divided it in two.
Select field_1, field_2
from table
into #temp;
and
Select field_1, field_2, field_1b
from #temp TMP
Inner Join table_2 ON TMP.field_2b = field_2
That works fine in SQL Server Management Studio.
Now I need to make a job that loads this data to another database. I have some import-export wizards projects that work without problem.
In this particular case I can't make the wizard work, it throws an error on #temp.
I tried
set fmtonly off
but I get timeout (the timeout value is set to 0)
Source is SQL Server 2014 (v12)
Destination is SQL Server 2016 (v13)
Any Idea of how can I make this work, my last resource is to make one query out of two, but like to try to maintain some order and readability if possible.
Thank you!
If you want to split your query into two purely for readability purposes, then there are alternative ways of formatting your query, such as a CTE:
WITH t1 AS (
SELECT field_1, field_2
FROM table
)
SELECT t1.field_1, t1.field_2, table_2.field_1b
FROM t1
INNER JOIN table_2 ON t1.field_2 = table_2.field_2b;
While I can't begin to speculate on performance (because I know nothing of your actual query or your underlying schema), this will probably improve performance as well because it removes the overhead of populating a temp table. In general, you shouldn't be compromising performance just to make your query 'readable'.
First and foremost, please post the error message. That is most often the most enlightening point of a question, and it can enable others to help you along.
Since you're using SSIS, please copy/paste what gets printed in the "output" window? That's where your error messages would go.
A few points
fmtonly has a rather different purpose, so if the difference between fmtonly on/off would be whether or not you get headers or headers and data. Please see the link below with the documentation for fmtonly:
https://learn.microsoft.com/en-us/sql/t-sql/statements/set-fmtonly-transact-sql?view=sql-server-2017
Have you tried alternatives to the temp table solution? That seems to be the most likely culprit. Here are some alternative ideas:
using a staging table (permanent table used for ETL); you would truncate it, populate it using the two queries mentioned in your question, pass it along to the destination server and voila!
Using a CTE. This can avoid the temp table issues, though they aren't the easiest to read.
Not breaking up the query for readability. This may not be fun, though it will eliminate the trouble.
I have other points which may help you along, though, without the error message, this is enough to give you a head start.
This is hard to identify problem when you are not posting how you loads the data to another database
But I suggest you use semi-permanent Temp table.
So you create real table at the start of job and drop the said table when the job is done. Remember you can have multiple steps on the job, it's doesn't have to be in the same query.

Eliminating code duplication when querying multiple tables with the same schema's

I've inherited some code which uses multiple tables to store the same information depending on how old it is (one for the current day, the last month, etc.).
Currently most of the code is duplicated for every condition, and I'd like to try and eliminate the majority of the duplication in the stored procedures. Right now re-architecting the design is not an option as there are a number of applications that depend on the current design that I have no control over.
One option I've tried so far is loading the needed data into a temp table which I found to have a rather large performance hit. I've also tried using a cte structured like this:
;WITH cte_table(...)
AS
(
SELECT ...
FROM a
WHERE #queried_date = CONVERT(DATE, GETDATE())
UNION ALL
SELECT ...
FROM b
WHERE #queried_date BETWEEN --some range
)
This works and the performance isn't terrible, but it's not very nice looking.
Could anyone offer a better alternative?
Two suggestions:
Just use UNION, not UNION ALL. The UNION operator removes duplicates in that case. UNION ALL preserves dupes.
Using the CTE, the SELECT clause on the outside / end can have a DISTICT operator to bring back unique rows. Of course, not sure why you'd be using a CTE in this scenario since UNION should work just fine. (In fact, I believe SQL will optimize the query to the same plan structure either way...)
Any way you slice it, if you have duplicate data, either you have to do something like the above, or you have to make explicit clauses that remove dupe cases, using things like #temp tables or WHERE ... NOT IN ().

Handling a Huge Data of Record in 1 table

I would like to ask couple question how to handle a huge 100 million of data in 1 single table.
The table will perform INSERT, SELECT & UPDATE.
I have got some advise that to Index the table and Archive the table into couple table.
Any other suggestion that can help to tweak the SQL Performance.
Case:
SQL Server 2008.
Most of the time the update regarding decimal value, and status of tiny int.
The INSERT statement will not using BULK INSERT since I'm assuming that per min that there'r a lot of users let said 10000-500000 performing INSERT statement and Update the table.
You should consider what kind of columns you have.
The more nvarchar/text/etc columns you have included in the different indexes, the slower the index will be.
Also what RDBMS are you going to use? You have different options based on SQL Server, Oracle and MySQL...
But the crucial thing is differently to build the right index's that you would use...
One other thing, you could use BULK INSERT on SQL Server to speed up the inserts.
But ask away, i have dealt with databases being populated with 70 mill data rows pr day ;)
EDIT ----- After more information has come
I'll try to take a little other approach to the case and compare it to data scraping.
There are no doubt that INSERTs are faster than UPDATEs. And you might want to make a table that acts as a "collect" table. What I mean is that it only get inserts all the time. No updates, all is handle with inserts.
Then you use a trigger/event/scheduler to handle what has come into that table and populate what you need to another(s) table(s).
This way you will be able to apply a little business logic to the "cleanup" (update) and keep the performance on the DB Server and not hold up a connection while these things are done.
This of course also have something to do with what the "final" data are to be used for...
\T
Clearly SQL 2008 is capable of 100 million records but a lot of details to look at that just do not come into play at 100 thousand. Pick a good primary key. Fill factor. Other indexes (will slow down insert but speed select). Concurrency (locking). If you can accept dirty reads then that will help performance. This question needs a lot more detail. You need to post the table design and your select, update, and insert TSQL statements. I did not vote your question down but if you don't provide more detail it will get voted down.
For insert be aware you can insert multiple rows at once and is much faster than multiple insert statements if BULK INSERT is not an option.
INSERT INTO Production.UnitMeasure
VALUES (N'FT2', N'Square Feet ', '20080923'), (N'Y', N'Yards', '20080923'), (N'Y3', N'Cubic Yards', '20080923');

Use of With Clause in SQL Server

How does with clause work in SQL Server? Does it really give me some performance boost or does it just help to make more readable scripts?
When it is right to use it? What should you know about with clause before you start to use it?
Here's an example of what I'm talking about:
http://www.dotnetspider.com/resources/33984-Use-With-Clause-Sql-Server.aspx
I'm not entirely sure about performance advantages, but I think it can definitely help in the case where using a subquery results in the subquery being performed multiple times.
Apart from that it can definitely make code more readable, and can also be used in the case where multiple subqueries would be a cut and paste of the same code in different places.
What should you know before you use it?
A big downside is that when you have a CTE in a view, you cannot create a clustered index on that view. This can be a big pain because SQL Server does not have materialised views, and has certainly bitten me before.
Unless you use recursive abilities, a CTE is not better performance-wise than a simple inline view.
It just saves you some typing.
The optimizer is free to decide whether to reevaluate it or not, when it's being reused, and it most cases it decides to reevaluate:
WITH q (uuid) AS
(
SELECT NEWID()
)
SELECT *
FROM q
UNION ALL
SELECT *
FROM q
will return you two different NEWIDs.
Note that other engines may behave differently.
PostgreSQL, unlike SQL Server, materializes the CTEs.
Oracle supports a special hint, /*+ MATERIALIZE */, that tells the optimizer whether it should materialize the CTE or not.
with is a keyword in SQL which just stores the temporary result in a temporary table. Example:
with a(--here a is the temporary table)
(id)(--id acts as colomn for table a )
as(select colomn_name from table_name )
select * from a

Resources