I've developed a couple of T-SQL stored procedures that iterate over a fair bit of data. The first one takes a couple of minutes to run over a year's worth of data which is fine for my purposes. The second one, which uses the same structure/algorithm, albeit over more data, takes two hours, which is unbearable.
I'm using SQL-Server and Query-Analyzer. Are there any profiling tools, and, if so, how do they work?
Alternatively, any thoughts on how improve the speed, based on the pseudo-code below? In short, I use a cursor to iterate over the data from a straight-forward SELECT (from a few joined tables). Then I build an INSERT statement based on the values and INSERT the result into another table. Some of the SELECTed variables require a bit of manipulation before INSERTion. The includes extracting some date parts from a date value, some basic float operations and some string concatenation.
--- Rough algorithm / pseudo-code
DECLARE <necessary variables>
DECLARE #cmd varchar(1000)
DECLARE #insert varchar(100) = 'INSERT INTO MyTable COL1, COL2, ... COLN, VALUES('
DECLARE MyCursor Cursor FOR
SELECT <columns> FROM TABLE_1 t1
INNER JOIN TABLE_2 t2 on t1.key = t2.foreignKey
INNER JOIN TABLE_3 t3 on t2.key = t3.foreignKey
OPEN MyCursor
FETCH NEXT FROM MyCursor INTO #VAL1, #VAL2, ..., #VALn
WHILE ##FETCH_STATUS = 0
BEGIN
#F = #VAL2 / 1.1 --- float op
#S = #VAL3 + ' ' + #VAL1
SET #cmd = #insert
SET #cmd = #cmd + DATEPART(#VAL1) + ', '
SET #cmd = #cmd + STR(#F) + ', '
SET #cmd = #cmd + #S + ', '
SET #cmd = #cmd + ')'
EXEC (#cmd)
FETCH NEXT FROM MyCursor #VAL1, #VAL2, ..., #VALn
END
CLOSE MyCursor
DEALLOCATE MyCursor
The first thing to do - get rid of the cursor...
INSERT INTO MyTable COL1, COL2, ... , COLN
SELECT ...cols and manipulations...
FROM TABLE_1 t1
INNER JOIN TABLE_2 t2 on t1.key = t2.foreignKey
INNER JOIN TABLE_3 t3 on t2.key = t3.foreignKey
Most things should be possible direct in TSQL (it is hard to be definite without an example) - and you could consider a UDF for more complex operations.
Lose the cursor. Now. (See here for why: Why is it considered bad practice to use cursors in SQL Server?).
Without being rude you seem to be taking a procedural programmers approach to SQL which is pretty much always going to be sub-optimal.
If what you're doing is complex and you're not confident I'd do it in three steps:
1) Select of the core data into a temporary table using insert or select into.
2) Use update to do the manipulation - you may be able to do this just updating existing columns or you may need to have added a few extra ones in the right format when you create the temporary table. You can use multiple update statements to break it down further if you want.
3) Select it out into wherever you want it.
If you want to call it all as one step then you can then wrap the whole thing up into a stored procedure.
This makes it easy to debug and easy for someone else to work with if they need to. You can break your updates down into individual steps so you can quickly identify what's gone wrong where.
That said I don't believe that what you're doing can't be done in a single insert statement from the looks of it. It might not be attractive but I believe it could be done:
INSERT INTO NewTable
DATEPART(#VAL1) DateCol,
#STR(#VAL2 / 1.1) FloatCol,
#VAL3 + ' ' + #VAL1 ConcatCol
FROM TABLE_1 t1
INNER JOIN TABLE_2 t2 on t1.key = t2.foreignKey
INNER JOIN TABLE_3 t3 on t2.key = t3.foreignKey
DateCol, FloatCol and ConcatCol are whatever names you want the columns to have. Although they're not needed it's best to assign them as (a) it makes it clearer what you're doing and (b) some languages struggle with unnamed columns (and handle it in a very unclear way).
get rid of the cursor and dynamic sql:
INSERT INTO MyTable
(COL1, COL2, ... COLN)
SELECT
<columns>
,DATEPART(#VAL1) AS DateCol
,#STR(#VAL2 / 1.1) AS FloatCol
,#VAL3 + ' ' + #VAL1 AS ConcatCol
FROM TABLE_1 t1
INNER JOIN TABLE_2 t2 on t1.key = t2.foreignKey
INNER JOIN TABLE_3 t3 on t2.key = t3.foreignKey
Are there any profiling tools, and, if
so, how do they work?
To answer your question regarding query tuning tools, you can use TOAD for SQL Server to assist in query tuning.
I really like this tool as it will run your SQL statement something like 20 different ways and compare execution plans for you to determine the best one. Sometimes I'm amazed at what it does to optimize my statements, and it works quite well.
More importantly, I've used it to become a better t-sql writer as I use the tips on future scripts that I write. I don't know how TOAD would work with this script because as others have mentioned it uses a cursor, and I don't use them so have never tried to optimize one.
TOAD is a huge toolbox of SQL Server functionality, and query optimization is only a small part. Incidentally, I am not affiliated with Quest Software in any way.
SQl Server also comes with a profiling tool called SQL Server Profiler. It's the first pick on the menu under Tools in SSMS.
Related
There are ~10 different subquestions that could be answered here, but the main question is in the title. TLDR version: I have a table like the example below and I want to replace all double quote marks across the whole table. Is there a simple way to do this?
My solution using cursor seems fairly straightforward. I know there's some CURSOR hatred in the SQL Server community (bad runtime?). At what point (num rows and/or num columns) would CURSOR stink at this?
Create Reproducible Example Table
DROP TABLE IF EXISTS #example;
CREATE TABLE #example (
NumCol INT
,CharCol NVARCHAR(20)
,DateCol NVARCHAR(100)
);
INSERT INTO #example VALUES
(1, '"commas, terrible"', '"2021-01-01 20:15:57,2021:04-08 19:40:50"'),
(2, '"loadsrc,.txt"', '2020-01-01 00:00:05'),
(3, '".txt,from.csv"','1/8/2021 10:14')
Right now, my identified solutions are:
Manually update for each column UPDATE X SET CharCol = REPLACE(CharCol, '"',''). Horribly annoying to do at any more than 2 columns IMO.
Use a CURSOR to update (similar to annoyingly complicated looking solution at SQL Server- SQL Replace on all columns in all tables across an entire DB
REPLACE character using CURSOR
This gets a little convoluted with all the cursor-related script, but seems to work well otherwise.
-- declare variable to store colnames, cursor to filter through list, string for dynamic sql code
DECLARE #colname VARCHAR(10)
,#sql VARCHAR(MAX)
,#namecursor CURSOR;
-- run cursor and set colnames and update table
SET #namecursor = CURSOR FOR SELECT ColName FROM #colnames
OPEN #namecursor;
FETCH NEXT FROM #namecursor INTO #colname;
WHILE (##FETCH_STATUS <> -1) -- alt: WHILE ##FETCH_STATUS = 0
BEGIN;
SET #sql = 'UPDATE #example SET '+#colname+' = REPLACE('+#colname+', ''"'','''')'
EXEC(#sql); -- parentheses VERY important: EXEC(sql-as-string) NOT EXEC storedprocedure
FETCH NEXT FROM #namecursor INTO #colname;
END;
CLOSE #namecursor;
DEALLOCATE #namecursor;
GO
-- see results
SELECT * FROM #example
Subquestion: While I've seen it in our database elsewhere, for this particular example I'm opening a .csv file in Excel and exporting it as tab delimited. Is there a way to change the settings to export without the double quotes? If I remember correctly, BULK INSERT doesn't have a way to handle that or a way to handle importing a csv file with extra commas.
And yes, I'm going to pretend that I'm fine that there's a list of datetimes in the date column (necessitating varchar data type).
Why not just dynamically build the SQL?
Presumably it's a one-time task you'd be doing so just run the below for your table, paste into SSMS and run. But if not you could build an automated process to execute it - better of course to properly sanitize when inserting the data though!
select
'update <table> set ' +
String_Agg(QuoteName(COLUMN_NAME) + '=Replace(' + QuoteName(column_name) + ',''"'','''')',',')
from INFORMATION_SCHEMA.COLUMNS
where table_name='<table>' and TABLE_SCHEMA='<schema>' and data_type in ('varchar','nvarchar')
example DB<>Fiddle
You might try this approach, not fast, but easy to type (or generate).
SELECT NumCol = y.value('(NumCol/text())[1]','int')
,CharCol = y.value('(CharCol/text())[1]','nvarchar(100)')
,DateCol = y.value('(DateCol/text())[1]','nvarchar(100)')
FROM #example e
CROSS APPLY(SELECT e.* FOR XML PATH('')) A(x)
CROSS APPLY(SELECT CAST(REPLACE(A.x,'"','') AS XML)) B(y);
The idea in short:
The first APPLY will transform all columns to a root-less XML.
Without using ,TYPE this will be of type nvarchar(max) implicitly
The second APPLY will first replace any " in the whole text (which is one row actually) and cast this to XML.
The SELECT uses .value to fetch the values type-safe from the XML.
Update: Just add INTO dbo.SomeNotExistingTableName right before FROM to create a new table with this data. This looks better than updating the existing table (might be a #-table too). I'd see this as a staging environment...
Good luck, messy data is always a pain in the neck :-)
My goal is to find the affected stored procedures across multiple databases when a table or view is updated. I am working in an environment with multiple databases where these stored procedures can exist. Below is a query that can do what I want for one database.
How can I achieve the results without having to change the USE statement to DatabaseB, DatabaseC, etc. or lengthy queries involving UNIONs?
USE DatabaseA
SELECT
O.name
, O.type_desc
, M.definition
FROM sys.sql_modules M
LEFT JOIN sys.objects O
ON O.object_id = M.object_id
WHERE 1=1
AND definition LIKE '%Error%'
I have played around with looping, but to no avail.
DECLARE name_Cursor CURSOR FOR
SELECT name
FROM master.dbo.sysdatabases
OPEN name_Cursor;
FETCH NEXT FROM name_Cursor;
WHILE ##FETCH_STATUS = 0
BEGIN
FETCH NEXT FROM name_Cursor;
END;
CLOSE name_Cursor;
DEALLOCATE name_Cursor;
GO
If you took your existing cursor (from your second code sample), and you plugged your query (from your first code sample) in as a dynamic query, it should work:
SET #SQL = '
USE ' + #DatabaseName + '
SELECT
O.name
...'
If you are wanting to run this query against ALL of your databases, a possibly easier solution would be to use the undocumented system stored procedure sp_msforeachdb.
I know you are reluctant to use a UNION, but it can be helpful for your situation. If you have the luxury of using Linked Servers, you can use a UNION with fully qualified paths. I would also recommend using INFORMATION_SCHEMA.ROUTINES if you can.
SELECT ROUTINE_NAME FROM [DB1].[dbo].INFORMATION_SCHEMA.ROUTINES WHERE ROUTINE_DEFINITION LIKE '%Error%'
UNION ALL
SELECT ROUTINE_NAME FROM [DB2].[dbo].INFORMATION_SCHEMA.ROUTINES WHERE ROUTINE_DEFINITION LIKE '%Error%'
You can also used a memory optimized table approach:
https://learn.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/cross-database-queries
If you need to do something for each proc found, use this as the Cursor's SELECT statement.
Backround
This question is a follow-up to a previous question. To give you context here as well, I would like to summarize the previous question: in my previous question I intended to have a methodology to execute selections without sending their result to the client. The goal was to measure performance without eating up a lot of resources by sending millions of data. I am only interested in the time needed to execute those queries and not in the time they will send the results to the client app, since I intend to optimize queries, so the results of the queries will not change at all, but the methodology will change and I intend to be able to compare the methodologies.
Current knowledge
In my other question several ideas were presented. An idea was to select the count of the records and put it into a variable. However, that changed the query plan significantly and the results were not accurate in terms of performance. The idea of using a temporary table was presented as well, but creating a temporary table and inserting into it is difficult if we do not know what query will be our input to measure and also introduces a lot of white noise, so, even though the idea was creative, it was not ideal for my problem. Finally Vladimir Baranov came with an idea to create as many variables as many columns the selection will return. This was a great idea, but I refined it further, by creating a single variable of nvarchar(max) and selecting all my columns into it. The idea works great, except for a few problems. I have the solution for most of the problems, but I would like to share them, so, I will describe them regardless, but do not misunderstand me, I have a single question.
Problem1
If I have a #container variable and I do a #container = columnname inside each selection, then I will have conversion problems.
Solution1
Instead of just doing a #container = columnname, I need to do a #container = cast(columnname as nvarchar(max))
Problem2
I will need to convert <whatever> as something into #container = cast(<whatever> as nvarchar(max)) for each columns in the selection, but not for subselections and I will need to have a general solution handling case when and parantheses, I do not want to have any instances of #container = anywhere, except to the left of the main selection.
Solution2
Since I am clueless about regular expressions, I can solve this by iterating the query string until I find the from of the main query and each time I find a parantheses, I will do nothing until that parantheses is closed, find the indexes where #container = should be put and as [customname] should be taken out and from right to left do all that in the query string. This will be a long and unelegant code.
Question
Is it possible to make sure that all my main columns but nothing else start with #container = and ends without as [Customname]?
This is much too long for a comment but I'd like to add my $.02 to the other answers and share the scripts I used to test the suggested methods.
I like #MartinSmith's TOP 0 solution but am concerned that it could result in a different execution plan shape in some cases. I didn't see this in the tests I ran but I think you'll need to verify the plan is about the same as the unmolested query for each query you test. However, the results of my tests suggest the number of columns and/or data types might skew performance with this method.
The SQLCLR method in #VladimirBaranov's answer should provide the exact plan as the app code generates (assuming identical SET options for the test) there will still be some slight overhead (YMMV) with SqlClient consuming results within the SQLCLR. There will be less server overhead with this method compared to returning results back to the calling application.
The SSMS discard results method I suggested in my first comment will incur more overhead than the other methods but does include the server-side work SQL Server will perform not only in running the query, but also filling buffers for the returned result. Whether or not this additional SQL Server work should be taken into account depends on the purpose of the test. For unit-level performance tests, I prefer to execute tests using the same API as the app code.
I captured server-side performance with these 3 methods with #MartinSmith's original query. The average of 1,000 iterations on my machine were:
test method cpu_time duration logical_reads
SSMS discard 53031.000000 55358.844000 7190.512000
TOP 0 52374.000000 52432.936000 7190.527000
SQLCLR 49110.000000 48838.532000 7190.578000
I did the same with a trivial query returning 10,000 rows and 2 columns (int and nvarchar(100)) from a user table:
test method cpu_time duration logical_reads
SSMS discard 4204.000000 9245.426000 402.004000
TOP 0 2641.000000 2752.695000 402.008000
SQLCLR 1921.000000 1878.579000 402.000000
Repeating the same test but with a varchar(100) column instead of nvarchar(100):
test method cpu_time duration logical_reads
SSMS discard 3078.000000 5901.023000 402.004000
TOP 0 2672.000000 2616.359000 402.008000
SQLCLR 1750.000000 1798.098000 402.000000
Below are the scripts I used for testing:
Source code for SQLCLR proc like #VladimirBaranov suggested:
public static void ExecuteNonQuery(string sql)
{
using (var connection = new SqlConnection("Context Connection=true"))
{
connection.Open();
var command = new SqlCommand(sql, connection);
command.ExecuteNonQuery();
}
}
Xe trace to capture the actual server-side timings and resource usage:
CREATE EVENT SESSION [test] ON SERVER
ADD EVENT sqlserver.sql_batch_completed(SET collect_batch_text=(1))
ADD TARGET package0.event_file(SET filename=N'QueryTimes')
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF);
GO
User table create and load:
CREATE TABLE dbo.Foo(
FooID int NOT NULL CONSTRAINT PK_Foo PRIMARY KEY
, Bar1 nvarchar(100)
, Bar2 varchar(100)
);
WITH
t10 AS (SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) t(n))
,t10k AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS num FROM t10 AS a CROSS JOIN t10 AS b CROSS JOIN t10 AS c CROSS JOIN t10 AS d)
INSERT INTO dbo.Foo WITH (TABLOCKX)
SELECT num, REPLICATE(N'X', 100), REPLICATE('X', 100)
FROM t10k;
GO
SQL script run from SSMS with the discard results query option to run 1000 iterations of the test with the 3 different methods:
SET NOCOUNT ON;
GO
--return and discard results
SELECT v.*,
o.name
FROM master..spt_values AS v
JOIN sys.objects o
ON o.object_id % NULLIF(v.number, 0) = 0;
GO 1000
--TOP 0
DECLARE #X NVARCHAR(MAX);
SELECT #X = (SELECT TOP 0 v.*,
o.name
FOR XML PATH(''))
FROM master..spt_values AS v
JOIN sys.objects o
ON o.object_id % NULLIF(v.number, 0) = 0;
GO 1000
--SQLCLR ExecuteNonQuery
EXEC dbo.ExecuteNonQuery #sql = N'
SELECT v.*,
o.name
FROM master..spt_values AS v
JOIN sys.objects o
ON o.object_id % NULLIF(v.number, 0) = 0;
'
GO 1000
--return and discard results
SELECT FooID, Bar1
FROM dbo.Foo;
GO 1000
--TOP 0
DECLARE #X NVARCHAR(MAX);
SELECT #X = (SELECT TOP 0 FooID, Bar1
FOR XML PATH(''))
FROM dbo.Foo;
GO 1000
--SQLCLR ExecuteNonQuery
EXEC dbo.ExecuteNonQuery #sql = N'
SELECT FooID, Bar1
FROM dbo.Foo
';
GO 1000
--return and discard results
SELECT FooID, Bar1
FROM dbo.Foo;
GO 1000
--TOP 0
DECLARE #X NVARCHAR(MAX);
SELECT #X = (SELECT TOP 0 FooID, Bar2
FOR XML PATH(''))
FROM dbo.Foo;
GO 1000
--SQLCLR ExecuteNonQuery
EXEC dbo.ExecuteNonQuery #sql = N'
SELECT FooID, Bar2
FROM dbo.Foo
';
GO 1000
I would try to write a single CLR function that runs as many queries as needed to measure. It may have a parameter with the text(s) of queries to run, or names of stored procedures to run.
You have a single request to the server. Everything is done locally on the server. No network overhead. You discard query result in the .NET CLR code without using explicit temp tables by using ExecuteNonQuery for each query that you need to measure.
Don't change the query that you are measuring. Optimizer is complex, changes to the query may have various effects on the performance.
Also, use SET STATISTICS TIME ON and let the server measure the time for you. Fetch what the server has to say, parse it and send it back in the format that suits you.
I think, that results of SET STATISTICS TIME ON / OFF are the most reliable and accurate and have the least amount of noise.
I have a large SQL query with multiple statements and UNION ALL. I am doing something like this now:
DECLARE #condition BIT;
SET #condition = 0;
SELECT * FROM table1
WHERE #condition = 1;
UNION ALL
SELECT * FROM table2
In this case, table1 won't return any results. However, that query is complex with many joins (such as FullTextTable). The execution plan's estimate shows a high cost, but the actual number of rows and time to execute seems to show otherwise. Is this the most efficient way of filtering a whole query, or is there a better way? I don't want anything in the first select to run, if possible.
I would imagine that your eventual SQL query with all of the unions and conditions that depend on pre-calculated values gets pretty complicated. If you're interested in reducing the complexity of the query (not to the machine but for maintenance purposes), I would go with a moving the individual queries into views or table valued functions to move that logic elsewhere. Then you can use the if #condition = 1 syntax that has been suggested elsewhere.
The best way to solve this is by using Dynamic SQL. The problem with DForck's solutions is that it may lead to parameter sniffing. Just to give a rough idea, your query might look something like this
DECLARE #query VARCHAR(MAX);
IF (#condition = 0)
SET #query = 'SELECT * FROM table1
UNION ALL '
SET #query = #query + 'SELECT * FROM table2'
sp_executesql #query
This is just a simplified case, but in actual implementation you would parameterize the dynamic query which will solve the problem of parameter sniffing. Here is an excellent explanation about this problem Parameter Sniffing (or Spoofing) in SQL Server
i think you might be better off with this:
if (#condition=1)
begin
select * from table1
union all
select * from table2
end
else
begin
select * from table2
end
Imagine a table with like a hundred of different columns in it. Imagine, then, that I have a user-data table from where I want to copy data to the base table. So I wrote this simple insert-select statement and this error pops up. So, what's the most elegant way to figure out which column raises the error?
My initial thoughts on the solution are about wrapping it in a transaction that I will ultimately rollback and use a sort of Divide and Conquer approach:
begin tran
insert into BaseTable (c1,c2,c3,...,cN)
select c1,c2,c3,...,cN
from UserTable
rollback tran
And this obviously fails. So we divide the column set in half like so:
begin tran
insert into BaseTable (c1,c2,c3,...,cK) --where K = N/2
select c1,c2,c3,...,cK --where K = N/2
from UserTable
rollback tran
And if it fails then the failing column is in the other half. And we continue the process, until we find the pesky column.
Anything more elegant than that?
Note: I also found a near-duplicate of this question but it barely answers it.
Following script would create SELECT statements for each integer column of Basetable.
Executing the resulting SELECT statements should pinpoint the offending columns in your Usertable.
SELECT 'PRINT '''
+ sc.Name
+ '''; SELECT MIN(CAST('
+ sc.Name
+ ' AS INTEGER)) FROM Usertable'
FROM sys.columns sc
INNER JOIN sys.types st ON st.system_type_id = sc.system_type_id
WHERE OBJECT_NAME(Object_ID) = 'BaseTable'
AND st.name = 'INT'
If this is just something you are running manually then depending upon how much data you are inserting you could use the OUTPUT clause to output the inserted rows to the client.
The row after the last one that is output should be the one with the problem.
I took Lieven Keersmaekers' approach but extended it. If a table has various numeric field lengths, this script will change the Cast based on the type name and precision. Credit still goes to Lieven for thinking of this solution - it helped me a lot.
DECLARE #tableName VARCHAR(100)
SET #tableName = 'tableName'
SELECT 'PRINT ''' + sc.NAME + '''; SELECT MIN(CAST([' + sc.NAME + '] as ' + CASE
WHEN st.NAME = 'int'
THEN 'int'
ELSE st.NAME + '(' + cast(sc.precision AS VARCHAR(5)) + ',' + cast(sc.scale AS VARCHAR(5)) + ')'
END + ')) from ' + #tableName
FROM sys.columns sc
INNER JOIN sys.types st ON st.system_type_id = sc.system_type_id
WHERE OBJECT_NAME(Object_ID) = #tableName
AND st.NAME NOT IN ('nvarchar', 'varchar', 'image', 'datetime', 'smalldatetime', 'char', 'nchar')
A lot of the times the brute force method you suggest is the best way.
However, if you have a copy of the database that you can post fake data too.
run the query on it so you don't have about the transcation hiding the column that is breaking it. Sometimes in the error it will give a hint as to what is up. Usually if you are looking at what is going in you can see when text is going into an int or vice versa.
I do this and that takes away anything else in my code causing the problem.
You need to get a copy of the query that is generated where you can copy and paste it into a query tool.
I think you are taking the wrong approach. If you are getting an arithmetic overflow by simply selecting columns from one table and inserting into another then you must be selecting from bigger colimns (e.g. bigint) and inserting into small columns (e.g. int). This is fundamentally incorrect thing to be doing and you need to alter your DB structure so that inserting rows from one table into another will work. Inspect each column from each table and see where it is possible to get an overflow, then adjust your destination table so that the data you are inserting will fit.
I still think my point stands but in response to your comments if you want a quick and dirty solution. Make all your columns in BaseTable varchar(MAX).
Then:
insert into BaseTable (c1,c2,...,cN)
select CAST(c1 AS varchar(max)),CAST(c2 AS varchar(max))...,cN
from UserTable