Why is it so difficult to do a loop in T-SQL - sql-server

OK, I know it can be done, I do it quite often, but why so difficult to do a loop in T-SQL? I can think of a ton of reasons I'd want to parse thru a query result set and do something that simply can't be done without a loop, yet the code to setup and execute my loop is > 20 lines.
I'm sure others have a similar opinions so why are we still without a simple way to perform a loop?
An aside: we finally got an UPSERT (aka MERGE) in SQL2008 so maybe all hope isn't lost.

SQL is a set-based, declarative language; not a procedural or imperative language. T-SQL tries to straddle the two, but it's still built on a fundamentally set-based paradigm.

I can think of a ton of reasons I'd want to parse thru a query result set and do something that simply can't be done without a loop
And for the vast majority of those I can either show you how to do it in a set-based operation instead or explain why it should be done in your client code rather than on the database. Needing to do a loop in sql is exceeding rare.

T-SQL is not designed to be an imperative language. Its designed to be declarative. Its declarative nature allows the optomizer to slice up the various tasks and run them in parrallel and in other ways do things in an order that is most efficient.

Because SQL is a Set based language. The power of sql is in find a smaller group within a larger group of data based on specific characteristics. To handle this task, looping is largely unnecessary. Obviously it's been added for convenience of handling some situations, but the intended use of the language make this feature irrelevant.

almost everything can be done set based, try using a number table
why 20 lines? This is all you need
select *,identity(int, 1,1) as Someid into #temp
from sysobjects
declare #id int, #MaxId int
select #id = 1,#MaxId = max(Someid) from #temp
while #id < #MaxId
begin
-- do your stuff here
print #id
set #id =#id + 1
end

it depends what you want to do in a loop. using a while loop is not difficult at all:
declare #i int
set #i = 20
while #i>0 begin
... do some stuff
set #i = #i-1
end
it only becomes cumbersome when using cursors, which should be avoided anyways.

You might try using user defined functions to do most of the work instead of taking a loop based approach. This would preserve the intention of the SQL language which is set based.

SQL is a SET based system, not a procedural (loop) one. Generally its regarded as bad practice to use loops in SQL because they perform poorly compared to thier set based equivalents.
WHILE is the most common looping structure, CURSORS can also be used, but have their own problems (forgetting to deallocate/ close)
...an example of WHILE (you may not need it but others may)
DECLARE #iterator INT
SET #iterator = 0
WHILE #iterator < 20
BEGIN
SELECT * FROM table WHERE rowKey = #iterator
/*do stuff*/
#iterator = #iterator + 1
END
The real question is "What is it that you are trying to do that simply cannot be done in a set based way?"

I'm not an expert in DB's but I believe the atomic nature of database transactions would make loops difficult to achieve because the transaction be complete or it should not occur at all. Maintaining state can be pesky!
Wikipedia Article on Atomicity

Related

How to join a table with the result of some manipulation of data (be an stored procedure, function, ...)?

This is one of the quirks of SQL-Server that I find more puzzling : to not be able to manipulate data within a function (execute UPDATE or INSERT commands), or not be able to join to a query the result of an stored procedure.
I want to code an object that returns the newer value from a table of counters, and be able to use its result on selects.
Something like :
create function getNewCounterValue(#Counter varchar(100))
returns int
as
begin
declare #Value int
select #Value = Value
from CounterValues
where Counter = #Counter
set #Value = coalesce(#Value, 0) + 1
update CounterValues set Value = #Value
where Counter = #Counter
if ##rowcount = 0 begin
insert into CounterValues (Counter, Value) values (#Counter, #Value)
end
return #Value
end
So then I would be able to run commands like :
declare #CopyFrom date = '2022-07-01'
declare #CopyTo date = '2022-08-01'
insert into Bills (IdBill, Date, Provider, Amount)
select getNewCounterValue('BILL'), #CopyTo, Amount
from Bills
where Date = #CopyFrom
But SQL-Server doesn't allow to create functions that changes its data (Invalid use of a side-effecting operator), so it forces me to write getNewCounterValue as an stored procedure, but then I can't execute and join it to a query.
Is there any way to have an object that manipulates data capable to join its result to a query ?.
PS: I know that I could use sequences to get new counter values without needing to change data, but I'm working on a huge legacy database that already uses counter tables, not sequences. So I cannot change that without breaking a zillion other things.
I also know that I could declare IdBill as an Identity column, so I wouldn't need to retrieve new counter values to insert rows, but again this is a huge legacy database that uses counter tables, not identity columns, so I cannot change the column types without breaking the system.
Besides, these counters are just an example of why being able to join on a query the result of some data manipulation would be very useful. I like to write a lot of logic on the database, so I would take advantage of it on plenty other situations.
A few years ago I saw a very dirty trick to do so executing the data manipulation instructions as openrawset calls within your function, but it was a seriously ugly hack. There still is no better way to achieve this ?.
Thank you.
You're clearly aware that a function is for returning data, and you're aware of sequences, and identity columns, and you have given a completely reasonable explanation in your question as to why you can't use this in this case.
But as you also said, the question is a bit more general than just sequence/identity problems. There is an coherent idea of "some kind of construct that can change data, and whose output can be composed directly into a select".
There's no "object" that exactly fits that description. Asking "why doesn't language X have feature Y" leads to philosophical discussions with good answers already provided by Eric Lippert here and here
I think there are a few more concrete answers in this case though:
1) Guaranteed idempotency.
A select returns a set (bag, collection, however you want to think about it). Then there is an obvious expectation that any process that runs for the result of a select may run for multiple rows. If the process is not idempotent, then the state of the system when the select is complete might depend on the number of rows in the result. It's also possible that the execution of the modifying process might change the semantics of the select, or the next iteration of the process, which leads to situations like the Halloween Problem.
2) Plan Compilation
Related to (1) but not precisely the same. How can the query optimizer approach this functionalty? It must generate a plan "ahead of time", and that plan depends on stateful information. Yes, we get adaptive memory grants with 2019, but that's a trivial sort of "mid flight change", and even that took years before it was implemented (by which I mean that I believe Oracle has been able to do this for years, though I could be wrong, I'm no Oracle guy).
3) It's not actually beneficial in a lot of use cases
Take the use case of generating a sequence. Why not just iterate and execute a stored procedure? One answer might be "because I want to avoid imperative iteration, we should try to be set based and declarative". But as your hypothetical function demonstrates, it would still be imperative and iterative, it would just be "hidden" behind the select. I think - or let's say I have an intuition - that many cases where it seems like it might be nice to put a state-changing operating into a select fall into this basket. We wouldn't really be gaining anything.
4) There acutally is a way to do it! (but only in the most trivial case)
When I said "composed directly into a select" I didn't use the word "compose" on a whim. We do have composable DML:
create table T(i int primary key, c char);
declare #output table (i int, c char);
insert #output (i, c)
select dml.i, dml.c
from (
insert t (i, c)
output inserted.i, inserted.c
values (1, 'a')
) dml
/* OK, you can't add this
join SomeOtherTable on ...
*/
Of course, this isn't substantially different from insert exec in that you can't have a "naked" select, it has to be the source for an insert first. And you can't join to the dml output directly, you have to get the output and then do the join. But at least it gives you a way to avoid the "nested insert exec" problem.

What can I do to improve performance of my pure User Defined Function in SQL Server?

I have made a simple, but relatively computationally complex, UDF that queries a rarely changing table. In typical usage this function is called many many times from a WHERE clauses over a very small domain of parameters.
What can I do to make my usage of the UDF faster? My thoughts are that there should be some way to tell SQL Server that my function returns the same result with the same parameters and thus should be memoized. There doesn't seem a way to do it within the UDF because they are required to be pure and thus can't write to a temp table.
For completeness my UDF is below, though I am seeking a general answer on how to make calling UDFs on small domains faster, and not how to optimize this particular UDF.
CREATE function [dbo].[WorkDay] (
#inputDate datetime,
#offset int)
returns datetime as begin
declare
#result datetime
set #result = #inputDate
while #offset != 0
begin
set #result = dateadd( day, sign(#offset), #result )
while ( DATEPART(weekday, #result ) not between 2 and 6 )
or #result in (select date from myDB.dbo.holidays
where calendar = 'US' and date = #result)
begin
set #result = dateadd( day, sign(#offset), #result )
end
set #offset = #offset - sign(#offset)
end
return #result
END
My first thought here is -- what's the performance problem? Sure you have a loop (once per row to apply where) within a loop that it runs a query. But are you getting poor execution plans? Are your result sets huge? But lets turn to the generic. How does once solve this problem? SQL doesn't really do memoization(as the illustrious #Martin_Smith points out). So what's a boy to do?
Option 1 - New Design
Create an entirely new design. In this specific case #Aaron_Bertrand points out that a calendar table may meet your needs. Quite right. This doesn't really help you with non calendar situations, but as is often the case in SQL you need to think a bit different.
Option 2 - Call the UDF Less
Narrow the set of items that call this function. This reminds me a lot of how to do successful paging/row counting. Generate a small result set that has the distinct values required and then call your UDF so it is only called a few times. This may or may not be an option, but can work in many scenarios.
Option 3 - Dynamic UDF
I'll probably get booed out of the room for this suggestion, but here goes. What makes this UDF slow is the select statement inside the loop. If your Holiday table really changes infrequently you could put a trigger on the table. The trigger would write out and updated UDF. The new UDF could brute force all the holiday decisions. Would it bit a bit like cannibalism with SQL writing SQL? Sure. But it would get rid of the sub-query and speed the UDF up. Let the heckling begin.
Option 4 - Memoize It!
While SQL can't directly memoize, we do have SQL CLR. Convert the UDF to a SQL CLR udf. In CLR you get to use static variables. You could easily grab the Holidays table at some regular interval and store them in a hashtable. Then just rewrite your loop in the CLR. You could even go further and memoize the entire answer if that's appropriate logic.
Update:
Option 1 - I was really trying to focus on the general here, not the example function you used above. However, the current design of your UDF allows for multiple calls to the Holiday table if you happen to hit a few in a row. Using some sort of calendar-style-table that contains a list of 'bad days' and the corresponding 'next business day' will allow you to remove the potential for multiple hits & queries.
Option 3 - While the domain is unknown ahead of time you could very well modify your holiday table. For a given holiday day it would contain the next corresponding work day. From this data you could spit out a UDF with a long case statement (when '5/5/2012' then '5/14/2012' or something similar) at the bottom. This strategy may not work for every type of problem, but could work well for some types of problems.
Option 4 - There are implications to every technology. CLR needs to be deployed, the SQL Server configuration modified and SQL CLR is limited to the 3.5 framework. Personally, I've found these adjustments easy enough, but your situation may be different (say a recalcitrant DBA, or restrictions on modifications to production servers).
Using static variables requires the assemblies be granted FULL TRUST. You'll have to make sure you get your locking correct.
There is some evidence that at very high transaction levels CLR doesn't perform as well as direct SQL. In your scenario, however, this observation might not be applicable because there isn't a direct SQL correlary for what your trying to do (memoize).
You could write to a real table keyed off of your params and select for that first and if that comes up null, then calculate and insert into the table doing your own caching.
It might make more sense to pre-fill a table with all possible values for the date range you are interested in and then just join to that. you are then only doing the calculation once for each combination of params and letting SQL handle the join.

How do I return an empty result set from a procedure using T-SQL?

I'm interested in returning an empty result set from SQL Server stored procedures in certain events.
The intended behaviour is that a L2SQL DataContext.SPName().SingleOrDefault() will result in CLR null value.
I'm presently using the following solution, but I'm unsure whether it would be considered bad practice, a performance hazard (I could not find one by reading the execution plan), or if there is simply a better way:
SELECT * FROM [dbo].[TableName]
WHERE 0 = 1;
The execution plan is a constant scan with a trivial cost associated with it.
The reason I am asking this instead of simply not running any SELECTs is because I'm concerned previous SELECT #scalar or SELECT INTO statements could cause unintended result sets to be served back to L2SQL. Am I worrying over nothing?
If you need column names in the response then proceed with the select TOP 0 * from that table, otherwise just use SELECT TOP 0 NULL. It should work pretty fast :)
That is a reasonable approach. Another alternative is:
SELECT TOP 0 * FROM [dbo].[TableName]
If you want to simply retrieve the metadata of a result set w/o any actual row, use SET FMTONLY ON.
I think the best solution is top 0 but not using a dummy table.
This does it for me
select top 0 null as column1, null as column2.
Using e.g. a system table may be fine for performance but looks unclean.
It's an entirely reasonable approach.
To alleviate any worries about performance (whoch you shouldn't have any in the first place - the server's smart enough to avoid table scanning for 1=0), pick a table that's very small and not heavily used - I'm sure your DB schema has one.

Why is it considered bad practice to use cursors in SQL Server?

I knew of some performance reasons back in the SQL 7 days, but do the same issues still exist in SQL Server 2005? If I have a resultset in a stored procedure that I want to act upon individually, are cursors still a bad choice? If so, why?
Because cursors take up memory and create locks.
What you are really doing is attempting to force set-based technology into non-set based functionality. And, in all fairness, I should point out that cursors do have a use, but they are frowned upon because many folks who are not used to using set-based solutions use cursors instead of figuring out the set-based solution.
But, when you open a cursor, you are basically loading those rows into memory and locking them, creating potential blocks. Then, as you cycle through the cursor, you are making changes to other tables and still keeping all of the memory and locks of the cursor open.
All of which has the potential to cause performance issues for other users.
So, as a general rule, cursors are frowned upon. Especially if that's the first solution arrived at in solving a problem.
The above comments about SQL being a set-based environment are all true. However there are times when row-by-row operations are useful. Consider a combination of metadata and dynamic-sql.
As a very simple example, say I have 100+ records in a table that define the names of tables that I want to copy/truncate/whatever. Which is best? Hardcoding the SQL to do what I need to? Or iterate through this resultset and use dynamic-SQL (sp_executesql) to perform the operations?
There is no way to achieve the above objective using set-based SQL.
So, to use cursors or a while loop (pseudo-cursors)?
SQL Cursors are fine as long as you use the correct options:
INSENSITIVE will make a temporary copy of your result set (saving you from having to do this yourself for your pseudo-cursor).
READ_ONLY will make sure no locks are held on the underlying result set. Changes in the underlying result set will be reflected in subsequent fetches (same as if getting TOP 1 from your pseudo-cursor).
FAST_FORWARD will create an optimised forward-only, read-only cursor.
Read about the available options before ruling all cursors as evil.
There is a work around about cursors that I use every time I need one.
I create a table variable with an identity column in it.
insert all the data i need to work with in it.
Then make a while block with a counter variable and select the data I want from the table variable with a select statement where the identity column matches the counter.
This way i dont lock anything and use alot less memory and its safe, i will not lose anything with a memory corruption or something like that.
And the block code is easy to see and handle.
This is a simple example:
DECLARE #TAB TABLE(ID INT IDENTITY, COLUMN1 VARCHAR(10), COLUMN2 VARCHAR(10))
DECLARE #COUNT INT,
#MAX INT,
#CONCAT VARCHAR(MAX),
#COLUMN1 VARCHAR(10),
#COLUMN2 VARCHAR(10)
SET #COUNT = 1
INSERT INTO #TAB VALUES('TE1S', 'TE21')
INSERT INTO #TAB VALUES('TE1S', 'TE22')
INSERT INTO #TAB VALUES('TE1S', 'TE23')
INSERT INTO #TAB VALUES('TE1S', 'TE24')
INSERT INTO #TAB VALUES('TE1S', 'TE25')
SELECT #MAX = ##IDENTITY
WHILE #COUNT <= #MAX BEGIN
SELECT #COLUMN1 = COLUMN1, #COLUMN2 = COLUMN2 FROM #TAB WHERE ID = #COUNT
IF #CONCAT IS NULL BEGIN
SET #CONCAT = ''
END ELSE BEGIN
SET #CONCAT = #CONCAT + ','
END
SET #CONCAT = #CONCAT + #COLUMN1 + #COLUMN2
SET #COUNT = #COUNT + 1
END
SELECT #CONCAT
I think cursors get a bad name because SQL newbies discover them and think "Hey a for loop! I know how to use those!" and then they continue to use them for everything.
If you use them for what they're designed for, I can't find fault with that.
SQL is a set based language--that's what it does best.
I think cursors are still a bad choice unless you understand enough about them to justify their use in limited circumstances.
Another reason I don't like cursors is clarity. The cursor block is so ugly that it's difficult to use in a clear and effective way.
All that having been said, there are some cases where a cursor really is best--they just aren't usually the cases that beginners want to use them for.
Cursors are usually not the disease, but a symptom of it: not using the set-based approach (as mentioned in the other answers).
Not understanding this problem, and simply believing that avoiding the "evil" cursor will solve it, can make things worse.
For example, replacing cursor iteration by other iterative code, such as moving data to temporary tables or table variables, to loop over the rows in a way like:
SELECT * FROM #temptable WHERE Id=#counter
or
SELECT TOP 1 * FROM #temptable WHERE Id>#lastId
Such an approach, as shown in the code of another answer, makes things much worse and doesn't fix the original problem. It's an anti-pattern called cargo cult programming: not knowing WHY something is bad and thus implementing something worse to avoid it! I recently changed such code (using a #temptable and no index on identity/PK) back to a cursor, and updating slightly more than 10000 rows took only 1 second instead of almost 3 minutes. Still lacking set-based approach (being the lesser evil), but the best I could do that moment.
Another symptom of this lack of understanding can be what I sometimes call "one object disease": database applications which handle single objects through data access layers or object-relational mappers. Typically code like:
var items = new List<Item>();
foreach(int oneId in itemIds)
{
items.Add(dataAccess.GetItemById(oneId);
}
instead of
var items = dataAccess.GetItemsByIds(itemIds);
The first will usually flood the database with tons of SELECTs, one round trip for each, especially when object trees/graphs come into play and the infamous SELECT N+1 problem strikes.
This is the application side of not understanding relational databases and set based approach, just the same way cursors are when using procedural database code, like T-SQL or PL/SQL!
Sometimes the nature of the processing you need to perform requires cursors, though for performance reasons it's always better to write the operation(s) using set-based logic if possible.
I wouldn't call it "bad practice" to use cursors, but they do consume more resources on the server (than an equivalent set-based approach) and more often than not they aren't necessary. Given that, my advice would be to consider other options before resorting to a cursor.
There are several types of cursors (forward-only, static, keyset, dynamic). Each one has different performance characteristics and associated overhead. Make sure you use the correct cursor type for your operation. Forward-only is the default.
One argument for using a cursor is when you need to process and update individual rows, especially for a dataset that doesn't have a good unique key. In that case you can use the FOR UPDATE clause when declaring the cursor and process updates with UPDATE ... WHERE CURRENT OF.
Note that "server-side" cursors used to be popular (from ODBC and OLE DB), but ADO.NET does not support them, and AFAIK never will.
There are very, very few cases where the use of a cursor is justified. There are almost no cases where it will outperform a relational, set-based query. Sometimes it is easier for a programmer to think in terms of loops, but the use of set logic, for example to update a large number of rows in a table, will result in a solution that is not only many less lines of SQL code, but that runs much faster, often several orders of magnitude faster.
Even the fast forward cursor in Sql Server 2005 can't compete with set-based queries. The graph of performance degradation often starts to look like an n^2 operation compared to set-based, which tends to be more linear as the data set grows very large.
# Daniel P -> you don't need to use a cursor to do it. You can easily use set based theory to do it. Eg: with Sql 2008
DECLARE #commandname NVARCHAR(1000) = '';
SELECT #commandname += 'truncate table ' + tablename + '; ';
FROM tableNames;
EXEC sp_executesql #commandname;
will simply do what you have said above. And you can do the same with Sql 2000 but the syntax of query would be different.
However, my advice is to avoid cursors as much as possible.
Gayam
Cursors do have their place, however I think it's mainly because they are often used when a single select statement would suffice to provide aggregation and filtering of results.
Avoiding cursors allows SQL Server to more fully optimize the performance of the query, very important in larger systems.
The basic issue, I think, is that databases are designed and tuned for set-based operations -- selects, updates, and deletes of large amounts of data in a single quick step based on relations in the data.
In-memory software, on the other hand, is designed for individual operations, so looping over a set of data and potentially performing different operations on each item serially is what it is best at.
Looping is not what the database or storage architecture are designed for, and even in SQL Server 2005, you are not going to get performance anywhere close to you get if you pull the basic data set out into a custom program and do the looping in memory, using data objects/structures that are as lightweight as possible.

WAITFOR command

Given the problem that a stored procedure on SQL Server 2005, which is looping through a cursor, must be run once an hour and it takes about 5 minutes to run, but it takes up a large chunk of processor time:
edit: I'd remove the cursor if I could, unfortunatly, I have to be doing a bunch of processing and running other stored procs/queries based on the row.
Can I use
WAITFOR DELAY '0:0:0.1'
before each fetch to act as SQL's version of .Net's Thread.Sleep? Thus allowing the other processes to complete faster at the cost of this procedure's execution time.
Or is there another solution I'm not seeing?
Thanks
Putting the WAITFOR inside the loop would indeed slow it down and allow other things to go faster. You might also consider a WHILE loop instead of a cursor - in my experience it runs faster. You might also consider moving your cursor to a fast-forward, read-only cursor - that can limit how much memory it takes up.
declare #minid int, #maxid int, #somevalue int
select #minid = 1, #maxid = 5
while #minid <= #maxid
begin
set #somevalue = null
select #somevalue = somefield from sometable where id = #minid
print #somevalue
set #minid = #minid + 1
waitfor delay '00:00:00.1'
end
I'm not sure if that would solve the problem. IMHO the performance problem with cursors is around the amount of memory you use to keep the dataset resident and loop through it, if you then add a waitfor inside the loop you're hogging resources for longer.
But I may be wrong here, what I would suggest is to use perfmon to check the server's performance under both conditions, and then make a decision whether it is worth-it or not to add the wait.
Looking at the tag, I'm assuming you're using MS SQL Server, and not any of the other flavours.
You could delay the procedure, but that might or might not help you. It depends on how the procedure works. Is it in a transaction, why a cursor (horribly inefficient in SQL Server), where is the slowdown, etc. Perhaps reworking the procedure would make more sense.
Ever since SQL 2005 included windowing functions and other neat features, I've been able to eliminate cursors in almost all instances. Perhaps your problem would best be served by eliminating the cursor itself?
Definitely check out Ranking functions http://msdn.microsoft.com/en-us/library/ms189798.aspx and Aggregate window functions http://msdn.microsoft.com/en-us/library/ms189461.aspx
I'm guessing that whatever code you have means that the other processes can't access the table your cursor is derived from.
Provided that you make the cursor READONLY FASTWORD you should not lock the tables the cursor is derived from.
If, however, you need to write, then WAITFOR wouldn't help. Once you've locked the table, it's locked.
An option would be to snapshot the tables into a temp table, then cursor/loop through that instead. You would then not be locking the underlying tables, but equally the tables could change while you're processing the snapshot...
Dems

Resources