Pros and cons of using a cursor (in SQL server) - sql-server

I asked a question here Using cursor in OLTP databases (SQL server)
where people responded saying cursors should never be used.
I feel cursors are very powerful tools that are meant to be used (I don't think Microsoft supports cursors for bad developers).Suppose you have a table where the value of a column in a row is dependent on the value of the same column in the previous row. If it is a one time back end process, don't you think using a cursor would be an acceptable choice?
Off the top of my head I can think of a couple of scenarios where I feel there should be no shame in using cursors. Please let me know if you guys feel otherwise.
A one time back end process to clean bad data which completes execution within a few minutes.
Batch processes that run once in a long period of time (something like once a year).
If in the above scenarios, there is no visible strain on the other processes, wouldn't it be unreasonable to spend extra hours writing code to avoid cursors? In other words in certain cases the developer's time is more important than the performance of a process that has almost no impact on anything else.
In my opinion these would be some scenarios where you should seriously try to avoid using a cursor.
A stored proc called from a website that can get called very often.
A SQL job that would run multiple times a day and consume a lot of resources.
I think its very superficial to make a general statement like "cursors should never be used" without analyzing the task at hand and actually weighing it against the alternatives.
Please let me know of your thoughts.

There are several scenarios where cursors actually perform better than set-based equivalents. Running totals is the one that always comes to mind - look for Itzik's words on that (and ignore any that involve SQL Server 2012, which adds new windowing functions that give cursors a run for their money in this situation).
One of the big problems people have with cursors is that they perform slowly, they use temporary storage, etc. This is partially because the default syntax is a global cursor with all kinds of inefficient default options. The next time you're doing something with a cursor that doesn't need to do things like UPDATE...WHERE CURRENT OF (which I've been able to avoid my entire career), give it a fair shake by comparing these two syntax options:
DECLARE c CURSOR
FOR <SELECT QUERY>;
DECLARE c CURSOR
LOCAL STATIC READ_ONLY FORWARD_ONLY
FOR <SELECT QUERY>;
In fact the first version represents a bug in the undocumented stored procedure sp_MSforeachdb which makes it skip databases if the status of any database changes during execution. I subsequently wrote my own version of the stored procedure (see here) which both fixed the bug (simply by using the latter version of the syntax above) and added several parameters to control which databases would be chosen.
A lot of people think that a methodology is not a cursor because it doesn't say DECLARE CURSOR. I've seen people argue that a while loop is faster than a cursor (which I hope I've dispelled here) or that using FOR XML PATH to perform group concatenation is not performing a hidden cursor operation. Looking at the plan in a lot of cases will show the truth.
In a lot of cases cursors are used where set-based is more appropriate. But there are plenty of valid use cases where a set-based equivalent is much more complicated to write, for the optimizer to generate a plan for, both, or not possible (e.g. maintenance tasks where you're looping through tables to update statistics, calling a stored procedure for each value in a result, etc.). The same is true for a lot of big multi-table queries where the plan gets too monstrous for the optimizer to handle. In these cases it can be better to dump some of the intermediate results into a temporary structure first. The same goes for some set-based equivalents to cursors (like running totals). I've also written about the other way, where people almost always think instinctively to use a while loop / cursor and there are clever set-based alternatives that are much better.
UPDATE 2013-07-25
Just wanted to add some additional blog posts I've written about cursors, which options you should be using if you do have to use them, and using set-based queries instead of loops to generate sets:
Best Approaches for Running Totals - Updated for SQL Server 2012
What impact can different cursor options have?
Generate a Set or Sequence Without Loops: [Part 1] [Part 2] [Part 3]

The issue with cursors in SQL Server is that the engine is set-based internally, unlike other DBMS's like Oracle which are cursor-based internally. This means that when you create a cursor in SQL Server, temporary storage needs to be created and the set-based resultset needs to be copied over to the temporary cursor storage. You can see why this would be expensive right off the bat, not to mention any row-by-row processing that you might be doing on top of the cursor itself. The bottom line is that set-based processing is more efficient, and often times your cursor-based operation can be done better using a CTE or temp table.
That being said, there are cases where a cursor is probably acceptable, as you said for one-off operations. The most common use I can think of is in a maintenance plan where you may be iterating through all the databases on a server executing various maintenance tasks. As long as you limit your usage and don't design whole applications around RBAR (row-by-agonizing-row) processing, you should be fine.

In general cursors are a bad thing. However in some cases it is more practical to use a cursor and in some it is even faster to use one. A good example is a cursor through a contact table sending emails based on some criteria. (Not to open up the question if sending an email from your DBMS is a good idea - let's just assume it is for the problem at hand.) There is no way to write that set-based. You could use some trickery to come up with a set-based solution to generate dynamic SQL, but a real set-based solution does not exist.
However, a calculation involving the previous row can be done using a self join. That is usually still faster than a cursor.
In all cases you need to balance the effort involved in developing a faster solution. If nobody cares, if you process runs in 1 minute or in one hour, use what gets the job done quickest. If you are looping through a dataset that grows over time like an [orders] table, try to stay away from a cursor if possible. If you are not sure, do a performance test comparing a cursor base with a set-based solution on several significantly different data sizes.

I had always disliked cursors because of their slow performance. However, I found I didn't fully understand the different types of cursors and that in certain instances, cursors are a viable solution.
When you have a business problem that can only be solved by processing one row at a time, then a cursor is appropriate.
So to improve performance with the cursor, change the type of cursor you are using. Something I didn't know was, if you don't specify which type of cursor you are declaring, you get the Dynamic Optimistic type by default, which is the one that is the slowest for performance because it's doing lots of work under the hood. However, by declaring your cursor as a different type, say a static cursor, it has very good performance.
See these articles for a fuller explanation:
The Truth About Cursors: Part I
The Truth About Cursors: Part II
The Truth About Cursors: Part III
I think the biggest con against cursors is performance, however, not laying out a task in a set based approach would probably rank second. Third would be readability and layout of the tasks as they usually don't have a lot of helpful comments.
SQL Server is optimized to run the set based approach. You write the query to return a result set of data, like a join on tables for example, but the SQL Server execution engine determines which join to use: Merge Join, Nested Loop Join, or Hash Join. SQL Server determines the best possible joining algorithm based upon the participating columns, data volume, indexing structure, and the set of values in the participating columns. So using a set based approach is generally the best approach in performance over the procedural cursor approach.

They are necessary for things like dynamic SQL pivoting, but you should try and avoid using them whenever possible.

Related

Inline Queries: Is it more performant to run a bunch of little queries, or a single huge one?

For an app I am working on, direction is that we shall use parameterized inline MS-SQL queries to obtain data from our database.
One of the queries I am operating on is rather huge and complex, so I'm considering breaking it out into multiple methods that each run the subquery, and return the value as some string that gets incorporated into the super-query, which gets run to perform some operation.
But this got me thinking about performance; is this 'sub-query' approach more or less performant than keeping my query big and unified?
(Bear in mind, this is my first foray into database technology beyond simple PL/SQL and MS/SQL queries.)
Using many small queries instead of a big one is the single worst performance mistake you can make. See this simple talk article on RBAR, row by agonizing row..

Which is faster in SQL, While loop, Recursive Stored proc, or Cursor?

Which is faster in SQL, While loop, Recursive Stored proc, or Cursor?
I want to optimize the performance in a couple of spots in a stored procedure.
The code I'm optimizing formats some strings for output to a file.
I'll assume you are using SQL Server.
First of all, as someone said in the statements, recursive stored procs, while possible, are not a good idea in SQL Server because of the stack size. So, any deeply recursive logic will break.
However, if you have 2-3 levels of nesting at best, you might try using recursion or using CTE, which is also a bit recursive (SQL Server 2005 and up). Once you manage to wrap your head around CTE, it's an immensely useful technique.
I haven't measured, but I've never had performance issues in the few places where I used CTE.
Cursors on the other hand are big performance hogs, so I (and half the internet) would recommend not to use them in code that is called often. But as cursors are more a classical programming structure, akin to a foreach in C#, some people find it easier to look at, understand and maintain SQL code that uses cursors for data manipulation, over some convoluted multiple-inner-select SQL monstrosity, so it's not the worst idea to use them in code that will be called once in a while.
Speaking of while, it also transfers the programming mindset from a set-based one, to a procedure-based one, so while it's relatively fast and does not consume lots of resources, can still dramatically increase the number of data manipulation statements you issue to the database itself.
To summarize, if I had to make a complex stored proc where the performance is paramount I'd try:
Using set-based approach (inner selects, joins, unions and such)
Using CTE (clear and manageable for an experienced user, bit shady for a beginner)
Using control-flow statements (if, while...)
Using cursors (procedural code, easy to follow)
in that order.
If the code is used much less often, I'll probably move 3 and 4 before 1 and 2, but, again, only for complex scenarios that use lots of tables, and lots of relations. Of course, YMMV, so I'd test whatever procedure I make in a real-world scenario, to actually measure the performance, because, we can talk until we are blue in the face about this is fast and that is slow, but until you get real measurements, there is no way to tell whether changes are making things better or worse.
And, do not forget, the code is only as fast as your data. There is no substitution for good indexing.
D) None of the above.
A set-based method will almost always be the fastest method. Without knowing what your actual code is (or a close approximation) it's hard to say whether or not that's possible or which method would be fastest.
Your best bet is to test all of the possible methods that you have and see which one is truly fastest.
If you want to improve performance then you need to look at SET based operations, While loops and cursors are basically the same thing. SQL works in SETs, it is not a procedural language, use it how it is intended to be used
Recursive stored procedure is likely to be slowest, while loop and cursors are not mutually exclusive. Cursor operations are pretty quick (IME), but I've only ever used them from external (non-SQL) code. The other posters are correct, if you can do your processing in a set-oriented manner you'll get the best performance.

Alternatives to MS SQL 2005 FullText Catalog

I can't seem to get acceptable performance from FullText Catalogs. We have situations where we must run 100k+ queries as quickly as possible. Some of the queries use FREETEXT some don't. Here's an example of a query
IF EXISTS(select 1 from user_data d where d.userid=#userid and FREETEXT(*, #activities) SET #match=1
This can take between 3-15 seconds. I need it to be much faster < 1s if possible.
I like the "flexibility" of the fulltext query in that it can search across multiple columns and the syntax is pretty intuitive. I'd rather not use a Like statement because we want to be able to match words like "Writer" and "Writing".
I've tried some of the suggestions listed here http://msdn.microsoft.com/en-us/library/ms142560(SQL.90).aspx
We've got as much memory and cpu as we can afford, unfortunately we can't put the catalogs on their own disk controllers.
I'm stumped and ready to explore other alternatives to FullText Queries. Is there anything else out there that gives that kind of "Writer"/"Writing" similar matches? Perhaps even something that uses the CLR?
Check out these alternatives, although I doubt they'll improve performance without isolating them onto separate hardware:
Which search technology to use with ASP.NET?
Lucene.Net and SQL Server
Due to the nature of FREETEXT, the performance is less than when you'd use CONTAINS, simply because it has to take into account less precise alternatives for the keywords given. CONTAINS can find Writing when you specify Write btw, I'm not sure if you've checked whether CONTAINS will do the trick or not.
Also be sure to avoid IF statements in SQL, as they often lead to a complete recompilation of the execution plan for every query execution, which will likely contribute to the poor performance you're seeing. I'm not sure how the IF statement is used, as it's likely inside a bigger piece of SQL. Try to merge the EXISTS query with that bigger piece of sql, as you can set the #match parameter from within the SELECT statement inside the EXISTS, or get rid of the variable altogether and use the EXISTS clause as a predicate in the bigger query.
SQL is a set-oriented language and interpreted. Therefore, it's often faster to get rid of imperative programming constructs and use the native set-operators of sql instead.
perhaps https://github.com/MahyTim/LuceneNetSqlDirectory can help you, it allows to store a LuceneNET index in SQLServer.

Cursor verus while loop - what are the advantages/disadvantages of cursors?

Is it a good idea to use while loop instead of a cursor?
What are the advantages/disadvantages of cursors?
I'm following this bit of advice:
[...] which is better: cursors or
WHILE loops? Again, it really depends
on your situation. I almost always use
a cursor to loop through records when
necessary. The cursor format is a
little more intuitive for me and,
since I just use the constructs to
loop through the result set once, it
makes sense to use the FAST_FORWARD
cursor. Remember that the type of
cursor you use will have a huge impact
on the performance of your looping
construct.
— Tim Chapman in Comparing cursor vs. WHILE loop performance in SQL Server 2008
The linked article contains simple examples of how to implement each approach.
Some of these depends on the DBMS, but generally:
Pros:
Outperform loops when it comes to row-by-row processing
Works reasonably well with large datasets
Cons:
Don't scale as well
Use more server resources
Increases load on tempdb
Can cause leaks if used incorrectly (eg. Open without corresponding Close)
I would ask you what you are doing with that cursor/while loop.
If you are updating or returning data why don't you use a proper WHERE clause. I know people who would say you should never use cursors.

Ways to avoid eager spool operations on SQL Server

I have an ETL process that involves a stored procedure that makes heavy use of SELECT INTO statements (minimally logged and therefore faster as they generate less log traffic). Of the batch of work that takes place in one particular stored the stored procedure several of the most expensive operations are eager spools that appear to just buffer the query results and then copy them into the table just being made.
The MSDN documentation on eager spools is quite sparse. Does anyone have a deeper insight into whether these are really necessary (and under what circumstances)? I have a few theories that may or may not make sense, but no success in eliminating these from the queries.
The .sqlplan files are quite large (160kb) so I guess it's probably not reasonable to post them directly to a forum.
So, here are some theories that may be amenable to specific answers:
The query uses some UDFs for data transformation, such as parsing formatted dates. Does this data transformation necessitate the use of eager spools to allocate sensible types (e.g. varchar lengths) to the table before it constructs it?
As an extension of the question above, does anyone have a deeper view of what does or does not drive this operation in a query?
My understanding of spooling is that it's a bit of a red herring on your execution plan. Yes, it accounts for a lot of your query cost, but it's actually an optimization that SQL Server undertakes automatically so that it can avoid costly rescanning. If you were to avoid spooling, the cost of the execution tree it sits on will go up and almost certainly the cost of the whole query would increase. I don't have any particular insight into what in particular might cause the database's query optimizer to parse the execution that way, especially without seeing the SQL code, but you're probably better off trusting its behavior.
However, that doesn't mean your execution plan can't be optimized, depending on exactly what you're up to and how volatile your source data is. When you're doing a SELECT INTO, you'll often see spooling items on your execution plan, and it can be related to read isolation. If it's appropriate for your particular situation, you might try just lowering the transaction isolation level to something less costly, and/or using the NOLOCK hint. I've found in complicated performance-critical queries that NOLOCK, if safe and appropriate for your data, can vastly increase the speed of query execution even when there doesn't seem to be any reason it should.
In this situation, if you try READ UNCOMMITTED or the NOLOCK hint, you may be able to eliminate some of the Spools. (Obviously you don't want to do this if it's likely to land you in an inconsistent state, but everyone's data isolation requirements are different). The TOP operator and the OR operator can occasionally cause spooling, but I doubt you're doing any of those in an ETL process...
You're right in saying that your UDFs could also be the culprit. If you're only using each UDF once, it would be an interesting experiment to try putting them inline to see if you get a large performance benefit. (And if you can't figure out a way to write them inline with the query, that's probably why they might be causing spooling).
One last thing I would look at is that, if you're doing any joins that can be re-ordered, try using a hint to force the join order to happen in what you know to be the most selective order. That's a bit of a reach but it doesn't hurt to try it if you're already stuck optimizing.

Resources