What are some valid cursor uses? - cursor

Short version
What are some (real) valid use cases of cursors?
Long version
A while ago, when I was in my database class and the teacher touched the cursors topic, I asked her about valid use cases. She answered with an use of cursors for presentation purposes: you have a book database and you want to show all the books but "grouped" by author, and only showing the author's name once.
I am not convinced by this example because it seems to me that these kind of presentation concerns do not belong to the database but the client (a UI that shows them in a pretty way).
I have thought about valid use cases for cursors, but I can't find any that I can't express in a clearer way with no cursors. It seems to me that using cursors makes you think in a "imperative way" rather than the "declarative, set-based" way you should think most of the time.
So I would like to know when it's good to use cursors. I would prefer to hear about real use cases.

Generally speaking, you should go out of your way to avoid cursors, and try to use set-based standard SQL when you can.
One case where it is actually required is if you want to return multiple results in one query with PostgreSQL.
CREATE FUNCTION myfunc(refcursor, refcursor) RETURNS SETOF refcursor AS $$
BEGIN
OPEN $1 FOR SELECT * FROM table_1;
RETURN NEXT $1;
OPEN $2 FOR SELECT * FROM table_2;
RETURN NEXT $2;
END;
$$ LANGUAGE plpgsql;

Related

Postgres bulk insert/update that's injection-safe. Perhaps a function that takes an array? [duplicate]

This question already has answers here:
Improving a function that UPSERTs based on an input array
(2 answers)
Closed 3 years ago.
I'm working on paying back some technical debt this week, and it hit me that I have no idea how to make multi-value inserts safe from accidental or malicious SQL injections. We're on Postgres 11.4. I've got a test bed to work from that includes a small table with about 26K rows, here's the declaration for a small table I'm using for testing:
BEGIN;
DROP TABLE IF EXISTS "data"."item" CASCADE;
CREATE TABLE IF NOT EXISTS "data"."item" (
"id" uuid NOT NULL DEFAULT NULL,
"marked_for_deletion" boolean NOT NULL DEFAULT false,
"name_" citext NOT NULL DEFAULT NULL,
CONSTRAINT item_id_pkey
PRIMARY KEY ("id")
);
CREATE INDEX item_marked_for_deletion_ix_bgin ON "data"."item" USING GIN("marked_for_deletion") WHERE marked_for_deletion = true;
ALTER TABLE "data"."item" OWNER TO "user_change_structure";
COMMIT;
I've been inserting to this table, and many others, using multi-value inserts, along the lines of:
BEGIN;
INSERT
bundle up hundres or thousands of rows
ON CONFLICT do what I need
COMMIT or ROLLBACK on the client side
Works fine. But how do you make a multi-value statement safe? That's what I can't figure out. This is one of those areas where I can't reason about the problem well. I don't have the appetite, aptitude, or patience for hacking things. That I can't think up an exploit means nothing, I would suck as a hacker. And, for that matter, I'm generally more concerned about errors than evil in code, since I run into errors a whole lot more often.
The standard advice I see for safe insertion is to use a prepared statement. A prepared statement for an INSERT is pretty much a temporary, runtime function for interpolation on a code template. For me, it's simpler to write an actual function, like this one:
DROP FUNCTION IF EXISTS data.item_insert_s (uuid, boolean, citext);
CREATE OR REPLACE FUNCTION data.item_insert_s (uuid, boolean, citext)
RETURNS int
AS $$
INSERT INTO item (
id,
marked_for_deletion,
name_)
VALUES
($1,$2,$3)
ON CONFLICT(id) DO UPDATE SET
marked_for_deletion = EXCLUDED.marked_for_deletion,
name_ = EXCLUDED.name_;
SELECT 1; -- No clue what to return, but you have to return something.
$$ LANGUAGE sql;
ALTER FUNCTION data.item_insert_s(uuid, boolean, citext) OWNER TO user_bender;
All of that works, and I've tried some timing tests. I truncate the table, do a multi-value insert, truncate, do a series of function call inserts, and see what the difference is. I've tried multiple runs, doing the operations in different orders, etc. Both cases use a BEGIN/COMMIT block in the same way, so I'll end up with the same number of transactions on either test. The results vary more across tests than within them, but the multi-value insert is always faster. Congratulations to me for confirming the obvious.
Is there a way to safely do bulk inserts and updates? It occurred to me that I could write a function that takes an array or arrays, parse it out, and run the code in a loop within the function. I'd like to test that out, but get flummoxed by the Postgres array syntax. I've looked around, and it sounds like an array of objects and a foreach loop might be just what I'm after. I've looked around, and this is a topic that has been addressed, but I haven't found a straightforward example of how to prepare data for insertion, and the the unpacking of it. I'm suspecting that I won't be able to use SQL and a plain unnest() because 1) I want to safe the inputs and 2) I might have functions that don't take all of the fields in a table in their input.
To make things a bit easier, I'm fine with functions with fixed parameter lists, and array inputs with fixed formats. I'll write code generators for my various tables, so I don't need to make the Postgres-side code any more complex than necessary.
Thanks for any help!
Note: I got a message to explain why this question is different than my newer, related question:
Improving a function that UPSERTs based on an input array
Answer: Yes, it's the same starting point. In this question, I was asking about SQL injection, in the second question I was trying to focus on the array-input solution. I'm not quite sure when to split out new questions, and when to let questions turn into multi-part threads.
It's morning here on the Far South Coast of NSW, and I figured I'd take another crack at this. I should have mentioned before that our deployment environment is RDS, which makes COPY less appealing. But the idea of passing in an array where each element includes the row data is very appealing. It's much like a multi-value INSERT, but with different syntactic sugar. I've poked at arrays in Postgres a bit, and always come away befuddled by the syntax. I found a few really excellent threads with lots of details from some top posters to study:
https://dba.stackexchange.com/questions/224785/pass-array-of-mixed-type-into-stored-function
https://dba.stackexchange.com/questions/131505/use-array-of-composite-type-as-function-parameter-and-access-it
https://dba.stackexchange.com/questions/225176/how-to-pass-an-array-to-a-plpgsql-function-with-variadic-parameter/
From there, I've got a working test function:
DROP FUNCTION IF EXISTS data.item_insert_array (item[]);
CREATE OR REPLACE FUNCTION data.item_insert_array (data_in item[])
RETURNS int
AS $$
INSERT INTO item (
id,
marked_for_deletion,
name_)
SELECT
d.id,
d.marked_for_deletion,
d.name_
FROM unnest(data_in) d
ON CONFLICT(id) DO UPDATE SET
marked_for_deletion = EXCLUDED.marked_for_deletion,
name_ = EXCLUDED.name_;
SELECT cardinality(data_in); -- array_length() doesn't work. ¯\_(ツ)_/¯
$$ LANGUAGE sql;
ALTER FUNCTION data.item_insert_array(item[]) OWNER TO user_bender;
To close the circle, here's an example of some input:
select * from item_insert_array(
array[
('2f888809-2777-524b-abb7-13df413440f5',true,'Salad fork'),
('f2924dda-8e63-264b-be55-2f366d9c3caa',false,'Melon baller'),
('d9ecd18d-34fd-5548-90ea-0183a72de849',true,'Fondue fork')
]::item[]
);
Going back to my test results, this performs roughly as well as my original multi-value insert. The other two methods I posted originally are, let's say, 4x slower. (The results are pretty erratic, but they're always a lot slower.) But I'm still left with my original question:
Is this injection safe?
If not, I guess I need to rewrite it in PL/pgSQL with a FOREACH loop and EXECUTE...USING or FORMAT to get the injection-cleaning text processing/interpolcation features there. Does anyone know?
I have a lot of other questions about this function (Should it be a procedure so that I can manage the transaction? How do I make the input anyarray? What would be a sensible result to return?) But I think I'll have to pursue those as their own questions.
Thanks for any help!

Performance of Postgresql stored procedures/functions in a multi-tenant environment that has one db with many schemata (one for each tenant)

I am new to Postgresql and I am trying to figure out some details about stored procedures (which I think are actually called functions in pgsql) when used in a multiple schema environment.
The application I have in mind involves a multi-tenant DB design where one schema is used for each tenant and all schemata, which have the same table structure and names, are part of the same database. As far as I know from DBs in general, stored procedures/functions are pre-compiled and therefore faster so I woulid like to use them for performing operations on each schema's tables by sending the required parameters from the application server instead of sending a list of SQL commands. In addition, I would like to have a SINGLE set of functions that implement all the SELECT (including JOIN type), INSERT, UPDATE, etc operations on the tables of each schema. This will allow to easily perform changes in each function and avoid SQL code replication and redundancy. As I found out, it is possible to create a set of functions in a schema s0 and then create s1, s2, ... schemata (having all the same tables) that use these functions.
For exapmle, I can create a template schema named s0 (identical to all others) and create a SQL or pl/pgSQL function that belongs to this schema and contains operations on the schema's tables. In this function, the table names are written without the schema prefix, i.e.
first_table and not s0.first_table
An example function could be:
CREATE FUNCTION sel() RETURNS BIGINT
AS 'SELECT count(a) from first_table;'
LANGUAGE SQL;
As I have tested, this function works well. If I move to schema s1 by entering:
set search_path to s1;
and then call the function again, the function acts upon s1 schema's identically named table first_table.
The function could also include the parameter path in order to call it with a schema name and a command to change the search_ path similar to this:
CREATE OR REPLACE FUNCTION doboth(path TEXT, is_local BOOLEAN DEFAULT false) RETURNS BIGINT AS $$
SELECT set_config('search_path', regexp_replace(path, '[^\w ,]', '', 'g'), is_local);
SELECT count(a) from first_table;
$$ LANGUAGE sql;
as shown in the proposed solution in PostgreSQL: how do I set the search_path from inside a function?
However, when I tried this and I called the function for a schema, I noticed that the second SELECT of the function was executed before the first, which led to executing the second SELECT on the wrong schema! This was really unexpected. Does anybody know the explanation to this behavior?
In order to bypass this issue, I created a plpgsql function that does the same thing and it worked without any execution order issues:
CREATE OR REPLACE FUNCTION doboth(path TEXT, is_local BOOLEAN DEFAULT false) RETURNS BIGINT AS $$
DECLARE result BIGINT;
BEGIN
PERFORM set_config('search_path', regexp_replace(path, '[^\w ,]', '', 'g'), is_local);
SELECT count(a) from first_table INTO result;
RETURN result;
END
$$ LANGUAGE plpgsql;
So, now some questions about performance this time:
1) Apart from a) having the selection of schema to operate and the specified operation on the schema in one transaction which is necessary for my multi-tenant implementation, and b) teaming together SQL commands and avoiding some extra data exchange between the application server and the DB server which is beneficial, do the Postgresql functions have any performance benefits over executing the same code in separate SQL commands?
2) In the described multi-tenant scenario with many schemata and one DB,
does a function that is defined once and called for any identical schema to the one it is defined lose any of its performance benefits (if any)?
3) Is there any difference in performance between an SQL function and a PL/pgSQL function that contains the same operations?
Before I answer your questions, a remark to your SQL function.
It does not fail because the statements are executed in a wrong order, but because both queries are parsed before the first one is executed. The error message you get is somewhat like
ERROR: relation "first_table" does not exist
[...]
CONTEXT: SQL function "doboth" during startup
Note the “during startup”.
Aswers
You may experience a slight performance boost, particularly if the SQL statements are complicated, because the plans of SQL statements in a PL/pgSQL function are cached for the duration of a database session or until they are invalidated.
If the plan for the query is cached by the PL/pgSQL function, but the SQL statement calling the function has to be planned every time, you might actually be worse of from a performance angle because of the overhead of executing the function.
Whenever you call the function with a different schema name, the query plan will be invalidated and has to be created anew. So if you change the schema name for every invocation, you won't gain anything.
SQL function don't cache query plans, so they don't perform better than the plain SQL query.
Note, however, that the gains from caching simple SQL statements in functions are not tremendous.
Use functions that just act as containers for SQL statements only if it makes life simpler for you, otherwise use plain SQL.
Do not only focus on performance uring design, but on a good architecture and a simple design.
If the same statements keep repeating over and over, you might gain more performance using prepared statements than using functions.
Firstly, I do not really believe there can be any issues with line execution order in functions. If you have any issues, it's your code not working, not Postgres.
Secondly, multi-tenant behavior is well implemented with set search_path to s1, s0;. There is usually no need for switching anything inside procedures.
Thirdly, there are no performance benefits in using stored procedures except for minimizing data flows between DB and the application. If you consider a query like SELECT count(*) FROM mytable WHERE somecolumn = $1 there is absolutely nothing you can optimize before you know the value of $1.
And finally, no, there is no significant difference between functions in SQL and PL/pgSQL. The most time is still consumed by reading through tables, so focus on perfecting that.
Hope that clarifies the situation. Also, you may want to consider security benefits of storage procedures. Just a hint.

one simple query written in store procedure and same written as inline query which will execute fast.?

one simple query written in store procedure and same written as inline query which will execute fast in SQL server.
someone from interview panel asked this question from me i said store procedure reason being procedure is compiled but he said i am wrong.
please explain.?
I suppose, that a simple query is some read-only code.
A VIEW is fully inlineable and precompiled. The biggest advantage is, that you can bind this into bigger statement with joins and filters and this will - in most cases - be able to use indexes and statistics.
A table valued function (TVF) in the "new" syntax (without BEGIN and END is very similar to a VIEW but the parameters and there handling are precompiled too. The advantages of the VIEW are the same here.
An UDF returning a table ("old" syntax") is in most cases something one should not do. The biggest disadvantage is the fact, that the optimizer cannot pre-estimate the result and will handle this as one-row-table which is - in most cases - really bad...
A StoredProcedure which does nothing more than a VIEW or a TVF could do as well, is a pain in the neck - at least in my eyes. I know, that there are other opinions... The biggest draw back is: Whenever you want to continue with the returned result set, you have to insert it into a table (or declared table variable). Further joins or filters against this new table will miss indexes and statistics. Might be that a simple Hey SP give me your result! is fast, but everything after this call is absolutely down.
So my facit: Use SPs when there is something to do and use VIEW or TVF when there is something to read

Why is it so difficult to do a loop in T-SQL

OK, I know it can be done, I do it quite often, but why so difficult to do a loop in T-SQL? I can think of a ton of reasons I'd want to parse thru a query result set and do something that simply can't be done without a loop, yet the code to setup and execute my loop is > 20 lines.
I'm sure others have a similar opinions so why are we still without a simple way to perform a loop?
An aside: we finally got an UPSERT (aka MERGE) in SQL2008 so maybe all hope isn't lost.
SQL is a set-based, declarative language; not a procedural or imperative language. T-SQL tries to straddle the two, but it's still built on a fundamentally set-based paradigm.
I can think of a ton of reasons I'd want to parse thru a query result set and do something that simply can't be done without a loop
And for the vast majority of those I can either show you how to do it in a set-based operation instead or explain why it should be done in your client code rather than on the database. Needing to do a loop in sql is exceeding rare.
T-SQL is not designed to be an imperative language. Its designed to be declarative. Its declarative nature allows the optomizer to slice up the various tasks and run them in parrallel and in other ways do things in an order that is most efficient.
Because SQL is a Set based language. The power of sql is in find a smaller group within a larger group of data based on specific characteristics. To handle this task, looping is largely unnecessary. Obviously it's been added for convenience of handling some situations, but the intended use of the language make this feature irrelevant.
almost everything can be done set based, try using a number table
why 20 lines? This is all you need
select *,identity(int, 1,1) as Someid into #temp
from sysobjects
declare #id int, #MaxId int
select #id = 1,#MaxId = max(Someid) from #temp
while #id < #MaxId
begin
-- do your stuff here
print #id
set #id =#id + 1
end
it depends what you want to do in a loop. using a while loop is not difficult at all:
declare #i int
set #i = 20
while #i>0 begin
... do some stuff
set #i = #i-1
end
it only becomes cumbersome when using cursors, which should be avoided anyways.
You might try using user defined functions to do most of the work instead of taking a loop based approach. This would preserve the intention of the SQL language which is set based.
SQL is a SET based system, not a procedural (loop) one. Generally its regarded as bad practice to use loops in SQL because they perform poorly compared to thier set based equivalents.
WHILE is the most common looping structure, CURSORS can also be used, but have their own problems (forgetting to deallocate/ close)
...an example of WHILE (you may not need it but others may)
DECLARE #iterator INT
SET #iterator = 0
WHILE #iterator < 20
BEGIN
SELECT * FROM table WHERE rowKey = #iterator
/*do stuff*/
#iterator = #iterator + 1
END
The real question is "What is it that you are trying to do that simply cannot be done in a set based way?"
I'm not an expert in DB's but I believe the atomic nature of database transactions would make loops difficult to achieve because the transaction be complete or it should not occur at all. Maintaining state can be pesky!
Wikipedia Article on Atomicity

Why is it considered bad practice to use cursors in SQL Server?

I knew of some performance reasons back in the SQL 7 days, but do the same issues still exist in SQL Server 2005? If I have a resultset in a stored procedure that I want to act upon individually, are cursors still a bad choice? If so, why?
Because cursors take up memory and create locks.
What you are really doing is attempting to force set-based technology into non-set based functionality. And, in all fairness, I should point out that cursors do have a use, but they are frowned upon because many folks who are not used to using set-based solutions use cursors instead of figuring out the set-based solution.
But, when you open a cursor, you are basically loading those rows into memory and locking them, creating potential blocks. Then, as you cycle through the cursor, you are making changes to other tables and still keeping all of the memory and locks of the cursor open.
All of which has the potential to cause performance issues for other users.
So, as a general rule, cursors are frowned upon. Especially if that's the first solution arrived at in solving a problem.
The above comments about SQL being a set-based environment are all true. However there are times when row-by-row operations are useful. Consider a combination of metadata and dynamic-sql.
As a very simple example, say I have 100+ records in a table that define the names of tables that I want to copy/truncate/whatever. Which is best? Hardcoding the SQL to do what I need to? Or iterate through this resultset and use dynamic-SQL (sp_executesql) to perform the operations?
There is no way to achieve the above objective using set-based SQL.
So, to use cursors or a while loop (pseudo-cursors)?
SQL Cursors are fine as long as you use the correct options:
INSENSITIVE will make a temporary copy of your result set (saving you from having to do this yourself for your pseudo-cursor).
READ_ONLY will make sure no locks are held on the underlying result set. Changes in the underlying result set will be reflected in subsequent fetches (same as if getting TOP 1 from your pseudo-cursor).
FAST_FORWARD will create an optimised forward-only, read-only cursor.
Read about the available options before ruling all cursors as evil.
There is a work around about cursors that I use every time I need one.
I create a table variable with an identity column in it.
insert all the data i need to work with in it.
Then make a while block with a counter variable and select the data I want from the table variable with a select statement where the identity column matches the counter.
This way i dont lock anything and use alot less memory and its safe, i will not lose anything with a memory corruption or something like that.
And the block code is easy to see and handle.
This is a simple example:
DECLARE #TAB TABLE(ID INT IDENTITY, COLUMN1 VARCHAR(10), COLUMN2 VARCHAR(10))
DECLARE #COUNT INT,
#MAX INT,
#CONCAT VARCHAR(MAX),
#COLUMN1 VARCHAR(10),
#COLUMN2 VARCHAR(10)
SET #COUNT = 1
INSERT INTO #TAB VALUES('TE1S', 'TE21')
INSERT INTO #TAB VALUES('TE1S', 'TE22')
INSERT INTO #TAB VALUES('TE1S', 'TE23')
INSERT INTO #TAB VALUES('TE1S', 'TE24')
INSERT INTO #TAB VALUES('TE1S', 'TE25')
SELECT #MAX = ##IDENTITY
WHILE #COUNT <= #MAX BEGIN
SELECT #COLUMN1 = COLUMN1, #COLUMN2 = COLUMN2 FROM #TAB WHERE ID = #COUNT
IF #CONCAT IS NULL BEGIN
SET #CONCAT = ''
END ELSE BEGIN
SET #CONCAT = #CONCAT + ','
END
SET #CONCAT = #CONCAT + #COLUMN1 + #COLUMN2
SET #COUNT = #COUNT + 1
END
SELECT #CONCAT
I think cursors get a bad name because SQL newbies discover them and think "Hey a for loop! I know how to use those!" and then they continue to use them for everything.
If you use them for what they're designed for, I can't find fault with that.
SQL is a set based language--that's what it does best.
I think cursors are still a bad choice unless you understand enough about them to justify their use in limited circumstances.
Another reason I don't like cursors is clarity. The cursor block is so ugly that it's difficult to use in a clear and effective way.
All that having been said, there are some cases where a cursor really is best--they just aren't usually the cases that beginners want to use them for.
Cursors are usually not the disease, but a symptom of it: not using the set-based approach (as mentioned in the other answers).
Not understanding this problem, and simply believing that avoiding the "evil" cursor will solve it, can make things worse.
For example, replacing cursor iteration by other iterative code, such as moving data to temporary tables or table variables, to loop over the rows in a way like:
SELECT * FROM #temptable WHERE Id=#counter
or
SELECT TOP 1 * FROM #temptable WHERE Id>#lastId
Such an approach, as shown in the code of another answer, makes things much worse and doesn't fix the original problem. It's an anti-pattern called cargo cult programming: not knowing WHY something is bad and thus implementing something worse to avoid it! I recently changed such code (using a #temptable and no index on identity/PK) back to a cursor, and updating slightly more than 10000 rows took only 1 second instead of almost 3 minutes. Still lacking set-based approach (being the lesser evil), but the best I could do that moment.
Another symptom of this lack of understanding can be what I sometimes call "one object disease": database applications which handle single objects through data access layers or object-relational mappers. Typically code like:
var items = new List<Item>();
foreach(int oneId in itemIds)
{
items.Add(dataAccess.GetItemById(oneId);
}
instead of
var items = dataAccess.GetItemsByIds(itemIds);
The first will usually flood the database with tons of SELECTs, one round trip for each, especially when object trees/graphs come into play and the infamous SELECT N+1 problem strikes.
This is the application side of not understanding relational databases and set based approach, just the same way cursors are when using procedural database code, like T-SQL or PL/SQL!
Sometimes the nature of the processing you need to perform requires cursors, though for performance reasons it's always better to write the operation(s) using set-based logic if possible.
I wouldn't call it "bad practice" to use cursors, but they do consume more resources on the server (than an equivalent set-based approach) and more often than not they aren't necessary. Given that, my advice would be to consider other options before resorting to a cursor.
There are several types of cursors (forward-only, static, keyset, dynamic). Each one has different performance characteristics and associated overhead. Make sure you use the correct cursor type for your operation. Forward-only is the default.
One argument for using a cursor is when you need to process and update individual rows, especially for a dataset that doesn't have a good unique key. In that case you can use the FOR UPDATE clause when declaring the cursor and process updates with UPDATE ... WHERE CURRENT OF.
Note that "server-side" cursors used to be popular (from ODBC and OLE DB), but ADO.NET does not support them, and AFAIK never will.
There are very, very few cases where the use of a cursor is justified. There are almost no cases where it will outperform a relational, set-based query. Sometimes it is easier for a programmer to think in terms of loops, but the use of set logic, for example to update a large number of rows in a table, will result in a solution that is not only many less lines of SQL code, but that runs much faster, often several orders of magnitude faster.
Even the fast forward cursor in Sql Server 2005 can't compete with set-based queries. The graph of performance degradation often starts to look like an n^2 operation compared to set-based, which tends to be more linear as the data set grows very large.
# Daniel P -> you don't need to use a cursor to do it. You can easily use set based theory to do it. Eg: with Sql 2008
DECLARE #commandname NVARCHAR(1000) = '';
SELECT #commandname += 'truncate table ' + tablename + '; ';
FROM tableNames;
EXEC sp_executesql #commandname;
will simply do what you have said above. And you can do the same with Sql 2000 but the syntax of query would be different.
However, my advice is to avoid cursors as much as possible.
Gayam
Cursors do have their place, however I think it's mainly because they are often used when a single select statement would suffice to provide aggregation and filtering of results.
Avoiding cursors allows SQL Server to more fully optimize the performance of the query, very important in larger systems.
The basic issue, I think, is that databases are designed and tuned for set-based operations -- selects, updates, and deletes of large amounts of data in a single quick step based on relations in the data.
In-memory software, on the other hand, is designed for individual operations, so looping over a set of data and potentially performing different operations on each item serially is what it is best at.
Looping is not what the database or storage architecture are designed for, and even in SQL Server 2005, you are not going to get performance anywhere close to you get if you pull the basic data set out into a custom program and do the looping in memory, using data objects/structures that are as lightweight as possible.

Resources