Postgres bulk insert/update that's injection-safe. Perhaps a function that takes an array? [duplicate] - arrays

This question already has answers here:
Improving a function that UPSERTs based on an input array
(2 answers)
Closed 3 years ago.
I'm working on paying back some technical debt this week, and it hit me that I have no idea how to make multi-value inserts safe from accidental or malicious SQL injections. We're on Postgres 11.4. I've got a test bed to work from that includes a small table with about 26K rows, here's the declaration for a small table I'm using for testing:
BEGIN;
DROP TABLE IF EXISTS "data"."item" CASCADE;
CREATE TABLE IF NOT EXISTS "data"."item" (
"id" uuid NOT NULL DEFAULT NULL,
"marked_for_deletion" boolean NOT NULL DEFAULT false,
"name_" citext NOT NULL DEFAULT NULL,
CONSTRAINT item_id_pkey
PRIMARY KEY ("id")
);
CREATE INDEX item_marked_for_deletion_ix_bgin ON "data"."item" USING GIN("marked_for_deletion") WHERE marked_for_deletion = true;
ALTER TABLE "data"."item" OWNER TO "user_change_structure";
COMMIT;
I've been inserting to this table, and many others, using multi-value inserts, along the lines of:
BEGIN;
INSERT
bundle up hundres or thousands of rows
ON CONFLICT do what I need
COMMIT or ROLLBACK on the client side
Works fine. But how do you make a multi-value statement safe? That's what I can't figure out. This is one of those areas where I can't reason about the problem well. I don't have the appetite, aptitude, or patience for hacking things. That I can't think up an exploit means nothing, I would suck as a hacker. And, for that matter, I'm generally more concerned about errors than evil in code, since I run into errors a whole lot more often.
The standard advice I see for safe insertion is to use a prepared statement. A prepared statement for an INSERT is pretty much a temporary, runtime function for interpolation on a code template. For me, it's simpler to write an actual function, like this one:
DROP FUNCTION IF EXISTS data.item_insert_s (uuid, boolean, citext);
CREATE OR REPLACE FUNCTION data.item_insert_s (uuid, boolean, citext)
RETURNS int
AS $$
INSERT INTO item (
id,
marked_for_deletion,
name_)
VALUES
($1,$2,$3)
ON CONFLICT(id) DO UPDATE SET
marked_for_deletion = EXCLUDED.marked_for_deletion,
name_ = EXCLUDED.name_;
SELECT 1; -- No clue what to return, but you have to return something.
$$ LANGUAGE sql;
ALTER FUNCTION data.item_insert_s(uuid, boolean, citext) OWNER TO user_bender;
All of that works, and I've tried some timing tests. I truncate the table, do a multi-value insert, truncate, do a series of function call inserts, and see what the difference is. I've tried multiple runs, doing the operations in different orders, etc. Both cases use a BEGIN/COMMIT block in the same way, so I'll end up with the same number of transactions on either test. The results vary more across tests than within them, but the multi-value insert is always faster. Congratulations to me for confirming the obvious.
Is there a way to safely do bulk inserts and updates? It occurred to me that I could write a function that takes an array or arrays, parse it out, and run the code in a loop within the function. I'd like to test that out, but get flummoxed by the Postgres array syntax. I've looked around, and it sounds like an array of objects and a foreach loop might be just what I'm after. I've looked around, and this is a topic that has been addressed, but I haven't found a straightforward example of how to prepare data for insertion, and the the unpacking of it. I'm suspecting that I won't be able to use SQL and a plain unnest() because 1) I want to safe the inputs and 2) I might have functions that don't take all of the fields in a table in their input.
To make things a bit easier, I'm fine with functions with fixed parameter lists, and array inputs with fixed formats. I'll write code generators for my various tables, so I don't need to make the Postgres-side code any more complex than necessary.
Thanks for any help!
Note: I got a message to explain why this question is different than my newer, related question:
Improving a function that UPSERTs based on an input array
Answer: Yes, it's the same starting point. In this question, I was asking about SQL injection, in the second question I was trying to focus on the array-input solution. I'm not quite sure when to split out new questions, and when to let questions turn into multi-part threads.

It's morning here on the Far South Coast of NSW, and I figured I'd take another crack at this. I should have mentioned before that our deployment environment is RDS, which makes COPY less appealing. But the idea of passing in an array where each element includes the row data is very appealing. It's much like a multi-value INSERT, but with different syntactic sugar. I've poked at arrays in Postgres a bit, and always come away befuddled by the syntax. I found a few really excellent threads with lots of details from some top posters to study:
https://dba.stackexchange.com/questions/224785/pass-array-of-mixed-type-into-stored-function
https://dba.stackexchange.com/questions/131505/use-array-of-composite-type-as-function-parameter-and-access-it
https://dba.stackexchange.com/questions/225176/how-to-pass-an-array-to-a-plpgsql-function-with-variadic-parameter/
From there, I've got a working test function:
DROP FUNCTION IF EXISTS data.item_insert_array (item[]);
CREATE OR REPLACE FUNCTION data.item_insert_array (data_in item[])
RETURNS int
AS $$
INSERT INTO item (
id,
marked_for_deletion,
name_)
SELECT
d.id,
d.marked_for_deletion,
d.name_
FROM unnest(data_in) d
ON CONFLICT(id) DO UPDATE SET
marked_for_deletion = EXCLUDED.marked_for_deletion,
name_ = EXCLUDED.name_;
SELECT cardinality(data_in); -- array_length() doesn't work. ¯\_(ツ)_/¯
$$ LANGUAGE sql;
ALTER FUNCTION data.item_insert_array(item[]) OWNER TO user_bender;
To close the circle, here's an example of some input:
select * from item_insert_array(
array[
('2f888809-2777-524b-abb7-13df413440f5',true,'Salad fork'),
('f2924dda-8e63-264b-be55-2f366d9c3caa',false,'Melon baller'),
('d9ecd18d-34fd-5548-90ea-0183a72de849',true,'Fondue fork')
]::item[]
);
Going back to my test results, this performs roughly as well as my original multi-value insert. The other two methods I posted originally are, let's say, 4x slower. (The results are pretty erratic, but they're always a lot slower.) But I'm still left with my original question:
Is this injection safe?
If not, I guess I need to rewrite it in PL/pgSQL with a FOREACH loop and EXECUTE...USING or FORMAT to get the injection-cleaning text processing/interpolcation features there. Does anyone know?
I have a lot of other questions about this function (Should it be a procedure so that I can manage the transaction? How do I make the input anyarray? What would be a sensible result to return?) But I think I'll have to pursue those as their own questions.
Thanks for any help!

Related

How to join a table with the result of some manipulation of data (be an stored procedure, function, ...)?

This is one of the quirks of SQL-Server that I find more puzzling : to not be able to manipulate data within a function (execute UPDATE or INSERT commands), or not be able to join to a query the result of an stored procedure.
I want to code an object that returns the newer value from a table of counters, and be able to use its result on selects.
Something like :
create function getNewCounterValue(#Counter varchar(100))
returns int
as
begin
declare #Value int
select #Value = Value
from CounterValues
where Counter = #Counter
set #Value = coalesce(#Value, 0) + 1
update CounterValues set Value = #Value
where Counter = #Counter
if ##rowcount = 0 begin
insert into CounterValues (Counter, Value) values (#Counter, #Value)
end
return #Value
end
So then I would be able to run commands like :
declare #CopyFrom date = '2022-07-01'
declare #CopyTo date = '2022-08-01'
insert into Bills (IdBill, Date, Provider, Amount)
select getNewCounterValue('BILL'), #CopyTo, Amount
from Bills
where Date = #CopyFrom
But SQL-Server doesn't allow to create functions that changes its data (Invalid use of a side-effecting operator), so it forces me to write getNewCounterValue as an stored procedure, but then I can't execute and join it to a query.
Is there any way to have an object that manipulates data capable to join its result to a query ?.
PS: I know that I could use sequences to get new counter values without needing to change data, but I'm working on a huge legacy database that already uses counter tables, not sequences. So I cannot change that without breaking a zillion other things.
I also know that I could declare IdBill as an Identity column, so I wouldn't need to retrieve new counter values to insert rows, but again this is a huge legacy database that uses counter tables, not identity columns, so I cannot change the column types without breaking the system.
Besides, these counters are just an example of why being able to join on a query the result of some data manipulation would be very useful. I like to write a lot of logic on the database, so I would take advantage of it on plenty other situations.
A few years ago I saw a very dirty trick to do so executing the data manipulation instructions as openrawset calls within your function, but it was a seriously ugly hack. There still is no better way to achieve this ?.
Thank you.
You're clearly aware that a function is for returning data, and you're aware of sequences, and identity columns, and you have given a completely reasonable explanation in your question as to why you can't use this in this case.
But as you also said, the question is a bit more general than just sequence/identity problems. There is an coherent idea of "some kind of construct that can change data, and whose output can be composed directly into a select".
There's no "object" that exactly fits that description. Asking "why doesn't language X have feature Y" leads to philosophical discussions with good answers already provided by Eric Lippert here and here
I think there are a few more concrete answers in this case though:
1) Guaranteed idempotency.
A select returns a set (bag, collection, however you want to think about it). Then there is an obvious expectation that any process that runs for the result of a select may run for multiple rows. If the process is not idempotent, then the state of the system when the select is complete might depend on the number of rows in the result. It's also possible that the execution of the modifying process might change the semantics of the select, or the next iteration of the process, which leads to situations like the Halloween Problem.
2) Plan Compilation
Related to (1) but not precisely the same. How can the query optimizer approach this functionalty? It must generate a plan "ahead of time", and that plan depends on stateful information. Yes, we get adaptive memory grants with 2019, but that's a trivial sort of "mid flight change", and even that took years before it was implemented (by which I mean that I believe Oracle has been able to do this for years, though I could be wrong, I'm no Oracle guy).
3) It's not actually beneficial in a lot of use cases
Take the use case of generating a sequence. Why not just iterate and execute a stored procedure? One answer might be "because I want to avoid imperative iteration, we should try to be set based and declarative". But as your hypothetical function demonstrates, it would still be imperative and iterative, it would just be "hidden" behind the select. I think - or let's say I have an intuition - that many cases where it seems like it might be nice to put a state-changing operating into a select fall into this basket. We wouldn't really be gaining anything.
4) There acutally is a way to do it! (but only in the most trivial case)
When I said "composed directly into a select" I didn't use the word "compose" on a whim. We do have composable DML:
create table T(i int primary key, c char);
declare #output table (i int, c char);
insert #output (i, c)
select dml.i, dml.c
from (
insert t (i, c)
output inserted.i, inserted.c
values (1, 'a')
) dml
/* OK, you can't add this
join SomeOtherTable on ...
*/
Of course, this isn't substantially different from insert exec in that you can't have a "naked" select, it has to be the source for an insert first. And you can't join to the dml output directly, you have to get the output and then do the join. But at least it gives you a way to avoid the "nested insert exec" problem.

Using an IN clause with table variable causes my query to run MUCH slower

I am using SSRS report whereby I need to pass multiple parameters to some SQL code.
Based on this blog post, the best way to handle multiple parameters is to used a split function, so that is the road I am following.
However, I am having some bad performance after following this.
For example, the following WHERE clause will return the data in 4 seconds:
AND DimBusinessDivision.Id IN (
22
)
This will also correctly return in 4 seconds:
DECLARE #BusinessDivisionId INT = 22
AND DimBusinessDivision.Id IN (
#BusinessDivisionId
)
However, using the split function such as below, It takes 2 minutes (which is the same time it takes without a WHERE clause:
AND DimBusinessDivision.Id IN (
SELECT Item FROM dbo.FuncSplit(#BusinessDivisionId, ',')
)
I've also tried creating a temp table and a table variable before the SQL statement with the results of the table but there's no difference. I have a feeling this has to do with the fact that the values are not literal values and that SQL server doesn't know what query plan to follow, or something similar. Does anyone know of any ways to increase the performance of this?
It simply doesn't like using a table to get the values in even if the table has the same amounts of rows.
UPDATE: I have used the table function as an inner join which has fixed the issue. Any idea's why this made all the difference?
INNER JOIN
dbo.FuncSplit(#BusinessDivisionIds, ',') AS FilteredBusinessDivisions ON
FilteredBusinessDivisions.Item = DimBusinessDivision.Id
A few things to play with:
Try the non-performant query and add OPTION (RECOMPILE); at the end of the query. If it magically runs much faster, then yes the issue was a bad cached query plan. For more information on this specific problem, you can Google "parameter sniffing" for a more thourough explanation.
You may also want to look at the function definition and toss a RECOMPILE in there too, and see what difference that makes.
Look at the estimated query plan and try to determine the difference.
But the root of the problem, I think, is that you are reinventing the wheel with this "split" function. You can have multi-valued parameters in SSRS and use "WHERE col IN #param": https://technet.microsoft.com/en-us/library/aa337396(v=sql.105).aspx
Unless there's a very specific reason you must split a comma separated list and cannot use normal parameters, just use a regular parameter that accepts multiple values.
Edit: I looked at the article you linked to. It's quite easy to have a SELECT ALL option in any reporting tool (not just SSRS), though it's not obvious. Using the "magic value" as written in the article you linked to works just fine. Can I ask what limitation is prompting you to need to do this string splitting?

one simple query written in store procedure and same written as inline query which will execute fast.?

one simple query written in store procedure and same written as inline query which will execute fast in SQL server.
someone from interview panel asked this question from me i said store procedure reason being procedure is compiled but he said i am wrong.
please explain.?
I suppose, that a simple query is some read-only code.
A VIEW is fully inlineable and precompiled. The biggest advantage is, that you can bind this into bigger statement with joins and filters and this will - in most cases - be able to use indexes and statistics.
A table valued function (TVF) in the "new" syntax (without BEGIN and END is very similar to a VIEW but the parameters and there handling are precompiled too. The advantages of the VIEW are the same here.
An UDF returning a table ("old" syntax") is in most cases something one should not do. The biggest disadvantage is the fact, that the optimizer cannot pre-estimate the result and will handle this as one-row-table which is - in most cases - really bad...
A StoredProcedure which does nothing more than a VIEW or a TVF could do as well, is a pain in the neck - at least in my eyes. I know, that there are other opinions... The biggest draw back is: Whenever you want to continue with the returned result set, you have to insert it into a table (or declared table variable). Further joins or filters against this new table will miss indexes and statistics. Might be that a simple Hey SP give me your result! is fast, but everything after this call is absolutely down.
So my facit: Use SPs when there is something to do and use VIEW or TVF when there is something to read

SQL Server and CLR, batching SqlFunction

I have a CLR function that returns "n" rows with random data. For example, to prime an empty table with 100 rows of test data I could write
INSERT INTO CustomerInfo(FirstName, LastName, City...)
SELECT FirstName, LastName, City...
FROM MyCLRFunction(100)
This would return 100 "customers" with random information. If I were to call this with a very high number I would get an out of memory error, since the entire dataset is created before it gets sent to the caller. I can, of course, use the SqlPipe object and send rows as they are created but as far as I can tell you can only use this approach with SqlProcedures. That would mean that I can't use an INSERT INTO approach since you can't SELECT from a stored proc.
I'm hoping that I've just missed something here and that it is actually possible to combine SqlPipe.SendResultRow with a function, or that someone has a clever workaround.
I could leave it as a proc and have that proc put these records into a session-scoped temporary table. Then the caller could use that table in their SELECT clause but I'm hoping for the best of all worlds where I can provide a nice, clean syntax to the caller and still scale to a large number of records.
Frankly, the original solution is probably "good enough" since we will probably never want this much test data and even if we did we could run the INSERT statement multiple times. But I'm trying to get a full understanding of CLR integration and wondering how I would address this if a similar use case presented itself in a business scenario.
Looking into streaming SQLCLR table valued functions - http://msdn.microsoft.com/en-us/library/ms131103.aspx
You basically return an IEnumerable to SQL Server and let it consume it, thereby not needing to materialize all the results before returning them.
I found a solution. Instead of returning the entire list of items, the solution was to use
yield mything;
This causes the FillRowMethod to be fired for each entity processed, effectively streaming the results back to the caller.
Glad to have this figured out but a little embarrased by how simple the final solution was.

T-SQL Update Table using current columns as input parameters of function

I am trying to update table columns using a function. The input parameters of the function are data fields from the table that I want to update.
Let's say I have table with two columns ("Country" and "Capital"). The "Capital" is entered and I am using a function that returns a county name by capital name as input parameter. So, my update code is something like this:
UPDATE #TableName
SET Country=(SELECT Country FROM dbo.fn_GetCountryByCapital(Capital))
There is no error generated by IntelliSence,but on F5 press it say:
Incorrect syntax near 'Capital'.
Please, note that this is just a example (because it may looks to you silly). I give it sample in order to describe my problem. My real situation includes the use of several functions in the update statement.
Thank you in advance for the help.
Joro
Possible Solution:
I have found other way to do this. It does not look so good, but it works:
I have added index in my temp table in order to use while statement
For each record in the table (using while statement) I have used temp variables to store the field information I have need
Then I have passed this information to my functions and the outcome I have used to update the table
My guess is that the brackets '( )' that surrounded the select statement and the function do not allowed the function to use the correct values from the table.
learn the right way (most efficient) to build SQL:
UPDATE a
SET Country=b.Country
FROM #TableName a
INNER JOIN YourCountryCapitalTable b ON a.Capital=b.Capital
you can not code SQL like an application program, you need to use set logic and NOT per row logic. When you throw a bunch of functions into a SQL statement they most likely will need to be run per row, slowing down your queries (unless they are table functions in your FROM clause). If just incorporate the function into the query you can most likely see massive performance improvements because of index usage and operations occur on the complete and not row per row.
it is sad to have to very sql code that isn't elegant and often repeats itself all over the place. however, your main sql goal is fast data retrieval (index usage and set operations) and not some fancy coding beauty contest.
I have found other way to do this. yuck yuck yuck, sounds like a future question here on SO when the next person needs to maintain this code. You don't need an index to use a WHILE. If you have so many rows in your temp table that you need an index, a WHILE is the LAST thing you should be doing!

Resources