T-SQL Update Table using current columns as input parameters of function - sql-server

I am trying to update table columns using a function. The input parameters of the function are data fields from the table that I want to update.
Let's say I have table with two columns ("Country" and "Capital"). The "Capital" is entered and I am using a function that returns a county name by capital name as input parameter. So, my update code is something like this:
UPDATE #TableName
SET Country=(SELECT Country FROM dbo.fn_GetCountryByCapital(Capital))
There is no error generated by IntelliSence,but on F5 press it say:
Incorrect syntax near 'Capital'.
Please, note that this is just a example (because it may looks to you silly). I give it sample in order to describe my problem. My real situation includes the use of several functions in the update statement.
Thank you in advance for the help.
Joro
Possible Solution:
I have found other way to do this. It does not look so good, but it works:
I have added index in my temp table in order to use while statement
For each record in the table (using while statement) I have used temp variables to store the field information I have need
Then I have passed this information to my functions and the outcome I have used to update the table
My guess is that the brackets '( )' that surrounded the select statement and the function do not allowed the function to use the correct values from the table.

learn the right way (most efficient) to build SQL:
UPDATE a
SET Country=b.Country
FROM #TableName a
INNER JOIN YourCountryCapitalTable b ON a.Capital=b.Capital
you can not code SQL like an application program, you need to use set logic and NOT per row logic. When you throw a bunch of functions into a SQL statement they most likely will need to be run per row, slowing down your queries (unless they are table functions in your FROM clause). If just incorporate the function into the query you can most likely see massive performance improvements because of index usage and operations occur on the complete and not row per row.
it is sad to have to very sql code that isn't elegant and often repeats itself all over the place. however, your main sql goal is fast data retrieval (index usage and set operations) and not some fancy coding beauty contest.
I have found other way to do this. yuck yuck yuck, sounds like a future question here on SO when the next person needs to maintain this code. You don't need an index to use a WHILE. If you have so many rows in your temp table that you need an index, a WHILE is the LAST thing you should be doing!

Related

How to join a table with the result of some manipulation of data (be an stored procedure, function, ...)?

This is one of the quirks of SQL-Server that I find more puzzling : to not be able to manipulate data within a function (execute UPDATE or INSERT commands), or not be able to join to a query the result of an stored procedure.
I want to code an object that returns the newer value from a table of counters, and be able to use its result on selects.
Something like :
create function getNewCounterValue(#Counter varchar(100))
returns int
as
begin
declare #Value int
select #Value = Value
from CounterValues
where Counter = #Counter
set #Value = coalesce(#Value, 0) + 1
update CounterValues set Value = #Value
where Counter = #Counter
if ##rowcount = 0 begin
insert into CounterValues (Counter, Value) values (#Counter, #Value)
end
return #Value
end
So then I would be able to run commands like :
declare #CopyFrom date = '2022-07-01'
declare #CopyTo date = '2022-08-01'
insert into Bills (IdBill, Date, Provider, Amount)
select getNewCounterValue('BILL'), #CopyTo, Amount
from Bills
where Date = #CopyFrom
But SQL-Server doesn't allow to create functions that changes its data (Invalid use of a side-effecting operator), so it forces me to write getNewCounterValue as an stored procedure, but then I can't execute and join it to a query.
Is there any way to have an object that manipulates data capable to join its result to a query ?.
PS: I know that I could use sequences to get new counter values without needing to change data, but I'm working on a huge legacy database that already uses counter tables, not sequences. So I cannot change that without breaking a zillion other things.
I also know that I could declare IdBill as an Identity column, so I wouldn't need to retrieve new counter values to insert rows, but again this is a huge legacy database that uses counter tables, not identity columns, so I cannot change the column types without breaking the system.
Besides, these counters are just an example of why being able to join on a query the result of some data manipulation would be very useful. I like to write a lot of logic on the database, so I would take advantage of it on plenty other situations.
A few years ago I saw a very dirty trick to do so executing the data manipulation instructions as openrawset calls within your function, but it was a seriously ugly hack. There still is no better way to achieve this ?.
Thank you.
You're clearly aware that a function is for returning data, and you're aware of sequences, and identity columns, and you have given a completely reasonable explanation in your question as to why you can't use this in this case.
But as you also said, the question is a bit more general than just sequence/identity problems. There is an coherent idea of "some kind of construct that can change data, and whose output can be composed directly into a select".
There's no "object" that exactly fits that description. Asking "why doesn't language X have feature Y" leads to philosophical discussions with good answers already provided by Eric Lippert here and here
I think there are a few more concrete answers in this case though:
1) Guaranteed idempotency.
A select returns a set (bag, collection, however you want to think about it). Then there is an obvious expectation that any process that runs for the result of a select may run for multiple rows. If the process is not idempotent, then the state of the system when the select is complete might depend on the number of rows in the result. It's also possible that the execution of the modifying process might change the semantics of the select, or the next iteration of the process, which leads to situations like the Halloween Problem.
2) Plan Compilation
Related to (1) but not precisely the same. How can the query optimizer approach this functionalty? It must generate a plan "ahead of time", and that plan depends on stateful information. Yes, we get adaptive memory grants with 2019, but that's a trivial sort of "mid flight change", and even that took years before it was implemented (by which I mean that I believe Oracle has been able to do this for years, though I could be wrong, I'm no Oracle guy).
3) It's not actually beneficial in a lot of use cases
Take the use case of generating a sequence. Why not just iterate and execute a stored procedure? One answer might be "because I want to avoid imperative iteration, we should try to be set based and declarative". But as your hypothetical function demonstrates, it would still be imperative and iterative, it would just be "hidden" behind the select. I think - or let's say I have an intuition - that many cases where it seems like it might be nice to put a state-changing operating into a select fall into this basket. We wouldn't really be gaining anything.
4) There acutally is a way to do it! (but only in the most trivial case)
When I said "composed directly into a select" I didn't use the word "compose" on a whim. We do have composable DML:
create table T(i int primary key, c char);
declare #output table (i int, c char);
insert #output (i, c)
select dml.i, dml.c
from (
insert t (i, c)
output inserted.i, inserted.c
values (1, 'a')
) dml
/* OK, you can't add this
join SomeOtherTable on ...
*/
Of course, this isn't substantially different from insert exec in that you can't have a "naked" select, it has to be the source for an insert first. And you can't join to the dml output directly, you have to get the output and then do the join. But at least it gives you a way to avoid the "nested insert exec" problem.

Postgres bulk insert/update that's injection-safe. Perhaps a function that takes an array? [duplicate]

This question already has answers here:
Improving a function that UPSERTs based on an input array
(2 answers)
Closed 3 years ago.
I'm working on paying back some technical debt this week, and it hit me that I have no idea how to make multi-value inserts safe from accidental or malicious SQL injections. We're on Postgres 11.4. I've got a test bed to work from that includes a small table with about 26K rows, here's the declaration for a small table I'm using for testing:
BEGIN;
DROP TABLE IF EXISTS "data"."item" CASCADE;
CREATE TABLE IF NOT EXISTS "data"."item" (
"id" uuid NOT NULL DEFAULT NULL,
"marked_for_deletion" boolean NOT NULL DEFAULT false,
"name_" citext NOT NULL DEFAULT NULL,
CONSTRAINT item_id_pkey
PRIMARY KEY ("id")
);
CREATE INDEX item_marked_for_deletion_ix_bgin ON "data"."item" USING GIN("marked_for_deletion") WHERE marked_for_deletion = true;
ALTER TABLE "data"."item" OWNER TO "user_change_structure";
COMMIT;
I've been inserting to this table, and many others, using multi-value inserts, along the lines of:
BEGIN;
INSERT
bundle up hundres or thousands of rows
ON CONFLICT do what I need
COMMIT or ROLLBACK on the client side
Works fine. But how do you make a multi-value statement safe? That's what I can't figure out. This is one of those areas where I can't reason about the problem well. I don't have the appetite, aptitude, or patience for hacking things. That I can't think up an exploit means nothing, I would suck as a hacker. And, for that matter, I'm generally more concerned about errors than evil in code, since I run into errors a whole lot more often.
The standard advice I see for safe insertion is to use a prepared statement. A prepared statement for an INSERT is pretty much a temporary, runtime function for interpolation on a code template. For me, it's simpler to write an actual function, like this one:
DROP FUNCTION IF EXISTS data.item_insert_s (uuid, boolean, citext);
CREATE OR REPLACE FUNCTION data.item_insert_s (uuid, boolean, citext)
RETURNS int
AS $$
INSERT INTO item (
id,
marked_for_deletion,
name_)
VALUES
($1,$2,$3)
ON CONFLICT(id) DO UPDATE SET
marked_for_deletion = EXCLUDED.marked_for_deletion,
name_ = EXCLUDED.name_;
SELECT 1; -- No clue what to return, but you have to return something.
$$ LANGUAGE sql;
ALTER FUNCTION data.item_insert_s(uuid, boolean, citext) OWNER TO user_bender;
All of that works, and I've tried some timing tests. I truncate the table, do a multi-value insert, truncate, do a series of function call inserts, and see what the difference is. I've tried multiple runs, doing the operations in different orders, etc. Both cases use a BEGIN/COMMIT block in the same way, so I'll end up with the same number of transactions on either test. The results vary more across tests than within them, but the multi-value insert is always faster. Congratulations to me for confirming the obvious.
Is there a way to safely do bulk inserts and updates? It occurred to me that I could write a function that takes an array or arrays, parse it out, and run the code in a loop within the function. I'd like to test that out, but get flummoxed by the Postgres array syntax. I've looked around, and it sounds like an array of objects and a foreach loop might be just what I'm after. I've looked around, and this is a topic that has been addressed, but I haven't found a straightforward example of how to prepare data for insertion, and the the unpacking of it. I'm suspecting that I won't be able to use SQL and a plain unnest() because 1) I want to safe the inputs and 2) I might have functions that don't take all of the fields in a table in their input.
To make things a bit easier, I'm fine with functions with fixed parameter lists, and array inputs with fixed formats. I'll write code generators for my various tables, so I don't need to make the Postgres-side code any more complex than necessary.
Thanks for any help!
Note: I got a message to explain why this question is different than my newer, related question:
Improving a function that UPSERTs based on an input array
Answer: Yes, it's the same starting point. In this question, I was asking about SQL injection, in the second question I was trying to focus on the array-input solution. I'm not quite sure when to split out new questions, and when to let questions turn into multi-part threads.
It's morning here on the Far South Coast of NSW, and I figured I'd take another crack at this. I should have mentioned before that our deployment environment is RDS, which makes COPY less appealing. But the idea of passing in an array where each element includes the row data is very appealing. It's much like a multi-value INSERT, but with different syntactic sugar. I've poked at arrays in Postgres a bit, and always come away befuddled by the syntax. I found a few really excellent threads with lots of details from some top posters to study:
https://dba.stackexchange.com/questions/224785/pass-array-of-mixed-type-into-stored-function
https://dba.stackexchange.com/questions/131505/use-array-of-composite-type-as-function-parameter-and-access-it
https://dba.stackexchange.com/questions/225176/how-to-pass-an-array-to-a-plpgsql-function-with-variadic-parameter/
From there, I've got a working test function:
DROP FUNCTION IF EXISTS data.item_insert_array (item[]);
CREATE OR REPLACE FUNCTION data.item_insert_array (data_in item[])
RETURNS int
AS $$
INSERT INTO item (
id,
marked_for_deletion,
name_)
SELECT
d.id,
d.marked_for_deletion,
d.name_
FROM unnest(data_in) d
ON CONFLICT(id) DO UPDATE SET
marked_for_deletion = EXCLUDED.marked_for_deletion,
name_ = EXCLUDED.name_;
SELECT cardinality(data_in); -- array_length() doesn't work. ¯\_(ツ)_/¯
$$ LANGUAGE sql;
ALTER FUNCTION data.item_insert_array(item[]) OWNER TO user_bender;
To close the circle, here's an example of some input:
select * from item_insert_array(
array[
('2f888809-2777-524b-abb7-13df413440f5',true,'Salad fork'),
('f2924dda-8e63-264b-be55-2f366d9c3caa',false,'Melon baller'),
('d9ecd18d-34fd-5548-90ea-0183a72de849',true,'Fondue fork')
]::item[]
);
Going back to my test results, this performs roughly as well as my original multi-value insert. The other two methods I posted originally are, let's say, 4x slower. (The results are pretty erratic, but they're always a lot slower.) But I'm still left with my original question:
Is this injection safe?
If not, I guess I need to rewrite it in PL/pgSQL with a FOREACH loop and EXECUTE...USING or FORMAT to get the injection-cleaning text processing/interpolcation features there. Does anyone know?
I have a lot of other questions about this function (Should it be a procedure so that I can manage the transaction? How do I make the input anyarray? What would be a sensible result to return?) But I think I'll have to pursue those as their own questions.
Thanks for any help!

Stored procedure to update different columns

I have an API that i'm trying to read that gives me just the updated field. I'm trying to take that and update my tables using a stored procedure. So far the only way I have been able to figure out how to do this is with dynamic SQL but i would prefer to not do that if there is a way not to.
If it was just a couple columns, I'd just write a proc for each but we are talking about 100 fields and any of them could be updated together. One ticket might just need a timestamp updated at this time, but the next ticket might be a timestamp and who modified it while the next one might just be a note.
Everything I've read and have been taught have told me that dynamic SQL is bad and while I'll write it if I have too, I'd prefer to have a proc.
YOU CAN PERHAPS DO SOMETHING LIKE THIS:::
IF EXISTS (SELECT * FROM NEWTABLE NOT IN (SELECT * FROM OLDTABLE))
BEGIN
UPDATE OLDTABLE
SET OLDTABLE.OLDRECORDS = NEWTABLE.NEWRECORDS
WHERE OLDTABLE.PRIMARYKEY= NEWTABLE.PRIMARYKEY
END
The best way to solve your problem is using MERGE:
Performs insert, update, or delete operations on a target table based on the results of a join with a source table. For example, you can synchronize two tables by inserting, updating, or deleting rows in one table based on differences found in the other table.
As you can see your update could be more complex but more efficient as well. Using MERGE requires some proficiency, but when you start to use it you'll use it with pleasure again and again.
I am not sure how your business logic works that determines what columns are updated at what time. If there are separate business functions that require updating different but consistent columns per function, you will probably want to have individual update statements for each function. This will ensure that each process updates only the columns that it needs to update.
On the other hand, if your API is such that you really don't know ahead of time what needs to be updated, then building a dynamic SQL query is a good idea.
Another option is to build a save proc that sets every user-configurable field. As long as the calling process has all of that data, it can call the save procedure and pass every updateable column. There is no harm in having a UPDATE MyTable SET MyCol = #MyCol with the same values on each side.
Note that even if all of the values are the same, the rowversion (or timestampcolumns) will still be updated, if present.
With our software, the tables that users can edit have a widely varying range of columns. We chose to create a single save procedure for each table that has all of the update-able columns as parameters. The calling processes (our web servers) have all the required columns in memory. They pass all of the columns on every call. This performs fine for our purposes.

Using an IN clause with table variable causes my query to run MUCH slower

I am using SSRS report whereby I need to pass multiple parameters to some SQL code.
Based on this blog post, the best way to handle multiple parameters is to used a split function, so that is the road I am following.
However, I am having some bad performance after following this.
For example, the following WHERE clause will return the data in 4 seconds:
AND DimBusinessDivision.Id IN (
22
)
This will also correctly return in 4 seconds:
DECLARE #BusinessDivisionId INT = 22
AND DimBusinessDivision.Id IN (
#BusinessDivisionId
)
However, using the split function such as below, It takes 2 minutes (which is the same time it takes without a WHERE clause:
AND DimBusinessDivision.Id IN (
SELECT Item FROM dbo.FuncSplit(#BusinessDivisionId, ',')
)
I've also tried creating a temp table and a table variable before the SQL statement with the results of the table but there's no difference. I have a feeling this has to do with the fact that the values are not literal values and that SQL server doesn't know what query plan to follow, or something similar. Does anyone know of any ways to increase the performance of this?
It simply doesn't like using a table to get the values in even if the table has the same amounts of rows.
UPDATE: I have used the table function as an inner join which has fixed the issue. Any idea's why this made all the difference?
INNER JOIN
dbo.FuncSplit(#BusinessDivisionIds, ',') AS FilteredBusinessDivisions ON
FilteredBusinessDivisions.Item = DimBusinessDivision.Id
A few things to play with:
Try the non-performant query and add OPTION (RECOMPILE); at the end of the query. If it magically runs much faster, then yes the issue was a bad cached query plan. For more information on this specific problem, you can Google "parameter sniffing" for a more thourough explanation.
You may also want to look at the function definition and toss a RECOMPILE in there too, and see what difference that makes.
Look at the estimated query plan and try to determine the difference.
But the root of the problem, I think, is that you are reinventing the wheel with this "split" function. You can have multi-valued parameters in SSRS and use "WHERE col IN #param": https://technet.microsoft.com/en-us/library/aa337396(v=sql.105).aspx
Unless there's a very specific reason you must split a comma separated list and cannot use normal parameters, just use a regular parameter that accepts multiple values.
Edit: I looked at the article you linked to. It's quite easy to have a SELECT ALL option in any reporting tool (not just SSRS), though it's not obvious. Using the "magic value" as written in the article you linked to works just fine. Can I ask what limitation is prompting you to need to do this string splitting?

TSQL "LIKE" or Regular Expressions?

I have a bunch (750K) of records in one table that I have to see they're in another table. The second table has millions of records, and the data is something like this:
Source table
9999-A1B-1234X, with the middle part potentially being longer than three digits
Target table
DescriptionPhrase9999-A1B-1234X(9 pages) - yes, the parens and the words are in the field.
Currently I'm running a .net app that loads the source records, then runs through and searches on a like (using a tsql function) to determine if there are any records. If yes, the source table is updated with a positive. If not, the record is left alone.
the app processes about 1000 records an hour. When I did this as a cursor sproc on sql server, I pretty much got the same speed.
Any ideas if regular expressions or any other methodology would make it go faster?
What about doing it all in the DB, rather than pulling records into your .Net app:
UPDATE source_table s SET some_field = true WHERE EXISTS
(
SELECT target_join_field FROM target_table t
WHERE t.target_join_field LIKE '%' + s.source_join_field + '%'
)
This will reduce the total number of queries from 750k update queries down to 1 update.
First I would redesign if at all possible. Better to add a column that contains the correct value and be able to join on it. If you still need the long one. you can use a trigger to extract the data into the column at the time it is inserted.
If you have data you can match on you need neither like '%somestuff%' which can't use indexes or a cursor both of which are performance killers. This should bea set-based task if you have designed properly. If the design is bad and can't be changed to a good design, I see no good way to get good performance using t-SQl and I would attempt the regular expression route. Not knowing how many different prharses and the structure of each, I cannot say if the regular expression route would be easy or even possible. But short of a redesign (which I strongly suggest you do), I don't see another possibility.
BTW if you are working with tables that large, I would resolve to never write another cursor. They are extremely bad for performance especially when you start taking about that size of record. Learn to think in sets not record by record processing.
One thing to be aware of with using a single update (mbeckish's answer) is that the transaction log (enabling a rollback if the query becomes cancelled) will be huge. This will drastically slow down your query. As such it is probably better to proces them in blocks of 1,000 rows or such like.
Also, the condition (b.field like '%' + a.field + '%') will need to check every single record in b (millions) for every record in a (750,000). That equates to more than 750 billion string comparisons. Not great.
The gut feel "index stuff" won't help here either. An index keeps things in order, so the first character(s) dictate the position in the index, not the ones you're interested in.
First Idea
For this reason I would actually consider creating another table, and parsing the long/messy value into something nicer. An example would be just to strip off any text from the last '(' onwards. (This assumes all the values follow that pattern) This would simplify the query condition to (b.field like '%' + a.field)
Still, an index wouldn't help here either though as the important characters are at the end. So, bizarrely, it could well be worth while storing the characters of both tables in reverse order. The index on you temporary table would then come in to use.
It may seem very wastefull to spent that much time, but in this case a small benefit would yield a greate reward. (A few hours work to halve the comparisons from 750billion to 375billion, for example. And if you can get the index in to play you could reduce this a thousand fold thanks to index being tree searches, not just ordered tables...)
Second Idea
Assuming you do copy the target table into a temp table, you may benefit extra from processing them in blocks of 1000 by also deleting the matching records from the target table. (This would only be worthwhile where you delete a meaningful amount from the target table. Such that after all 750,000 records have been checked, the target table is now [for example] half the size that it started at.)
EDIT:
Modified Second Idea
Put the whole target table in to a temp table.
Pre-process the values as much as possible to make the string comparison faster, or even bring indexes in to play.
Loop through each record from the source table one at a time. Use the following logic in your loop...
DELETE target WHERE field LIKE '%' + #source_field + '%'
IF (##row_count = 0)
[no matches]
ELSE
[matches]
The continuous deleting makes the query faster on each loop, and you're only using one query on the data (instead of one to find matches, and a second to delete the matches)
Try this --
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
First thing is to make sure you have an index for that column on the searched table. Second is to do the LIKE without a % sign on the left side. Check the execution plan to see if you are not doing a table scan on every row.
As le dorfier correctly pointed out, there is little hope if you are using a UDF.
There are lots of ways to skin the cat - I would think that first it would be important to know if this is a one-time operation, or a regular task that needs to be completed regularly.
Not knowing all the details of you problem, if it was me, at this was a one-time (or infrequent operation, which it sounds like it is), I'd probably extract out just the pertinent fields from the two tables including the primary key from the source table and export them down to a local machine as text files. The files sizes will likely be significantly smaller than the full tables in your database.
I'd run it locally on a fast machine using a routine written in something like 'C'/C++ or another "lightweight" language that has raw processing power, and write out a table of primary keys that "match", which I would then load back into the sql server and use it as a basis of an update query (i.e. update source table where id in select id from temp table).
You might spend a few hours writing the routine, but it would run in a fraction of the time you are seeing in sql.
By the sounds of you sql, you may be trying to do 750,000 table scans against a multi-million records table.
Tell us more about the problem.
Holy smoke, what great responses!
system is on disconnected network, so I can't copy paste, but here's the retype
Current UDF:
Create function CountInTrim
(#caseno varchar255)
returns int
as
Begin
declare #reccount int
select #reccount = count(recId) from targettable where title like '%' + #caseNo +'%'
return #reccount
end
Basically, if there's a record count, then there's a match, and the .net app updates the record. The cursor based sproc had the same logic.
Also, this is a one time process, determining which entries in a legacy record/case management system migrated successfully into the new system, so I can't redesign anything. Of course, developers of either system are no longer available, and while I have some sql experience, I am by no means an expert.
I parsed the case numbers from the crazy way the old system had to make the source table, and that's the only thing in common with the new system, the case number format. I COULD attempt to parse out the case number in the new system, then run matches against the two sets, but with a possible set of data like:
DescriptionPhrase1999-A1C-12345(5 pages)
Phrase/Two2000-A1C2F-5432S(27 Pages)
DescPhraseThree2002-B2B-2345R(8 pages)
Parsing that became a bit more complex so I thought I'd keep it simpler.
I'm going to try the single update statement, then fall back to regex in the clr if needed.
I'll update the results. And, since I've already processed more than half the records, that should help.
Try either Dan R's update query from above:
update SourceTable
set ContainsBit = 1
from SourceTable t1
join (select TargetField
from dbo.TargetTable t2) t2
on charindex(t1.SourceField, t2.TargetField) > 0
Alternatively, if the timeliness of this is important and this is sql 2005 or later, then this would be a classic use for a calculated column using SQL CLR code with Regular Expressions - no need for a standalone app.

Resources