SQL Server and CLR, batching SqlFunction - sql-server

I have a CLR function that returns "n" rows with random data. For example, to prime an empty table with 100 rows of test data I could write
INSERT INTO CustomerInfo(FirstName, LastName, City...)
SELECT FirstName, LastName, City...
FROM MyCLRFunction(100)
This would return 100 "customers" with random information. If I were to call this with a very high number I would get an out of memory error, since the entire dataset is created before it gets sent to the caller. I can, of course, use the SqlPipe object and send rows as they are created but as far as I can tell you can only use this approach with SqlProcedures. That would mean that I can't use an INSERT INTO approach since you can't SELECT from a stored proc.
I'm hoping that I've just missed something here and that it is actually possible to combine SqlPipe.SendResultRow with a function, or that someone has a clever workaround.
I could leave it as a proc and have that proc put these records into a session-scoped temporary table. Then the caller could use that table in their SELECT clause but I'm hoping for the best of all worlds where I can provide a nice, clean syntax to the caller and still scale to a large number of records.
Frankly, the original solution is probably "good enough" since we will probably never want this much test data and even if we did we could run the INSERT statement multiple times. But I'm trying to get a full understanding of CLR integration and wondering how I would address this if a similar use case presented itself in a business scenario.

Looking into streaming SQLCLR table valued functions - http://msdn.microsoft.com/en-us/library/ms131103.aspx
You basically return an IEnumerable to SQL Server and let it consume it, thereby not needing to materialize all the results before returning them.

I found a solution. Instead of returning the entire list of items, the solution was to use
yield mything;
This causes the FillRowMethod to be fired for each entity processed, effectively streaming the results back to the caller.
Glad to have this figured out but a little embarrased by how simple the final solution was.

Related

Postgres bulk insert/update that's injection-safe. Perhaps a function that takes an array? [duplicate]

This question already has answers here:
Improving a function that UPSERTs based on an input array
(2 answers)
Closed 3 years ago.
I'm working on paying back some technical debt this week, and it hit me that I have no idea how to make multi-value inserts safe from accidental or malicious SQL injections. We're on Postgres 11.4. I've got a test bed to work from that includes a small table with about 26K rows, here's the declaration for a small table I'm using for testing:
BEGIN;
DROP TABLE IF EXISTS "data"."item" CASCADE;
CREATE TABLE IF NOT EXISTS "data"."item" (
"id" uuid NOT NULL DEFAULT NULL,
"marked_for_deletion" boolean NOT NULL DEFAULT false,
"name_" citext NOT NULL DEFAULT NULL,
CONSTRAINT item_id_pkey
PRIMARY KEY ("id")
);
CREATE INDEX item_marked_for_deletion_ix_bgin ON "data"."item" USING GIN("marked_for_deletion") WHERE marked_for_deletion = true;
ALTER TABLE "data"."item" OWNER TO "user_change_structure";
COMMIT;
I've been inserting to this table, and many others, using multi-value inserts, along the lines of:
BEGIN;
INSERT
bundle up hundres or thousands of rows
ON CONFLICT do what I need
COMMIT or ROLLBACK on the client side
Works fine. But how do you make a multi-value statement safe? That's what I can't figure out. This is one of those areas where I can't reason about the problem well. I don't have the appetite, aptitude, or patience for hacking things. That I can't think up an exploit means nothing, I would suck as a hacker. And, for that matter, I'm generally more concerned about errors than evil in code, since I run into errors a whole lot more often.
The standard advice I see for safe insertion is to use a prepared statement. A prepared statement for an INSERT is pretty much a temporary, runtime function for interpolation on a code template. For me, it's simpler to write an actual function, like this one:
DROP FUNCTION IF EXISTS data.item_insert_s (uuid, boolean, citext);
CREATE OR REPLACE FUNCTION data.item_insert_s (uuid, boolean, citext)
RETURNS int
AS $$
INSERT INTO item (
id,
marked_for_deletion,
name_)
VALUES
($1,$2,$3)
ON CONFLICT(id) DO UPDATE SET
marked_for_deletion = EXCLUDED.marked_for_deletion,
name_ = EXCLUDED.name_;
SELECT 1; -- No clue what to return, but you have to return something.
$$ LANGUAGE sql;
ALTER FUNCTION data.item_insert_s(uuid, boolean, citext) OWNER TO user_bender;
All of that works, and I've tried some timing tests. I truncate the table, do a multi-value insert, truncate, do a series of function call inserts, and see what the difference is. I've tried multiple runs, doing the operations in different orders, etc. Both cases use a BEGIN/COMMIT block in the same way, so I'll end up with the same number of transactions on either test. The results vary more across tests than within them, but the multi-value insert is always faster. Congratulations to me for confirming the obvious.
Is there a way to safely do bulk inserts and updates? It occurred to me that I could write a function that takes an array or arrays, parse it out, and run the code in a loop within the function. I'd like to test that out, but get flummoxed by the Postgres array syntax. I've looked around, and it sounds like an array of objects and a foreach loop might be just what I'm after. I've looked around, and this is a topic that has been addressed, but I haven't found a straightforward example of how to prepare data for insertion, and the the unpacking of it. I'm suspecting that I won't be able to use SQL and a plain unnest() because 1) I want to safe the inputs and 2) I might have functions that don't take all of the fields in a table in their input.
To make things a bit easier, I'm fine with functions with fixed parameter lists, and array inputs with fixed formats. I'll write code generators for my various tables, so I don't need to make the Postgres-side code any more complex than necessary.
Thanks for any help!
Note: I got a message to explain why this question is different than my newer, related question:
Improving a function that UPSERTs based on an input array
Answer: Yes, it's the same starting point. In this question, I was asking about SQL injection, in the second question I was trying to focus on the array-input solution. I'm not quite sure when to split out new questions, and when to let questions turn into multi-part threads.
It's morning here on the Far South Coast of NSW, and I figured I'd take another crack at this. I should have mentioned before that our deployment environment is RDS, which makes COPY less appealing. But the idea of passing in an array where each element includes the row data is very appealing. It's much like a multi-value INSERT, but with different syntactic sugar. I've poked at arrays in Postgres a bit, and always come away befuddled by the syntax. I found a few really excellent threads with lots of details from some top posters to study:
https://dba.stackexchange.com/questions/224785/pass-array-of-mixed-type-into-stored-function
https://dba.stackexchange.com/questions/131505/use-array-of-composite-type-as-function-parameter-and-access-it
https://dba.stackexchange.com/questions/225176/how-to-pass-an-array-to-a-plpgsql-function-with-variadic-parameter/
From there, I've got a working test function:
DROP FUNCTION IF EXISTS data.item_insert_array (item[]);
CREATE OR REPLACE FUNCTION data.item_insert_array (data_in item[])
RETURNS int
AS $$
INSERT INTO item (
id,
marked_for_deletion,
name_)
SELECT
d.id,
d.marked_for_deletion,
d.name_
FROM unnest(data_in) d
ON CONFLICT(id) DO UPDATE SET
marked_for_deletion = EXCLUDED.marked_for_deletion,
name_ = EXCLUDED.name_;
SELECT cardinality(data_in); -- array_length() doesn't work. ¯\_(ツ)_/¯
$$ LANGUAGE sql;
ALTER FUNCTION data.item_insert_array(item[]) OWNER TO user_bender;
To close the circle, here's an example of some input:
select * from item_insert_array(
array[
('2f888809-2777-524b-abb7-13df413440f5',true,'Salad fork'),
('f2924dda-8e63-264b-be55-2f366d9c3caa',false,'Melon baller'),
('d9ecd18d-34fd-5548-90ea-0183a72de849',true,'Fondue fork')
]::item[]
);
Going back to my test results, this performs roughly as well as my original multi-value insert. The other two methods I posted originally are, let's say, 4x slower. (The results are pretty erratic, but they're always a lot slower.) But I'm still left with my original question:
Is this injection safe?
If not, I guess I need to rewrite it in PL/pgSQL with a FOREACH loop and EXECUTE...USING or FORMAT to get the injection-cleaning text processing/interpolcation features there. Does anyone know?
I have a lot of other questions about this function (Should it be a procedure so that I can manage the transaction? How do I make the input anyarray? What would be a sensible result to return?) But I think I'll have to pursue those as their own questions.
Thanks for any help!

one simple query written in store procedure and same written as inline query which will execute fast.?

one simple query written in store procedure and same written as inline query which will execute fast in SQL server.
someone from interview panel asked this question from me i said store procedure reason being procedure is compiled but he said i am wrong.
please explain.?
I suppose, that a simple query is some read-only code.
A VIEW is fully inlineable and precompiled. The biggest advantage is, that you can bind this into bigger statement with joins and filters and this will - in most cases - be able to use indexes and statistics.
A table valued function (TVF) in the "new" syntax (without BEGIN and END is very similar to a VIEW but the parameters and there handling are precompiled too. The advantages of the VIEW are the same here.
An UDF returning a table ("old" syntax") is in most cases something one should not do. The biggest disadvantage is the fact, that the optimizer cannot pre-estimate the result and will handle this as one-row-table which is - in most cases - really bad...
A StoredProcedure which does nothing more than a VIEW or a TVF could do as well, is a pain in the neck - at least in my eyes. I know, that there are other opinions... The biggest draw back is: Whenever you want to continue with the returned result set, you have to insert it into a table (or declared table variable). Further joins or filters against this new table will miss indexes and statistics. Might be that a simple Hey SP give me your result! is fast, but everything after this call is absolutely down.
So my facit: Use SPs when there is something to do and use VIEW or TVF when there is something to read

Way to persist function result as a constant

I needed to create a function today which will always return the exact same value on the specific database it's executed on. It may / may not be the same across databases which is why it has to be able to load it from a table the first time it's required.
CREATE FUNCTION [dbo].[PAGECODEGET] ()
RETURNS nvarchar(6)
AS
BEGIN
DECLARE #PageCode nvarchar(6) = ( SELECT PCO_IDENTITY FROM PAGECODES WHERE PCO_PAGE = 'SWTE' AND PCO_TAB = 'RECORD' )
RETURN #PageCode
END
The PCO_IDENTITY field is a sql identity field, so once the record is inserted for the first time, it's always going to return the same result thereafter.
My question is, is there any way to persist this value to something equivalent to a C# readonly variable?
From a perfomance point of view I know sql will optimise the plan etc, but from a best practice point of view I'm thinking there may possibly be a better way of doing it.
We use a mix of SQL Servers, but the lowest is 2008 R2 in case there's a version specific solution.
I'm afraid there's no such thing as a global variable like you suggest in SQL Server.
As you've pointed out, the function will potentially return different results on another database, depending on a variety of factors, such as when the row was inserted, what other values exist in the table already etc. - basically, the PCO_IDENTITY value for this row cannot be relied upon to be consistent.
A few observations:
I don't see how getting this value occasionally is really going to be a performance bottleneck. I don't think best practices cover this, as selecting a value from a table is as basic as you can get.
If this is part of another larger query, you will probably get better performance by using a join to the PAGECODES table directly, rather than potentially running this function for every row
However, if you are really worried:
There are objects in the database which are persistant - tables. When you first insert this value, retrieve the PCO_IDENTITY value, and create a new table with just that in, that you may join to in your queries. Seems a bit of a waste for one value, doesn't it? (Note you could also make a view, but how would that be any better performing than the function you started with?)
You could force these values into a row with a specific PCO_IDENTITY value, using IDENTITY_INSERT. That way the value is consistent, and you know what it is - you could hard code it in your queries. (NB: Turn IDENTITY_INSERT off again afterwards, and other rows inserted into this table will continue to be automatically generated again)
TL;DR: How you are doing it is probably fine. I suspect you are trying to optimise something that isn't a problem. As always - if in doubt, try out a few approaches and measure.

Efficient Cross Apply with a CLR integrated table function

In SQL Server, I have a CLR integration based table valued function, GetArchiveImages. I call it something like this:
SELECT ...
FROM Items
CROSS APPLY GetArchiveImages(Items.ID) AS archiveimages
WHERE ...
The issue is that there is overhead for each individual call to the function.
If it could be joined with the whole table at once, the overhead would be quite minor, but since it's called once for each row, that overhead scales with the number of rows.
I don't use a stored procedure, because a table returned by a stored procedure can't be joined with anything (as far as I know).
Is there an efficient way to join tables with the results of a stored procedure or function in bulk, instead of row by row?
As the result of GetArchiveImages depends on the Items.ID SQL Server has to call the function for each item, otherwise you wont get correct results.
The only function that SQL Server can "break up" is a T-SQL Inline Table Valued Function. So if you can rewrite your CLR as a ITVF, you will get better performance.
In my experience, the overhad of calling a CLR function however is not that big. It is much more likely that you are having problems somewhere else in the query. For example, SQL Server has no idea how many rows will be returned by that function and just assumes it will be one (for each call). That can lead to misinformed decisions in other places during the optimization process.
UPDATE:
SQL Server does not allow to keep static non-constant data within a CLR class. There are ways to trick the system, e.g. by creating a static final collection object (you can add and remove items from a static collection), however, I would advise against that for stability reasons.
In you case It might make sense to create a cache table that is refreshed either automatically with some sort of (database- or file-system-) trigger or on a schedule. Instead of calling the function you can then just join with that table.
If the GetArchiveImages() function does not need to be used in multiple queries, or at least not used outside of similar queries, you can switch the Outer and Inner aspects of this: Do the main SELECT fields FROM [Items] WHERE ... in the SQLCLR TVF. And make it a streaming TVF.
The basic structure needed would be:
Define a variable of type SqlDataRecord to be all of the fields you want to return from [Items] plus the others being returned by the current GetArchiveImages() function.
Read the "several files in the file system" (taken from the first comment on #Sebastian Meine's answer)
Open a SqlConnection using "Trusted_Connection = true; Enlist = false;" as the ConnectionString.
Execute the main SELECT fields FROM [Items] {optional WHERE}. If it is possible at this point to narrow down some of the rows, then fill out the WHERE. You can even pass in values to the function to pass along to the WHERE clause.
Loop through the SqlDataRecord:
Fill out the SqlDataRecord variable for this row
Get related items that the current GetArchiveImages() function is getting based on [Items].[ItemID]
call yield return;
Close the SqlConnection
Dispose of the SqlDataReader, SqlCommand, and SqlConnection.
Close any files opened in Step 2 (if they can't be closed earlier in the process).

What should be returned when inserting into SQL?

A few months back, I started using a CRUD script generator for SQL Server. The default insert statement that this generator produces, SELECTs the inserted row at the end of the stored procedure. It does the same for the UPDATE too.
The previous way (and the only other way I have seen online) is to just return the newly inserted Id back to the business object, and then have the business object update the Id of the record.
Having an extra SELECT is obviously an additional database call, and more data is being returned to the application. However, it allows additional flexibility within the stored procedure, and allows the application to reflect the actual data in the table.
The additional SELECT also increases the complexity when wanting to wrap the insert/update statements in a transaction.
I am wondering what people think is better way to do it, and I don't mean the implementation of either method. Just which is better, return just the Id, or return the whole row?
We always return the whole row on both an Insert and Update. We always want to make sure our client apps have a fresh copy of the row that was just inserted or updated. Since triggers and other processes might modify values in columns outside of the actual insert/update statement, and since the client usually needs the new primary key value (assuming it was auto generated), we've found it's best to return the whole row.
The select statement will have some sort of an advantage only if the data is generated in the procedure. Otherwise the data that you have inserted is generally available to you already so no point in selecting and returning again, IMHO. if its for the id then you can have it with SCOPE_IDENTITY(), that will return the last identity value created in the current session for the insert.
Based on my prior experience, my knee-jerk reaction is to just return the freshly generated identity value. Everything else the application is inserting, it already knows--names, dollars, whatever. But a few minutes reflection and reading the prior 6 (hmm, make that 5) replies, leads to a number of “it depends” situations:
At the most basic level, what you inserted is what you’d get – you pass in values, they get written to a row in the table, and you’re done.
Slightly more complex that that is when there are simple default values assigned during an insert statement. “DateCreated” columns that default to the current datetime, or “CreatedBy” that default to the current SQL login, are a prime example. I’d include identity columns here, since not every table will (or should) contain them. These values are generated by the database upon table insertion, so the calling application cannot know what they are. (It is not unknown for web server clocks to not be synchronized with database server clocks. Fun times…) If the application needs to know the values just generated, then yes, you’d need to pass those back.
And then there are are situations where additional processing is done within the database before data is inserted into the table. Such work might be done within stored procedures or triggers. Once again, if the application needs to know the results of such calculations, then the data would need to be returned.
With that said, it seems to me the main issue underlying your decision is: how much control/understanding do you have over the database? You say you are using a tool to automatically generate your CRUD procedures. Ok, that means that you do not have any elaborate processing going on within them, you’re just taking data and loading it on in. Next question: are there triggers (of any kind) present that might modify the data as it is being written to the tables? Extend that to: do you know whether or not such triggers exists? If they’re there and they matter, plan accordingly; if you do not or cannot know, then you might need to “follow up” on the insert to see if changes occurred. Lastly: does the application care? Does it need to be informed of the results of the insert action it just requested, and if so, how much does it need to know? (New identity value, date time it was added, whether or not something changed the Name from “Widget” to “Widget_201001270901”.)
If you have complete understanding and control over the system you are building, I would only put in as much as you need, as extra code that performs no useful function impacts performance and maintainability. On the flip side, if I were writing a tool to be used by others, I’d try to build something that did everything (so as to increase my market share). And if you are building code where you don't really know how and why it will be used (application purpose), or what it will in turn be working with (database design), then I guess you'd have to be paranoid and try to program for everything. (I strongly recommend not doing that. Pare down to do only what needs to be done.)
Quite often the database will have a property that gives you the ID of the last inserted item without having to do an additional select. For example, MS SQL Server has the ##Identity property (see here). You can pass this back to your application as an output parameter of your stored procedure and use it to update your data with the new ID. MySQL has something similar.
INSERT
INTO mytable (col1, col2)
OUTPUT INSERTED.*
VALUES ('value1', 'value2')
With this clause, returning the whole row does not require an extra SELECT and performance-wise is the same as returning only the id.
"Which is better" totally depends on your application needs. If you need the whole row, return the whole row, if you need only the id, return only the id.
You may add an extra setting to your business object which can trigger this option and return the whole row only if the object needs it:
IF #return_whole_row = 1
INSERT
INTO mytable (col1, col2)
OUTPUT INSERTED.*
VALUES ('value1', 'value2')
ELSE
INSERT
INTO mytable (col1, col2)
OUTPUT INSERTED.id
VALUES ('value1', 'value2')
FI
I don't think I would in general return an entire row, but it could be a useful technique.
If you are code-generating, you could generate two procs (one which calls the other, perhaps) or parametrize a single proc to determine whther to return it over the wire or not. I doubt the DB overhead is significant (single-row, got to have a PK lookup), but the data on the wire from DB to client could be significant when all added up and if it's just discarded in 99% of the cases, I see little value. Having an SP which returns different things with different parameters is a potential problem for clients, of course.
I can see where it would be useful if you have logic in triggers or calculated columns which are managed by the database, in which case, a SELECT is really the only way to get that data back without duplicating the logic in your client or the SP itself. Of course, the place to put any logic should be well thought out.
Putting ANY logic in the database is usually a carefully-thought-out tradeoff which starts with the minimally invasive and maximally useful things like constraints, unique constraints, referential integrity, etc and growing to the more invasive and marginally useful tools like triggers.
Typically, I like logic in the database when you have multi-modal access to the database itself, and you can't force people through your client assemblies, say. In this case, I would still try to force people through views or SPs which minimize the chance of errors, duplication, logic sync issues or misinterpretation of data, thereby providing as clean, consistent and coherent a perimeter as possible.

Resources