Efficient Cross Apply with a CLR integrated table function

Efficient Cross Apply with a CLR integrated table function - sql-server

In SQL Server, I have a CLR integration based table valued function, GetArchiveImages. I call it something like this:
SELECT ...
FROM Items
CROSS APPLY GetArchiveImages(Items.ID) AS archiveimages
WHERE ...
The issue is that there is overhead for each individual call to the function.
If it could be joined with the whole table at once, the overhead would be quite minor, but since it's called once for each row, that overhead scales with the number of rows.
I don't use a stored procedure, because a table returned by a stored procedure can't be joined with anything (as far as I know).
Is there an efficient way to join tables with the results of a stored procedure or function in bulk, instead of row by row?

As the result of GetArchiveImages depends on the Items.ID SQL Server has to call the function for each item, otherwise you wont get correct results.
The only function that SQL Server can "break up" is a T-SQL Inline Table Valued Function. So if you can rewrite your CLR as a ITVF, you will get better performance.
In my experience, the overhad of calling a CLR function however is not that big. It is much more likely that you are having problems somewhere else in the query. For example, SQL Server has no idea how many rows will be returned by that function and just assumes it will be one (for each call). That can lead to misinformed decisions in other places during the optimization process.
UPDATE:
SQL Server does not allow to keep static non-constant data within a CLR class. There are ways to trick the system, e.g. by creating a static final collection object (you can add and remove items from a static collection), however, I would advise against that for stability reasons.
In you case It might make sense to create a cache table that is refreshed either automatically with some sort of (database- or file-system-) trigger or on a schedule. Instead of calling the function you can then just join with that table.

If the GetArchiveImages() function does not need to be used in multiple queries, or at least not used outside of similar queries, you can switch the Outer and Inner aspects of this: Do the main SELECT fields FROM [Items] WHERE ... in the SQLCLR TVF. And make it a streaming TVF.
The basic structure needed would be:
Define a variable of type SqlDataRecord to be all of the fields you want to return from [Items] plus the others being returned by the current GetArchiveImages() function.
Read the "several files in the file system" (taken from the first comment on #Sebastian Meine's answer)
Open a SqlConnection using "Trusted_Connection = true; Enlist = false;" as the ConnectionString.
Execute the main SELECT fields FROM [Items] {optional WHERE}. If it is possible at this point to narrow down some of the rows, then fill out the WHERE. You can even pass in values to the function to pass along to the WHERE clause.
Loop through the SqlDataRecord:
Fill out the SqlDataRecord variable for this row
Get related items that the current GetArchiveImages() function is getting based on [Items].[ItemID]
call yield return;
Close the SqlConnection
Dispose of the SqlDataReader, SqlCommand, and SqlConnection.
Close any files opened in Step 2 (if they can't be closed earlier in the process).

Related

Why ,when and where we will use stored procedures, views and functions?

I often use a stored procedure for data access purpose but don't know which one is best - a view or a stored procedure or a function?
Please tell me which one of the above is best for data access purpose and why it is best, list down the reason with the example please.
I searched Google to learn which one is best but got no expected answer

View
A view is a “virtual” table consisting of a SELECT statement, by means of “virtual”
I mean no physical data has been stored by the view -- only the definition of the view is stored inside the database; unless you materialize the view by putting an index on it.
By definition you can not pass parameters to the view
NO DML operations (e.g. INSERT, UPDATE, and DELETE) are allowed inside the view; ONLY SELECT statements.
Most of the time, view encapsulates complex joins so it can be reusable in the queries or stored procedures. It can also provide level of isolation and security by hiding sensitive columns from the underlying tables.
Stored procedure
A stored procedure is a group of Transact-SQL statements compiled into a single execution plan or in other words saved collection of Transact-SQL statements.
A stored procedure:
accepts parameters
can NOT be used as building block in a larger query
can contain several statements, loops, IF ELSE, etc.
can perform modifications to one or several tables
can NOT be used as the target of an INSERT, UPDATE or DELETE statement
A view:
does NOT accept parameters
can be used as building block in a larger query
can contain only one single SELECT query
can NOT perform modifications to any table
but can (sometimes) be used as the target of an INSERT, UPDATE or DELETE statement.
Functions
Functions are subroutines made up of one or more Transact-SQL statements that can be used to encapsulate code for reuse
There are three types (scalar, table valued and inline mutlistatement) UDF and each of them server different purpose you can read more about functions or UDF in BOL
UDF has a big limitation; by definition it cannot change the state of the database. What I mean by this you cannot perform data manipulation operation inside UDF (INSERT, UPDATE , DELETE) etc.

SP are good for doing DDL statements that you can't do with functions. SP and user defined functions accept parameters and can returns values but they can't do the same statements.
User defined functions can only do DML statements.
View doesn't accept parameters, and only accept DML statements.

I hope below information will help you to understand the use of the SQL procedure, view, and function.
Stored Procedure - Stored Procedure can be used for any database operation like insert, update, delete and fetch which you mentioned that you are already using.
View - View only can be used to fetch the data but it has limitations as you can't pass the parameters to the view. e.g. filter the data based on the passed parameter
Function - Function usually used for a specific operation like you have many in-built SQL server functions also for the date, for math, for string manipulation etc.

I'll make it very short and straight.
When you are accessing data from different tables and don't want to pass parameter use View.
When you want to perform DML statement go for Function.
When you want to perform DDL statement go for Stored Procedure.
Rest is upon your knowledge and idea hit in your mind at particular point of time.

And for performance reasons many would argue
- avoid functions (especially scalar) if possible
It's easier to tweak stored procedures (query plans) and views
IMO, View (and Indexed View) are just Fancier SELECT
Stored Procedure are versatile as you can transform/manipulate within

Performance of Postgresql stored procedures/functions in a multi-tenant environment that has one db with many schemata (one for each tenant)

I am new to Postgresql and I am trying to figure out some details about stored procedures (which I think are actually called functions in pgsql) when used in a multiple schema environment.
The application I have in mind involves a multi-tenant DB design where one schema is used for each tenant and all schemata, which have the same table structure and names, are part of the same database. As far as I know from DBs in general, stored procedures/functions are pre-compiled and therefore faster so I woulid like to use them for performing operations on each schema's tables by sending the required parameters from the application server instead of sending a list of SQL commands. In addition, I would like to have a SINGLE set of functions that implement all the SELECT (including JOIN type), INSERT, UPDATE, etc operations on the tables of each schema. This will allow to easily perform changes in each function and avoid SQL code replication and redundancy. As I found out, it is possible to create a set of functions in a schema s0 and then create s1, s2, ... schemata (having all the same tables) that use these functions.
For exapmle, I can create a template schema named s0 (identical to all others) and create a SQL or pl/pgSQL function that belongs to this schema and contains operations on the schema's tables. In this function, the table names are written without the schema prefix, i.e.
first_table and not s0.first_table
An example function could be:
CREATE FUNCTION sel() RETURNS BIGINT
AS 'SELECT count(a) from first_table;'
LANGUAGE SQL;
As I have tested, this function works well. If I move to schema s1 by entering:
set search_path to s1;
and then call the function again, the function acts upon s1 schema's identically named table first_table.
The function could also include the parameter path in order to call it with a schema name and a command to change the search_ path similar to this:
CREATE OR REPLACE FUNCTION doboth(path TEXT, is_local BOOLEAN DEFAULT false) RETURNS BIGINT AS $$
SELECT set_config('search_path', regexp_replace(path, '[^\w ,]', '', 'g'), is_local);
SELECT count(a) from first_table;
$$ LANGUAGE sql;
as shown in the proposed solution in PostgreSQL: how do I set the search_path from inside a function?
However, when I tried this and I called the function for a schema, I noticed that the second SELECT of the function was executed before the first, which led to executing the second SELECT on the wrong schema! This was really unexpected. Does anybody know the explanation to this behavior?
In order to bypass this issue, I created a plpgsql function that does the same thing and it worked without any execution order issues:
CREATE OR REPLACE FUNCTION doboth(path TEXT, is_local BOOLEAN DEFAULT false) RETURNS BIGINT AS $$
DECLARE result BIGINT;
BEGIN
PERFORM set_config('search_path', regexp_replace(path, '[^\w ,]', '', 'g'), is_local);
SELECT count(a) from first_table INTO result;
RETURN result;
END
$$ LANGUAGE plpgsql;
So, now some questions about performance this time:
1) Apart from a) having the selection of schema to operate and the specified operation on the schema in one transaction which is necessary for my multi-tenant implementation, and b) teaming together SQL commands and avoiding some extra data exchange between the application server and the DB server which is beneficial, do the Postgresql functions have any performance benefits over executing the same code in separate SQL commands?
2) In the described multi-tenant scenario with many schemata and one DB,
does a function that is defined once and called for any identical schema to the one it is defined lose any of its performance benefits (if any)?
3) Is there any difference in performance between an SQL function and a PL/pgSQL function that contains the same operations?

Before I answer your questions, a remark to your SQL function.
It does not fail because the statements are executed in a wrong order, but because both queries are parsed before the first one is executed. The error message you get is somewhat like
ERROR: relation "first_table" does not exist
[...]
CONTEXT: SQL function "doboth" during startup
Note the “during startup”.
Aswers
You may experience a slight performance boost, particularly if the SQL statements are complicated, because the plans of SQL statements in a PL/pgSQL function are cached for the duration of a database session or until they are invalidated.
If the plan for the query is cached by the PL/pgSQL function, but the SQL statement calling the function has to be planned every time, you might actually be worse of from a performance angle because of the overhead of executing the function.
Whenever you call the function with a different schema name, the query plan will be invalidated and has to be created anew. So if you change the schema name for every invocation, you won't gain anything.
SQL function don't cache query plans, so they don't perform better than the plain SQL query.
Note, however, that the gains from caching simple SQL statements in functions are not tremendous.
Use functions that just act as containers for SQL statements only if it makes life simpler for you, otherwise use plain SQL.
Do not only focus on performance uring design, but on a good architecture and a simple design.
If the same statements keep repeating over and over, you might gain more performance using prepared statements than using functions.

Firstly, I do not really believe there can be any issues with line execution order in functions. If you have any issues, it's your code not working, not Postgres.
Secondly, multi-tenant behavior is well implemented with set search_path to s1, s0;. There is usually no need for switching anything inside procedures.
Thirdly, there are no performance benefits in using stored procedures except for minimizing data flows between DB and the application. If you consider a query like SELECT count(*) FROM mytable WHERE somecolumn = $1 there is absolutely nothing you can optimize before you know the value of $1.
And finally, no, there is no significant difference between functions in SQL and PL/pgSQL. The most time is still consumed by reading through tables, so focus on perfecting that.
Hope that clarifies the situation. Also, you may want to consider security benefits of storage procedures. Just a hint.

Do SQL Server functions such as inline table-values functions persist?

I am aware that derived table and Common table expression (CTE) do not persist. They live in memory til the end of the outer query. Every call is a repeated execution.
Do functions such as inline table-valued functions persist, meaning they are only calculated once ? Can we index an inline table-valued function?

Inline function is basically the same thing as a view or a CTE, except that it has parameters. If you look at the query plan you'll see that the logic from the function will be included in the query using it -- so no, you can't index it and SQL Server doesn't cache it's results as such, but of course the pages will be in buffer pool for future use.
I wouldn't say that each call to CTE is a repeated execution either, since SQL server can freely decide how to run the query, as long as the results are correct.
For multi statement UDF each of the calls (at least in versions up to 2014) are separate executions, as far as I know, every time, and not cached in the sense I assume you mean.

Do functions such as inline table-valued functions persist
NO, if you see syntax of table valued function it returns result of a select statement in essence and so it doesn't store the fetched data anywhere (same as in view). So, NO there is no question of creating index on it since the data doesn't gets stored.
Unless you are storing that fetched data in another table like below and then you can create a index/other stuff on that test table;
SELECT * FROM yourInlineTableValuedFunction(parameter)
INTO TestTable;

Can we index an inline table-valued function?
No, but if you make the table into a temp table, sure you can index and speed up. The overhead of creation of a temp table will more than pay off in improved indexed access, caching, and, based on your use case, repeated use of the same temp table in a multi-user scenario.

Optimization problems with View using Clustered Index Insert on tempdb on SQL Server 2008

I am creating a Java function that needs to use a SQL query with a lot of joins before doing a full scan of its result. Instead of hard-coding a lot of joins I decided to create a view with this complex query. Then the Java function just uses the following query to get this result:
SELECT * FROM VW_####
So the program is working fine but I want to make it faster since this SELECT command is taking a lot of time. After taking a look on its plan execution plan I created some indexes and made it +-30% faster but I want to make it faster.
The problem is that every operation in the execution plan have cost between 0% and 4% except one operation, a clustered-index insert that has +-50% of the execution cost. I think that the system is using a temporary table to store the view's data, but an index in this view isn't useful for me because I need all rows from it.
So what can I do to optimize that insert in the CWT_PrimaryKey? I think that I can't turn off that index because it seems to be part of the SQL Server's internals. I read somewhere that this operation could appear when you use cursors but I think that I am not using (or does the view use it?).
The command to create the view is something simple (no T-SQL, no OPTION, etc) like:
create view VW_#### as SELECTS AND JOINS HERE
And here is a picture of the problematic part from the execution plan: http://imgur.com/PO0ZnBU
EDIT: More details:
Well the query to create the problematic view is a big query that join a lot of tables. Based on a single parameter the Java-Client modifies the query string before creating it. This view represents a "data unit" from a legacy Database migrated to the SQLServer that didn't had any Foreign or Primary Key, so our team choose to follow this strategy. Because of that the view have more than 50 columns and it is made from the join of other seven views.
Main view's query (with a lot of Portuguese words): http://pastebin.com/Jh5vQxzA
The other views (from VW_Sintese1 until VW_Sintese7) are created like this one but without using extra views, they just use joins with the tables that contain the data requested by the main view.
Then the Java Client create a prepared Statement with the query "Select * from VW_Sintese####" and execute it using the function "ExecuteQuery", something like:
String query = "Select * from VW_Sintese####";
PreparedStatement ps = myConn.prepareStatement(query,ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_READ_ONLY);
ResultSet rs = ps.executeQuery();
And then the program goes on until the end.
Thanks for the attention.

First: you should post the code of the view along with whatever is using the views because of the rest of this answer.
Second: the definition of a view in SQL Server is later used to substitute in querying. In other words, you created a view, but since (I'm assuming) it isn't an indexed view, it is the same as writing the original, long SELECT statement. SQL Server kind of just swaps it out in the DML statement.
From Microsoft's 'Querying Microsoft SQL Server 2012': T-SQL supports the following table expressions: derived tables, common table expressions (CTEs), views, inline table-valued functions.
And a direct quote:
It’s important to note that, from a performance standpoint, when SQL Server optimizes
queries involving table expressions, it first unnests the table expression’s logic, and therefore interacts with the underlying tables directly. It does not somehow persist the table expression’s result in an internal work table and then interact with that work table. This means that table expressions don’t have a performance side to them—neither good nor
bad—just no side.
This is a long way of reinforcing the first statement: please include the SQL code in the view and what you're actually using as the SELECT statement. Otherwise, we can't help much :) Cheers!
Edit: Okay, so you've created a view (no performance gain there) that does 4-5 LEFT JOIN on to the main view (again, you're not helping yourself out much here by eliminating rows, etc.). If there are search arguments you can use to filter down the resultset to fewer rows, you should have those in here. And lastly, you're ordering all of this at the top, so your query engine will have to get those views, join them up to a massive SELECT statement, figure out the correct order, and (I'm guessing here) the result count is HUGE and SQL's db engine is ordering it in some kind of temporary table.
The short answer: get less data (fewer columns and only the rows you need); don't order the results if the resultset is very large, just get the data to the client and then sort it there.
Again, if you want more help, you'll need to post table schemas and index strategies for all tables that are in the query (including the views that are joined) and you'll need to include all view definitions (including the views that are joined).

SQL Server and CLR, batching SqlFunction

I have a CLR function that returns "n" rows with random data. For example, to prime an empty table with 100 rows of test data I could write
INSERT INTO CustomerInfo(FirstName, LastName, City...)
SELECT FirstName, LastName, City...
FROM MyCLRFunction(100)
This would return 100 "customers" with random information. If I were to call this with a very high number I would get an out of memory error, since the entire dataset is created before it gets sent to the caller. I can, of course, use the SqlPipe object and send rows as they are created but as far as I can tell you can only use this approach with SqlProcedures. That would mean that I can't use an INSERT INTO approach since you can't SELECT from a stored proc.
I'm hoping that I've just missed something here and that it is actually possible to combine SqlPipe.SendResultRow with a function, or that someone has a clever workaround.
I could leave it as a proc and have that proc put these records into a session-scoped temporary table. Then the caller could use that table in their SELECT clause but I'm hoping for the best of all worlds where I can provide a nice, clean syntax to the caller and still scale to a large number of records.
Frankly, the original solution is probably "good enough" since we will probably never want this much test data and even if we did we could run the INSERT statement multiple times. But I'm trying to get a full understanding of CLR integration and wondering how I would address this if a similar use case presented itself in a business scenario.

Looking into streaming SQLCLR table valued functions - http://msdn.microsoft.com/en-us/library/ms131103.aspx
You basically return an IEnumerable to SQL Server and let it consume it, thereby not needing to materialize all the results before returning them.

I found a solution. Instead of returning the entire list of items, the solution was to use
yield mything;
This causes the FillRowMethod to be fired for each entity processed, effectively streaming the results back to the caller.
Glad to have this figured out but a little embarrased by how simple the final solution was.