I want to memoize function results for performance, i.e. lazily populate a cache indexed on the function arguments. The first time I call a function, the cache won't have anything for the input arguments, so it will calculate it and store it before returning it. Subsequent calls just use the cache.
However, it seems that SQL Server 2000 has a stupid arbitrary rule about functions being "deterministic". INSERTs, UPDATEs, and regular stored procedure calls are forbidden. However, extended stored procedures are allowed. How is this deterministic? If another session modifies the database state, the function output will change anyways.
I'm steaming mad. I had thought I could make caching transparent to the user. Is this possible? I don't have the permissions to deploy extended stored procedures.
EDIT:
This limitation is still in 2008. You can't call RAND, for God's sake!
The cache would be implemented by me in the DB. A cache is any data store used for caching...
EDIT:
There are no cases where the same arguments to a function will yield different results, outside of changes to the underlying data. This is a BI platform, and the only changes come from scheduled ETL, at which time I would TRUNCATE the cache table.
These are I/O intensive time series calculations, on the order of O(n^4). I don't have the mandate to change the underlying table or indexes. Also, a lot of these functions use the same intermediate functions, and caching allows those to be used.
UDFs are not truly deterministic, unless they account for changes in database state. What's the point? Is SQL Server caching? (Ironic.) If SQL Server is caching, then it must be expiring on changes to tables that are schema bound. If they're schema bound, then why not bind tables that the function modifies? I can see why procs aren't allowed, although that's just sloppy; just schema bind procs. And, BTW, why allow extended stored procs? You can't possibly track what those do to ensure determinism!!! Argh!!!
EDIT:
My question is: Is there any way to lazily cache function results in a way that can be used in a view?
Deterministic means that the same inputs return the same output independent of time and database.
SQL Server (any version) does no caching of UDFs - I believe it will avoid calling the UDF twice on a single row, but that's it.
One trick I've used is to (I think I posted it here on SO):
Refactor the UDF if you can so that there are effectively a usable discrete subset of values returned for a given set of inputs. For numerical calculations, one can sometimes refactor the logic to return a factor or rate which is multiplied outside the UDF instead of multiplied inside the UDF from a passed in value.
Call the UDF over the DISTINCT rowset and cache the results to a temporary table. If you are only calling the UDF with 100,000 tuples of parameters over a 17,000,000 row set, this is very much more efficient.
JOIN to the temporary table (basically converting from code-based logic to table-based logic) to get values.
This table can be re-used as necessary or even kept.
Addition to the table can be done by first LEFT JOINing to find missing cached entries.
This works for both single-row table-valued UDFs and scalar UDFs. I'm mainly using it for table-valued UDFs. There is a hotfix to SQL Server 2005 which is supposed to address the UDF performance - I'm waiting on mthe DBAs to test it before deploying to production.
Related
The question is: Can inline table valued functions (ITVF) be used to encapsulate and reuse code? Or will this result in performance issues?
I was researching inline table valued functions, which led me to this discussion:
When would you use a table-valued function?
An answer in the discussion states that an inline table valued function "allows the optimizer to treat these functions no differently than the objects they encapsulate giving you optimum performance (assuming that your indexes and statistics are ideal)."
My original problem was that I was trying to reformat different data sources into a standard format and then union them. I tested unioning 6 different ITVF versus performing the unions and transformation all in one query. The execution plan was identical.
Since my background is in oop, I would prefer to split up queries into smaller functions, but before I commit to doing this throughout future projects, I was wondering if using too many ITVFs will eventually cause performance issues.
Can inline table valued functions (ITVF) be used to encapsulate and reuse code?
Yes. And they are superior in this to multi-statement TVFs because with multi-statement TVFs the encapsulation prevents the query optimizer from pushing predicates into the TVF logic, and prevents it from accurately estimating the number of rows returned.
Or will this result in performance issues?
Short answer, not typically.
Longer answer:
There are 4 ways to encapsulate and reuse query logic (whole queries, not just scalar expressions).
Views
Inline Table Valued Functions
Multi-Statement Table Valued Functions
Temporary Tables
Table Variables
Views and Inline TVFs don't inherently degrade performance, but they add to the complexity of query optimization.
Where the optimizer fails to consistently find low-cost plans you may need to intervene. A common way you can do that is by forcing spooling (ie materializing) of intermediate results, for instance replacing an Inline TVF with a multi-statement TVF, or by spooling results to a temp table ahead-of-time.
Spooling reduces the complexity of the encapsulating query at the cost of possible optimization of the encapsulated query when run in the context of the larger query.
When spooling results, Temp Tables are typically the best, as SQL Server they can have indexes and statistics that enable SQL Server to accurately assess the cost of the plans that will consume the intermediate results.
ITVFs are perfect for encapsulating query logic for re-use. I have a dozen or so financial reports that all query the same set of tables for roughly the same information, and by creating a function to provide that data I can be sure that all of my reports are pulling from the same body of data with the same filters and transformations, etc.
That being said, you could also just as easily create a view instead of an ITVF, but ITVFs also provide a way to filter or otherwise transform data based on the parameters sent in. For example, my financial functions could accept a district name as an optional input parameter and only return data for that district. By using the ITVFs this way the optimizer can, over time, optimize the query plan based on the parameters sent in, which helps rather than hindering performance.
I'd recommend that instead of a union on six different ITVFs, just pull all of your tables together into a single ITVF: that way you only have one place to make updates if your table schemas or reporting demands change.
Greeting,
Recently I've started to work on an application, where 8 different modules are using the same table at some point in the workflow. This table have an Instead-Of trigger, which is 5,000 lines long (where first 500 and last 500 lines are common for all modules, and then each module has its own 500 lines of code).
Since the number of modules are going to grow and I want to keep thing as clear (and separate) as possible, I was wondering is there some sort of best practice to split trigger into stored procedures, or should I leave it all in one place?
P.S. Are there going to be any performance penalties for calling procedures from the trigger and passing 15+ parameters to them?
Bearing in mind that the inserted and deleted pseudo-tables are only accessible from within trigger code, and that they can contain multiple rows, you're facing two choices:
Process the rows in inserted and deleted in a RBAR1 fashion, to be able to pass scalar parameters to the stored procedures, or,
Copy all of the data from inserted and deleted into table variables that are then passed to the procedures as appropriate.
I'd expect either approach to impose some2 performance overhead, just from the copying
That being said, it sounds like too much is happening inside the triggers themselves - does all of this code have to be part of the same transaction that performed the DML statement? If not, consider using some form of queue (a table of requests or Service Broker, say) in which to place information on work to perform, and then process the data later - if you use Service Broker, you could have it inspect a shared message and then send appropriate messages to dedicated endpoints for each of your modules, as appropriate.
1 Row By Agonizing Row - using either a cursor of something else to simulate one to access each row in turn - usually frowned upon in a Set-based language like SQL.
2 How much is impossible to know without getting into the specifics of your code and probably trying all possible approaches and measuring the result.
I don't think there is a meaningful performance penalty in this case.
Any way, it is bad practice to write it all inside the trigger (when it is 5000 lines long...).
I think the main consideration is maintainability, which will be much better if you split it
To several SPs
I am performing stress testing on SQL Server 2008 with JMeter.
I wish to improve a stored procedure that has to serve 20 requests per second.
The procedure takes an xml parameter and returns an xml result.
Should I use only one parameter value or test multiple scenarios?
My main doubts are:
recompilations of the procedure execution plan (this may slow down the procedure)
extraction of data from disk (not all necessary data may be hold in the main memory)
Designing a realistic Stress Test/Load Test in SQL Server is an art.
There are many factors that can impact performance:
Hardware: You need to run your tests against the the same hardware that you have defined your target (20 call per second). This includes disk configuration, redundancy, clustering, ... This is not always possible so you need to make it as close as possible however the more different your test environment becomes, the more unrealistic results can be. This means, for example, if you use 2 CPUs instead of 4, you cannot adjust the parameters accordingly.
Data load: in terms of number of the records you need to test, it is ideal to have around 30%-40% more of the maximum rows you expect in the tables.
Data and index distribution: It is a common mistake to load the server with a preset or completely random data. Both are wrong. The distribution of the values need to be realistic. For example distribution of the marital status is not the same across all possible values so you need to design your data generation to include this.
Index fragmentation: this is a tough one. Normally indexes are rebuilt overnight, but during the course of the day, indexes become fragmented so the performance can be very different during those times.
Concurrent load: A server could provide you with 20 requests per second, if it is the only call you are making to the database but as soon as you start making other calls, it all falls to pieces. The load need to include other related parts of the system.
Operation Load: It is absolutely no point to make 20 calls per second if the requests are all the same. You need to use Data Generation techniques to make the requests realistic not purely random.
If you are using C#, I have done this tool a while back which might help you with creating realistic random data.
I have an interesting delimma. I have a very expensive query that involves doing several full table scans and expensive joins, as well as calling out to a scalar UDF that calculates some geospatial data.
The end result is a resultset that contains data that is presented to the user. However, I can't return everything I want to show the user in one call, because I subdivide the original resultset into pages and just return a specified page, and I also need to take the original entire dataset, and apply group by's and joins etc to calculate related aggregate data.
Long story short, in order to bind all of the data I need to the UI, this expensive query needs to be called about 5-6 times.
So, I started thinking about how I could calculate this expensive query once, and then each subsequent call could somehow pull against a cached result set.
I hit upon the idea of abstracting the query into a stored procedure that would take in a CacheID (Guid) as a nullable parameter.
This sproc would insert the resultset into a cache table using the cacheID to uniquely identify this specific resultset.
This allows sprocs that need to work on this resultset to pass in a cacheID from a previous query and it is a simple SELECT statement to retrieve the data (with a single WHERE clause on the cacheID).
Then, using a periodic SQL job, flush out the cache table.
This works great, and really speeds things up on zero load testing. However, I am concerned that this technique may cause an issue under load with massive amounts of reads and writes against the cache table.
So, long story short, am I crazy? Or is this a good idea.
Obviously I need to be worried about lock contention, and index fragmentation, but anything else to be concerned about?
I have done that before, especially when I did not have the luxury to edit the application. I think its a valid approach sometimes, but in general having a cache/distributed cache in the application is preferred, cause it better reduces the load on the DB and scales better.
The tricky thing with the naive "just do it in the application" solution, is that many time you have multiple applications interacting with the DB which can put you in a bind if you have no application messaging bus (or something like memcached), cause it can be expensive to have one cache per application.
Obviously, for your problem the ideal solution is to be able to do the paging in a cheaper manner, and not need to churn through ALL the data just to get page N. But sometimes its not possible. Keep in mind that streaming data out of the db can be cheaper than streaming data out of the db back into the same db. You could introduce a new service that is responsible for executing these long queries and then have your main application talk to the db via the service.
Your tempdb could balloon like crazy under load, so I would watch that. It might be easier to put the expensive joins in a view and index the view than trying to cache the table for every user.
We are building a new application in .net 3.5 with SQL server database. The database is fairly large having around 60 tables with loads on data. The .net application have functionality to bring data into this database from data entry and from third party systems.
After all the data is available in database the system have to do lots of calculation. The calculation logic is pretty complex. All the data required for calculations is in database and the output also needs to be stored in database. The data gathering will happen every week and the calculation needs to be done every week to generate required reports.
Due to above scenario I was thinking do all these calculations using Stored Procedure. The problem is we need data independence also and stored procedure will not be able to provide us that. But if I do all this in .net by query database all the time, I don't think it will be able to finish the work quickly.
For example, I need to query one table which will return me 2000 rows then for each row I need to query another table which will return me 300 results than for each row of this I need to query multiple tables (around 10) to get required data, do the calculation and store the output in another table.
Now my question should I go ahead with stored-procedure solution and forget about database independence since performance is important. I also think development time will be much less if we use stored procedure solution. If any of client want this solution on say oracle database (because they don't want to maintain another database) then we port the stored procedures to oracle database and maintain two versions for any future changes/enhancements. Similarly other clients may ask for other databases.
The 2000 rows which I mentioned above is of product skus. The 300 rows I mentioned is of different attributes which we want to calculate, e.g. handling cost, transport cost, etc. The 10 tables I mentioned have information about currency conversion, unit conversion, network, area, company, sell price, number sold per day, etc. The resulting table stores all the information as a star schema for analysis and reporting purpose. The goal is to get any minute information about the product so that one know what attribute of a product selling is costing us money and where we can do the improvement.
I wouldn't consider doing the data manipulation anywhere other than in the database.
most people try to work with database data using looping algorithms. if you need real speed, think of your data as a SET of rows and you can update thousands of rows within a single update. I have rewritten so many cursor loops written by novice programmers into single update statements where the execution time was massively improved.
you say:
I need to query one table which will
return me 2000 rows then for each row
I need to query another table which
will return me 300 results than for
each row of this I need to query
multiple tables (around 10) to get
required data
from your question it looks like you are not using joins, and you are already thinking in loops. even if you do intend to loop, it is much better to write a query to join in all data necessary then loop over it. remember update and insert statements can have massively complex queries driving them. include in CASE statements, derived tables, conditional joins (LEFT OUTER JOIN) and you can just about solve any problem in a single update/insert.
Well without any specific details of what data you have in these tables, just a back of the napkin calculation shows that you're talking about processing over 6 million rows of information in the example you provided (2,000 rows * 300 rows * (1 row * 10 tables)).
Are all of these rows distinct, or are the 10 tables lookup information that has a relatively low cardinality? In other words, would it be possible to make a program that has the information from the 10 lookup tables in memory, and then just process the 300 row result set in memory to perform the calculations?
Also, I would be concerned about scalability -- if you do this in a stored procedure, it is guaranteed to be a serial process limited by the speed of the single database server. If you have the possibility of multiple copies of a client program, each processing a chunk of the 2,000 initial record set, then you can perform some of the calculations in parallel perhaps speeding up your overall processing time, as well as making it scalable for when your initial record set is 10 times larger.
Programming things like calculation code tend to be easier and more maintainable in C#. Also, normally keeping processing on the SQL Server to a minimum is a good practice since the database is the hardest to scale.
Having said that, from your description it sounds like the stored procedure approach is the way to go. When calculation code is dependent on large volumes of data, it's going to be more expensive to move the data off server for calculation. So unless you have reasonable ways of optimizing the dependent data (such as caching lookup tables?) then you are most likely going to find it more painful then it's worth to not use a stored proc.
Stored procedures every time, but as KM said within those stored procedures keep those iterations to minimum that is to say use joins in your SQL, relational databases are soooooo good at joining.
Database scalibility will be a small issue especially as it sounds like you'd be performing these calcualtions in a batch process.
Database independence doesn't really exist except for the most trivial of CRUD applications so if your initial requirement is to get this all working with SQL Server then leverage the tools that the RDBMS provides (after all your client will have spent a great deal of money on it). If (and it's a big if) a subsequent client really really doesn't want to use SQL Server then you'll have to bite the bullet and code it up in another flavour of stored procedure. But then as you identifed: "if I do all this in .net by query database all the time, I don't think it will be able to finish the work quickly." you've defered the expense of doing it until if and when required.
I would consider doing this in SQL Server Integration Services (SSIS). I'd put the calculations into SSIS, but leave the queries as stored procedures. This would provide you database independence - SSIS can process data from any database with an ODBC connection - as well as high performance. Only the simple SELECT statements would be in stored procedures, and those are the parts of the SQL standard most likely to be identical across multiple database products (assuming you stick to standard forms of query).