Greeting,
Recently I've started to work on an application, where 8 different modules are using the same table at some point in the workflow. This table have an Instead-Of trigger, which is 5,000 lines long (where first 500 and last 500 lines are common for all modules, and then each module has its own 500 lines of code).
Since the number of modules are going to grow and I want to keep thing as clear (and separate) as possible, I was wondering is there some sort of best practice to split trigger into stored procedures, or should I leave it all in one place?
P.S. Are there going to be any performance penalties for calling procedures from the trigger and passing 15+ parameters to them?
Bearing in mind that the inserted and deleted pseudo-tables are only accessible from within trigger code, and that they can contain multiple rows, you're facing two choices:
Process the rows in inserted and deleted in a RBAR1 fashion, to be able to pass scalar parameters to the stored procedures, or,
Copy all of the data from inserted and deleted into table variables that are then passed to the procedures as appropriate.
I'd expect either approach to impose some2 performance overhead, just from the copying
That being said, it sounds like too much is happening inside the triggers themselves - does all of this code have to be part of the same transaction that performed the DML statement? If not, consider using some form of queue (a table of requests or Service Broker, say) in which to place information on work to perform, and then process the data later - if you use Service Broker, you could have it inspect a shared message and then send appropriate messages to dedicated endpoints for each of your modules, as appropriate.
1 Row By Agonizing Row - using either a cursor of something else to simulate one to access each row in turn - usually frowned upon in a Set-based language like SQL.
2 How much is impossible to know without getting into the specifics of your code and probably trying all possible approaches and measuring the result.
I don't think there is a meaningful performance penalty in this case.
Any way, it is bad practice to write it all inside the trigger (when it is 5000 lines long...).
I think the main consideration is maintainability, which will be much better if you split it
To several SPs
Related
Suppose I have a table which contains relevant information. However, the data is only relevant for, let's say, 30 minutes.
After that it's just database junk, so I need to get rid of it asap.
If I wanted, I could clean this table periodically, setting an expiration date time for each record individually and deleting expired records through a job or something. This is my #1 option, and it's what will be done unless someone convince me otherwise.
But I think this solution may be problematic. What if someone stops the job from running and no one notices? I'm looking for something like a built-in way to insert temporary data into a table. Or a table that has "volatile" data itself, in a way that it automagically removes data after x amount of time after its insertion.
And last but not least, if there's no built-in way to do that, could I be able to implement this functionality in SQL server 2008 (or 2012, we will be migrating soon) myself? If so, could someone give me directions as to what to look for to implement something like it?
(Sorry if the formatting ends up bad, first time using a smartphone to post on SO)
As another answer indicated, TRUNCATE TABLE is a fast way to remove the contents of a table, but it's aggressive; it will completely empty the table. Also, there are restrictions on its use; among others, it can't be used on tables which "are referenced by a FOREIGN KEY constraint".
Any more targeted removal of rows will require a DELETE statement with a WHERE clause. Having an index on relevant criteria fields (such as the insertion date) will improve performance of the deletion and might be a good idea (depending on its effect on INSERT and UPDATE statements).
You will need something to "trigger" the DELETE statement (or TRUNCATE statement). As you've suggested, a SQL Server Agent job is an obvious choice, but you are worried about the job being disabled or removed. Any solution will be vulnerable to someone removing your work, but there are more obscure ways to trigger an activity than a job. You could embed the deletion into the insertion process-- either in whatever stored procedure or application code you have, or as an actual table trigger. Both of those methods increase the time required for an INSERT and, because they are not handled out of band by the SQL Server Agent, will require your users to wait slightly longer. If you have the right indexes and the table is reasonably-sized, that might be an acceptable trade-off.
There isn't any other capability that I'm aware of for SQL Server to just start deleting data. There isn't automatic data retention policy enforcement.
See #Yuriy comment, that's relevant.
If you really need to implement it DB side....
Truncate table is fast way to get rid of records.
If all you need is ONE table and you just need to fill it with data, use it and dispose it asap you can consider truncating a (permanent) "CACHE_TEMP" table.
The scenario can become more complicated you are running concurrent threads/jobs and each is handling it's own data.
If that data is just existing for a single "job"/context you can consider using #TEMP tables. They are a bit volatile and maybe can be what you are looking for.
Also you maybe can use table variables, they are a bit more volatile than temporary tables but it depends on things you don't posted, so I cannot say what's really better.
I'm not sure I 100% understand what the database does. If I just have some misconception, please point it out.
Let's say I have a function that wants to create 100 new entry in the database with has 100,000 entries.
It seems a lot faster when those 100 entries get create and the commit is made after the last entry is created.
Now, if those 100 entries get created by different users, is there a easy way to commit only after 100 entries are created?
Edit:
Should I maybe write some sort of buffer?
Databases are optimized for set-based operations, so yes it wouldbe faster to insert 100 records in a set than one at a time. However, when you are talking about users entering records one ata atime, you would not want to group them together under any circumstances that I can think of. Why?
First, if there was one bad record, the others would fail. This would make for 99 cranky users out of 100 (actually 100, but one would not really have reason to be cranky becasue he did the bad data entry to begin with).
Second, users would not see the records immediately after being entered. It is also true that they would not be able to do something further with those records until they are entered such as enter data into related tables. Having a delay like this would make users cranky. If users are entering data from customers through a phone call, they will be especially cranky at the wait (I worked at a call center with a horribly slow commercial product and believe me I know how upset the users used to get!)
Third, users will have gone on to something else and would not realize that their data was rejected for bad information, not a good thing at all.
How long are you going to wait to get your set number of records? 5 seconds, ten minutes?
What happens if for some reason the netwrok connection is lost during that time, wouldn;t the users lose the data they entered.
You might be able to hack something like that together, but you really shouldn't, because it wrecks your data integrity, which is the whole point of using transactions.
In your proposed solution, a problem with any insert in the batch would cause all the other (possibly totally valid) inserts from completely different users to fail. Also, users wouldn't be able to see the data they just tried to insert because the system was waiting to do the insert until the batch was full.
P.S. Here's a quick intro to transaction processing.
I think you do have a misconception. It sounds like you're looking at the database as something that is only for some sort of "long-term" memory. This is a bad concept; the database is the only memory your application has. Even when this isn't true, it's best to pretend that it is.
To go a little deeper, your application has:
scoped memory: variables that you define within view functions, for example. These all get destroyed when flow leaves the function.
globals: variables that are defined in the outermost part of your code. It is really important not to use these for any sort of state except perhaps configuration constants. The important thing is that you should rely on any dynamic behavior. Otherwise you will have to battle concurrency and forked processes (depending on server gateway) that aren't aware of each other. Just don't do it.
a caching scheme, if you choose to implement one. This is entirely optional in django, and there are many ways to do it. However, one typically uses some scheme to ensure that even if the cache crashes, the database reflects the current state of the data accurately.
your local filesystem. From a design point of view, most ways of taking advantage of this will either resemble a caching system (above) or be clumsy and fragile. From a performance point of view, it might be about as slow as a database.
your database.
So you see that there's not much place for you to put your data besides the database.
Suppose i am having stored procedures which performs Insert/update/delete operations on table.
Depending upon some criteria i want to perform some operations.
Should i create trigger or do the operation in stored procedure itself.
Does using the triggers decreases the performance?
Does these two tables viz Inserted and deleted exists(persistent) or are created dynamically?
If they are created dynamically does it have performance issue.
If they are persistent tables then where are they?
Also if they exixts then can i access Inserted and Deleted tables in stored procedures?
Will it be less performant than doing the same thing in a stored proc. Probably not but with all performance questions the only way to really know is to test both approaches with a realistic data set (if you have a 2,000,000 record table don't test with a table with 100 records!)
That said, the choice between a trigger and another method depends entirely on the need for the action in question to happen no matter how the data is updated, deleted, or inserted. If this is a business rule that must always happen no matter what, a trigger is the best place for it or you will eventually have data integrity problems. Data in databases is frequently changed from sources other than the GUI.
When writing a trigger though there are several things you should be aware of. First, the trigger fires once for each batch, so whether you inserted one record or 100,000 records the trigger only fires once. You cannot assume ever that only one record will be affected. Nor can you assume that it will always only be a small record set. This is why it is critical to write all triggers as if you are going to insert, update or delete a million rows. That means set-based logic and no cursors or while loops if at all possible. Do not take a stored proc written to handle one record and call it in a cursor in a trigger.
Also do not send emails from a cursor, you do not want to stop all inserts, updates, or deletes if the email server is down.
Yes, a table with a trigger will not perform as well as it would without it. Logic dictates that doing something is more expensive than doing nothing.
I think your question would be more meaningful if you asked in terms of whether it is more performant than some other approach that you haven't specified.
Ultimately, I'd select the tool that is most appropriate for the job and only worry about performance if there is a problem, not before you have even implemented a solution.
Inserted and deleted tables are available within the trigger, so calling them from stored procedures is a no-go.
It decreases performance on the query by definition: the query is then doing something it otherwise wasn't going to do.
The other way to look at it is this: if you were going to manually be doing whatever the trigger is doing anyway then they increase performance by saving a round trip.
Take it a step further: that advantage disappears if you use a stored procedure and you're running within one server roundtrip anyway.
So it depends on how you look at it.
Performance on what? the trigger will perform an update on the DB after the event so the user of your system won't even know it's going on. It happens in the background.
Your question is phrased in a manner quite difficult to understand.
If your Operation is important and must never be missed, then you have 2 choice
Execute your operation immediately after Update/Delete with durability
Delay the operation by making it loosely coupled with durability.
We also faced the same issue and our production MSSQL 2016 DB > 1TB with >500 tables and need to send changes(insert, update, delete) of few columns from 20 important tables to 3rd party. Number of business process that updates those few columns in 20 important tables were > 200 and it's a tedious task to modify them because it's a legacy application. Our existing process must work without any dependency of data sharing. Data Sharing order must be important. FIFO must be maintained
eg User Mobile No: 123-456-789, it change to 123-456-123 and again change to 123-456-456
order of sending this 123-456-789 --> 123-456-123 --> 123-456-456. Subsequent request can only be send if response of first previous request is successful.
We created 20 new tables with limited columns that we want. We compare main tables and new table (MainTable1 JOIN MainTale_LessCol1) using checksum of all columns and TimeStamp Column to Identify change.
Changes are logged in APIrequest tables and updated back in MainTale_LessCol1. Run this logic in Scheduled Job every 15 min.
Separate process will pick from APIrequest and send data to 3rd party.
We Explored
Triggers
CDC (Change Data Capture)
200+ Process Changes
Since our deadlines were strict, and cumulative changes on those 20 tables were > 1000/sec and our system were already on peak capacity, our current design work.
You can try CDC share your experience
I have an interesting delimma. I have a very expensive query that involves doing several full table scans and expensive joins, as well as calling out to a scalar UDF that calculates some geospatial data.
The end result is a resultset that contains data that is presented to the user. However, I can't return everything I want to show the user in one call, because I subdivide the original resultset into pages and just return a specified page, and I also need to take the original entire dataset, and apply group by's and joins etc to calculate related aggregate data.
Long story short, in order to bind all of the data I need to the UI, this expensive query needs to be called about 5-6 times.
So, I started thinking about how I could calculate this expensive query once, and then each subsequent call could somehow pull against a cached result set.
I hit upon the idea of abstracting the query into a stored procedure that would take in a CacheID (Guid) as a nullable parameter.
This sproc would insert the resultset into a cache table using the cacheID to uniquely identify this specific resultset.
This allows sprocs that need to work on this resultset to pass in a cacheID from a previous query and it is a simple SELECT statement to retrieve the data (with a single WHERE clause on the cacheID).
Then, using a periodic SQL job, flush out the cache table.
This works great, and really speeds things up on zero load testing. However, I am concerned that this technique may cause an issue under load with massive amounts of reads and writes against the cache table.
So, long story short, am I crazy? Or is this a good idea.
Obviously I need to be worried about lock contention, and index fragmentation, but anything else to be concerned about?
I have done that before, especially when I did not have the luxury to edit the application. I think its a valid approach sometimes, but in general having a cache/distributed cache in the application is preferred, cause it better reduces the load on the DB and scales better.
The tricky thing with the naive "just do it in the application" solution, is that many time you have multiple applications interacting with the DB which can put you in a bind if you have no application messaging bus (or something like memcached), cause it can be expensive to have one cache per application.
Obviously, for your problem the ideal solution is to be able to do the paging in a cheaper manner, and not need to churn through ALL the data just to get page N. But sometimes its not possible. Keep in mind that streaming data out of the db can be cheaper than streaming data out of the db back into the same db. You could introduce a new service that is responsible for executing these long queries and then have your main application talk to the db via the service.
Your tempdb could balloon like crazy under load, so I would watch that. It might be easier to put the expensive joins in a view and index the view than trying to cache the table for every user.
If my program was to hit the database with multiple updates would it be better to pull in the tables into a dataset, change the values and then send it back to the database. Does anyone know what's more expensive?
No matter what, the database needs to perform all those updates based on the edits you did to the local DataSet. As I understand it, that will be just as expensive as sequentially updating. The only advantage is its easier to iterate over a dataset rather than pull-and-push one result after another.
What will be expensive is all the workaround code for dealing with the potential exceptions that can occur because you choose "which costs less" over "which is simplest". Premature optimization.
Depends on the size of the DataSet. If your data set is too large, it doesn't worth it. Otherwise, it might be a good approach. However, nothing prevents you to do multiple updates in a batch even without using a DataSet. You could write a stored procedure with an XML parameter that will do batch updates for you.