Generate change log with minimal SQL calls? - sql-server

I am trying to run a lot of update statements from code, and we have a requirement to summarize what changed for every operation for an audit log.
The update basically persists an entire graph consisting of dozens of tables to SQL Server. Right now, before we begin, we collect the data from all the tables, assemble the graph(s) as a "before" picture, apply the updates, then re-collect the data from all the tables, re-assemble the graph(s) for the "after", serialize the before and after graph(s) to JSON, then create a message to an ESB queue for an off-process consumer to crunch through the graphs, identify the deltas, and update the audit log. All the sql operations occur in a single transaction.
Needless to say, this is an expensive and time-consuming process.
I've been playing with the OUTPUT directive in T-SQL, I like the idea of getting the results of the operation in the same command as the update, but it seems to have some limitations. For example, ideally, it'd be great if I could get the INSERTED and DELETED result sets back at the same time, but there doesn't seem to be a concept of UNION between the two tablesets, so that gets unwieldy very quickly. Also, because the updates don't actually modify every column, I can't take the changes I made and compare them to the DELETED, since we'd show deltas for columns we didn't change.
...but maybe I'm missing some syntax with the OUTPUT command, or I'm not using it correctly, so I figured I'd ask the SO community.
What is the most efficient way to collect the deltas of an update operation in SQL Server? The goal is to minimize the calls to SQL Server, and collect the minimum necessary amount of information for writing an accurate audit log, without writing a bunch of custom code for every single operation.

Related

Are there best practices for deduplicating records when using auto-ingest Snowpipes?

Currently in Snowflake we have configured an auto-ingest Snowpipe connected to an external S3 stage as documented here. This works well and we're copying records from the pipe into a "landing" table. The end goal is to MERGE these records into a final table to deal with any duplicates, which also works well. My question is around how best to safely perform this MERGE without missing any records? At the moment, we are performing a single data extraction job per-day so there is normally a point where the Snowpipe queue is empty which we use as an indicator that it is safe to proceed, however we are looking to move to more frequent extractions where it will become harder and harder to guarantee there will be no new records ingested at any given point.
Things we've considered:
Temporarily pause the pipe, MERGE the records, TRUNCATE the landing table, then unpause the pipe. I believe this should technically work but it is not clear to me that this is an advised way to work with Snowpipes. I'm not sure how resilient they are to being paused/unpaused, how long it tends to take to pause/unpause, etc. I am aware that paused pipes can become "stale" after 14 days (link) however we're talking about pausing it for a few minutes, not multiple days.
Utilize transactions in some way. I have a general understanding of SQL transactions, but I'm having a hard time determining exactly if/how they could be used in this situation to guarantee no data loss. The general thought is if the MERGE and DELETE could be contained in a transaction it may provide a safe way to process the incoming data throughout the day but I'm not sure if that's true.
Add in a third "processing" table and a task to swap the landing table with the processing table. The task to swap the tables could run on a schedule (e.g. every hour), and I believe the key is to have the conditional statement check both that there are records in the landing table AND that the processing table is empty. As this point the MERGE and TRUNCATE would work off the processing table and the landing table would continue to receive the incoming records.
Any additional insights into these options or completely different suggestions are very welcome.
Look into table streams which record insertions/updates/deletions to your snowpipe table. You can then merge off the stream to your target table which then resets the offset. Use a task to run your merge statement. Also, given it is snowpipe, when creating your stream it is probably best to use an append only stream
However, I had a question here where in some circumstances, we were missing some rows. Our task was set to 1min intervals, which may be partly the reason. However I never did get to the end of it, even with Snowflake support.
What we did notice though was that using a stored procedure, with a transaction and also running a select on the stream before the merge, seems to have solved the issue i.e. no more missing rows

Rolling back specific, older SQL transaction

We have a fairly large stored proc that merges two people found in our system with similar names. It deletes and updates on many different tables (about 10 or so, all very different tables). It's all wrapped in a transaction and rolls back if it fails of course. This may be a dumb question, but is it possible to somehow store and rollback just this specific transaction at a later time without having to create and insert into many "history" tables that keep track of exactly what happened? I don't want to restore the whole database, just the results of a stored procedures specific transaction, and at a later date.
It sounds like you may want to investigate Change Data Capture.
It will still be capturing data as it changes, and if you're only doing it for one execution or for a very small amount of data, other methods may be better.
Once a transaction has been committed, it's not possible to roll back just that one transaction at a later date. You're "committed" in quite a literal sense. You can obviously restore from a backup and roll back every transaction since a particular point, but that's probably not what you are looking for.
So making auditing tables are about your only option. As another answer pointed out you can use Data Change Capture, but unless you have forked out the big money for Enterprise Edition, this isn't an option for you. But if you're just interested in undoing this particular type of change, it's probably easiest to add some code to your procedure that does the merge to store the data records necessary to re-split them and create a procedure to do the actual split. But you have to keep in mind that you must handle any changes to the "merged" data that might break your ability to perform the split. This is why SQL can't do it for you automatically... it doesn't know how you might want to handle any changes to the data that might occur after your original transaction.

Is there a quicker way to add column or change fields in SSIS packages?

We have 10 slightly different SSIS packages that transfer data from one database to another. Whenever we make a change to the first db, such as adding in a new field or changing a property of said field such as extending a varchar's length, we have to update the packages as well.
Each of these packages have a long flow with multiple merge joins, sorts, conditional statements, etc. If the field that needs to be changed is at the beginning of the process, I have to go through each merge and update it with the new change and each time I do so, it takes a few minutes to process, then I'm on to the next one. As I get near the end, the process takes longer and longer to compute for each merge join. Doing this for 10 different packages, even if they are done at the same time, still takes upwards of 3 hours. This is time consuming and very monotonous. There's got to be a better way, right?
BIML is very good for this. BIML is an XML-based technology which translates to dtsx packages. BIMLScript is BIML interleaved with c# or vb to provide control flow logic, so you can create multiple packages/package elements based on conditions. You can easily query the table structure or custom metadata, such that if you are only doing db to db transformations, you can make structural changes to the database(s) and regenerate your SSIS packages without having to do any editing.
The short answer is no. The metadata that SSIS generates makes it very awkward when data sources change. You can go down the road of dynamically generated packages, but it's not ideal.
Your other option is damage reduction. Consider whether you could implement the Canonical Data Model pattern:
http://www.eaipatterns.com/CanonicalDataModel.html
It would involve mapping the data to some kind of internal format immediately on receiving it, possibly via a temporary table or cache, and then only using your internal format from then on. Then you map back to your output format at the end of processing.
While this does increase the overall complexity of your package, it means that an external data source changing will only affect the transforms at the beginning and end of your processing, which may well save you lots of time in the long run.

Splitting Trigger in SQL Server

Greeting,
Recently I've started to work on an application, where 8 different modules are using the same table at some point in the workflow. This table have an Instead-Of trigger, which is 5,000 lines long (where first 500 and last 500 lines are common for all modules, and then each module has its own 500 lines of code).
Since the number of modules are going to grow and I want to keep thing as clear (and separate) as possible, I was wondering is there some sort of best practice to split trigger into stored procedures, or should I leave it all in one place?
P.S. Are there going to be any performance penalties for calling procedures from the trigger and passing 15+ parameters to them?
Bearing in mind that the inserted and deleted pseudo-tables are only accessible from within trigger code, and that they can contain multiple rows, you're facing two choices:
Process the rows in inserted and deleted in a RBAR1 fashion, to be able to pass scalar parameters to the stored procedures, or,
Copy all of the data from inserted and deleted into table variables that are then passed to the procedures as appropriate.
I'd expect either approach to impose some2 performance overhead, just from the copying
That being said, it sounds like too much is happening inside the triggers themselves - does all of this code have to be part of the same transaction that performed the DML statement? If not, consider using some form of queue (a table of requests or Service Broker, say) in which to place information on work to perform, and then process the data later - if you use Service Broker, you could have it inspect a shared message and then send appropriate messages to dedicated endpoints for each of your modules, as appropriate.
1 Row By Agonizing Row - using either a cursor of something else to simulate one to access each row in turn - usually frowned upon in a Set-based language like SQL.
2 How much is impossible to know without getting into the specifics of your code and probably trying all possible approaches and measuring the result.
I don't think there is a meaningful performance penalty in this case.
Any way, it is bad practice to write it all inside the trigger (when it is 5000 lines long...).
I think the main consideration is maintainability, which will be much better if you split it
To several SPs

Do triggers decreases the performance? Inserted and deleted tables?

Suppose i am having stored procedures which performs Insert/update/delete operations on table.
Depending upon some criteria i want to perform some operations.
Should i create trigger or do the operation in stored procedure itself.
Does using the triggers decreases the performance?
Does these two tables viz Inserted and deleted exists(persistent) or are created dynamically?
If they are created dynamically does it have performance issue.
If they are persistent tables then where are they?
Also if they exixts then can i access Inserted and Deleted tables in stored procedures?
Will it be less performant than doing the same thing in a stored proc. Probably not but with all performance questions the only way to really know is to test both approaches with a realistic data set (if you have a 2,000,000 record table don't test with a table with 100 records!)
That said, the choice between a trigger and another method depends entirely on the need for the action in question to happen no matter how the data is updated, deleted, or inserted. If this is a business rule that must always happen no matter what, a trigger is the best place for it or you will eventually have data integrity problems. Data in databases is frequently changed from sources other than the GUI.
When writing a trigger though there are several things you should be aware of. First, the trigger fires once for each batch, so whether you inserted one record or 100,000 records the trigger only fires once. You cannot assume ever that only one record will be affected. Nor can you assume that it will always only be a small record set. This is why it is critical to write all triggers as if you are going to insert, update or delete a million rows. That means set-based logic and no cursors or while loops if at all possible. Do not take a stored proc written to handle one record and call it in a cursor in a trigger.
Also do not send emails from a cursor, you do not want to stop all inserts, updates, or deletes if the email server is down.
Yes, a table with a trigger will not perform as well as it would without it. Logic dictates that doing something is more expensive than doing nothing.
I think your question would be more meaningful if you asked in terms of whether it is more performant than some other approach that you haven't specified.
Ultimately, I'd select the tool that is most appropriate for the job and only worry about performance if there is a problem, not before you have even implemented a solution.
Inserted and deleted tables are available within the trigger, so calling them from stored procedures is a no-go.
It decreases performance on the query by definition: the query is then doing something it otherwise wasn't going to do.
The other way to look at it is this: if you were going to manually be doing whatever the trigger is doing anyway then they increase performance by saving a round trip.
Take it a step further: that advantage disappears if you use a stored procedure and you're running within one server roundtrip anyway.
So it depends on how you look at it.
Performance on what? the trigger will perform an update on the DB after the event so the user of your system won't even know it's going on. It happens in the background.
Your question is phrased in a manner quite difficult to understand.
If your Operation is important and must never be missed, then you have 2 choice
Execute your operation immediately after Update/Delete with durability
Delay the operation by making it loosely coupled with durability.
We also faced the same issue and our production MSSQL 2016 DB > 1TB with >500 tables and need to send changes(insert, update, delete) of few columns from 20 important tables to 3rd party. Number of business process that updates those few columns in 20 important tables were > 200 and it's a tedious task to modify them because it's a legacy application. Our existing process must work without any dependency of data sharing. Data Sharing order must be important. FIFO must be maintained
eg User Mobile No: 123-456-789, it change to 123-456-123 and again change to 123-456-456
order of sending this 123-456-789 --> 123-456-123 --> 123-456-456. Subsequent request can only be send if response of first previous request is successful.
We created 20 new tables with limited columns that we want. We compare main tables and new table (MainTable1 JOIN MainTale_LessCol1) using checksum of all columns and TimeStamp Column to Identify change.
Changes are logged in APIrequest tables and updated back in MainTale_LessCol1. Run this logic in Scheduled Job every 15 min.
Separate process will pick from APIrequest and send data to 3rd party.
We Explored
Triggers
CDC (Change Data Capture)
200+ Process Changes
Since our deadlines were strict, and cumulative changes on those 20 tables were > 1000/sec and our system were already on peak capacity, our current design work.
You can try CDC share your experience

Resources