Should I use a merge with this scenario? - sql-server

I have a table that gets updated from an outside source. It normally sits empty until they push data to me. With this data I am supposed to add, update or delete records in two other tables (link by a primary/foreign key). Data is pushed to me one row at a time and occasionally in a large download twice a year. They want me to update my tables in real time. SHould I use a trigger and have it read line by line or merge the tables?

I'd have a scheduled job that runs a sproc to check for work to do in that table, and them process them in batches. Have a column on the import/staging table that you can update with a batch number or timestamp so if something goes wrong (like they have pushed you some goofy data) you know where to restart from and can identify which row caused the problem.
If you use a trigger, not only might it slow down them feeding you a large batch of data, but you'll also possibly lose the ability to keep a record of where the process got to if it fails.
If it was always one row at a time then I think the trigger method would be okay option.
Edit: Just to clarify the point about batch number/timestamp, this is so if you have new/unexpected data which crashes your import, you can alter the code and re-run the process as much as you like without having to ask for a fresh import.

Related

Snowflake: is it ever okay to run infinite loops? I need to track changes from a table every second

My use case is that I need to track all the changes (insertions/updates/deletions) from a table.
My idea is to create a stream on that table, and consume that stream every second or so, exporting all the changes to another history table (mytable_history).
A task would be the perfect candidate for that. But unfortunately, a task can only be scheduled for 1 minute or more. I'll be getting same-row updates per second, so I'd really need the task to run every second at least.
My idea now is to run an infinite LOOP, using SYSTEM$WAIT to consume the stream every 1 second and inserting the data to the history table.
Is this a bad idea? What could go wrong?
Thanks
I can add two points to your idea:
Please note that "DML updates to the source object in parallel transactions are tracked by the change tracking system but do not update the stream until the explicit transaction statement is committed and the existing change data is consumed." (https://docs.snowflake.com/en/user-guide/streams-intro.html#table-versioning)
Your warehouse would run all day to process this, that's why your costs would increase noticeable.

Backing up a table before deleting all the records and reloading it in SSIS

I have a table named abcTbl, the data in there is populated
from other tables from a different database. Every time I am loading
data to abcTbl, I am doing a delete all to it and loading the buffer
data into it.
This package runs daily. My question is how do I avoid losing data
from the table abcTbl if we fail to load the data into it. So my
first step is deleting all the data in the abcTbl and then
selecting the data from various sources into a buffer and then
loading the buffer data into abcTbl.
Since we can encounter issues like failed connections, package
stopping prematurely, supernatural forces trying to stop/break my
package from running smoothly, etc. which will end up with the
package losing all the data in the buffer after I have already
deleted the data from abcTbl. 
My first intuition was to save the data from the abcTbl into a
backup table and then deleting the data in the abcTbl but my DBAs
wouldn't be too thrilled about creating a backup table for in every
environment for the purpose of this package, and giving me juice to
create backup tables on the fly and then deleting it again is out of
the question too. This data is not business critical and can be repopulated
again if lost.
But, what is the best approach here? What are the best practices for this issue?
For backing up your table, instead of loading data from one table (Original) to another table (Backup), you can just rename your original table to something (back-up table), create original table again like the back-up table and then drop the renamed table only when your data load is successful. This may save some time to transfer data from one table to another. You may want to test which approach is faster for you depending on your data/table structure etc., But what I wanted to mention is, this is also one of the way to do it. If you have lot of data in that table below approach may be faster.
sp_rename 'abcTbl', 'abcTbl_bkp';
CREATE TABLE abcTbl ;
While creating this table, you can keep similar table structure as that of abcTbl_bkp
Load your new data to abcTbl table
DROP TABLE abcTbl_bkp;
Trying to figure this out but I think what you are asking for is a method to capture the older data before loading the new data. I would agree with your DBA's that a seperate table for every reload would be extremely messy and not very usable if you ever need it.
Instead, create a table that copies your load table but adds a single DateTime field(say history_date). Each load you would just flow all the data in your primary table to the backup table. Use a Derived Column task in the Data Flow to add the history_date value to the backup table.
Once the backup table is complete, either truncate or delete the contents of the current table. Then load the new data.
Instead of created additional tables you can set the package to execute as a single transaction. By doing this, if any component fails all the tasks that have already executed will be rolled back and subsequent ones will not run. To do this, set the TransactionOption to Required on the package. This will allow that the package will begin a transaction. After this set all this property to Supported for all components that you want to succeed or fail together. The Supported level will have these tasks join a transaction that is already in progress by the parent container, being the package in this case. If there are other components in the package that you want to commit or rollback independent of these tasks you can place the related objects in a Sequence container, and apply the Required level to the Sequence instead. An important thing to note is that if anything performs a TRUNCATE then all other components that access the truncated object will need to have the ValidateExternalMetadata option set to false to avoid the known blocking issue that is a result of this.

How to find out the rows affected in SQL Profiler or trace?

I'm using tracing to log all delete or update queries run through the system. The problem is, if I run a query like DELETE FROM [dbo].[Artist] WHERE ArtistId>280, I know how many rows were deleted but I'm unable to find out which rows were deleted (the data they had).
I'm thinking of doing this as a logging system so it would be useful to see which rows were affected and what data they had if at all possible. I don't really want to use triggers for this job but I will if I have to (and if it's feasible).
If you need the original data and are planning on storing all the deleted data in a separate table why not just logically delete the original data rather than physically delete it? i.e.
UPDATE dbo.Artist SET Artist_deleted = 1 WHERE ArtistId>280
Then you only need add one column to your current table rather than creating new tables and scripts to support these. You could then partition the current table based on the deleted flag if you are worried about disk space/performance etc.

Persist Data in SSIS for Next Execution

I have data to load where I only need to pull records since the last time I pulled this data. There are no date fields to save this information in my destination table so I have to keep track of the maximum date that I last pulled. The problem is I can't see how to save this value in SSIS for the next time the project runs.
I saw this:
Persist a variable value in SSIS package
but it doesn't work for me because there is another process that purges and reloads the data separate from my process. This means that I have to do more than just know the last time my process ran.
The only solution I can think of is to create a table but it seems a bit much to create a table to hold one field.
This is a very common thing to do. You create an execution table that stores the package name, the start time, the end time, and whether or not the package failed/succeeded. You are then able to pull the max start time of the last successfully ran execution.
You can't persist anything in a package between executions.
What you're talking about is a form of differential replication and this has been done many many times.
For differential replication it is normal to store some kind of state in the subscriber (the system reading the data) or the publisher (the system providing the data) that remembers what state you're up to.
So I suggest you:
Read up on differential replication design patterns
Absolutely put your mind at rest about writing data to a table
If you end up having more than one source system or more than one source table your storage table is not going to have just one record. Have a think about that. I answered a question like this the other day - you'll find over time that you're going to add handy things like the last time the replication ran, how long it took, how many records were transferred etc.
Is it viable to have a SQL table with only one row and one column?
TTeeple and Nick.McDermaid are absolutely correct, and you should follow their advice if humanly possible.
But if for some reason you don't have access to write to an execution table, you can always use a script task to read/write the last loaded date to a text file on on whatever local file-system you're running SSIS on.

How to update and add new records at the same time

I have a table that contains over than a million records (products).
Now, daily, I need to either update existing records, and/or add new ones.
Instead of doing it one-by-one (takes couple of hours), I managed to use SqlBulkCopy to work with bunch of records and managed to do my inserts in the matter of seconds, but it can handle only new inserts. So I am thinking about creating a new table that contains new records and old records; and then use that temporary table (on the SQL end) to update/add to the main table.
Any advice how can I perform that update?
One of the better ways to handle this is with the MERGE command in SQL. Mssqltips has a good tutorial on it, it can be a bit trickier to use than some of the other commands.
Also, due to locking you may want to break this up into multiple smaller transactions, unless you know you can tolerate blocking during the update.
We handle this situation in our code in the way you described; we have a temp table, then run an update where the ID in the temp table matches the table to be updated, then run an insert where the ID in the table to be updated is null. We normally do this for updates to library/program settings, though, so it is only run infrequently, on smaller tables. Performance may not be up to par for that many records, or daily runs.
The main "gotcha" I've encountered with this method is that for the update, we did a comparison to make sure at least one of several fields changed before actually running the update. (Our initial reason for this was to avoid overwriting some defaults, which could affect server behavior. Your reason for this might be performance, if your temp table could contain records that haven't actually changed). We encountered a case where we did actually want to update one of the defaults, but our old script didn't catch that. So if you do any comparisons to determine which products you want to update, make sure it is either complete from the start, or document well any fields you don't compare, and why.

Resources