I have data to load where I only need to pull records since the last time I pulled this data. There are no date fields to save this information in my destination table so I have to keep track of the maximum date that I last pulled. The problem is I can't see how to save this value in SSIS for the next time the project runs.
I saw this:
Persist a variable value in SSIS package
but it doesn't work for me because there is another process that purges and reloads the data separate from my process. This means that I have to do more than just know the last time my process ran.
The only solution I can think of is to create a table but it seems a bit much to create a table to hold one field.
This is a very common thing to do. You create an execution table that stores the package name, the start time, the end time, and whether or not the package failed/succeeded. You are then able to pull the max start time of the last successfully ran execution.
You can't persist anything in a package between executions.
What you're talking about is a form of differential replication and this has been done many many times.
For differential replication it is normal to store some kind of state in the subscriber (the system reading the data) or the publisher (the system providing the data) that remembers what state you're up to.
So I suggest you:
Read up on differential replication design patterns
Absolutely put your mind at rest about writing data to a table
If you end up having more than one source system or more than one source table your storage table is not going to have just one record. Have a think about that. I answered a question like this the other day - you'll find over time that you're going to add handy things like the last time the replication ran, how long it took, how many records were transferred etc.
Is it viable to have a SQL table with only one row and one column?
TTeeple and Nick.McDermaid are absolutely correct, and you should follow their advice if humanly possible.
But if for some reason you don't have access to write to an execution table, you can always use a script task to read/write the last loaded date to a text file on on whatever local file-system you're running SSIS on.
Related
My use case is that I need to track all the changes (insertions/updates/deletions) from a table.
My idea is to create a stream on that table, and consume that stream every second or so, exporting all the changes to another history table (mytable_history).
A task would be the perfect candidate for that. But unfortunately, a task can only be scheduled for 1 minute or more. I'll be getting same-row updates per second, so I'd really need the task to run every second at least.
My idea now is to run an infinite LOOP, using SYSTEM$WAIT to consume the stream every 1 second and inserting the data to the history table.
Is this a bad idea? What could go wrong?
Thanks
I can add two points to your idea:
Please note that "DML updates to the source object in parallel transactions are tracked by the change tracking system but do not update the stream until the explicit transaction statement is committed and the existing change data is consumed." (https://docs.snowflake.com/en/user-guide/streams-intro.html#table-versioning)
Your warehouse would run all day to process this, that's why your costs would increase noticeable.
I need to do one-time historical data load, followed by incremental load every 10 minutes.
is there a way to parametrize snowflake task to 1st run the historical load and then change the parameter to execute incremental loads? if not, can you suggest a better approach to handle historical (One-time) and incremental loads via tasks
Note: An underlying table of snowflake stream contains historical records and any new data after implementing stream/tasks is considered as incremental.
if you have a task call a stored procedure, you could have the stored procedure first check to see if the target table is empty (or whatever check you want. As long as you can write it as code, it'll work. Heck you could have it insert a task run log into a separate table, and check to see if it's the first time it's run.) and do the initial historical load in that case, and not otherwise.
Then the first time you run it, it will do one code path, and foreverafter it will do the other.
I have a controlling package to manage data transfers between two different data sources, so we can coy multiple tables and capture how many rows were transferred and duration for each transfer.
The package was working as intended until yesterday, when I promoted a change to the controlling package to allow us to have variables which determine whether the transfer required the destination table to be Truncated or not. The next day I saw the bizarre behaviour of the Package log stating all rows transferred, but the actual table counts were not what the log stated, see attached file. Thinking, as usual that it may be something my change did, I reverted the change and guess what the same errors happened, but on some of the same tables and some different ones.
As it is the child packages doing the transfers and nothing has changed in them, has anyone got any ideas how this could possibly happen? We did have a SAN controller fail a week or so ago and we are now back on that same controller, but either way I cannot see how the log would be able to say all rows are transferred and yet not all of the rows are actually moved???
If you need anything further please let me know as this has me well and truly baffled...
Process output - where you can see counts and log
Table Transfer Controller (as txt not dtsx)
An example of the data transfer child packages (as txt not dtsx)
regards,
Anthony
Is there an automatic way of knowing which rows are the latest to have been added to an OpenEdge table? I am working with a client and have access to their database, but they are not saving ids nor timestamps for the data.
I was wondering if, hopefully, OpenEdge is somehow doing this out of the box. (I doubt it is but it won't hurt to check)
Edit: My Goal
My goal from this is to be able to only import the new data, i.e. the delta, of a specific table. Without having which rows are new, I am forced to import everything because I have no clue what was aded.
1) Short answer is No - there's no "in the box" way for you to tell which records were added, or the order they were added.
The only way to tell the order of creation is by applying a sequence or by time-stamping the record. Since your application does neither, you're out of luck.
2) If you're looking for changes w/out applying schema changes, you can capture changes using session or db triggers to capture updates to the db, and saving that activity log somewhere.
3) If you're just looking for a "delta" - you can take a periodic backup of the database, and then use queries to compare the current db with the backup db and get the differences that way.
4) Maintain a db on the customer site with the contents of the last table dump. The next time you want to get deltas from the customer, compare that table's contents with the current table, dump the differences, then update the db table to match the current db's table.
5) Personally. I'd talk to the customer and see if (a) they actually require this functionality, (b) find out what they think about adding some fields and a bit of code to the system to get an activity log. Adding a few fields and some code to update them shouldn't be that big of a deal.
You could use database triggers to meet this need. In order to do so you will need to be able to write and deploy trigger procedures. And you need to keep in mind that the 4GL and SQL-92 engines do not recognize each other's triggers. So if updates are possible via SQL, 4GL triggers will be blind to those updates. And vice-versa. (If you do not use SQL none of this matters.)
You would probably want to use WRITE triggers to catch both insertions and updates to data. Do you care about deletes?
Simple-minded 4gl WRITE trigger:
TRIGGER PROCEDURE FOR WRITE OF Customer. /* OLD BUFFER oldCustomer. */ /* OLD BUFFER is optional and not needed in this use case ... */
output to "customer.dat" append.
export customer.
output close.
return.
end.
I have a table that gets updated from an outside source. It normally sits empty until they push data to me. With this data I am supposed to add, update or delete records in two other tables (link by a primary/foreign key). Data is pushed to me one row at a time and occasionally in a large download twice a year. They want me to update my tables in real time. SHould I use a trigger and have it read line by line or merge the tables?
I'd have a scheduled job that runs a sproc to check for work to do in that table, and them process them in batches. Have a column on the import/staging table that you can update with a batch number or timestamp so if something goes wrong (like they have pushed you some goofy data) you know where to restart from and can identify which row caused the problem.
If you use a trigger, not only might it slow down them feeding you a large batch of data, but you'll also possibly lose the ability to keep a record of where the process got to if it fails.
If it was always one row at a time then I think the trigger method would be okay option.
Edit: Just to clarify the point about batch number/timestamp, this is so if you have new/unexpected data which crashes your import, you can alter the code and re-run the process as much as you like without having to ask for a fresh import.