DWH - Incremental Data import - database

I am designing a new ETL process , the ETL process should take data from Source DB to Target DB .
The ETL process should run once a day .
My problem here is, how to check the delta and to do incremental import
i cannot perform full import once a day because of the performance issue.
in order to have correct data at the target i have to check the LastUPDATE column for every records that i already have on the target DB because the existing records on the source can be updated and can be different from yesterday to today .
The solution that i thought about is to take all the key's that i have on the target and search the same key's on the source and then to compare the lastUPDATE value where the date will be different at the source in front of the target i will delete the record at the target and i will perform new insert .
i have thousand of records and every time it's become bigger and bigger which means it will take more time to run the process.
Any idea how to do that kind of process more efficient ?

Related

How many versions are created in a delta table in a Data lake on Azure

I have a clarification question. As per what I have read, Delta tables create 0--original data, 1--updated data version of a row in a table.
So basically we have just two versions of the data in Delta tables or this is configurable? what happens, when we update same row multiple times, delta table simply keep latest version of updates?
Thanks in advance.
Delta will create a new version for each operation - insert/update/delete, and also for additional operations, like, changing properties of the table, optimize, vacuum, etc., although some operations will not create new files (update table properties), or even delete not used files (vacuum).
Please take into account that data files in Delta aren't mutable, when you update or delete data, Delta identifies which files contain the data for update/delete, and create new files with modified data. That's why it's important to run VACUUM periodically, so you can get rid of the old files (although it will limit your ability to time travel just to the given period of time - one week by default)

SSIS ETL Pattern based on Rowversion occasionally missing rows, how to correct?

We have been using an SSIS pattern based around rowversion to synchronize records between two databases by looking at only rows in the source that have been inserted or updated since the last package run. Note data is never deleted from the source table, which is a prerequisite for this SSIS pattern.
However lately we discovered despite running daily our import has actually missed rows from last month, leaving them out of our data warehouse entirely!
This is what i'm seeking a solution to..how can we change our ETL pattern to avoid this problem, without going back to reading every row from source every day?
From internet searching we found an explanation for why this might be happening, but not a solution. The flaw seems to be related to the fact that the SQL column rowversion gets its value when an insert/update starts, not when it commits, which can lead to rows not being available at package execution time, but getting committed later with rowversion values less than your stored ETLRowversion value, so next time your job runs they get skipped.
In brief our pattern currently is like this: (I've left out steps involving index maintenance, etc for simplicity.)
Get the last active rowversion from source DB using min_active_rowversion() call that #MaxRv.
Get the rowversion value as of last successful execution of our SSIS task (stored in our data warehouse in a table called ETLRowversions). Call that #LSERV.
Read rows from the source table WHERE rowversion is >= #LSERV and rowversion is <= #MaxRv
For each row read, check if the row exists in target DB (if so add the row to an update staging table) or not (in which case, insert it directly into Target table)
Update the Target table using the update staging table
Update ETLrowVersions table in our data warehouse with the #MaxRv value.
Edit: Comments have suggested to implement Change Tracking and Snapshot Isolation as the best solution to this problem. Unfortunately both change tracking and allow_snapshot_isolation are both OFF for the source database..and I am pessimistic about my chances of getting these features turned on. For better or worse our BI concerns carry far less weight than performance concerns of the production application/DB that is our source.

Detecting and Publishing Changes to Data in SQL Server in Real-time

I have an ERP System (Navision) where product data and stock numbers are frequently updated. Every time an attribute of a product is updated I want this change to be pushed to another SQL Server using Service Broker. I was considering using triggers for the detection, but I am unsure if that is the best way, and whether this is scalable. I expect updates to happen approx. once per second, but this number might double or triple.
Any feedback would be appreciated.
Add a column for Last Modified Date for each record and update this column using the trigger each time a record is being updated. Then Run a scheduled job at a specific time each day (Off-business hours preferred) So that all records that are updated after the last scheduled run is processed.
So The following items need to be done
Add a new column LastModifiedDate in the table with DATETIME data type.
Create a Trigger to update the ModifiedDate each time the record is updated
Create a new table to store the schedule run date and time
Create a scheduled job on Database that will run at a specified time every day.
This job will pick all the records that have the value greater than the date in the Table Create on Step#4.
So Since only 1 column is being updated in the trigger, it won't affect the performance of the table. Also since we are running the update job only once a day, It will also reduce the Database Traffic.

Persist Data in SSIS for Next Execution

I have data to load where I only need to pull records since the last time I pulled this data. There are no date fields to save this information in my destination table so I have to keep track of the maximum date that I last pulled. The problem is I can't see how to save this value in SSIS for the next time the project runs.
I saw this:
Persist a variable value in SSIS package
but it doesn't work for me because there is another process that purges and reloads the data separate from my process. This means that I have to do more than just know the last time my process ran.
The only solution I can think of is to create a table but it seems a bit much to create a table to hold one field.
This is a very common thing to do. You create an execution table that stores the package name, the start time, the end time, and whether or not the package failed/succeeded. You are then able to pull the max start time of the last successfully ran execution.
You can't persist anything in a package between executions.
What you're talking about is a form of differential replication and this has been done many many times.
For differential replication it is normal to store some kind of state in the subscriber (the system reading the data) or the publisher (the system providing the data) that remembers what state you're up to.
So I suggest you:
Read up on differential replication design patterns
Absolutely put your mind at rest about writing data to a table
If you end up having more than one source system or more than one source table your storage table is not going to have just one record. Have a think about that. I answered a question like this the other day - you'll find over time that you're going to add handy things like the last time the replication ran, how long it took, how many records were transferred etc.
Is it viable to have a SQL table with only one row and one column?
TTeeple and Nick.McDermaid are absolutely correct, and you should follow their advice if humanly possible.
But if for some reason you don't have access to write to an execution table, you can always use a script task to read/write the last loaded date to a text file on on whatever local file-system you're running SSIS on.

data migration in informatica

A large amount of data is coming from source to target. After a successful insertion in target, we have to change the status to every rows as "committed". But when will we know that all datas have come or not in target without directly querying the source?
For example - suppose 10 records have migrated to target from source.
We cannot change the status of all the records as "committed" before successful insertion of all records in target.
So before changing the status of all the records, how will we know that 11th record is coming or not?
Is there anything that will give me the information about total records in source?
I need a real-time based answer.
we had the same scenario and this is what we did:
First of all
to check if data is loaded in target you can join source and target table, update will lock the rows so for this commit must be fired at database level in target table (so that lock for update can happen).
after joining, update the loaded data based on join with target column.
Few things.
You have to stop you session (used pmcmd to stop session in command task)
update data in your source table and restart session.
keep load for counter of 20k-30 rows so update goes smoothly.

Resources