Data import via Stored procedure or Triggers - database

We've a legacy system (MAS200 if you need to know) and there's an old vbs script which pull data from MAS and populates two staging tables in our production SQL database. And after some processing / cleanup that data goes into actual tables.
Data flow : MAS200 --> Staging tables --> Production table
To simplify consider there's an "Order" parent table and an "Items" child table. Order can have multiple items, each item record will have an FK OrderId. So, during import first we import the Order data and create an entry in the "Order" table and then fetch "Items" entries and import them.
Existing TRIGGER based approach -
At present we've two TRIGGERs - one on each staging table (Order & Items). So each new insert is tapped, and after processing data a new entry is inserted into actual production table. My only concern is that the trigger is executed for each Items entry instead of BULK insert. And it seems less manageable.
SP based approach -
If I remove both the TRIGGERs then import data into staging tables and finally execute an SP which will import Order data and then perform a BULK insert into the Items table. Could that be more efficient / faster?
Its not a comparison actually just a diff design. I'd like to know which one seems better or if there's a 3rd better approach to import from MAS to production SQL db.
EDIT 1 : Thanks. As asked by many - the data volume is not big or too frequent. Lets say 10-12 Orders (with 20-30 Items) every hour. Also with TRIGGERs, thought we don't get a TRANSACTION but only two simple TRIGGERs are suffice. I believe more scripting is needed with SP.
Goal : Need to keep it as simple, clean and efficient as possible.

Using Triggers:
Pros:
The data sync is real time. As you create data by data entry, the volume of data should not be big, so having bulk insert doesn't improve a lot. performance using trigger is good enough
Cons:
Data sync is not real time and if the connection breaks between MAS200 and production, you'll have a big problem. Also (as you mentioned) you can not have transaction, which is a big issue.
I suggest you use SP to transfer data in a time interval basis (if you can tolerate synchronization delay).

If you really want fast approach , you need :
1) to disable the FK on the ITEMS table for the duration of the load
2) then LOAD the ORDERS , and then enable the FK
All this should be done using SP , trigger approach is safe but very slow when its come to large bulks load
I hope you will find it useful
Thanks

Related

"master-slave" table replication in Oracle

Would there be something similar as the master-slave database but at the table level in the database?
For example, I have the following scenario:
I have a table with millions of records and the reason is because the system is more than 15 years old.
I only want to show the records of the last year (2019-2020).
I decided to create a view that only shows the records of that range (1 year) from the information of that table that contains millions of records.
Thanks to the view, the loading time of that system screen is lighter, thanks to the fact that I have less load of records.
The problem: What if the user adds a new record to the table that contains millions of records? how do I make my view modify when the other table are modified ...
I can use triggers to update the view I think, but, is there a functionality in oracle that allows me something similar to what I just asked (master-slave) where the "slave" table is updated as the "master" table suffers changes?
First of all, you misunderstood views. View is not a physical table, and does not store any information. If you insert data into view, you are actually inserting into the source table.
Since the view is not physical, you are just filtering the data. This does not have any performance benefits.
For the big tables, you can use partitioning which drastically improves performance. And if you still need archival you can archive the partitioned data.
Partitioning is generally the best method, because you can typically archive data by simply doing an "exchange" command to archive off old data.
Data doesn't "move" in that scenario, it simply gets 'detached' from the table via data dictionary manipulation.
Would there be something similar as the master-slave database but at the table level in the database
If you are asking about master/slave replication on a table level, then,
I suppose, table/materialized view relationship is appropriate to call as a master-slave. Quote from Oracle Docs:
A materialized view is a database object that contains the results of a query. The FROM clause of the query can name tables, views, and other materialized views. Collectively these objects are called master tables (a replication term)...
When you need to "update" or, more appropriately - refresh the mview, you can use different options:
update mview periodically and refresh it periodically
update mview each time the data in the master table is changed and commited.
update manually calling DBMS_MVIEW.REFRESH or DBMS_SNAPSHOT.REFRESH
Mview could be faster then view because each time you select from a mview you select from a different "table" which was replicated from the original one. Especially if you have complex logic in a sql, you can put the logic to mview definition.
The drawbacks are you need extra disk space for mview, and there will be a delay of refreshing the data.

Backing up a table before deleting all the records and reloading it in SSIS

I have a table named abcTbl, the data in there is populated
from other tables from a different database. Every time I am loading
data to abcTbl, I am doing a delete all to it and loading the buffer
data into it.
This package runs daily. My question is how do I avoid losing data
from the table abcTbl if we fail to load the data into it. So my
first step is deleting all the data in the abcTbl and then
selecting the data from various sources into a buffer and then
loading the buffer data into abcTbl.
Since we can encounter issues like failed connections, package
stopping prematurely, supernatural forces trying to stop/break my
package from running smoothly, etc. which will end up with the
package losing all the data in the buffer after I have already
deleted the data from abcTbl. 
My first intuition was to save the data from the abcTbl into a
backup table and then deleting the data in the abcTbl but my DBAs
wouldn't be too thrilled about creating a backup table for in every
environment for the purpose of this package, and giving me juice to
create backup tables on the fly and then deleting it again is out of
the question too. This data is not business critical and can be repopulated
again if lost.
But, what is the best approach here? What are the best practices for this issue?
For backing up your table, instead of loading data from one table (Original) to another table (Backup), you can just rename your original table to something (back-up table), create original table again like the back-up table and then drop the renamed table only when your data load is successful. This may save some time to transfer data from one table to another. You may want to test which approach is faster for you depending on your data/table structure etc., But what I wanted to mention is, this is also one of the way to do it. If you have lot of data in that table below approach may be faster.
sp_rename 'abcTbl', 'abcTbl_bkp';
CREATE TABLE abcTbl ;
While creating this table, you can keep similar table structure as that of abcTbl_bkp
Load your new data to abcTbl table
DROP TABLE abcTbl_bkp;
Trying to figure this out but I think what you are asking for is a method to capture the older data before loading the new data. I would agree with your DBA's that a seperate table for every reload would be extremely messy and not very usable if you ever need it.
Instead, create a table that copies your load table but adds a single DateTime field(say history_date). Each load you would just flow all the data in your primary table to the backup table. Use a Derived Column task in the Data Flow to add the history_date value to the backup table.
Once the backup table is complete, either truncate or delete the contents of the current table. Then load the new data.
Instead of created additional tables you can set the package to execute as a single transaction. By doing this, if any component fails all the tasks that have already executed will be rolled back and subsequent ones will not run. To do this, set the TransactionOption to Required on the package. This will allow that the package will begin a transaction. After this set all this property to Supported for all components that you want to succeed or fail together. The Supported level will have these tasks join a transaction that is already in progress by the parent container, being the package in this case. If there are other components in the package that you want to commit or rollback independent of these tasks you can place the related objects in a Sequence container, and apply the Required level to the Sequence instead. An important thing to note is that if anything performs a TRUNCATE then all other components that access the truncated object will need to have the ValidateExternalMetadata option set to false to avoid the known blocking issue that is a result of this.

Best way to handle updates on a table

I am looking for much more better way to update tables using SSIS. Specifically, i wanted to optimize the updates on tables (around 10 tables uses same logic).
The logic is,
Select the source data from staging then inserts into physical temp table in the DW (i.e TMP_Tbl)
Update all data matching by customerId column from TMP_Tbl to MyTbl.
Inserts all non-existing customerId column from TMP_Tbl1 to MyTbl.
Using the above steps, this takes some time populating TMP_Tbl. Hence, i planned to change the logic to delete-insert but according to this:
In SQL, is UPDATE always faster than DELETE+INSERT? this would be a recipe for pain.
Given:
no index/keys used on the tables
some tables contains 5M rows, some contains 2k rows
each table update took up to 2-3 minutes, which took for about (15 to 20 minutes) all in all
these updates we're in separate sequence container simultaneously runs
Anyone knows what's the best way to use, seems like using physical temp table needs to be remove, is this normal?
With SSIS you usually BULK INSERT, not INSERT. So if you do not mind DELETE - reinserting the rows should in general outperform UPDATE.
Considering this the faster approach will be:
[Execute SQL Task] Delete all records which you need to update. (Depending on your DB design and queries, some index may help here).
[Data Flow Task] Fast load (using OLE DB Destination, Data access mode: Table of fiew - fast load) both updated and new records from source into MyTbl. No need for temp tables here.
If you cannot/don't want to DELETE records - your current approach is OK too.
You just need to fix the performance of that UPDATE query (adding an index should help). 2-3 minutes per every record updated is way too long.
If it is 2-3 minutes for updating millions of records though - then it's acceptable.
Adding the correct non-clustered index to a table should not result in "much more time on the updates".
There will be a slight overhead, but if it helps your UPDATE to seek instead of scanning a big table - it is usually well worth it.

SSIS - Fast way of detecting true deletes and updating data warehouse

I'm looking for an efficient way of detecting deleted records in production and updating the data warehouse to reflect those deletes because the table is > 12M rows and contains transactional data used for accounting purposes.
Originally, everything was done in a stored procedure by somebody before me and I've been tasked with moving the process to SSIS.
Here is what my test pattern looks like so far:
Inside the Data Flow Task:
I'm using MD5 hashes to speed up the ETL process as demonstrated in this article.
This should give a huge speed boost to the process by not having to store so many rows in memory for comparison purposes and by removing the bulk of conditional split processing at the same time.
But the issue is it doesn't account for records that are deleted in production.
How should I go about doing this? It may be simple to you but I'm new to SSIS so I'm not sure how to ask correctly.
Thank you in advance.
The solution I ended up using was to add another Data Flow Task and use the Lookup transformation to find records that didn't exist in production when compared to our fact table. This task comes after all of the inserts and updates as shown in my question above.
Then we can batch delete missing records in an execute SQL task.
Inside Data Flow Task:
Inside Lookup Transformation:
(note the Redirect rows to no match output)
So, if the ID's don't match those rows will be redirected to the no match output which we set to go to our staging table. Then, we will join staging to the fact table and apply the deletions as shown below inside an execute SQL task.
I think you'll need to adopt you dataflow to use a merge join instead of a lookup.
That way you can see whats new/changed & deleted.
You'll need to sort both Flows by the same joining key (in this case your hash column).
Personally i'm not sure I'd bother and Instead I'd simply stage all my prod data and then do a 3-way SQL merge statement to handle Inserts updates & deletes in one pass. You can keep your hash column as a joining key if you like.

Large Data Service Architecture

Everyday a company drops a text file with potentially many records (350,000) onto our secure FTP. We've created a windows service that runs early in the AM to read in the text file into our SQL Server 2005 DB tables. We don't do a BULK Insert because the data is relational and we need to check it against what's already in our DB to make sure the data remains normalized and consistent.
The problem with this is that the service can take a very long time (hours). This is problematic because it is inserting and updating into tables that constantly need to be queried and scanned by our application which could affect the performance of the DB and the application.
One solution we've thought of is to run the service on a separate DB with the same tables as our live DB. When the service is finished we can do a BCP into the live DB so it mirrors all of the new records created by the service.
I've never worked with handling millions of records in a DB before and I'm not sure what a standard approach to something like this is. Is this an appropriate way of doing this sort of thing? Any suggestions?
One mechanism I've seen is to insert the values into a temporary table - with the same schema as the target table. Null IDs signify new records and populated IDs signify updated records. Then use the SQL Merge command to merge it into the main table. Merge will perform better than individual inserts/updates.
Doing it individually, you will incur maintenance of the indexes on the table - can be costly if its tuned for selects. I believe with merge its a bulk action.
It's touched upon here:
What's a good alternative to firing a stored procedure 368 times to update the database?
There are MSDN articles about SQL merging, so Googling will help you there.
Update: turns out you cannot merge (you can in 2008). Your idea of having another database is usually handled by SQL replication. Again I've seen in production a copy of the current database used to perform a long running action (reporting and aggregation of data in this instance), however this wasn't merged back in. I don't know what merging capabilities are available in SQL Replication - but it would be a good place to look.
Either that, or resolve the reason why you cannot bulk insert/update.
Update 2: as mentioned in the comments, you could stick with the temporary table idea to get the data into the database, and then insert/update join onto this table to populate your main table. The difference is now that SQL is working with a set so can tune any index rebuilds accordingly - should be faster, even with the joining.
Update 3: you could possibly remove the data checking from the insert process and move it to the service. If you can stop inserts into your table while this happens, then this will allow you to solve the issue stopping you from bulk inserting (ie, you are checking for duplicates based on column values, as you don't yet have the luxury of an ID). Alternatively with the temporary table idea, you can add a WHERE condition to first see if the row exists in the database, something like:
INSERT INTO MyTable (val1, val2, val3)
SELECT val1, val2, val3 FROM #Tempo
WHERE NOT EXISTS
(
SELECT *
FROM MyTable t
WHERE t.val1 = val1 AND t.val2 = val2 AND t.val3 = val3
)
We do much larger imports than that all the time. Create an SSIS pacakge to do the work. Personally I prefer to create a staging table, clean it up, and then do the update or import. But SSIS can do all the cleaning in memory if you want before inserting.
Before you start mirroring and replicating data, which is complicated and expensive, it would be worthwhile to check your existing service to make sure it is performing efficiently.
Maybe there are table scans you can get rid of by adding an index, or lookup queries you can get rid of by doing smart error handling? Analyze your execution plans for the queries that your service performs and optimize those.

Resources