MS SQL Trigger for ETL vs Performance - sql-server

I would need information what might be the impact for production DB of creating triggers for ~30 Production tables that capture any Update,Delete and Insert statement and put following information "PK", "Table Name", "Time of modification" to separate table.
I have limited ability to test it as I have read only permissions to both Prod and Test environment (and I can get one work day for 10 end users to test it).
I have estimated that number of records inserted from those triggers will be around ~150-200k daily.
Background:
I have project to deploy Data Warehouse for database that is very customized + there are jobs running every day that manipulate the data. Updated on Date column is not being maintain (customization) + there are hard deletes occurring on tables. We decided to ask DEV team to add triggers like:
CREATE TRIGGER [dbo].[triggerName] ON [dbo].[ProductionTable]
FOR INSERT, UPDATE, DELETE
AS
INSERT INTO For_ETL_Warehouse (Table_Name, Regular_PK, Insert_Date)
SELECT 'ProductionTable', PK_ID, GETDATE() FROM inserted
INSERT INTO For_ETL_Warehouse (Table_Name, Regular_PK, Insert_Date)
SELECT 'ProductionTable', PK_ID, GETDATE() FROM deleted
on core ~30 production tables.
Based on this table we will pull delta from last 24 hours and push it to Data Warehouse staging tables.
If anyone had similar issue and can help me estimate how it can impact performance on production database I will really appreciate. (if it works - I am saved, if not I need to propose other solution. Currently mirroring or replication might be hard to get as local DEVs have no idea how to set it up...)
Other ideas how to handle this situation or perform tests are welcome (My deadline is Friday 26-01).

First of all I would suggest you code your table name into a smaller variable and not a character one (30 tables => tinyint).
Second of all you need to understand how big is the payload you are going to write and how:
if you chose a correct clustered index (date column) then the server will just need to out data row by row in a sequence. That is a silly easy job even if you put all 200k rows at once.
if you code the table name as a tinyint, then basically it has to write:
1byte (table name) + PK size (hopefully numeric so <= 8bytes) + 8bytes datetime - so aprox 17bytes on the datapage + indexes if any + log file . This is very lightweight and again will put no "real" pressure on sql sever.
The trigger itself will add a small overhead, but with the amount of rows you are talking about, it is negligible.
I saw systems that do similar stuff on a way larger scale with close to 0 effect on the work process, so I would say that it's a safe bet. The only problem with this approach is that it will not work in some cases (ex: outputs to temp tables from DML statements). But if you do not have these kind of blockers then go for it.
I hope it helps.

Related

Best way to handle updates on a table

I am looking for much more better way to update tables using SSIS. Specifically, i wanted to optimize the updates on tables (around 10 tables uses same logic).
The logic is,
Select the source data from staging then inserts into physical temp table in the DW (i.e TMP_Tbl)
Update all data matching by customerId column from TMP_Tbl to MyTbl.
Inserts all non-existing customerId column from TMP_Tbl1 to MyTbl.
Using the above steps, this takes some time populating TMP_Tbl. Hence, i planned to change the logic to delete-insert but according to this:
In SQL, is UPDATE always faster than DELETE+INSERT? this would be a recipe for pain.
Given:
no index/keys used on the tables
some tables contains 5M rows, some contains 2k rows
each table update took up to 2-3 minutes, which took for about (15 to 20 minutes) all in all
these updates we're in separate sequence container simultaneously runs
Anyone knows what's the best way to use, seems like using physical temp table needs to be remove, is this normal?
With SSIS you usually BULK INSERT, not INSERT. So if you do not mind DELETE - reinserting the rows should in general outperform UPDATE.
Considering this the faster approach will be:
[Execute SQL Task] Delete all records which you need to update. (Depending on your DB design and queries, some index may help here).
[Data Flow Task] Fast load (using OLE DB Destination, Data access mode: Table of fiew - fast load) both updated and new records from source into MyTbl. No need for temp tables here.
If you cannot/don't want to DELETE records - your current approach is OK too.
You just need to fix the performance of that UPDATE query (adding an index should help). 2-3 minutes per every record updated is way too long.
If it is 2-3 minutes for updating millions of records though - then it's acceptable.
Adding the correct non-clustered index to a table should not result in "much more time on the updates".
There will be a slight overhead, but if it helps your UPDATE to seek instead of scanning a big table - it is usually well worth it.

Copy data from one column to another in oracle table

My current project for a client requires me to work with Oracle databases (11g). Most of my previous database experience is with MSSQL Server, Access, and MySQL. I've recently run into an issue that seems incredibly strange to me and I was hoping someone could provide some clarity.
I was looking to do a statement like the following:
update MYTABLE set COLUMN_A = COLUMN_B;
MYTABLE has about 13 million rows.
The source column is indexed (COLUMN_B), but the destination column is not (COLUMN_A)
The primary key field is a GUID.
This seems to run for 4 hours but never seems to complete.
I spoke with a former developer that was more familiar with Oracle than I, and they told me you would normally create a procedure that breaks this down into chunks of data to be commited (roughly 1000 records or so). This procedure would iterate over the 13 million records and commit 1000 records, then commit the next 1000...normally breaking the data up based on the primary key.
This sounds somewhat silly to me coming from my experience with other database systems. I'm not joining another table, or linking to another database. I'm simply copying data from one column to another. I don't consider 13 million records to be large considering there are systems out there in the orders of billions of records. I can't imagine it takes a computer hours and hours (only to fail) at copying a simple column of data in a table that as a whole takes up less than 1 GB of storage.
In experimenting with alternative ways of accomplishing what I want, I tried the following:
create table MYTABLE_2 as (SELECT COLUMN_B, COLUMN_B as COLUMN_A from MYTABLE);
This took less than 2 minutes to accomplish the exact same end result (minus dropping the first table and renaming the new table).
Why does the UPDATE run for 4 hours and fail (which simply copies one column into another column), but the create table which copies the entire table takes less than 2 minutes?
And are there any best practices or common approaches used to do this sort of change? Thanks for your help!
It does seem strange to me. However, this comes to mind:
When you are updating the table, transaction logs must be created in case a rollback is needed. Creating a table, that isn't necessary.

improve sql delete performance and reduce log file and tempDB size?

I have a huge database, it process the email traffic everyday. In the system, it needs delete some old emails everyday:
Delete from EmailList(nolock)
WHERE EmailId IN (
SELECT EmailId
FROM Emails
WHERE EmailDate < DATEADD([days], -60, GETDATE())
)
It works, but the problem is: it takes a long time to finish and the log file becomes very huge because of this. The log file size increases more than 100GB everyday.
I'm thinking we can change it to
Delete from EmailList(nolock)
WHERE EXISTS (
SELECT EmailId
FROM Emails
WHERE (Emails.EmailId = EmailList.EmailId) AND
(EmailDate < DATEADD([days], -60, GETDATE()))
)
But other than this, is there anything we can do to improve the performance. most of all, reduce the log file size?
EmailId is indexed.
Ive seen
GetDate()-60
style syntax perform MUCH better than
DATEADD([days], -60, GETDATE()))
especially if there is an index on the date column. A few fellow DBAs and I had spent quite a bit of time trying to understand WHY it would perform better, but the result was in the pudding.
Another thing you might want to consider, considering the volume of records I presume you have to delete, is to chunk the deletes in batches of say 1000 or 10000 records. This would probably speed up the delete process.
Have you tried parition by date, then you can just drop the table versions for the days yo uare not interested in anymore. Given a "hugh" database you for sure do run enterprise edition of SQL Server (after all, hugh is bigger than very large) and that has table partitioning.
[EDIT]:
regarding #TomTom's comment:
If you have SQL Server Enterprise edition available you should use Table Partitioning.
If this is not the case, my original post may be helpfull:
[ORIGINAL POST]
Deleting a large amount of data is always difficult. I faced the same problem and I went with the following solution:
Depending on your requirements this will not work, but maybe you can get some ideas from it.
Instead of using 1 table, use 2 tables, with the same schema. Create a synonym (I assume you are using MS SQL server) that points to the "active table of the 2 tables (active means, this is the table that you currently write to). Use this synyonym for the inserts in your application, or instead of using the synonym, the application could just change the table each x days it writes to.
Every x days you can truncate the old/inactive table and afterwards recreate the synonym to target the truncated table (if you use the synonnym solution), so effectively you are partitioning the data per time.
You have to synchronize the switch of the active table. I automated this completely, by using a shared App-lock for the application, and an Exclusive Applock when changing the synonym (== blocking the writing application during the switching process).
If changing your applicaiotn's code is not an option, consider using the same principle but instead of writing to the synonym you could create a view with instead of triggers (the insert operation would insert into the "active" partition). The trigger code would need syhcnronize using something like the Applock as mentioned above (so that writes during the switching process work).
My solution is a litte more complex, so I currently cannot post the code here, But it works without problems for a high load application and the swithcingt/cleanup process is completely automated.

Handling a Huge Data of Record in 1 table

I would like to ask couple question how to handle a huge 100 million of data in 1 single table.
The table will perform INSERT, SELECT & UPDATE.
I have got some advise that to Index the table and Archive the table into couple table.
Any other suggestion that can help to tweak the SQL Performance.
Case:
SQL Server 2008.
Most of the time the update regarding decimal value, and status of tiny int.
The INSERT statement will not using BULK INSERT since I'm assuming that per min that there'r a lot of users let said 10000-500000 performing INSERT statement and Update the table.
You should consider what kind of columns you have.
The more nvarchar/text/etc columns you have included in the different indexes, the slower the index will be.
Also what RDBMS are you going to use? You have different options based on SQL Server, Oracle and MySQL...
But the crucial thing is differently to build the right index's that you would use...
One other thing, you could use BULK INSERT on SQL Server to speed up the inserts.
But ask away, i have dealt with databases being populated with 70 mill data rows pr day ;)
EDIT ----- After more information has come
I'll try to take a little other approach to the case and compare it to data scraping.
There are no doubt that INSERTs are faster than UPDATEs. And you might want to make a table that acts as a "collect" table. What I mean is that it only get inserts all the time. No updates, all is handle with inserts.
Then you use a trigger/event/scheduler to handle what has come into that table and populate what you need to another(s) table(s).
This way you will be able to apply a little business logic to the "cleanup" (update) and keep the performance on the DB Server and not hold up a connection while these things are done.
This of course also have something to do with what the "final" data are to be used for...
\T
Clearly SQL 2008 is capable of 100 million records but a lot of details to look at that just do not come into play at 100 thousand. Pick a good primary key. Fill factor. Other indexes (will slow down insert but speed select). Concurrency (locking). If you can accept dirty reads then that will help performance. This question needs a lot more detail. You need to post the table design and your select, update, and insert TSQL statements. I did not vote your question down but if you don't provide more detail it will get voted down.
For insert be aware you can insert multiple rows at once and is much faster than multiple insert statements if BULK INSERT is not an option.
INSERT INTO Production.UnitMeasure
VALUES (N'FT2', N'Square Feet ', '20080923'), (N'Y', N'Yards', '20080923'), (N'Y3', N'Cubic Yards', '20080923');

What is the best database for my needs?

I am currently using MS SQL Server 2008 but I'm not sure it it is the best system for this particular task.
I have a single table like so:
PK_ptA PK_ptB DateInserted LookupColA LookupColB ... LookupColF DataCol (ntext)
A common query is
SELECT TOP(1000000) DataCol FROM table
WHERE LookupColA=x AND LookupColD=y AND LookupColE=z
ORDER BY DateInserted DESC
The table has about a billion rows with 5 million inserted per day.
My main problem with SQL Server is that it isn't too easy to shard or spread out the datafiles. Also, exporting seems to max out at 1000rows per second (about 1MB/s) which seems very slow.
Another problem I have is, with SQL Server, if I want to add a new LookupCol the log file grows enormously requiring a large amount of rarely used free space on tap.
Are there any obvious better solutions for this problem?
You have a problem, and it is not SQL Server. let me also ignore that you seem to ahve a bad table design.
Spreading data files is actually pretty easy. REORGANIZING later is not that easy, but also doable. How is your table, filegroup and file layout?
export 1mb per second is a joke. Seriously. I have been handling 150 million row files in minutes - that runs down to a LOT more than 60.000 rows per minute. Something is freaking out. Temp space? Did you do a performance analysis? How does the hardware look?
Nothing will work for the log usage. Basically like most pro databases the log contains all changed database pages during a transaction. Adding a field changes - ALL pages.
You should:
Redesign the database (use a view to keep the same old table in place if you ahve to) so that it does not ahve "LookupColA" etc., but is normalized (LookupValue, and a LookuPTable that is coded by "column"). This way you get instant additional fields. This turns into a data warehouse like star schema.
Do a performance analysis. Looks like you ahve some problems.
Definitely tell us abou your hardware ;)
This problem here is definitely NOT SQL Server, it is related to bad table design AND - possibly - insufficient - badly utilized hardware.
Ok, the table design (separate answer). Lokup are bassically lookup tables.
So....
LookupTable
pk (int)
TableType
Value
as vields
ValueTable
pk
ValueLookupMap table
pk of ValueTable entry
pk of LookupTable entry
So, basically, if you add a lookup "field" then you just create a set of entries in the LookupTable then add entries in the ValueLookupMap.

Resources