Background
A customer is collecting prices of different products and wants to update the price in their database if there is any change. We are talking about ~100 million rows of data that comes from scrapping which is stored in spreadsheets.
We are using SQL Server to store this data.
Issue
Problem starts when we upload those spreadsheets so the data can be inserted into staging table and updates curated price table.
Following are two steps behind the operation-
step 1 data is stored in staging table
step 2 price is checked. The ones that changed since last run then
gets updated in the main curated table "price".
But SQL Server is taking too long
What I have done so far
to help step 2 run faster, we created an index on staging table and but that made step 1 slower.
so to help step 1 faster, we disable the index, then bulk insert into the table and then rebuild the said index. which takes about 15 minutes. We don't see any significant improvement.
What I want help with?
I am trying to improve performance. Any ideas on how to accomplish this?
(This question is "focused" towards performance improvement)
Have you considered using MERGE in your query. It would reduce the I/O operations on the disk drastically. Here is an example I found on the very first google search of your problem
Related
I would need information what might be the impact for production DB of creating triggers for ~30 Production tables that capture any Update,Delete and Insert statement and put following information "PK", "Table Name", "Time of modification" to separate table.
I have limited ability to test it as I have read only permissions to both Prod and Test environment (and I can get one work day for 10 end users to test it).
I have estimated that number of records inserted from those triggers will be around ~150-200k daily.
Background:
I have project to deploy Data Warehouse for database that is very customized + there are jobs running every day that manipulate the data. Updated on Date column is not being maintain (customization) + there are hard deletes occurring on tables. We decided to ask DEV team to add triggers like:
CREATE TRIGGER [dbo].[triggerName] ON [dbo].[ProductionTable]
FOR INSERT, UPDATE, DELETE
AS
INSERT INTO For_ETL_Warehouse (Table_Name, Regular_PK, Insert_Date)
SELECT 'ProductionTable', PK_ID, GETDATE() FROM inserted
INSERT INTO For_ETL_Warehouse (Table_Name, Regular_PK, Insert_Date)
SELECT 'ProductionTable', PK_ID, GETDATE() FROM deleted
on core ~30 production tables.
Based on this table we will pull delta from last 24 hours and push it to Data Warehouse staging tables.
If anyone had similar issue and can help me estimate how it can impact performance on production database I will really appreciate. (if it works - I am saved, if not I need to propose other solution. Currently mirroring or replication might be hard to get as local DEVs have no idea how to set it up...)
Other ideas how to handle this situation or perform tests are welcome (My deadline is Friday 26-01).
First of all I would suggest you code your table name into a smaller variable and not a character one (30 tables => tinyint).
Second of all you need to understand how big is the payload you are going to write and how:
if you chose a correct clustered index (date column) then the server will just need to out data row by row in a sequence. That is a silly easy job even if you put all 200k rows at once.
if you code the table name as a tinyint, then basically it has to write:
1byte (table name) + PK size (hopefully numeric so <= 8bytes) + 8bytes datetime - so aprox 17bytes on the datapage + indexes if any + log file . This is very lightweight and again will put no "real" pressure on sql sever.
The trigger itself will add a small overhead, but with the amount of rows you are talking about, it is negligible.
I saw systems that do similar stuff on a way larger scale with close to 0 effect on the work process, so I would say that it's a safe bet. The only problem with this approach is that it will not work in some cases (ex: outputs to temp tables from DML statements). But if you do not have these kind of blockers then go for it.
I hope it helps.
I've tried to search for some ideas but can't find anything that's very suitable for my scenario.
I have a table which I write and updata data to from multiple sites, maybe a row per second for specific hours of the day and on average having around 50k records added daily. Seperate to this, I have dashboards where people can query this data but some of the queries may be quite complex and take a number of seconds to complete.
I can't afford my write/updates to slow down
Although the dashboards don't need to be real time, it would be a bonus
Im hosting on Azure DB S2. What options are available?
Current idea is to use an 'active' table for writes/updates and flush the data to the full table every x min. My only concern is that I have a seeded bigint as a PK on the main table and because I also save other data linked to this, I'd have problems linking to this id until I commit to the main table. An option would be to reseed the active table and set identity insert off on the main table to populate it myself but I'm not 100% happy with this.
Just looking for suggestions until I go ahead with my current idea! Thanks
I am looking for much more better way to update tables using SSIS. Specifically, i wanted to optimize the updates on tables (around 10 tables uses same logic).
The logic is,
Select the source data from staging then inserts into physical temp table in the DW (i.e TMP_Tbl)
Update all data matching by customerId column from TMP_Tbl to MyTbl.
Inserts all non-existing customerId column from TMP_Tbl1 to MyTbl.
Using the above steps, this takes some time populating TMP_Tbl. Hence, i planned to change the logic to delete-insert but according to this:
In SQL, is UPDATE always faster than DELETE+INSERT? this would be a recipe for pain.
Given:
no index/keys used on the tables
some tables contains 5M rows, some contains 2k rows
each table update took up to 2-3 minutes, which took for about (15 to 20 minutes) all in all
these updates we're in separate sequence container simultaneously runs
Anyone knows what's the best way to use, seems like using physical temp table needs to be remove, is this normal?
With SSIS you usually BULK INSERT, not INSERT. So if you do not mind DELETE - reinserting the rows should in general outperform UPDATE.
Considering this the faster approach will be:
[Execute SQL Task] Delete all records which you need to update. (Depending on your DB design and queries, some index may help here).
[Data Flow Task] Fast load (using OLE DB Destination, Data access mode: Table of fiew - fast load) both updated and new records from source into MyTbl. No need for temp tables here.
If you cannot/don't want to DELETE records - your current approach is OK too.
You just need to fix the performance of that UPDATE query (adding an index should help). 2-3 minutes per every record updated is way too long.
If it is 2-3 minutes for updating millions of records though - then it's acceptable.
Adding the correct non-clustered index to a table should not result in "much more time on the updates".
There will be a slight overhead, but if it helps your UPDATE to seek instead of scanning a big table - it is usually well worth it.
My current project for a client requires me to work with Oracle databases (11g). Most of my previous database experience is with MSSQL Server, Access, and MySQL. I've recently run into an issue that seems incredibly strange to me and I was hoping someone could provide some clarity.
I was looking to do a statement like the following:
update MYTABLE set COLUMN_A = COLUMN_B;
MYTABLE has about 13 million rows.
The source column is indexed (COLUMN_B), but the destination column is not (COLUMN_A)
The primary key field is a GUID.
This seems to run for 4 hours but never seems to complete.
I spoke with a former developer that was more familiar with Oracle than I, and they told me you would normally create a procedure that breaks this down into chunks of data to be commited (roughly 1000 records or so). This procedure would iterate over the 13 million records and commit 1000 records, then commit the next 1000...normally breaking the data up based on the primary key.
This sounds somewhat silly to me coming from my experience with other database systems. I'm not joining another table, or linking to another database. I'm simply copying data from one column to another. I don't consider 13 million records to be large considering there are systems out there in the orders of billions of records. I can't imagine it takes a computer hours and hours (only to fail) at copying a simple column of data in a table that as a whole takes up less than 1 GB of storage.
In experimenting with alternative ways of accomplishing what I want, I tried the following:
create table MYTABLE_2 as (SELECT COLUMN_B, COLUMN_B as COLUMN_A from MYTABLE);
This took less than 2 minutes to accomplish the exact same end result (minus dropping the first table and renaming the new table).
Why does the UPDATE run for 4 hours and fail (which simply copies one column into another column), but the create table which copies the entire table takes less than 2 minutes?
And are there any best practices or common approaches used to do this sort of change? Thanks for your help!
It does seem strange to me. However, this comes to mind:
When you are updating the table, transaction logs must be created in case a rollback is needed. Creating a table, that isn't necessary.
I have reports that perform some time consuming data calculations for each user in my database, and the result is 10 to 20 calculated new records for each user. To improve report responsiveness, a nightly job was created to run the calculations and dump the results to a snapshot table in the database. It only runs for active users.
So with 50k users, 30k of which are active, the job "updates" 300k to 600k records in the large snapshot table. The method it currently uses is it deletes all previous records for a given user, then inserts the new set. There is no PK on the table, only a business key is used to group the sets of data.
So my question is, when removing and adding up to 600k records every night, are there techniques to optimize the table to handle this? For instance, since the data can be recreated on demand, is there a way to disable logging for the table as these changes are made?
UPDATE:
One issue is I cannot do this in batch because the way the script works, it's examining one user at a time, so it looks at a user, deletes the previous 10-20 records, and inserts a new set of 10-20 records. It does this over and over. I am worried that the transaction log will run out of space or other performance issues could occur. I would like to configure the table to now worry about data preservation or other items that could slow it down. I cannot drop the indexes and all that because people are accessing the table concurrently to it being updated.
It's also worth noting that indexing could potentially speed up this bulk update rather than slow it down, because UPDATE and DELETE statements still need to be able to locate the affected rows in the first place, and without appropriate indexes it will resort to table scans.
I would, at the very least, consider a non-clustered index on the column(s) that identify the user, and (assuming you are using 2008) consider the MERGE statement, which can definitely avoid the shortcomings of the mass DELETE/INSERT method currently employed.
According to The Data Loading Performance Guide (MSDN), MERGE is minimally logged for inserts with the use of a trace flag.
I won't say too much more until I know which version of SQL Server you are using.
This is called Bulk Insert, you have to drop all indexes in destination table and send insert commands in large packs (hundreds of insert statements) separated by ;
Another way is to use BULK INSERT statement http://msdn.microsoft.com/en-us/library/ms188365.aspx
but it involves dumping data to file.
See also: Bulk Insert Sql Server millions of record
It really depends upon many things
speed of your machine
size of the records being processed
network speed
etc.
Generally it is quicker to add records to a "heap" or an un-indexed table. So dropping all of your indexes and re-creating them after the load may improve your performance.
Partitioning the table may see performance benefits if you partition by active and inactive users (although the data set may be a little small for this)
Ensure you test how long each tweak adds or reduces your load and work from there.