I have to gather a large volume of data from various SQL Server tables (~around 300 million rows) and to upsert them into a single fact table in my data warehouse.
1/ What is the best strategy to import all these rows?
2/ Is this a good practice to import by batches? How big should be a batch? 10k rows is ok?
The way that I designed this was for a data movement between 3 different layers
Landing Area
Staging area (where most of the look ups and key substitutions happened)
Data Warehouse
We created bulk tables in the landing area without any sort of key's or anything on there. We would simply land the data in that area and then would move it further along the system.
The way I designed the package was to create 2 very simple table in SQL Server with 4 columns each. The first table, I called it ToBeProcessed and the 2nd (quite obviously) Processed.
The columns that I had were
1)
dbo.ToBeProcessed
(ID INT IDENTITY (1,1),
BeginDate DATETIME,
EndDate DateTime,
Processed VARCHAR(1)
)
2)
dbo.Processed
( ID INT IDENTITY(1,1),
ProcessedEndDate DATETIME,
TableName VARCHAR (24),
CompletedDateTime DATETIME
)
What I did was to populate the ToBeProcessed Table with date ranges spanning a week each. For example 1st Row would be from 01/01/2014 to 01/07/2014, the next row would be from 01/08/2014 to 01/15/2014 and so on. This makes sure that you dont overlap any piece of data that you are pulling in.
On the SSIS Side you would want to create a for each loop container and parse through all the dates in the 1st table one by one. You can parametrize your Data Flow task with the variables you would create to store the dates from the For each loop container. Every time a weeks worth of data gets processed, you simple insert the end date into your 2nd table.
This makes sure that you have a track of the data you have processed. The reason for doing this is because if the package fails for any reason, you can start from the point of failure without repulling all the data that you have already processed (I think in your case, you may want to turn the T-Logs off if you are not working on production environment).
As for upserting, I think using a merge statement could be an option, but it all depends on what your processing time frames are. If you are looking to turn this around over the weekend, I would suggest using a stored proc on the data set and making sure that your Log tables can grow comfortably with that amount of data.
This is a brief summary of the quick and dirty way which worked for me. This does not mean its the best method out there, but certainly got the job done for me. Let me know if you have any questions.
Related
I have two table(T_1 & T_2) with same fields. What I need, after every hour T_2 table only have the data which was inserted on T_1 table within that hour(previous hour data will be erased). I am using sql server. Please help me.
Why would you set up two tables to do this?
Your use-case seems like a canonical case for table partitioning. This is a way of storing data in separate "units" (files). You seem to want T_1 to have its data split by hour.
Then you can directly access the data for a particular hour. This will be as efficient from an access perspective as copying the data into a separate table.
If you really wanted to, you could copy the most recent partition to another table every hour -- swapping in the new data for the older data. But that seems unnecessary in practice.
I would need information what might be the impact for production DB of creating triggers for ~30 Production tables that capture any Update,Delete and Insert statement and put following information "PK", "Table Name", "Time of modification" to separate table.
I have limited ability to test it as I have read only permissions to both Prod and Test environment (and I can get one work day for 10 end users to test it).
I have estimated that number of records inserted from those triggers will be around ~150-200k daily.
Background:
I have project to deploy Data Warehouse for database that is very customized + there are jobs running every day that manipulate the data. Updated on Date column is not being maintain (customization) + there are hard deletes occurring on tables. We decided to ask DEV team to add triggers like:
CREATE TRIGGER [dbo].[triggerName] ON [dbo].[ProductionTable]
FOR INSERT, UPDATE, DELETE
AS
INSERT INTO For_ETL_Warehouse (Table_Name, Regular_PK, Insert_Date)
SELECT 'ProductionTable', PK_ID, GETDATE() FROM inserted
INSERT INTO For_ETL_Warehouse (Table_Name, Regular_PK, Insert_Date)
SELECT 'ProductionTable', PK_ID, GETDATE() FROM deleted
on core ~30 production tables.
Based on this table we will pull delta from last 24 hours and push it to Data Warehouse staging tables.
If anyone had similar issue and can help me estimate how it can impact performance on production database I will really appreciate. (if it works - I am saved, if not I need to propose other solution. Currently mirroring or replication might be hard to get as local DEVs have no idea how to set it up...)
Other ideas how to handle this situation or perform tests are welcome (My deadline is Friday 26-01).
First of all I would suggest you code your table name into a smaller variable and not a character one (30 tables => tinyint).
Second of all you need to understand how big is the payload you are going to write and how:
if you chose a correct clustered index (date column) then the server will just need to out data row by row in a sequence. That is a silly easy job even if you put all 200k rows at once.
if you code the table name as a tinyint, then basically it has to write:
1byte (table name) + PK size (hopefully numeric so <= 8bytes) + 8bytes datetime - so aprox 17bytes on the datapage + indexes if any + log file . This is very lightweight and again will put no "real" pressure on sql sever.
The trigger itself will add a small overhead, but with the amount of rows you are talking about, it is negligible.
I saw systems that do similar stuff on a way larger scale with close to 0 effect on the work process, so I would say that it's a safe bet. The only problem with this approach is that it will not work in some cases (ex: outputs to temp tables from DML statements). But if you do not have these kind of blockers then go for it.
I hope it helps.
I've tried to search for some ideas but can't find anything that's very suitable for my scenario.
I have a table which I write and updata data to from multiple sites, maybe a row per second for specific hours of the day and on average having around 50k records added daily. Seperate to this, I have dashboards where people can query this data but some of the queries may be quite complex and take a number of seconds to complete.
I can't afford my write/updates to slow down
Although the dashboards don't need to be real time, it would be a bonus
Im hosting on Azure DB S2. What options are available?
Current idea is to use an 'active' table for writes/updates and flush the data to the full table every x min. My only concern is that I have a seeded bigint as a PK on the main table and because I also save other data linked to this, I'd have problems linking to this id until I commit to the main table. An option would be to reseed the active table and set identity insert off on the main table to populate it myself but I'm not 100% happy with this.
Just looking for suggestions until I go ahead with my current idea! Thanks
Here is my situation, my client wants to bulk insert 100,000+ rows into the database from a csv file which is simple enough but the values need to be checked against data that is already in the database (does this product type exist? is this product still sold? etc.). To make things worse these files will also be uploaded into the live system during the day so I need to make sure I’m not locking any tables for long. The data that is inserted will also be spread across multiple tables.
I’ve been adding the date into a staging table which takes seconds, I then tried creating a WebService to start processing the table using Linq and marking any erroneous rows with an invalid flag (this can take some time). Once the validation is done I need take the valid rows and update/add the rows to the appropriate tables.
Is there a process for this that I am unfamiliar with?
For a smaller dataset I would suggest
IF EXISTS (SELECT blah FROM blah WHERE....)
UPDATE (blah)
ELSE
INSERT (blah)
You could do this in chunks to avoid server load, but this is by no means a quick solution, so SSIS would be preferable
My current project for a client requires me to work with Oracle databases (11g). Most of my previous database experience is with MSSQL Server, Access, and MySQL. I've recently run into an issue that seems incredibly strange to me and I was hoping someone could provide some clarity.
I was looking to do a statement like the following:
update MYTABLE set COLUMN_A = COLUMN_B;
MYTABLE has about 13 million rows.
The source column is indexed (COLUMN_B), but the destination column is not (COLUMN_A)
The primary key field is a GUID.
This seems to run for 4 hours but never seems to complete.
I spoke with a former developer that was more familiar with Oracle than I, and they told me you would normally create a procedure that breaks this down into chunks of data to be commited (roughly 1000 records or so). This procedure would iterate over the 13 million records and commit 1000 records, then commit the next 1000...normally breaking the data up based on the primary key.
This sounds somewhat silly to me coming from my experience with other database systems. I'm not joining another table, or linking to another database. I'm simply copying data from one column to another. I don't consider 13 million records to be large considering there are systems out there in the orders of billions of records. I can't imagine it takes a computer hours and hours (only to fail) at copying a simple column of data in a table that as a whole takes up less than 1 GB of storage.
In experimenting with alternative ways of accomplishing what I want, I tried the following:
create table MYTABLE_2 as (SELECT COLUMN_B, COLUMN_B as COLUMN_A from MYTABLE);
This took less than 2 minutes to accomplish the exact same end result (minus dropping the first table and renaming the new table).
Why does the UPDATE run for 4 hours and fail (which simply copies one column into another column), but the create table which copies the entire table takes less than 2 minutes?
And are there any best practices or common approaches used to do this sort of change? Thanks for your help!
It does seem strange to me. However, this comes to mind:
When you are updating the table, transaction logs must be created in case a rollback is needed. Creating a table, that isn't necessary.