I have an Oracle datawarehouse, which contains a huge amount of data (around 11 million rows) and want to extract this data on a daily basis to SQL Server database.
SSIS Package
I have created a package to import data from Oracle to SQL Server using slowly changing dimensions however it is handling around 600 rows per second.
I need my package to just insert new records without updating or doing anything to old records as the data is huge.
Is there any way to do it very fast with any other data flow items?
You could try to utilize a Merge Join in SSIS, this should allow for a comparison where only new records are inserted. Also, I don't like using just datetime when determining what data does and does not get inserted, I guess it depends on your source data though. Sounds like there is not a sequential ID field for the Oracle source data? If there is, I'd utilize that and datetime in combination for what data to insert. This could be done in SQL or SSIS.
600/sec is not too bad in your case.
If assume that those 11 millions were collected during only 1 year. That means the number of new records is just 30K per day. Which is about 1 minute to run.
The biggest problem is to identify records to insert.
If you have to have Timestamp or sequential ID to identify latest inserted records.
In case your ID is not sequential you can try to extract into SSIS ONLY ID field from Oracle table and compare it to the existing dataset and then request from Oracle only newest records.
If you don't have these fields you can extract all 11 million records, then generate hash on both sides and compare these hash values to know what new to insert.
Related
I have a Table in sql server consisting of 200 million records in two different servers. I need to move this table from Server 1 to Server 2.
Table in server 1 can be a subset or a superset of the table in server 2. Some of the records(around 1 million) in server 1 are updated which I need to update in server 2. So currently I am following this approach :-
1) Use SSIS to move data from server 1 to staging database in server 2.
2) Then compare data in staging with the table in server 2 column by column. If any of the column is different, I update the whole row.
This is taking a lot of time. I tried using hashbytes inorder to compare rows like this:-
HASHBYTES('sha',CONCAT(a.[account_no],a.[transaction_id], ...))
<>
HASHBYTES('sha',CONCAT(b.[account_no],b.[transaction_id], ...))
But this is taking even more time.
Any other approach which can be faster and can save time?
This is a problem that's pretty common.
First - do not try and do the updates directly in SQL - the performance will be terrible, and will bring the database server to its knees.
In context, TS1 will be the table on Server 1, TS2 will be the table on Server 2
Using SSIS - create two steps within the job:
First, find the deleted - scan TS2 by ID, and any TS2 ID that does not exist in TS1, delete it.
Second, scan TS1, and if the ID exists in TS2, you will need to update that record. If memory serves, SSIS can inspect for differences and only update if needed, otherwise, just execute the update statement.
While scanning TS1, if the ID does not exist in TS2, then insert the record.
I can't speak to performance on this due to variations in schemas as servers, but it will be compute intensive to analyze the 200mm records. It WILL take a long time.
For on-going execution, you will need to add a "last modified date" timestamp to each record and a trigger to update the field on any legitimate change. Then use that to filter out your problem space. The first scan will not be terrible, as it ONLY looks at the IDs. The insert/update phase will actually benefit from the last modified date filter, assuming the number of records being modified is small (< 5%?) relative to the overall dataset. You will also need to add an index to that column to aid in the filtering.
The other option is to perform a burn and load each time - disable any constraints around TS2, truncate TS2 and copy the data into TS2 from TS1, finally reenabling the constraints and rebuild any indexes.
Best of luck to you.
I have a huge text file that has 1 million rows and each row there is only 28 length number as text.
I want to import them into sql server that has table corresponding column. So that a million data will be inserted to one column DB table.
I used SSIS, it's kind of slow. (1 million data will be inserted in 4.5 hours or more) Are there any other ways to do that much faster ?
You can use BCP utility for fast import . See official documentation here : DOC
As a result, I decided to sptling up the huge data into parts and run more SSIS at the same time through insert same table. There will be no lock in inserting. I hope 6 SSIS finish this job nearly about an hour.
Thanks.
Dilemma:
I am about to perform population of data on MS SQL Server (2012 Dev Edition). Data is based on production data. Amount is around 4TB (around 250 million items).
Purpose:
To test performance on full text search and on regular index as well. Target number should be around 300 million items around 500K each.
Question:
What should I do before to speed up the process or consequences that I should worry about?
Ex.
Switching off statistics?
Should I do a bulk insert of 1k items per transaction instead of single transaction?
Simple recovery model?
Log truncation?
Important:
I will use sample of 2k of production items to create every random item that will be inserted into database. I will use near unique samples generated in c#. It will be one table:
table
(
long[id],
nvarchar(50)[index],
nvarchar(50)[index],
int[index],
float,
nvarchar(50)[index],
text[full text search index]
)
Almost invariably, in a situation like this, and I've had several of them, I've used SSIS. SSIS is the fastest way I know to import large amounts of data into a SQL Server database. You have complete control over batch (transaction size) and it will perform bulk inserting. In addition, if you have transformation requirements, SSIS will handle this with ease.
I am using JasperReports to generate reports from SQL Server on daily basis. The problem is that every day the report reads data from beginning, but I want it to exclude records read earlier and include only new rows. The database is old and doesn't have timestamp columns in table so there is no way to identify which records are 'new' and which ones are 'old'.
I am not allowed to modify it either.
Please suggest any other way if possible.
You can create a new table and every time you print records on your report, insert that records in the table. So you can use a query with a NOT EXISTS condition from the original table on the new table.
The obvious drawbacks of this approach is space consumption on the DB and the extra work needed in inserting records on the new table, but if you cannot modify the original table, it's the only solution.
Otherwise the Alex K suggestion is very good.
I am looking into created an MD5 checksum persisted computed value for SQL Server 2008. There is gigabytes of data in this particular table(not my design) and I would like to know if I create the new computed column will this lock the entire table until the computation has finished?
Will the new column update all entries in the table or update them only when they are selected/updated via a SQL command?
What is the recommended practice for created MD5 computed values in SQL Server?
After kicking up an exact copy of the database on a virtual machine I experimented with adding the persisted column.
There were about 1 million rows and it took 2 hours to compute the MD5 hash column. In this time the entire table was locked for select, update and insert. On a production server you would have to factor in other issues.