I have a simple SSIS package that transfer data between source and destination from one server to another.
If its new records - it inserts, otherwise it checks HashByteValue column and if it different its update record.
Table contains approx 1.5 million rows, and updates around 50 columns.
When I start debug the package, for around 2 minutes nothing happens, I cant even see the green check-mark. After that I can see data starts flowing through, but sometimes it stops, then flowing again, then stops again and so on.
The whole package looks like this:
But if I do just INSERT part (without update) then it works perfectly, 1 min and all 1.5 million records in a destination table.
So why adding another LOOKUP transformation to the package that updates records slows down performance so significantly.
Is it something to do with memory? I am using FULL CACHE option in both lookups.
what would be the way to increase performance?
Can the reason be in Auto Growth File size:
Besides changing AutoGrowth size to 100MB, your Database Log file is 29GB. That means you most likely are not doing Transaction Log backups.
If you're not, and only do Full Backups nightly or periodically. Change the Recovery Model of your Database from Full to Simple.
Database Properties > Options > Recovery Model
Then Shrink your Log file down to 100MB using:
DBCC SHRINKFILE(Catalytic_Log, 100)
I don't think that your problem is in the lookup. The OLE DB Command is realy slow on SSIS and I don't think it is meant for a massive update of rows. Look at this answer in the MSDN: https://social.msdn.microsoft.com/Forums/sqlserver/en-US/4f1a62e2-50c7-4d22-9ce9-a9b3d12fd7ce/improve-data-load-perfomance-in-oledb-command?forum=sqlintegrationservices
To verify that the error is not the lookup, try disabling the "OLE DB Command" and rerun the process and see how long it takes.
In my personal experience it is always better to create a Stored procedure to do the whole "dataflow" when you have to update or insert based on certain conditions. To do that you would need a Staging table and a Destination table (where you are going to load the transformed data).
Hope it helps.
Related
I am encountering this weird problem and any help will be really appreciated.
I have a single container in which I have 2 data flow task in my SSIS package, data transfer is very huge. Breakdown of problem.
First Container is transferring from oracle to SQL around 130 million rows and it ran just fine and transfer the rows in about 40 to 60 mins which is very much acceptable.
Now come the second part another data flow task is there that is transferring around 86 million rows from SQL server to SQL server(one table) only, the data transfer flies very fast till 60 70 million and after that it just dies out or crawls just like anything for next 10 million rows it took 15 hours, I am not able to understand why is it happening so?
Table get truncated and then it gets loaded, I have tried increasing DataBuffer proeprties etc but with no avail.
Thanks in advance for any help.
You are creating a single transaction and the transaction log is filling up. You can get 10-100x faster speeds if you move 10000 rows at a time. You may also try setting Maximum Insert Commit Size to 0 or try 5000 and go up to see the impact on performance. This is on the OLE DB Destination component. In my experience 10000 rows is the current magic number that seems to be the sweet spot but of course it is very dependent on how large the rows are, version of SQL Server and the hardware setup.
You should also look if there are indexes on the target table you can try dropping the indexes, loading the table and recreating the indexes.
What is your destination recovery model? Full/Simple, etc...
Are there any transformations between the source and destination? Try sending the source to a RowCount to determine the maximum speed your source can send data. You may be seeing a slowdown on the source side as well.
Is there any difference in content of the rows once you notice the slow down? For example, maybe the more recent rows have lots of text in a varchar(max) column that the early rows did not make use of.
Is your destination running on a VM? If yes, have you pre-allocated the CPU and RAM? SSIS is multi-threaded, but it won't necessarily use 100% of each core. VM hosts may share the resources with other VMs because the SSIS VM is not reporting full usage of all of the resources.
I need to perform a task in which we have a table who has 19 columns with text data type. I want to delete these columns from this source table and move those columns to a new table with data type as varchar(max). The source table has currently 30k rows (with text data type data). This will increase eventually as client will use the database for record storage. For transferring this old data i tried to use "insert into..select.." query but it is taking around 25-30 mins to transfer these much rows(30k). Same is the case with "Select from..insert.." query. I have also tried creating data flow task of SSIS for transferring with OLE DB as source and destination as well. But still it's taking same amount of time. I'm really confused as all posts over internet suggests that SSIS is fastest way for data transfer. Can you please suggests me better way to improve performance of data transfer using any technique?
Thanks
SSIS probably isn't faster if the source and the destination are in the same database and the SSIS process is on the same box.
One approach might be to figure out where you are spending the time and optimise that. If you set Management Studio to "discard results after execution" and run just the select part of your query, how long does that take? If this is a substantial part of the 25-30 minutes then work on optimising that.
If the select statement turns out to be really fast, then all the time is being spent on the insert and you need to look at improving that part of the process instead. There are a couple of things you can try here before you go hardware shopping; are there any indexes or constraints (or triggers!) on the target table that you can drop for the duration of the insert and put back again at the end? Can you put the database in simple mode?
I'm copying 99 million rows from one SQL Server instance to another using the right-click "Tasks" > "Import Data" method. It's just a straight copy into a new, empty table on a new and empty NDF file. I'm using the identity insert when doing the copy so that the IDs will stay in tact. It was going very slowly (30 million records after 12 hours), so my boss told me to cancel it, remove all indexes from the new empty table, then run again.
Will removing indexes on the new table really speed up the transfer of records, and why? I imagine I can create indexes after the table is filled.
What is the underlying process behind right-click "Import Data"? Is it using SqlBulkCopy, is it logging tons of stuff? I know it's not in a transaction because cancelling it stopped it immediately and the already inserted rows were there.
My file growth on the NDF file that holds the table is 20MB. Will increasing this speed up the process when using the above records on 99 million records? It's just an idea I had.
Yes, it should. Each new row being inserted will cause each index to be updated with the new data. It's worth noting that if you remove the indexes, import, then re-add the indexes, those indexes will take a very long time to build anyway.
It essentially runs as a very simple SSIS package. It reads rows from the source and inserts in chunks as a transaction. If your recovery model is set to Full, you could switch it to Bulk Logged for the import. This should be done if you're bulk-moving data when other updates to the database won't be happening, though.
I would try to size the MDF/NDF close to what you'd expect the end result to be. The autogrowth can take time, especially if you have it set low.
I have reports that perform some time consuming data calculations for each user in my database, and the result is 10 to 20 calculated new records for each user. To improve report responsiveness, a nightly job was created to run the calculations and dump the results to a snapshot table in the database. It only runs for active users.
So with 50k users, 30k of which are active, the job "updates" 300k to 600k records in the large snapshot table. The method it currently uses is it deletes all previous records for a given user, then inserts the new set. There is no PK on the table, only a business key is used to group the sets of data.
So my question is, when removing and adding up to 600k records every night, are there techniques to optimize the table to handle this? For instance, since the data can be recreated on demand, is there a way to disable logging for the table as these changes are made?
UPDATE:
One issue is I cannot do this in batch because the way the script works, it's examining one user at a time, so it looks at a user, deletes the previous 10-20 records, and inserts a new set of 10-20 records. It does this over and over. I am worried that the transaction log will run out of space or other performance issues could occur. I would like to configure the table to now worry about data preservation or other items that could slow it down. I cannot drop the indexes and all that because people are accessing the table concurrently to it being updated.
It's also worth noting that indexing could potentially speed up this bulk update rather than slow it down, because UPDATE and DELETE statements still need to be able to locate the affected rows in the first place, and without appropriate indexes it will resort to table scans.
I would, at the very least, consider a non-clustered index on the column(s) that identify the user, and (assuming you are using 2008) consider the MERGE statement, which can definitely avoid the shortcomings of the mass DELETE/INSERT method currently employed.
According to The Data Loading Performance Guide (MSDN), MERGE is minimally logged for inserts with the use of a trace flag.
I won't say too much more until I know which version of SQL Server you are using.
This is called Bulk Insert, you have to drop all indexes in destination table and send insert commands in large packs (hundreds of insert statements) separated by ;
Another way is to use BULK INSERT statement http://msdn.microsoft.com/en-us/library/ms188365.aspx
but it involves dumping data to file.
See also: Bulk Insert Sql Server millions of record
It really depends upon many things
speed of your machine
size of the records being processed
network speed
etc.
Generally it is quicker to add records to a "heap" or an un-indexed table. So dropping all of your indexes and re-creating them after the load may improve your performance.
Partitioning the table may see performance benefits if you partition by active and inactive users (although the data set may be a little small for this)
Ensure you test how long each tweak adds or reduces your load and work from there.
I've inherited an SSIS package which loads 500K rows (about 30 columns) into a staging table.
It's been cooking now for about 120 minutes and it's not done --- this suggests it's running at less than 70 rows per second. I know that everybody's environment is different but I think this is a couple orders of magnitude off from "typical".
Oddly enough the staging table has a PK constraint on an INT (identity) column -- and now I'm thinking that it may be hampering the load performance. There are no other constraints, indexes, or triggers on the staging table.
Any suggestions?
---- Additional information ------
The source is a tab delimited file which connects to two separate Data Flow Components that add some static data (the run date, and batch ID) to the stream, which then connects to an OLE DB Destination Adapter
Access mode is OpenRowset using FastLoad
FastLoadOptions are TABLOCK,CHECK_CONSTRAINTS
Maximum insert commit size: 0
I’m not sure about the etiquette of answering my own question -- so sorry in advance if this is better suited for a comment.
The issue was the datatype of the input columns from the text file: They were all declared as “text stream [DT_TEXT]” and when I changed that to “String [DT_STR]” 2 million rows loaded in 58 seconds which is now in the realm of “typical” -- I'm not sure what the Text file source is doing when columns are declared that way, but it's behind me now!
I'd say there is a problem of some sort, I bulk insert a staging table from a file with 20 million records and more columns and an identity field in far less time than that and SSIS is supposed to be faster than SQL Server 2000 bulk insert.
Have you checked for blocking issues?
If it is running in one big transaction, that may explain things. Make sure that a commit is done every now and then.
You may also want to check processor load, memory and IO to rule out resource issues.
This is hard to say.
I there was complex ETL, I would check the max number of threads allowed in the data flows, see if some things can run in parallel.
But it sounds like it's a simple transfer.
With 500,000 rows, batching is an option, but I wouldn't think it necessary for that few rows.
The PK identity should not be an issue. Do you have any complex constraints or persisted calculated columns on the destination?
Is this pulling or pushing over a slow network link? Is it pulling or pushing from a complex SP or view? What is the data source?