I've inherited an SSIS package which loads 500K rows (about 30 columns) into a staging table.
It's been cooking now for about 120 minutes and it's not done --- this suggests it's running at less than 70 rows per second. I know that everybody's environment is different but I think this is a couple orders of magnitude off from "typical".
Oddly enough the staging table has a PK constraint on an INT (identity) column -- and now I'm thinking that it may be hampering the load performance. There are no other constraints, indexes, or triggers on the staging table.
Any suggestions?
---- Additional information ------
The source is a tab delimited file which connects to two separate Data Flow Components that add some static data (the run date, and batch ID) to the stream, which then connects to an OLE DB Destination Adapter
Access mode is OpenRowset using FastLoad
FastLoadOptions are TABLOCK,CHECK_CONSTRAINTS
Maximum insert commit size: 0
I’m not sure about the etiquette of answering my own question -- so sorry in advance if this is better suited for a comment.
The issue was the datatype of the input columns from the text file: They were all declared as “text stream [DT_TEXT]” and when I changed that to “String [DT_STR]” 2 million rows loaded in 58 seconds which is now in the realm of “typical” -- I'm not sure what the Text file source is doing when columns are declared that way, but it's behind me now!
I'd say there is a problem of some sort, I bulk insert a staging table from a file with 20 million records and more columns and an identity field in far less time than that and SSIS is supposed to be faster than SQL Server 2000 bulk insert.
Have you checked for blocking issues?
If it is running in one big transaction, that may explain things. Make sure that a commit is done every now and then.
You may also want to check processor load, memory and IO to rule out resource issues.
This is hard to say.
I there was complex ETL, I would check the max number of threads allowed in the data flows, see if some things can run in parallel.
But it sounds like it's a simple transfer.
With 500,000 rows, batching is an option, but I wouldn't think it necessary for that few rows.
The PK identity should not be an issue. Do you have any complex constraints or persisted calculated columns on the destination?
Is this pulling or pushing over a slow network link? Is it pulling or pushing from a complex SP or view? What is the data source?
Related
I am encountering this weird problem and any help will be really appreciated.
I have a single container in which I have 2 data flow task in my SSIS package, data transfer is very huge. Breakdown of problem.
First Container is transferring from oracle to SQL around 130 million rows and it ran just fine and transfer the rows in about 40 to 60 mins which is very much acceptable.
Now come the second part another data flow task is there that is transferring around 86 million rows from SQL server to SQL server(one table) only, the data transfer flies very fast till 60 70 million and after that it just dies out or crawls just like anything for next 10 million rows it took 15 hours, I am not able to understand why is it happening so?
Table get truncated and then it gets loaded, I have tried increasing DataBuffer proeprties etc but with no avail.
Thanks in advance for any help.
You are creating a single transaction and the transaction log is filling up. You can get 10-100x faster speeds if you move 10000 rows at a time. You may also try setting Maximum Insert Commit Size to 0 or try 5000 and go up to see the impact on performance. This is on the OLE DB Destination component. In my experience 10000 rows is the current magic number that seems to be the sweet spot but of course it is very dependent on how large the rows are, version of SQL Server and the hardware setup.
You should also look if there are indexes on the target table you can try dropping the indexes, loading the table and recreating the indexes.
What is your destination recovery model? Full/Simple, etc...
Are there any transformations between the source and destination? Try sending the source to a RowCount to determine the maximum speed your source can send data. You may be seeing a slowdown on the source side as well.
Is there any difference in content of the rows once you notice the slow down? For example, maybe the more recent rows have lots of text in a varchar(max) column that the early rows did not make use of.
Is your destination running on a VM? If yes, have you pre-allocated the CPU and RAM? SSIS is multi-threaded, but it won't necessarily use 100% of each core. VM hosts may share the resources with other VMs because the SSIS VM is not reporting full usage of all of the resources.
I need to perform a task in which we have a table who has 19 columns with text data type. I want to delete these columns from this source table and move those columns to a new table with data type as varchar(max). The source table has currently 30k rows (with text data type data). This will increase eventually as client will use the database for record storage. For transferring this old data i tried to use "insert into..select.." query but it is taking around 25-30 mins to transfer these much rows(30k). Same is the case with "Select from..insert.." query. I have also tried creating data flow task of SSIS for transferring with OLE DB as source and destination as well. But still it's taking same amount of time. I'm really confused as all posts over internet suggests that SSIS is fastest way for data transfer. Can you please suggests me better way to improve performance of data transfer using any technique?
Thanks
SSIS probably isn't faster if the source and the destination are in the same database and the SSIS process is on the same box.
One approach might be to figure out where you are spending the time and optimise that. If you set Management Studio to "discard results after execution" and run just the select part of your query, how long does that take? If this is a substantial part of the 25-30 minutes then work on optimising that.
If the select statement turns out to be really fast, then all the time is being spent on the insert and you need to look at improving that part of the process instead. There are a couple of things you can try here before you go hardware shopping; are there any indexes or constraints (or triggers!) on the target table that you can drop for the duration of the insert and put back again at the end? Can you put the database in simple mode?
My current project for a client requires me to work with Oracle databases (11g). Most of my previous database experience is with MSSQL Server, Access, and MySQL. I've recently run into an issue that seems incredibly strange to me and I was hoping someone could provide some clarity.
I was looking to do a statement like the following:
update MYTABLE set COLUMN_A = COLUMN_B;
MYTABLE has about 13 million rows.
The source column is indexed (COLUMN_B), but the destination column is not (COLUMN_A)
The primary key field is a GUID.
This seems to run for 4 hours but never seems to complete.
I spoke with a former developer that was more familiar with Oracle than I, and they told me you would normally create a procedure that breaks this down into chunks of data to be commited (roughly 1000 records or so). This procedure would iterate over the 13 million records and commit 1000 records, then commit the next 1000...normally breaking the data up based on the primary key.
This sounds somewhat silly to me coming from my experience with other database systems. I'm not joining another table, or linking to another database. I'm simply copying data from one column to another. I don't consider 13 million records to be large considering there are systems out there in the orders of billions of records. I can't imagine it takes a computer hours and hours (only to fail) at copying a simple column of data in a table that as a whole takes up less than 1 GB of storage.
In experimenting with alternative ways of accomplishing what I want, I tried the following:
create table MYTABLE_2 as (SELECT COLUMN_B, COLUMN_B as COLUMN_A from MYTABLE);
This took less than 2 minutes to accomplish the exact same end result (minus dropping the first table and renaming the new table).
Why does the UPDATE run for 4 hours and fail (which simply copies one column into another column), but the create table which copies the entire table takes less than 2 minutes?
And are there any best practices or common approaches used to do this sort of change? Thanks for your help!
It does seem strange to me. However, this comes to mind:
When you are updating the table, transaction logs must be created in case a rollback is needed. Creating a table, that isn't necessary.
I am currently using MS SQL Server 2008 but I'm not sure it it is the best system for this particular task.
I have a single table like so:
PK_ptA PK_ptB DateInserted LookupColA LookupColB ... LookupColF DataCol (ntext)
A common query is
SELECT TOP(1000000) DataCol FROM table
WHERE LookupColA=x AND LookupColD=y AND LookupColE=z
ORDER BY DateInserted DESC
The table has about a billion rows with 5 million inserted per day.
My main problem with SQL Server is that it isn't too easy to shard or spread out the datafiles. Also, exporting seems to max out at 1000rows per second (about 1MB/s) which seems very slow.
Another problem I have is, with SQL Server, if I want to add a new LookupCol the log file grows enormously requiring a large amount of rarely used free space on tap.
Are there any obvious better solutions for this problem?
You have a problem, and it is not SQL Server. let me also ignore that you seem to ahve a bad table design.
Spreading data files is actually pretty easy. REORGANIZING later is not that easy, but also doable. How is your table, filegroup and file layout?
export 1mb per second is a joke. Seriously. I have been handling 150 million row files in minutes - that runs down to a LOT more than 60.000 rows per minute. Something is freaking out. Temp space? Did you do a performance analysis? How does the hardware look?
Nothing will work for the log usage. Basically like most pro databases the log contains all changed database pages during a transaction. Adding a field changes - ALL pages.
You should:
Redesign the database (use a view to keep the same old table in place if you ahve to) so that it does not ahve "LookupColA" etc., but is normalized (LookupValue, and a LookuPTable that is coded by "column"). This way you get instant additional fields. This turns into a data warehouse like star schema.
Do a performance analysis. Looks like you ahve some problems.
Definitely tell us abou your hardware ;)
This problem here is definitely NOT SQL Server, it is related to bad table design AND - possibly - insufficient - badly utilized hardware.
Ok, the table design (separate answer). Lokup are bassically lookup tables.
So....
LookupTable
pk (int)
TableType
Value
as vields
ValueTable
pk
ValueLookupMap table
pk of ValueTable entry
pk of LookupTable entry
So, basically, if you add a lookup "field" then you just create a set of entries in the LookupTable then add entries in the ValueLookupMap.
I have a requirement to take a "snapshot" of a current database and clone it into the same database, with new Primary Keys.
The schema in question consists of about 10 tables, but a few of the tables will potentially contain hundreds of thousands to 1 million records that need to be duplicated.
What are my options here?
I'm afraid that writing a SPROC will require a locking of the database rows in question (for concurrency) for the entire duration of the operation, which is quite annoying to other users. How long would such an operation take, assuming that we can optimize it to the full extent sqlserver allows? Is it going to be 30 seconds to 1 minute to perform this many inserts? I'm not able to lock the whole table(s) and do a bulk insert, because there are other users under other accounts that are using the same tables independently.
Depending on performance expectations, an alternative would be to dump the current db into an xml file and then asynchronously clone the db from this xml file at leisure in the background. The obvious advantage of this is that the db is only locked for the time it takes to do the xml dump, and the inserts can run in the background.
If a good DBA can get the "clone" operation to execute start to finish in under 10 seconds, then it's probably not worth the complexity of the xmldump/webservice solution. But if it's a lost cause, and inserting potentially millions of rows is likely to balloon out in time, then I'd rather start out with the xml approach right away.
Or maybe there's an entirely better approach altogether??
Thanks a lot for any insights you can provide.
I would suggest backing the up database, and then restoring it as new db on your server. You can use that new DB as your source.
I will definitely recommend against the xml dump idea..
Does it need to be in the exact same tables? You could make a set of "snapshots" tables where all these records go, you would only need a single insert + select, like
insert into snapshots_source1 (user,col1, col2, ..., colN)
select 'john', col1, col2, ..., colN from source1
and so on.
You can make snapshots_* to have an IDENTITY column that will create the 'new PK' and that can also preserve the old one if you so wished.
This has (almost) no locking issues and looks a lot saner.
It does require a change in the code, but shouldn't be too hard to make the app to point to the snapshots table when appropriate.
This also eases cleaning and maintenance issues
---8<------8<------8<---outdated answer---8<---8<------8<------8<------8<---
Why don't you just take a live backup and do the data manipulation (key changing) on the destination clone?
Now, in general, this snapshot with new primary keys idea sounds suspect. If you want a replica, you have log shipping and cluster service, if you want a copy of the data to generate a 'new app instance' a backup/restore/manipulate process should be enough.
You don't say how much your DB will occupy, but you can certainly backup 20 million rows (800MB?) in about 10 seconds depending on how fast your disk subsystem is...