My team is using an SSIS Package that runs in ADF SSIS IR(D4V3) through self hosted integration runtime in an Azure VM.
The SSIS package is ran through ADF pipeline(execute SSIS package) and connects to Azure SSAS to pull 5M rows of data using DAX query (about 3GB total).
On manual query, it takes about 60 seconds to pull 1M rows of data. However, the SSIS package takes 10 hours to pull the data. The average usage of memory of the SSAS was below the limit (50GB) which was around 35GB. Upon doing verbose logging to check, it shows multiple lines saying "Data Flow Task: Rows were provided to a data flow component as input." and another one saying the DFT engine will call a component method. I see 50k rows loading incrementally in the database periodically (every 10~20 minutes or so).
I am using ADO NET source and OLE destination to Azure Database(standard, DTU 200,S4 tier). Access Mode is SQL Command for source with DAX query and Access Mode for Table or view fast load for destination. CPU utilization is 20~40% and Memory Usage out of 16GB is about 4~12GB for the ADF SSIS IR.
My question here is:
What could cause the slow issue here? Manual query is only 60
seconds or less for 1/5 of the data. No spilling to disk should
occur since memory usage is under the limit. Also, sometimes there
is a log Data Flow Task : Primeoutput will be called on a component.
Primeoutput will be called on a component : ADO NET Source. and
package is stuck. What could be the issue?
What is the indicator that all the data have been pulled from the source? If the data is loading in the database, does that mean connection and pulling the data to SSIS package in-memory is finished?
Related
I am doing an ETL in a SSIS package using an Oracle source and SQL Server destination.
The source contains about 17 million rows. When transferring the data, it is done in batches of 10000 rows. Each 10000 rows will be committed to the SQL Server destination when done.
When we observe the time taken for the transfer we can notice that the first 10000 rows took 2 seconds and it gradually increased throughout the load and last 10000 rows took 132 seconds. The duration vs rowset number graph is as follows. What could be the reason for the load time to increase exponentially?
We are using a script in the SSIS package to pull the data from the Oracle source.
There are 2 indexes on the destination SQL server table. one on the primary key(clustered) and the next on a date column(non-unique, non-clustered). But we are first loading to a temp table on SQL server which doesn't have any indexes. After the transfer is completed to the temp table then we are transferring the data to the actual table with indexes.
The issue was with the oracle oledb driver. We were using the 19.0 driver previously which gave us this performance issue. After changing to the 12.1 driver the data load looks like below. Throughout the load, the 10000 row sets took around 3-5 seconds.
We have an SSIS package that is apparently termed as 'slow' by the development team. Since they do not have a person with SSIS ETL, as a DBA I tried digging into it. Below is the information I found:
SQL Server was 2014 version upgraded -inplace to 2017 so it has SSIS of both versions.
They load a SQL Server table of size 200 GB into SSIS and then zip the data into flatfile using command line zip functionality.
The data flow task simple hits a select * from view - the view is nothing but containing the table with no other fancy joins.
While troubleshooting I found that on SQL Server, there is hardly any load coming, possibly because the select command is running in single thread and not utilizing SQL server cores.
When I run the same select * command (only for 5 seconds, since it is 200 GB table), even my command is single threaded.
The package has a configuration file that the SQL job shows (this is how the package runs) with some connection settings.
Opening the package in BIDS show defaultBufferMaxRows as 10000 only (possibly default value) (since configuration file or any variables does not has a customer value, I guess this is what the package is using too).
Both SQL and SSIS are on same server. SQL has been allocated max memory leaving around 100 GB for SSIS and OS.
Kindly share any ideas on how can I force the SQL Server to run this select command using multiple threads so that entire table gets inside SSIS buffer pool faster.
Edit: I am aware that bcp can read data faster than any process and save it to flatfile but at this point changes to the SSIS package has to be kept minimum and exploring options that can be incorporated within SSIS package.
Edit2: Parallelism works perfectly for my SQL Server as I verified for a lot of other queries.The table in question is 200 GB. It is something with SSIS only which is not hammering my DB as hard as it should.
Edit3: I have made some progress, adjusted the buffer value to 100 MB and max rows to 100000 and now the package seem to be doing better. when I run this package on the server directly using dtexec utility, it generates good load of 40- 50 MB per second but through SQL job it never generates lod more than 10 MB. so I am trying to figure out this behavior.
Edit4: I found that when I run the package directly from logging to the server and invoking dtexec utility, it runs good because it generates good load on the DB causing data I\O to remain steady between 30-50 MB\sec.
The same thing from SQL job never exceeds the I\O more than 10 MB\sec.
I even tried to run the package using agent and opting for cmdline operation but no changes. Agent literally sucks here, any pointers on what could be wrong here?
Final Try:
I am stumped at the observation I have finally:
1)Same package runs 3x faster when run from command prompt from windows node by invoking dtexc utility
2) Exact same package runs 3 times slower than above when involked by SQL agent which has sysadmin permissions on windows as well as SQL Server
In both cases, I tried to see the version of DTEXEC they invoke, and they both invoke the same version. So why one would be so slow is out of my understanding.
I don't think that there is a general solution to this issue since it is a particular case that you didn't provide much information. Since there are two components in your data flow task (OLE DB Source and Flat File Destination), I will try to give some suggestions related to each component.
Before giving suggestions for each component, it is good to mention the following:
If no transformations are applied within the data flow task, It is not recommended to use this task. It is preferable to use bcp utility
Check the TempDb and the database log size.
If a clustered index exists, try to rebuild it. If not, try to create a clustered index.
To check the component that is slowing the package execution, open the package in Visual Studio and try to remove the flat file destination and replace it with a dummy Script Component (write any useless code, for example: string s = "";). And then run the package; if it is fast enough, then the problem is caused by the Flat File Destination, else you need to troubleshoot the OLE DB Source.
Try executing the query in the SQL Server management studio and shows the execution plan.
Check the package TargetServerVersion property within the package configuration and make sure it is correct.
OLE DB Source
As you mentioned, you are using a Select * from view query where data is stored in a table that contains a considerable amount of data. The SQL Server query optimizer may find that reading data using Table Scan is more efficient than reading from indexes, especially if your table does not have a clustered index (row store or column store).
There are many things you may try to improve data load:
Try replacing the Select * from view with the original query used to create the view.
Try changing the data provider used in the OLE DB Connection Manager: SQL Server Native Client, Microsoft OLE DB provider for SQL Server (not the old one).
Try increasing the DefaultBufferMaxRows and DefaultBufferSize properties. more info
Try replacing using SQL Command with specific column names instead of selecting the view name (Table of View data access mode). more info
Try to load data in chunks
Flat File Destination
Check that the flat file directory is not located on the same drive where SQL Server instance is installed
Check that the flat file is not located on a busy drive
Try to export data into multiple flat files instead of one huge file (split data into smaller files) , since when the exported data size increase in a single file, writing to this file become slower, then the package will become slower. (Check the 5th suggestion above)
Any indexes on the table could slow loading. If there are any indexes, try dropping them before the load and then recreating them after. This would also update the index statistics, which would be skewed by the bulk insert.
Are you seeing SQL server utilizing other cores too for other queries? If not, maybe someone played with the following settings:
Check these under server configuration setting:
Maximum Degree of Parallelism
Cost Threshold for Parallelism (server configuration setting).
Does processors affinitized to a CPU.
Also, MaxDOP query hint can cause this too but you said there is no fancy stuff in the view.
Also, it seems you have enough memory on error, why not increase defaultBufferMaxRows to an extremely large number so that SQL server doesn't get slowed down waiting for the buffer to get empty. Remember, they are using the same disk and they will have to wait for each other to use the disk, which will cause extra wait times for the both. It's better SQL server uses it, put into the buffer, and then SSIS starts processing and writing it into disk.
DefaultBufferSize : default is 10MB, max possible 2^31-1 bytes
DefaultBufferMaxRows : default is 10000
you can set AutoAdjustBufferSize so that DefaultBufferSize is automatically calculated based on DefaultBufferMaxRows
See other performance troubleshooting ideas here
https://learn.microsoft.com/en-us/sql/integration-services/data-flow/data-flow-performance-features?view=sql-server-ver15
Edit 1: Some other properties you can check out. These are explained in the above link as well
MaxConcurrentExecutables (package property): This defines how many threads a package can use.
EngineThreads (Data Flow property): how many threads the data flow engine can use
Also try running dtsexec under the same proxy user used by SQL agent to see if you get different result with this account versus your account. You can use runas /user:... cmd to open a command window under that user and then execute dtexec.
Try changing the proxy user used in SQL Agent to a new one and see if it will help. Or try giving elevated permissions in the directories it needs access to.
Try keeping the package in file-system and execute through dtexec from the SQL Agent directly instead of using catalog.start_execution.
Not your case but for other readers: if you have "Execute Package Task", make sure the child packages to be executed are set to run in-process via ExecuteOutOfProcess property. This just reduces overhead of using more processes.
Not your case but for other readers: if you're testing in BIDS, it will run in debug mode by default and thus run slow. Use CTRL-F5 (start without debugging). The best is to use dtexec directly to test the performance
A data flow task may not be the best choice to move this data. SSIS Data Flow tasks are an ETL tool where you can do transformations, look ups, redirect invalid rows, add derived columns and a lot more. If the data flow task is simple and only moves data with no manipulation or redirection of rows then ditch the Data Flow task and use a simple Execute SQL Task and OPENROWSET to import the flat file that was generated from command line and zipped up. Assuming the flat file is a .csv file here are some working examples to query a .csv and insert the data to a table.
You need [Ad Hoc Distributed Queries] run_value set to 1
into dbo.Destination
SELECT *
from openrowset('MSDASQL', 'Driver={Microsoft Text Driver (*.txt; *.csv)};
DefaultDir=D:\YourCsv.csv;Extensions=csv;','select * from YourCsv.csv') File;
Here is some additional examples https://sqlpowershell.blog/2015/02/09/t-sql-read-csv-files-using-openrowset/
There are suggestions in this MSDN article: MSDN DataFlow performance features
Key ones appear to be:
Check the EngineThreads property of the DataFlow task, which tells SSIS how may source and worker threads it should use
If using OLE DB Source to select data from a view uses "SQL Command" and write a SELECT * From View rather than Table or View
Let us know how you get on
You may be facing I/O bottleneck while writing the 200GB to the flat file. I don't see any problem with SQL Query.
If possible create multiple files and split the data (either by modifying SSIS or changing the select query)
We are transferring data to an Oracle Database from two different sources and it's extremely slow.
Please see notes and images below. Any suggestions?
Notes:
We're using the Microsoft OLE DB Provider for Oracle.
One data source is SQL Server and includes about 5M records.
The second data source is Oracle and includes about 700M records.
When trying to transfer the SQL Server data, we broke it up into
five "Data Flow Tasks" in the "Control Flow". Each "Data Flow Task"
in turn use an "OLE DB Source" which internally uses a "SQL command"
that effectively selects 1M of the 5M records. When we run this
package it ran the first data flow task for about 3 hours and only
transferred about 50,000 records until we ended the process.
We had similar experience with the Oracle data as well.
For some reason saving to a Oracle Destination is extremely slow.
Interestingly, we once transfer the same 700M records from Oracle to
SQL Server (so the opposite direction) and it worked as expected in
about 4.5 to 5 hours.
Images:
On the Oracle side you can examine v$session to see where the time is being spent (if AWR is licensed on the Oracle instance you can use DBA_HIST_ACTIVE_SESS_HISTORY or v$active_session_history).
I work on Oracle performance problems every day (over 300 production Oracle instances), so I feel qualified to say that I can't give you a specific answer to your question, but I can point you in the right direction.
Typical process mistakes that make inserts slow:
not using array insert
connecting to the DB for each insert (sound strange? believe me
I've seen DataStage and other ETL tools set-up this way)
app server/client not on same local area network as the Oracle instance
indexes on table(s) being inserted into (especially problematic with
bit mapped indexes); requires index update and table update per
statement
redo log files too small on Oracle instance (driving up
redo log file switching)
log_buffer parameter on DB side too small
not enough db writers (see db_writer_processes initialization
parameter)
committing too often
Not an answer, just a bunch of observations and questions...
Any one of the components in the data pipeline could be the bottleneck.
You first need to observe the row counts when running interactively in SSIS and see if there is any obvious clogging going on - i.e. do you have a large rowcount right before your Data conversion transformation and a low one after? Or is it at the Oracle destination? Or is it just taking a long time to come out of SQL? A quick way to check the SQL side is to dump it to a local file instead - that mostly measures the SQL select performance without any blocking from Oracle.
When you run your source query in SQL Server, how long does it take to return all rows?
Your data conversion transformation can be performed in the source query. Every transformation requires set up of buffers, memory etc. and can slow down and block your dataflow. Avoid these and do it in the source query instead
Various buffers and config that exists in Oracle driver. Already addressed in detail by #RogerCornejo. For read performance out of Oracle, I have found altering FetchBufferSize made a huge difference, but you are doing writes here so that's not the case.
Lastly, where are the two database servers and the SSIS client tool situated network wise? If you are running this across three different servers then you have network throughput to consider.
If you use a linked server as suggested, note that SSIS doesn't do any processing at all so you take that whole piece out of the equation
And if you're just looking for the fastest way to transfer data, you might find that dumping to a file and bulk inserting is the fastest
Thank you all for your suggestions. For those who may run into a similar problem in the future, I'm posting what finally worked for me. The answer was ... switching the provider. The ODBC or Attunity providers were much faster, by a factor of almost 800X.
Remember that my goal was to move data from a SQL Server Database to an Oracle database. I originally used an OLE DB provider for both the source and destination. This provider works fine if you are moving data from SQL Server to SQL Server because it allows you to use the "Fast Load" option on the destination which in turn allows you to use batch processing.
However, the OLE DB provider doesn't allow the "Fast Load" option with an Oracle DB as the destination (couldn't get it to work and read elsewhere that it doesn't work). Because I couldn't use the "Fast Load" option I couldn't batch and instead was inserting records row by row which was extremely slow.
A colleague suggested trying ODBC and others suggested trying Microsoft's Attunity Connectors for Oracle. I didn't think the difference would be so great because in my experience ODBC had similar (and sometimes less) performance than OLE DB (hadn't tried Attunity). BUT... that was when moving data from and to a SQL Server database or staying in the Microsoft world.
When moving data from a SQL Server database to an Oracle database, there was a huge difference! Both ODBC and Attunity out performed OLE DB dramatically.
Here were my summarized performance test results inserting 5.4M records from a SQL Server database to an Oracle Database.
When doing all the work on one local computer.
OLE DB source and destination inserted 12 thousand records per minute which would have taken approx. 7 hours to complete.
ODBC source and destination inserted 9 Million records per minute which only took approx. 30 seconds to complete.
When moving data from one network/remote computer to another network/remote computer.
OLE DB source and destination inserted 115 records per minute which would have taken approx. 32 days to complete.
ODBC source and destination inserted 1 Million records per minute which only took approx. 5 minutes to complete.
Big difference!
Now why when working locally it only took 30 seconds and remotely it took 5 minutes is another issue for another day, but for now I have something workable (it should be slower on the network, but surprised it's that much slower).
Thanks again to everyone!
Extra notes:
My OLE DB results were similar with either Microsoft's or Oracle OLE DB providers for Oracle databases.
Attunity was a little faster than ODBC. I didn't get to test on remote servers or on larger data set, but locally it was a consitently about 2 to 3 seconds faster than ODBC. Those seconds could add up on a large data set so take note.
Loading data from my OLTP database (it's part of ETL) via OPENQUERY or SSIS Data Flow to another SQL Server database (Warehouse which run this SSIS package / OPENQUERY statement), kills it. As I checked in Performance Monitor I use resources from source database, not from destiny. Is possible to reverse this resource utilization (using SQL Server 2016 or SSIS)?
The problem here is in your destination write operation. If you are using OLE DB Destination with fast load access mode try setting the rows per batch value to a non-zero value and reduce the maximum insert commit size to a value that will be easy on your memory and CPU. SSIS will not have to wait for the default of 2147483647 before writing to the destination table which can have a large impact on your log file slowing your process down. Please refer to this Article for more info on setting this values. All the best
How does your export query looks like? Is it just a simple data dump or do you have some complex logic in (e.g. doing some denormalization/aggregation with the export)?
If it's just a simple export, check on which server your SSIS package runs and what resources it uses. In any case, you need to read the data from your source system, so expect some read disc operations.
In general it is better to get the data from an OLTP as quickly as possible and then apply other operations in further steps of your ETL process on your ETL/Data warehouse server. In order to reduce an impact on your transactional system.
Hope it helps.
Any tips for speeding up the import processes? Theres alot of Joins in the db.
Also, when a SSIS task is completed, would the best way to handle the next functions by code or using the Emailing notification SSIS has..?
Here is a sample that I have used to illustrate loading 1 million rows in under 3 minutes from text file to SQL Server database. The package in the sample was created using SSIS 208 R2 and was executed on Xeon single core CPU 2.5GHz and 3.00 GB RAM.
Import records on SSIS after lookup
One of the main bottlenecks in importing large number of rows will be the destination component. Faster the destination component can insert the rows, the faster the preceding source or transformation components can process the rows. Again if you happen to have components like Sort transformation that will be different because Sort transformation would hold up all the data to sort before sending it down the pipeline.
Sending email depends on what you would like to do.
If you need simple success or failure, you could simply use Send Mail task. Other option is that you could also enable the Alert notification on SQL jobs from where you might schedule the package to run on regular basis.
If you need more information to be added to the email, then you might need to use a Script Task to formulate the message body. After creating the message body, you can send the mail from within Script Task or use Send Mail task.
Hope that example along with the article #Nikhil S provided should help you fine tune your package.
This SimpleTalk article discusses ways to optimize your data flow task
Horizontally partition your data-to-be transferred into N data flows. Where N is the number of cpu cores available on your server where ssis is installed.
Play with the ssis buffer size property to figure our setting optimal for your kind of data.