SSIS Package Full Table Load Slow - sql-server

We have an SSIS package that is apparently termed as 'slow' by the development team. Since they do not have a person with SSIS ETL, as a DBA I tried digging into it. Below is the information I found:
SQL Server was 2014 version upgraded -inplace to 2017 so it has SSIS of both versions.
They load a SQL Server table of size 200 GB into SSIS and then zip the data into flatfile using command line zip functionality.
The data flow task simple hits a select * from view - the view is nothing but containing the table with no other fancy joins.
While troubleshooting I found that on SQL Server, there is hardly any load coming, possibly because the select command is running in single thread and not utilizing SQL server cores.
When I run the same select * command (only for 5 seconds, since it is 200 GB table), even my command is single threaded.
The package has a configuration file that the SQL job shows (this is how the package runs) with some connection settings.
Opening the package in BIDS show defaultBufferMaxRows as 10000 only (possibly default value) (since configuration file or any variables does not has a customer value, I guess this is what the package is using too).
Both SQL and SSIS are on same server. SQL has been allocated max memory leaving around 100 GB for SSIS and OS.
Kindly share any ideas on how can I force the SQL Server to run this select command using multiple threads so that entire table gets inside SSIS buffer pool faster.
Edit: I am aware that bcp can read data faster than any process and save it to flatfile but at this point changes to the SSIS package has to be kept minimum and exploring options that can be incorporated within SSIS package.
Edit2: Parallelism works perfectly for my SQL Server as I verified for a lot of other queries.The table in question is 200 GB. It is something with SSIS only which is not hammering my DB as hard as it should.
Edit3: I have made some progress, adjusted the buffer value to 100 MB and max rows to 100000 and now the package seem to be doing better. when I run this package on the server directly using dtexec utility, it generates good load of 40- 50 MB per second but through SQL job it never generates lod more than 10 MB. so I am trying to figure out this behavior.
Edit4: I found that when I run the package directly from logging to the server and invoking dtexec utility, it runs good because it generates good load on the DB causing data I\O to remain steady between 30-50 MB\sec.
The same thing from SQL job never exceeds the I\O more than 10 MB\sec.
I even tried to run the package using agent and opting for cmdline operation but no changes. Agent literally sucks here, any pointers on what could be wrong here?
Final Try:
I am stumped at the observation I have finally:
1)Same package runs 3x faster when run from command prompt from windows node by invoking dtexc utility
2) Exact same package runs 3 times slower than above when involked by SQL agent which has sysadmin permissions on windows as well as SQL Server
In both cases, I tried to see the version of DTEXEC they invoke, and they both invoke the same version. So why one would be so slow is out of my understanding.

I don't think that there is a general solution to this issue since it is a particular case that you didn't provide much information. Since there are two components in your data flow task (OLE DB Source and Flat File Destination), I will try to give some suggestions related to each component.
Before giving suggestions for each component, it is good to mention the following:
If no transformations are applied within the data flow task, It is not recommended to use this task. It is preferable to use bcp utility
Check the TempDb and the database log size.
If a clustered index exists, try to rebuild it. If not, try to create a clustered index.
To check the component that is slowing the package execution, open the package in Visual Studio and try to remove the flat file destination and replace it with a dummy Script Component (write any useless code, for example: string s = "";). And then run the package; if it is fast enough, then the problem is caused by the Flat File Destination, else you need to troubleshoot the OLE DB Source.
Try executing the query in the SQL Server management studio and shows the execution plan.
Check the package TargetServerVersion property within the package configuration and make sure it is correct.
OLE DB Source
As you mentioned, you are using a Select * from view query where data is stored in a table that contains a considerable amount of data. The SQL Server query optimizer may find that reading data using Table Scan is more efficient than reading from indexes, especially if your table does not have a clustered index (row store or column store).
There are many things you may try to improve data load:
Try replacing the Select * from view with the original query used to create the view.
Try changing the data provider used in the OLE DB Connection Manager: SQL Server Native Client, Microsoft OLE DB provider for SQL Server (not the old one).
Try increasing the DefaultBufferMaxRows and DefaultBufferSize properties. more info
Try replacing using SQL Command with specific column names instead of selecting the view name (Table of View data access mode). more info
Try to load data in chunks
Flat File Destination
Check that the flat file directory is not located on the same drive where SQL Server instance is installed
Check that the flat file is not located on a busy drive
Try to export data into multiple flat files instead of one huge file (split data into smaller files) , since when the exported data size increase in a single file, writing to this file become slower, then the package will become slower. (Check the 5th suggestion above)

Any indexes on the table could slow loading. If there are any indexes, try dropping them before the load and then recreating them after. This would also update the index statistics, which would be skewed by the bulk insert.

Are you seeing SQL server utilizing other cores too for other queries? If not, maybe someone played with the following settings:
Check these under server configuration setting:
Maximum Degree of Parallelism
Cost Threshold for Parallelism (server configuration setting).
Does processors affinitized to a CPU.
Also, MaxDOP query hint can cause this too but you said there is no fancy stuff in the view.
Also, it seems you have enough memory on error, why not increase defaultBufferMaxRows to an extremely large number so that SQL server doesn't get slowed down waiting for the buffer to get empty. Remember, they are using the same disk and they will have to wait for each other to use the disk, which will cause extra wait times for the both. It's better SQL server uses it, put into the buffer, and then SSIS starts processing and writing it into disk.
DefaultBufferSize : default is 10MB, max possible 2^31-1 bytes
DefaultBufferMaxRows : default is 10000
you can set AutoAdjustBufferSize so that DefaultBufferSize is automatically calculated based on DefaultBufferMaxRows
See other performance troubleshooting ideas here
https://learn.microsoft.com/en-us/sql/integration-services/data-flow/data-flow-performance-features?view=sql-server-ver15
Edit 1: Some other properties you can check out. These are explained in the above link as well
MaxConcurrentExecutables (package property): This defines how many threads a package can use.
EngineThreads (Data Flow property): how many threads the data flow engine can use
Also try running dtsexec under the same proxy user used by SQL agent to see if you get different result with this account versus your account. You can use runas /user:... cmd to open a command window under that user and then execute dtexec.
Try changing the proxy user used in SQL Agent to a new one and see if it will help. Or try giving elevated permissions in the directories it needs access to.
Try keeping the package in file-system and execute through dtexec from the SQL Agent directly instead of using catalog.start_execution.
Not your case but for other readers: if you have "Execute Package Task", make sure the child packages to be executed are set to run in-process via ExecuteOutOfProcess property. This just reduces overhead of using more processes.
Not your case but for other readers: if you're testing in BIDS, it will run in debug mode by default and thus run slow. Use CTRL-F5 (start without debugging). The best is to use dtexec directly to test the performance

A data flow task may not be the best choice to move this data. SSIS Data Flow tasks are an ETL tool where you can do transformations, look ups, redirect invalid rows, add derived columns and a lot more. If the data flow task is simple and only moves data with no manipulation or redirection of rows then ditch the Data Flow task and use a simple Execute SQL Task and OPENROWSET to import the flat file that was generated from command line and zipped up. Assuming the flat file is a .csv file here are some working examples to query a .csv and insert the data to a table.
You need [Ad Hoc Distributed Queries] run_value set to 1
into dbo.Destination
SELECT *
from openrowset('MSDASQL', 'Driver={Microsoft Text Driver (*.txt; *.csv)};
DefaultDir=D:\YourCsv.csv;Extensions=csv;','select * from YourCsv.csv') File;
Here is some additional examples https://sqlpowershell.blog/2015/02/09/t-sql-read-csv-files-using-openrowset/

There are suggestions in this MSDN article: MSDN DataFlow performance features
Key ones appear to be:
Check the EngineThreads property of the DataFlow task, which tells SSIS how may source and worker threads it should use
If using OLE DB Source to select data from a view uses "SQL Command" and write a SELECT * From View rather than Table or View
Let us know how you get on

You may be facing I/O bottleneck while writing the 200GB to the flat file. I don't see any problem with SQL Query.
If possible create multiple files and split the data (either by modifying SSIS or changing the select query)

Related

BULK INSERT Task Issues

First, I am new to SSIS so I am still getting the hang of things.
I am using Visual Studio 19 and SSMS 19
Regardless, I have set-up an OLE DB Package from .TSV file to table in SSMS. The issue is that it took 1 hour and 11 minutes to execute for 500,000 rows.
The data is extremely variable so I have set-up a staging table in SSMS that is essentially all varchar(max) columns. Once all the data is inserted, then I was going to look at some aggregations like max(len(<column_name>)) in order to better optimize the table and the SSIS package.
Anyways, there are 10 of these files so I need to create a ForEach File loop. This would take at minimum (1.17 hours)*10=11.70 hours of total runtime.
I thought this was a bit long and created a BULK INSERT Task, but I am having some issues.
It seems very straightforward to set-up.
I added the Bulk Insert Task to the Control Flow tab and went into the Bulk Insert Task Editor Dialogue Box.
From here, I configured the Source and Destination connections. Both of which went very smoothly. I only have one local instance of SQL Server on my machine so I used localhost.<database_name> and the table name for the Destination Connection.
I run the package and it executes just fine without any errors or warnings. It takes less than a minute for a roughly 600 MB .TSV file to load into a SSMS table with about 300 columns of varchar(max).
I thought this was too quick and it was. Nothing loaded, but the package executed!!!
I have tried searching for this issue with no success. I checked my connections too.
Do I need Data Flow Tasks for Bulk Insert Tasks? Do I need any connection managers? I had to configure Data Flow Tasks and connection managers for the OLE DB package, but the articles I have referenced do not do this for Bulk Insert Tasks.
What am I doing wrong?
Any advice from someone more well-versed in SSIS would be much appreciated.
Regarding my comment about using a derived column in place of a real destination, it would look like 1 in the image below. You can do this in a couple of steps:
Run the read task only and see how long this takes. Limit the total read to a sample size so your test does not take an hour.
Run the read task with a derived column as a destination. This will test the total read time, plus the amount of time to load the data into memory.
If 1) takes a long time, it could indicate a bottleneck with slow read times on the disk where the file is or a network bottleneck if the file is on another server on a shared drive. If 2) adds a lot more time, it could indicate a memory bottleneck on the server that SSIS is running. Please note that you testing this on a server is the best way to test performance, because it removes a lot of issues that probably won't exist there such as network bottlenecks and memory constraints.
Lastly, please turn on the feature noted as 2) below, AutoAdjustBufferSize. This will change the settings for DefaultBufferSize (max memory in the buffer) and DefaultBufferMaxRows (total rows allowed in each buffer, these are the numbers that you see next to the arrows in the dataflow when you run the package interactively). Because your column sizes are so large, this will give a hint to the server to maximize the buffer size which gives you a bigger and faster pipeline to push the data through.
One final note, if you add the real destination and that has a significant impact on time, you can look into issues with the target table. Make sure there are no indexes including a cluster index, make sure tablock is on, make sure there are no constraints or triggers.

Sql Server JDBC driver pagination

I am using the latest(sqljdbc42) SQL Server JDBC driver available for official download. The SQL Server source dbs can vary from 2008 to 2016.
I flipped through several existing threads about setFetchSize(...) and it seems :
If there are 1000 rows and the fetch size is 100, the result set will have only 100 records in memory, at a time and will make 10 network trips to fetch the next 100 records, when rs.next() is called
It's unclear whether the SQL Server JDBC driver honors the fetch size
I am writing a custom library for our internal use. It will select all the data from specific tables, iterate over the results and write it to a stream. I get an OutOfMemoryError when I try to run it for several tables(each one with thousands to hundred thousand rows) but when not if I am iterating over a single table with several thousand rows. I suspect somewhere, a large table is causing the issue.
While I will continue to debug my code, I wish to know if the setFetchSize(...) really works with the latest SQL Server jdbc driver.
Note : The table is NOT having any incremental columns/surrogate keys on which I can manually paginate, I need something out-of-box
*****Edit-1*****
As specified in the comment by #a_horse_with_no_name, I think 'responseBuffering=adaptive' should be explicitly specified and the document specifies some ways to do it. The challenge is :
I receive a custom DB connection pool instance as an argument and I have to use it to get a Connection object, thus, there is no way I can specify 'responseBuffering=adaptive' in the db connect URL or else where. All I get is a Connection object for usage
The only relevant(?) method I can use is setClientInfo(...) but I am unsure if that would help

Pulling instead pushing data from database

Loading data from my OLTP database (it's part of ETL) via OPENQUERY or SSIS Data Flow to another SQL Server database (Warehouse which run this SSIS package / OPENQUERY statement), kills it. As I checked in Performance Monitor I use resources from source database, not from destiny. Is possible to reverse this resource utilization (using SQL Server 2016 or SSIS)?
The problem here is in your destination write operation. If you are using OLE DB Destination with fast load access mode try setting the rows per batch value to a non-zero value and reduce the maximum insert commit size to a value that will be easy on your memory and CPU. SSIS will not have to wait for the default of 2147483647 before writing to the destination table which can have a large impact on your log file slowing your process down. Please refer to this Article for more info on setting this values. All the best
How does your export query looks like? Is it just a simple data dump or do you have some complex logic in (e.g. doing some denormalization/aggregation with the export)?
If it's just a simple export, check on which server your SSIS package runs and what resources it uses. In any case, you need to read the data from your source system, so expect some read disc operations.
In general it is better to get the data from an OLTP as quickly as possible and then apply other operations in further steps of your ETL process on your ETL/Data warehouse server. In order to reduce an impact on your transactional system.
Hope it helps.

Data streams in case of Merge

We are seeing enormous amounts of data-traffic to and fro our SSIS server. We cannot find the culprit. Is there any way to find out which package is causing all the trafffic? Any advice on that? We are thinking that maybe all the merges we do cause all the traffic. Our SSIS machine gets data from several production SQL servers, merges that with data in our warehouses. Dies that mean that
a) new data is transfered to the SSI machine,
b) existing data is transferred to the SSIS machine,
c) Merge is done and then all data is transferred to the
warehouse?
Then how would you go about limiting all the data moved from and to?
The answer to your questions a, b and c (if you're using SSIS transformation components in SSIS) is essentially “yes, all new data and existing data required for transformation will flow into SSIS instance, and the resulting merged data will flow out of SSIS instance to the target server”. More detailed explanation is below.
Assuming that you are using SQL Server 2012 and above, you would be able to enable Verbose logging to capture the number of rows transferred. The details are captured in [catalog].[execution_data_statistics]. If you are looking for the size in bytes, you would need to calculate that based on the columns that are being extracted and transformed against the number of rows. The [catalog].[execution_data_statistics] captures package name, task name, data flow path and source/destination component name, the time of execution and execution path, which is great for diagnosing.
SSIS is an in-memory pipeline. If you have 3 separate servers, Source, SSIS and Target, the amount of data/traffic will vary. As an example, if the Data Flow Tasks require transformation and use components such as Merge, Merge Join, Lookup etc, you can expect data flowing from Source Server, SSIS Server and Target Server.
On the other hand if you are running a simple Data Flow Task with SQL Server Destination for the Target between 2 databases with the same source and target, SSIS will issue a BULK INSERT statement on the target (= source = SSIS server) instance. In this case, there will be very low data traffic across the network (at least not related to the BULK INSERT statement).
If your package contains an “Execute SQL Task” component that invoke MERGE t-sql statements, this would not cause data traffic into/out of SSIS Server. The activity will be done on the SQL Server instance that the MERGE statement is executed on. If you are using Linked Servers, then the data will flow into/out of linked server as required by the MERGE statement just the same way as if you're invoking the statement on the instance.
My recommendation for limiting the amount of data moved from and to, is to be selective at the source level. For example, if you know that you are only going to be using ColumnA, ColumnB, ColumnC in dbo.Customer, then use
SELECT [ColumnA], [ColumnB], [ColumnC] FROM [dbo].[Customer] --
Better!
instead of the following statement which potentially can retrieve more than those 3 columns:
SELECT *
FROM [dbo].[Customer] -- Do Not Use
There are also a number of best practices to optimize SSIS including reducing bandwidth and optimizing the amount of data transferred, that you can follow. Please have a read here: http://blogs.msdn.com/b/sqlcat/archive/2013/09/16/top-10-sql-server-integration-services-best-practices.aspx.
If you are working on Hybrid platform, you may also be interested in reading "SSIS for Azure and Hybrid Data Movement" white paper (https://msdn.microsoft.com/en-us/library/jj901708.aspx). This white paper has an additional link to "SSIS Operational and Tuning Guide" that would be useful as well.
In addition, you may also be interested in having a look at SSIS Reporting Pack available on CodePlex to get more visualization of SSIS executions on the server.
Hope this helps.
Julie

DataFlow task in SSIS is very slow as compared to writing the sql query in Execute SQL task

I am new to SSIS and have a pair of questions
I want to transfer 1,25,000 rows from one table to another in the same database. But When I use Data Flow Task, it is taking too much time. I tried using an ADO NET Destination as well as an OLE DB Destination but the performance was unacceptable. When I wrote the equivalent query inside an Execute SQL Task it provided acceptable performance. Why is such a difference in performance.
INSERT INTO table1 select * from table2
Based on the first observation, I changed my package. It is exclusively composed of Execute SQL Tasks either with a direct query or with a stored procedure. If I can solve my problem using only the Execute SQL Task, then why would one use SSIS as so many documents and articles indicate. I have seen as it's reliable, easy to maintain and comparatively fast.
Difference in performance
There are many things that could cause the performance of a "straight" data flow task and the equivalent Execute SQL Task.
Network latency. You are performing insert into table a from table b on the same server and instance. In an Execute SQL Task, that work would be performed entirely on the same machine. I could run a package on server B that queries 1.25M rows from server A which will then be streamed over the network to server B. That data will then be streamed back to server A for the corresponding INSERT operation. If you have a poor network, wide data-especially binary types, or simply great distance between servers (server A is in the US, server B is in the India) there will be poor performance
Memory starvation. Assuming the package executes on the same server as the target/source database, it can still be slow as the Data Flow Task is an in-memory engine. Meaning, all of the data that is going to flow from the source to the destination will get into memory. The more memory SSIS can get, the faster it's going to go. However, it's going to have to fight the OS for memory allocations as well as SQL Server itself. Even though SSIS is SQL Server Integration Services, it does not run in the same memory space as the SQL Server database. If your server has 10GB of memory allocated to it and the OS uses 2GB and SQL Server has claimed 8GB, there is little room for SSIS to operate. It cannot ask SQL Server to give up some of its memory so the OS will have to page out while trickles of data move through a constricted data pipeline.
Shoddy destination. Depending on which version of SSIS you are using, the default access mode for an OLE DB Destination was "Table or View." This was a nice setting to try and prevent a low level lock escalating to a table lock. However, this results in row by agonizing row inserts (1.25M unique insert statements being sent). Contrast that with the set-based approach of the Execute SQL Tasks INSERT INTO. More recent versions of SSIS default the access method to the "Fast" version of the destination. This will behave much more like the set-based equivalent and yield better performance.
OLE DB Command Transformation. There is an OLE DB Destination and some folks confuse that with the OLE DB Command Transformation. Those are two very different components with different uses. The former is a destination and consumes all the data. It can go very fast. The latter is always RBAR. It will perform singleton operations for each row that flows through it.
Debugging. There is overhead running a package in BIDS/SSDT. That package execution gets wrapped in DTS Debugging Host. That can cause a "not insignificant" slowdown of package execution. There's not much the debugger can do about an Execute SQL Task-it runs or it doesn't. A data flow, there's a lot of memory it can inspect, monitor, etc which reduces the amount of memory available (see pt 2) as well as just slows it down because of assorted checks it's performing. To get a more accurate comparison, always run packages from the command line (dtexec.exe /file MyPackage.dtsx) or schedule it from SQL Server Agent.
Package design
There is nothing inherently wrong with an SSIS package that is just Execute SQL Tasks. If the problem is easily solved by running queries, then I'd forgo SSIS entirely and write the appropriate stored procedure(s) and schedule it with SQL Agent and be done.
Maybe. What I still like about using SSIS even for "simple" cases like this is it can ensure a consistent deliverable. That may not sound like much, but from a maintenance perspective, it can be nice to know that everything that is mucking with the data is contained in these source controlled SSIS packages. I don't have to remember or train the new person that tasks A-C are "simple" so they are stored procs called from a SQL Agent job. Tasks D-J, or was it K, are even simpler than that so it's just "in line" queries in the Agent jobs to load data and then we have packages for the rest of stuff. Except for the Service Broker thing and some web services, those too update the database. The older I get and the more places I get exposed to, the more I can find value in a consistent, even if overkill, approach to solution delivery.
Performance isn't everything, but the SSIS team did set the ETL benchmarks using SSIS so it definitely has the capability to push some data in a hurry.
As this answer grows long, I'd simply leave it as the advantages of SSIS and the Data Flow over straight TSQL are native, out of the box
logging
error handling
configuration
parallelization
It's hard to beat those for my money.
If you are Passing SSIS Variables As Parameter in Parameter mapping Tab and assigning values to These Variables by Expression Then Your Execute SQL Task consume a lot of time in Evaluating that Expression.
Use Expression Task(Separately) To assign Variables Instead of using Expression in Variable Tab.

Resources