Parallel Processing in SSIS - sql-server

I have the following scenario in SSIS. There are two packages, Outer.dtsx and Inner.dtsx. the Inner package is called inside the Outer package in the workflow. To increase the performance, as the workload is heavy, I added a sequence container, and instead of having only one Inner package running, I managed several packages inside the container, so to have multiple instances (10 to be exact) of Inner package running in parallel. It is only one version of Inner package, however it is called several times.
How does this scenario differ from having multiple versions of Inner (Inner_1, Inner_2, ...... , Inner_10) and run them again inside the sequence container? Does having multiple packages with same definition improves the performance, compared to one version of the package, called several times ? Which scenario is more efficient and has best performance ?

From the definition of SSIS package - it is a definition of tasks and transformations written in XML format and being executed by SSIS engine. SSIS engine can execute several instances of the same package simultaneously.
On practice -- performed the following experiment. Created a dummy package loading from CSV file to MSSQL DB table with parameters of file name and table name - InnerPkg. Then created a copy of it - InnerPkg1. Also created two copies of the source file and the destination SQL Table.
Please note!!! I created different source and destinations to avoid resource locking.
OuterPkg_Parallel calls two instances of InnerPkg, passing different parameters of osurce filename and destination tablename at Execute Package Task.
OuterPkg_Copies calls InnerPkg and InnerPkg1 with appropriate parameters.
Results (average of 5 runs):
OuterPkg_Parallel - 12,72 seconds
OuterPkg_Copies - 12,77 seconds
So, the difference is negligible, to my understanding.
The tests were conducted on MS SQL - SSIS version 2016, OS - Windows Server 2016.
Bottom Line - use single package calling, as it has no visible performance penalty and greatly simplifies support.

Related

SSIS Package Full Table Load Slow

We have an SSIS package that is apparently termed as 'slow' by the development team. Since they do not have a person with SSIS ETL, as a DBA I tried digging into it. Below is the information I found:
SQL Server was 2014 version upgraded -inplace to 2017 so it has SSIS of both versions.
They load a SQL Server table of size 200 GB into SSIS and then zip the data into flatfile using command line zip functionality.
The data flow task simple hits a select * from view - the view is nothing but containing the table with no other fancy joins.
While troubleshooting I found that on SQL Server, there is hardly any load coming, possibly because the select command is running in single thread and not utilizing SQL server cores.
When I run the same select * command (only for 5 seconds, since it is 200 GB table), even my command is single threaded.
The package has a configuration file that the SQL job shows (this is how the package runs) with some connection settings.
Opening the package in BIDS show defaultBufferMaxRows as 10000 only (possibly default value) (since configuration file or any variables does not has a customer value, I guess this is what the package is using too).
Both SQL and SSIS are on same server. SQL has been allocated max memory leaving around 100 GB for SSIS and OS.
Kindly share any ideas on how can I force the SQL Server to run this select command using multiple threads so that entire table gets inside SSIS buffer pool faster.
Edit: I am aware that bcp can read data faster than any process and save it to flatfile but at this point changes to the SSIS package has to be kept minimum and exploring options that can be incorporated within SSIS package.
Edit2: Parallelism works perfectly for my SQL Server as I verified for a lot of other queries.The table in question is 200 GB. It is something with SSIS only which is not hammering my DB as hard as it should.
Edit3: I have made some progress, adjusted the buffer value to 100 MB and max rows to 100000 and now the package seem to be doing better. when I run this package on the server directly using dtexec utility, it generates good load of 40- 50 MB per second but through SQL job it never generates lod more than 10 MB. so I am trying to figure out this behavior.
Edit4: I found that when I run the package directly from logging to the server and invoking dtexec utility, it runs good because it generates good load on the DB causing data I\O to remain steady between 30-50 MB\sec.
The same thing from SQL job never exceeds the I\O more than 10 MB\sec.
I even tried to run the package using agent and opting for cmdline operation but no changes. Agent literally sucks here, any pointers on what could be wrong here?
Final Try:
I am stumped at the observation I have finally:
1)Same package runs 3x faster when run from command prompt from windows node by invoking dtexc utility
2) Exact same package runs 3 times slower than above when involked by SQL agent which has sysadmin permissions on windows as well as SQL Server
In both cases, I tried to see the version of DTEXEC they invoke, and they both invoke the same version. So why one would be so slow is out of my understanding.
I don't think that there is a general solution to this issue since it is a particular case that you didn't provide much information. Since there are two components in your data flow task (OLE DB Source and Flat File Destination), I will try to give some suggestions related to each component.
Before giving suggestions for each component, it is good to mention the following:
If no transformations are applied within the data flow task, It is not recommended to use this task. It is preferable to use bcp utility
Check the TempDb and the database log size.
If a clustered index exists, try to rebuild it. If not, try to create a clustered index.
To check the component that is slowing the package execution, open the package in Visual Studio and try to remove the flat file destination and replace it with a dummy Script Component (write any useless code, for example: string s = "";). And then run the package; if it is fast enough, then the problem is caused by the Flat File Destination, else you need to troubleshoot the OLE DB Source.
Try executing the query in the SQL Server management studio and shows the execution plan.
Check the package TargetServerVersion property within the package configuration and make sure it is correct.
OLE DB Source
As you mentioned, you are using a Select * from view query where data is stored in a table that contains a considerable amount of data. The SQL Server query optimizer may find that reading data using Table Scan is more efficient than reading from indexes, especially if your table does not have a clustered index (row store or column store).
There are many things you may try to improve data load:
Try replacing the Select * from view with the original query used to create the view.
Try changing the data provider used in the OLE DB Connection Manager: SQL Server Native Client, Microsoft OLE DB provider for SQL Server (not the old one).
Try increasing the DefaultBufferMaxRows and DefaultBufferSize properties. more info
Try replacing using SQL Command with specific column names instead of selecting the view name (Table of View data access mode). more info
Try to load data in chunks
Flat File Destination
Check that the flat file directory is not located on the same drive where SQL Server instance is installed
Check that the flat file is not located on a busy drive
Try to export data into multiple flat files instead of one huge file (split data into smaller files) , since when the exported data size increase in a single file, writing to this file become slower, then the package will become slower. (Check the 5th suggestion above)
Any indexes on the table could slow loading. If there are any indexes, try dropping them before the load and then recreating them after. This would also update the index statistics, which would be skewed by the bulk insert.
Are you seeing SQL server utilizing other cores too for other queries? If not, maybe someone played with the following settings:
Check these under server configuration setting:
Maximum Degree of Parallelism
Cost Threshold for Parallelism (server configuration setting).
Does processors affinitized to a CPU.
Also, MaxDOP query hint can cause this too but you said there is no fancy stuff in the view.
Also, it seems you have enough memory on error, why not increase defaultBufferMaxRows to an extremely large number so that SQL server doesn't get slowed down waiting for the buffer to get empty. Remember, they are using the same disk and they will have to wait for each other to use the disk, which will cause extra wait times for the both. It's better SQL server uses it, put into the buffer, and then SSIS starts processing and writing it into disk.
DefaultBufferSize : default is 10MB, max possible 2^31-1 bytes
DefaultBufferMaxRows : default is 10000
you can set AutoAdjustBufferSize so that DefaultBufferSize is automatically calculated based on DefaultBufferMaxRows
See other performance troubleshooting ideas here
https://learn.microsoft.com/en-us/sql/integration-services/data-flow/data-flow-performance-features?view=sql-server-ver15
Edit 1: Some other properties you can check out. These are explained in the above link as well
MaxConcurrentExecutables (package property): This defines how many threads a package can use.
EngineThreads (Data Flow property): how many threads the data flow engine can use
Also try running dtsexec under the same proxy user used by SQL agent to see if you get different result with this account versus your account. You can use runas /user:... cmd to open a command window under that user and then execute dtexec.
Try changing the proxy user used in SQL Agent to a new one and see if it will help. Or try giving elevated permissions in the directories it needs access to.
Try keeping the package in file-system and execute through dtexec from the SQL Agent directly instead of using catalog.start_execution.
Not your case but for other readers: if you have "Execute Package Task", make sure the child packages to be executed are set to run in-process via ExecuteOutOfProcess property. This just reduces overhead of using more processes.
Not your case but for other readers: if you're testing in BIDS, it will run in debug mode by default and thus run slow. Use CTRL-F5 (start without debugging). The best is to use dtexec directly to test the performance
A data flow task may not be the best choice to move this data. SSIS Data Flow tasks are an ETL tool where you can do transformations, look ups, redirect invalid rows, add derived columns and a lot more. If the data flow task is simple and only moves data with no manipulation or redirection of rows then ditch the Data Flow task and use a simple Execute SQL Task and OPENROWSET to import the flat file that was generated from command line and zipped up. Assuming the flat file is a .csv file here are some working examples to query a .csv and insert the data to a table.
You need [Ad Hoc Distributed Queries] run_value set to 1
into dbo.Destination
SELECT *
from openrowset('MSDASQL', 'Driver={Microsoft Text Driver (*.txt; *.csv)};
DefaultDir=D:\YourCsv.csv;Extensions=csv;','select * from YourCsv.csv') File;
Here is some additional examples https://sqlpowershell.blog/2015/02/09/t-sql-read-csv-files-using-openrowset/
There are suggestions in this MSDN article: MSDN DataFlow performance features
Key ones appear to be:
Check the EngineThreads property of the DataFlow task, which tells SSIS how may source and worker threads it should use
If using OLE DB Source to select data from a view uses "SQL Command" and write a SELECT * From View rather than Table or View
Let us know how you get on
You may be facing I/O bottleneck while writing the 200GB to the flat file. I don't see any problem with SQL Query.
If possible create multiple files and split the data (either by modifying SSIS or changing the select query)

Utilising multiple SQL Servers simultaneously via SSIS

I recently discovered how to utilise the processing resources of multiple SQL Servers simultaneously through SSMS. (From a brilliant Thread on this forum). Where one registers multiple servers, from View --> Registered Servers (in SSMS), see pic below.
My Question is, is it possible to encapsulate SQL statements in an Execute Sql command container, that then utilises the resources of multiple Servers simultaneously in SSIS, just as it can be done within SSMS?
SSIS can certainly execute tasks against multiple servers at the same time, but you can't use multiple servers to share the execution of a single task. If you want the same SQL to execute against multiple server simultaneously in the same way multiserver execution works in SSMS, you must create a separate execute SQL task for each, you can't share that one task. If you wanted to change the executed SQL statement, this would mean editing all of the tasks containing that statement. But you can avoid this by making the executed SQL statement be sourced from an SSIS variable. That way, you only need to change a single variable.
To execute multiple tasks at the same time, simply drag multiple execution arrows out of the parent task. If there is no parent task, just drop the execute SQL tasks down on the design surface with no connection between them. James Serra wrote a quick blog entry about controlling parallel execution in SSIS quite a while ago, but the information is still current.

Using temporary tables in SSIS flow fails

I have an ETL process which extracts ~40 tables from a source database (Oracle 10g) to a SQL Server (2014 developer edition) Staging environment. My process for extraction:
Determine newest row in staging
Select all newer rows from source
Insert results into #TEMPTABLE
Merge results from #TEMPTABLE to Staging
This works on a package by package basis both from Visual Studio locally and executing from SSISDB on the SQL Server.
However I am grouping my Extract jobs into one master package for ease of execution and flow to the transform stage. Only approximately 5 of my packages use temporary tables, the others are all trunc and load, but wanted to move some more to this method. When i run the master package anything using a temporary table fails. Because of pretty large log files, its hard to pinpoint the actual error but so far all it tells me is that the #TEMPTABLE can't be found and/or the status is VS_ISBROKEN.
Things i have tried:
Set all relevant components to delay validation = false
Master package has ExecuteOutOfProcess = true
Increased my tempdb capacity far exceeding my needs
A thought i had was the RetainSameConnection = true on my Staging database connection - could this be the cause? I would try to create separate connections for each, but assumed the ExecuteOutOfProcess would take care of this for me.
EDIT
I created the following scenario:
Package A (Master package containing Execute Package Task references only)
Package B (Uses temp tables)
Package C (No temp tables)
Executing Package B on it's own completes successfully. All temp table usage is contained within this package - there is no requirement for Package C to see the temp table created by Package B.
Executing Package C completes successfully.
Executing Package A, C completes successfully, B fails.
UPDATE
The workaround was to create a package level connection for each package that uses temporary tables, thus ensuring that each package held its own connection. I have raised a connect issue with Microsoft as i believe that as the parent package opens the connection it should inherit and retain throughout any child packages.
Several suggestions to your case.
Set RetainSameCoonection=true. This will allow you to work safely with TEMP tables in SSIS packages.
Would not use ExecuteOutOfProcess, it will increase your RAM footprint since every Child pack will start in its process, and decrease performance - add process start lag. This used in 32-bit environments to overcome 2 GB limit, but on x64 it is no longer necessary.
Child package execution does not inherit connection object instances from its Parent, so the same connection will not be spanned across all of your Child packages.
SSIS Packages with Temp table operations are more difficult to debug (less obvious), so pay attention to testing.

SSIS Package Hangs Randomly on Execution

I'm working with an SSIS package that itself calls multiple SSIS packages and hangs periodically during execution.
This is a once-a-day package that runs every evening and collects new and changed records from our census databases and migrates them into the staging tables of our data warehouse. Each dimension has its own package that we call through this package.
So, the package looks like
Get current change version
Load last change version
Identify changed values
a-z - Move changed records to staging tables (Separate packages)
Save change version for future use
All of those are execute SQL tasks except for the moving records tasks which are twenty some execute package tasks (data move tasks), which are executed somewhat in parallel. (Max four at a time.)
The strange part is that it almost always fails when executed by the SQL agent (using a proxy user) or dtexec, but never fails when I run the package through Visual Studio. I've added logging so that I can see where it stops, but it's inconsistent.
We didn't see any of this while working in our development / training environments, but the volume of data is considerably smaller. I wonder if we're just doing too much at once.
I may - to test - execute the tasks serially through the SQL Server agent to see if it's a problem with a package calling a package , but I'd rather not do this because we have a relatively short time in the evening to do this for seven database servers.
I'm slightly new to SSIS, so any advice would be appreciated.
Justin

SSIS Parallelism - Microsoft HPC Cluster?

I am new to SSIS, and am trying to use its Parallelism Feature to import data from a database.
My job is to do this: Import a multi terabyte database into a set of flat files as quickly as possible.
I was thinking of this:
I have a Microsoft Server 2008 HPC Cluster (of 3 nodes) at my disposal. I was thinking of writing a HPC SOA job so that all the three compute nodes can make independent connections to the SQL Server and import a portion of the data in parallel. Ofcourse this would have nothing to do with SSIS and be an independent utility.
Then I came across SSIS, and its parallel import features. MY SSIS Server is not very high end - only a 4GB Machine. I am somehow inclined to use SSIS because that's the ideal Microsoft way of doing data import - and I won't have to rewrite a lot of stuff and possibly use existing transformations etc.
What is the best way to use Custom Tasks (or available ones) and do this import in parallel?
Gitmo, I may misunderstand your question but will give it a shot. You need to move data from a SQL Server instance to multiple files, correct? You want to leverage the parallelised data movement functionality provided by SSIS. That means multiple simultaneously running Data Flow Tasks (DFTs). For each target file you can have only one DFT because of problems with concurrent writes.
To get multiple simultaneously running Data Flow Tasks where your source is a SQL Server database and your target is a set of files, you can possibly try the following ways (please note there are upper limits on the parallelization you can get out of SSIS based upon many factors including your CPU Core count, whether you are running in BIDS/Visual Studio or not, and various settings in your packages, your server(s), your SQL Server instance, and many other considerations):
The Multiple Simultaneous DFT Solution: A single SSIS Package with one Connection Manager pointed to the source SQL Server database and many Connection Managers each pointed to a separate target file, plus one DFT for each target file. The DFTs are all disconnected from one another (no precedence constraints or green/red/blue lines/arrows). If there are pre or post ETL steps that need to run a great way to parallelize these DFTs is to drop them all in a Sequence Container that is connected to the earlier and later tasks through precedence constraints/arrows. These disconnected DFTs in their own Sequence Container will try to all run simultaneously.
The Multiple Simultaneous DTEXEC Solution: Multiple SSIS packages each with their own target file-specific DFT. You manually run separate DTEXEC processes either through separate CMD windows or through the GUI. #3 below is a variation on this solution and possibly a better one.
The Parent Master Package Running Multiple Children Packages Solution: Wrap the per target file packages developed in #2 above in a single Parent Master package. In the Parent package have multiple simultaneously running Execute Package Tasks. Again these Execute Package Tasks would be disconnected from other tasks. A good way to do this is to drop the multiple Execute Package Tasks in their own Sequence Container. As before if the Execute Package Tasks are disconnected (no precedence constraints/arrows) they will all try to run simultaneously.
Take a look at this excellent article from the Microsoft SQLCAT Team for some more ideas/insight: Top 10 SQL Server Integration Services Best Practices
There are likely variations on these same ideas and possibly other solutions available both inside and outside of SSIS. Good luck!
please look this post ..... using multi threading out side ssis and acheiveing parallelism Multithreaded serial execution
with out modifying much of package
http://sqljunkieshare.com/2011/12/21/parallelism-in-etl-process-ssis-2008-and-ssis-2012/

Resources