BULK INSERT Task Issues - sql-server

First, I am new to SSIS so I am still getting the hang of things.
I am using Visual Studio 19 and SSMS 19
Regardless, I have set-up an OLE DB Package from .TSV file to table in SSMS. The issue is that it took 1 hour and 11 minutes to execute for 500,000 rows.
The data is extremely variable so I have set-up a staging table in SSMS that is essentially all varchar(max) columns. Once all the data is inserted, then I was going to look at some aggregations like max(len(<column_name>)) in order to better optimize the table and the SSIS package.
Anyways, there are 10 of these files so I need to create a ForEach File loop. This would take at minimum (1.17 hours)*10=11.70 hours of total runtime.
I thought this was a bit long and created a BULK INSERT Task, but I am having some issues.
It seems very straightforward to set-up.
I added the Bulk Insert Task to the Control Flow tab and went into the Bulk Insert Task Editor Dialogue Box.
From here, I configured the Source and Destination connections. Both of which went very smoothly. I only have one local instance of SQL Server on my machine so I used localhost.<database_name> and the table name for the Destination Connection.
I run the package and it executes just fine without any errors or warnings. It takes less than a minute for a roughly 600 MB .TSV file to load into a SSMS table with about 300 columns of varchar(max).
I thought this was too quick and it was. Nothing loaded, but the package executed!!!
I have tried searching for this issue with no success. I checked my connections too.
Do I need Data Flow Tasks for Bulk Insert Tasks? Do I need any connection managers? I had to configure Data Flow Tasks and connection managers for the OLE DB package, but the articles I have referenced do not do this for Bulk Insert Tasks.
What am I doing wrong?
Any advice from someone more well-versed in SSIS would be much appreciated.

Regarding my comment about using a derived column in place of a real destination, it would look like 1 in the image below. You can do this in a couple of steps:
Run the read task only and see how long this takes. Limit the total read to a sample size so your test does not take an hour.
Run the read task with a derived column as a destination. This will test the total read time, plus the amount of time to load the data into memory.
If 1) takes a long time, it could indicate a bottleneck with slow read times on the disk where the file is or a network bottleneck if the file is on another server on a shared drive. If 2) adds a lot more time, it could indicate a memory bottleneck on the server that SSIS is running. Please note that you testing this on a server is the best way to test performance, because it removes a lot of issues that probably won't exist there such as network bottlenecks and memory constraints.
Lastly, please turn on the feature noted as 2) below, AutoAdjustBufferSize. This will change the settings for DefaultBufferSize (max memory in the buffer) and DefaultBufferMaxRows (total rows allowed in each buffer, these are the numbers that you see next to the arrows in the dataflow when you run the package interactively). Because your column sizes are so large, this will give a hint to the server to maximize the buffer size which gives you a bigger and faster pipeline to push the data through.
One final note, if you add the real destination and that has a significant impact on time, you can look into issues with the target table. Make sure there are no indexes including a cluster index, make sure tablock is on, make sure there are no constraints or triggers.

Related

Is there a way I can fast load data with SSIS?

I'm moving data from ODBC to OLE Destination, records get inserted everyday on the ODBC in different tables. The packages gets slower and slower it take about a day for million records sometimes more. The tables can have new data inserted or new updated data and the loading and looking up of new data slows the processs. Is the anyway i can fast track the ETL process or is there any open source platform i can use to load the data faster
Tried to count the number of rows in the OLE Destination to check and only insert new records that are greater than the ones in the ODBC Source, but to my surprise the ROW_NUMBER() function isn't supported in Openedge ODBC
Based on the limited information in your question, I'd design your packages like the following
SEQC PG to SQL
The point of these operations is to transfer data from our source system verbatim to the target. The target table should be brand new and the SQL Server equivalent of the PG table from a data type perspective. Clustered Key if one exists, otherwise, see how a heap performs. I am going to reference this as a staging table.
The Data Flow itself is going to be bang simple
By default, the destination will perform a fast load and lock the table.
Run the package and observe times.
Edit the OLE DB Destination and change the Maximum Commit Size to something less than 2147483647. Try 100000 - is it better, worse? Move up/down an order of magnitude until you have an idea of what it looks like will be the fastest the package can move data.
There are a ton of variables at this stage of the game - how busy is the source PG database, what are the data types involved, how far does the data need to travel from the Source, to your computer, to the Destination but this can at least help you understand "can I pull (insert large number here) rows from the source system within the expected tolerance" If you can get the data moved from PG to SQL within the expected SLA and you still have processing time left, then move on to the next section.
Otherwise, you have to rethink your strategy for what data gets brought over. Maybe there's reliable (system generated) insert/update times associated to the rows. Maybe it's a financial-like system where rows aren't updated, just new versions of the row are insert and the net values are all that matters. Too many possibilities here but you'll likely need to find a Subject Matter Expert on the system - someone who knows the logical business process the database models as well as how the data is stored in the database. Buy that person some tasty snacks because they are worth their weight in gold.
Now what?
At this point, we have transferred the data from PG to SQL Server and we need to figure out what to do with it. 4 possibilities exist
The data is brand new. We need to add the row into the target table
The data is unchanged. Do nothing
The data exists but is different. We need to change the existing row in the target table
There is data in the target table that isn't in the staging table. We're not going to do anything about this case either.
Adding data, inserts, are easy and can be fast - it depends on table design.
Changing data, updates, are less easy in SSIS and are slower than adding new rows. Slower because behind the scenes, the database will delete and add the row back in.
Non-Clustered indexes are also potential bottlenecks here, but they can also be beneficial. Welcome to the world of "it depends"
Option 1 is to just write the SQL statements to handle the insert and update. Yes, you have a lovely GUI tool for creating data flows but you need speed and this is how you get it (especially since we've already moved all the data from the external system to a central repository)
Option 2 is to use a Data Flow and potentially an Execute SQL Task to move the data. The idea being, the Data Flow will segment your data into New which will use an OLE DB Destination to write the inserts. The updates - it depends on volume what makes the most sense from an efficiency perspective. If it's tens, hundreds, thousands of rows to update, eh take the performance penalty and use an OLE DB Command to update the row. Maybe it's hundreds of thousands and the package runs good enough, then keep it.
Otherwise, route your changed rows to yet another staging table and then do a mass update from the staged updates to the target table. But at this point, you just wrote half the query you needed for the first option so just write the Insert and be done (and speed up performance because now everything is just SQL Engine "stuff")
You might want to investigate Progress' Change Data Capture feature. If you have a modern release of OpenEdge (11.7 or better) and the proper licenses you can enable CDC policies to track changes. Your ETL process could then use that information to target its efforts.
Warning: it's complicated. There is a lot more to actually doing it than marketing would have you believe. But if your use-case is straight-forward it might not be too terrible.
Or you could implement Progress "Pro2" product to do all the dirty work for you. (That's an extra cost option.)

Optimalizations of SSIS package

I'm looking for some tips on optimalizing SSIS packages. I run a lot of them but the execution time is high. Some packages run on thousands or milions(2-15) of records. As the night time is not enough anymore they overlap and run 3-4 at once sometimes and that makes it even more difficult.
I did some testings.
I found that vievs are really bad in SSIS... They run faster when I run a SQL query and select into the view into a table and then go with the OleDB SOurce/Destination work.
As I look on the executions the source rows are selected in 5-10 min with over 5milin records but it takes 10 times more time to insert that data into destination table. As it goes I get this informations.
[SSIS.Pipeline] Information: The buffer manager detected that the system was low on virtual memory, but was unable to swap out any buffers. 4 buffers were considered and 4 were locked. Either not enough memory is available to the pipeline because not enough is installed, other processes are using it, or too many buffers are locked.
Information: Buffer manager allocated 0 megabyte(s) in 0 physical buffer(s).
Are there any options to make things move faster? Do I need to invrease DefaultBufferSize and DefaultBufferMaxRows at the same time? It goes on default 10k and 10mil.
A few notes to get you going in the right direction:
About this error below, this indicates that the server running your ssis packages ran out of memory. As suggested, this could be caused because you have other processed consuming that memory, i.e. other packages running at the same time.
[SSIS.Pipeline] Information: The buffer manager detected that the
system was low on virtual memory, but was unable to swap out any
buffers. 4 buffers were considered and 4 were locked. Either not
enough memory is available to the pipeline because not enough is
installed, other processes are using it, or too many buffers are
locked.
SSIS uses memory buffers to process data through a dataflow. There will be a max of 5 used at any given time. In this case, SSIS was unable to allocate the buffers needed to run the dataflow because there was not enough memory available. Increasing the size of the memory buffer will only make things worse if the server is low on memory.
Here's more details about how SSIS runs stuff so you can consider how memory is being used. If you have multiple dataflows running in parallel in the same package, by default SSIS will run as many as the number of processors on your server +2. So if you have 4 cores, you'll have 6 dataflows running at the same time. In a dataflow, parallel execution is managed by the EngineThreads property. The important thing to note here is just that the number of things that a data flow is doing between sources and transformations is limited. So the less complex a dataflow, the more efficient it will run.
What this means for you in a memory constrained system:
Limit the number of packages running at the the same time
Limit the number of dataflows running concurrently in a package
As a best practice, only have one source and destination in a dataflow
Keep the dataflows simple. Use them to move data from one server to another. Use SQL to do transformation work. Use staging tables to land your data and then act on it before it is loaded into the final table. Even if you are not doing any transformations, it will still be more efficient to load the table using SQL and a WHERE NOT EXISTS clause.
Once you've simplified the dataflow, but you are still seeing poor performance, isolate the bottleneck. Run the dataflow with just the source and see how long it takes. If that's slow, tune the source query. Add subsequent tasks to the dataflow and measure the performance. If you're only step that is slow is the destination, tune that.
For tuning a destination, try using a staging table with no indexes on it. Use TABLOCK which will tell SQL to use minimal logging. Once you've implemented the staging pattern, tune the SQL to load the data into the final table.
For tuning the source, avoid views because they may contain unnecessary joins and columns that you are not using, or use a view that is dedicated to your process so it can be written for exactly what you need. A proc can be a better choice overall if there are a lot of tables involved - performance can be improved by using temp tables to stage data before joining it to the final largest table for the select.
Do not use the WITH (NOLOCK) hint in production. If the data is actually being updated at the time that you are reading it, you'd want your data to have those updates before you load it, nevermind phantom reads and the other terrible things that can happen.

SSIS Package Full Table Load Slow

We have an SSIS package that is apparently termed as 'slow' by the development team. Since they do not have a person with SSIS ETL, as a DBA I tried digging into it. Below is the information I found:
SQL Server was 2014 version upgraded -inplace to 2017 so it has SSIS of both versions.
They load a SQL Server table of size 200 GB into SSIS and then zip the data into flatfile using command line zip functionality.
The data flow task simple hits a select * from view - the view is nothing but containing the table with no other fancy joins.
While troubleshooting I found that on SQL Server, there is hardly any load coming, possibly because the select command is running in single thread and not utilizing SQL server cores.
When I run the same select * command (only for 5 seconds, since it is 200 GB table), even my command is single threaded.
The package has a configuration file that the SQL job shows (this is how the package runs) with some connection settings.
Opening the package in BIDS show defaultBufferMaxRows as 10000 only (possibly default value) (since configuration file or any variables does not has a customer value, I guess this is what the package is using too).
Both SQL and SSIS are on same server. SQL has been allocated max memory leaving around 100 GB for SSIS and OS.
Kindly share any ideas on how can I force the SQL Server to run this select command using multiple threads so that entire table gets inside SSIS buffer pool faster.
Edit: I am aware that bcp can read data faster than any process and save it to flatfile but at this point changes to the SSIS package has to be kept minimum and exploring options that can be incorporated within SSIS package.
Edit2: Parallelism works perfectly for my SQL Server as I verified for a lot of other queries.The table in question is 200 GB. It is something with SSIS only which is not hammering my DB as hard as it should.
Edit3: I have made some progress, adjusted the buffer value to 100 MB and max rows to 100000 and now the package seem to be doing better. when I run this package on the server directly using dtexec utility, it generates good load of 40- 50 MB per second but through SQL job it never generates lod more than 10 MB. so I am trying to figure out this behavior.
Edit4: I found that when I run the package directly from logging to the server and invoking dtexec utility, it runs good because it generates good load on the DB causing data I\O to remain steady between 30-50 MB\sec.
The same thing from SQL job never exceeds the I\O more than 10 MB\sec.
I even tried to run the package using agent and opting for cmdline operation but no changes. Agent literally sucks here, any pointers on what could be wrong here?
Final Try:
I am stumped at the observation I have finally:
1)Same package runs 3x faster when run from command prompt from windows node by invoking dtexc utility
2) Exact same package runs 3 times slower than above when involked by SQL agent which has sysadmin permissions on windows as well as SQL Server
In both cases, I tried to see the version of DTEXEC they invoke, and they both invoke the same version. So why one would be so slow is out of my understanding.
I don't think that there is a general solution to this issue since it is a particular case that you didn't provide much information. Since there are two components in your data flow task (OLE DB Source and Flat File Destination), I will try to give some suggestions related to each component.
Before giving suggestions for each component, it is good to mention the following:
If no transformations are applied within the data flow task, It is not recommended to use this task. It is preferable to use bcp utility
Check the TempDb and the database log size.
If a clustered index exists, try to rebuild it. If not, try to create a clustered index.
To check the component that is slowing the package execution, open the package in Visual Studio and try to remove the flat file destination and replace it with a dummy Script Component (write any useless code, for example: string s = "";). And then run the package; if it is fast enough, then the problem is caused by the Flat File Destination, else you need to troubleshoot the OLE DB Source.
Try executing the query in the SQL Server management studio and shows the execution plan.
Check the package TargetServerVersion property within the package configuration and make sure it is correct.
OLE DB Source
As you mentioned, you are using a Select * from view query where data is stored in a table that contains a considerable amount of data. The SQL Server query optimizer may find that reading data using Table Scan is more efficient than reading from indexes, especially if your table does not have a clustered index (row store or column store).
There are many things you may try to improve data load:
Try replacing the Select * from view with the original query used to create the view.
Try changing the data provider used in the OLE DB Connection Manager: SQL Server Native Client, Microsoft OLE DB provider for SQL Server (not the old one).
Try increasing the DefaultBufferMaxRows and DefaultBufferSize properties. more info
Try replacing using SQL Command with specific column names instead of selecting the view name (Table of View data access mode). more info
Try to load data in chunks
Flat File Destination
Check that the flat file directory is not located on the same drive where SQL Server instance is installed
Check that the flat file is not located on a busy drive
Try to export data into multiple flat files instead of one huge file (split data into smaller files) , since when the exported data size increase in a single file, writing to this file become slower, then the package will become slower. (Check the 5th suggestion above)
Any indexes on the table could slow loading. If there are any indexes, try dropping them before the load and then recreating them after. This would also update the index statistics, which would be skewed by the bulk insert.
Are you seeing SQL server utilizing other cores too for other queries? If not, maybe someone played with the following settings:
Check these under server configuration setting:
Maximum Degree of Parallelism
Cost Threshold for Parallelism (server configuration setting).
Does processors affinitized to a CPU.
Also, MaxDOP query hint can cause this too but you said there is no fancy stuff in the view.
Also, it seems you have enough memory on error, why not increase defaultBufferMaxRows to an extremely large number so that SQL server doesn't get slowed down waiting for the buffer to get empty. Remember, they are using the same disk and they will have to wait for each other to use the disk, which will cause extra wait times for the both. It's better SQL server uses it, put into the buffer, and then SSIS starts processing and writing it into disk.
DefaultBufferSize : default is 10MB, max possible 2^31-1 bytes
DefaultBufferMaxRows : default is 10000
you can set AutoAdjustBufferSize so that DefaultBufferSize is automatically calculated based on DefaultBufferMaxRows
See other performance troubleshooting ideas here
https://learn.microsoft.com/en-us/sql/integration-services/data-flow/data-flow-performance-features?view=sql-server-ver15
Edit 1: Some other properties you can check out. These are explained in the above link as well
MaxConcurrentExecutables (package property): This defines how many threads a package can use.
EngineThreads (Data Flow property): how many threads the data flow engine can use
Also try running dtsexec under the same proxy user used by SQL agent to see if you get different result with this account versus your account. You can use runas /user:... cmd to open a command window under that user and then execute dtexec.
Try changing the proxy user used in SQL Agent to a new one and see if it will help. Or try giving elevated permissions in the directories it needs access to.
Try keeping the package in file-system and execute through dtexec from the SQL Agent directly instead of using catalog.start_execution.
Not your case but for other readers: if you have "Execute Package Task", make sure the child packages to be executed are set to run in-process via ExecuteOutOfProcess property. This just reduces overhead of using more processes.
Not your case but for other readers: if you're testing in BIDS, it will run in debug mode by default and thus run slow. Use CTRL-F5 (start without debugging). The best is to use dtexec directly to test the performance
A data flow task may not be the best choice to move this data. SSIS Data Flow tasks are an ETL tool where you can do transformations, look ups, redirect invalid rows, add derived columns and a lot more. If the data flow task is simple and only moves data with no manipulation or redirection of rows then ditch the Data Flow task and use a simple Execute SQL Task and OPENROWSET to import the flat file that was generated from command line and zipped up. Assuming the flat file is a .csv file here are some working examples to query a .csv and insert the data to a table.
You need [Ad Hoc Distributed Queries] run_value set to 1
into dbo.Destination
SELECT *
from openrowset('MSDASQL', 'Driver={Microsoft Text Driver (*.txt; *.csv)};
DefaultDir=D:\YourCsv.csv;Extensions=csv;','select * from YourCsv.csv') File;
Here is some additional examples https://sqlpowershell.blog/2015/02/09/t-sql-read-csv-files-using-openrowset/
There are suggestions in this MSDN article: MSDN DataFlow performance features
Key ones appear to be:
Check the EngineThreads property of the DataFlow task, which tells SSIS how may source and worker threads it should use
If using OLE DB Source to select data from a view uses "SQL Command" and write a SELECT * From View rather than Table or View
Let us know how you get on
You may be facing I/O bottleneck while writing the 200GB to the flat file. I don't see any problem with SQL Query.
If possible create multiple files and split the data (either by modifying SSIS or changing the select query)

SSIS stuck in validation when running package (SQL Server 2016)

I've got a very basic SSIS package with a task in it that has a simple Dataflow in it. Source is a table in a database in one SQL Server instance and destination is an identical table in a different instances database. I've done this kind of thing tons of times, but this time, I'm getting a weird occurance.
When I run the package, it freezes on the validation of the destination after 50% has been reached. I've turned off referential integrity for the table, and i'm doing a fast load keeping identity fields and nulls, with table lock and check contraints ticked. The table only contains about 200,000 records and should be fast to copy.
Anyone know what could be causing this?
As an extra bit of information, If I execute the task directly it runs fine..
Well, that was very odd. It turns out that the previous Task (an Execute SQL) was messing things up. I was turning off referential integrity for the table, which should not have adversely affected the dataflow. Removing this sorted out the problem. To be honest I don't need it in this case, it was just copied from a previous package that did need it.
Weird.

SQL server Instance hanging randomly

I have a SQL Server agent job running every 5 minutes with SSIS package from SSIS Catalog, that package does:
DELETE all existing data ON OLTP_DB
Extract data from Production DB
DELETE all existing data on OLAP_DB and then
Extract data transformed from OLTP_DB into OLAP_DB ...
PROBLEM
That job I mentioned above is hanging randomly for some reason that I don't know,
I just realize using the activity monitor, every time it hangs it shows something like:
and if I try to run any query against that database it does not response just say executing.... and nothing happen until I stop the job.
The average running time for that job is 5 or 6 minutes, but when it hangs it can stay running for days if I donĀ“t stop it. :(
WHAT I HAVE DONE
Set delayValidation : True
Improved my queries
No transactions running
No locking or blocking (I guess)
Rebuild and organize index
Ran DBCC FREEPROCCACHE
Ran DBCC FREESESSIONCACHE
ETC.....
My settings:
Recovery Mode Simple
SSIS OLE DB Destination
1-Keep Identity (checked)
2-Keep Nulls (checked)
3-Table lock (checked)
4-Check constraints (unchecked)
rows per batch (blank)
Maximum insert commit size(2147483647)
Note:
I have another job running a ssis package as well (small) in the same instance but different databases and when the main ETL mentioned above hangs then this small one sometimes, that is why I think the problem is with the instance (I guess).
I'm open to provide more information as need it.
Any assistance or help would be really appreciated!
As Joeren Mostert said, it's showing CXPACKET which means that it's executing some work in parallel. (cxpacket)
It's also showing ASYNC_NETWORK_IO (async_network_io) which means it's also transfering data to the network.
There could be many reasons. Just a few more hints:
- Have you checked if network connection is slow? - What is the size of the data being transfered vs the speed of the network? - Is there an antivirus running that could slow the data transfer?
My guess is that there is lots of data to transfer and that it's taking a long time. I would think either I/O or network but since you have an asyn_network_io that takes most of the cumulative wait time, I would go for network.
As #Jeroen Mostert and #Danielle Paquette-Harvey Said, By doing right click using the activity monitor I could figure out that I had an object that was executing in parallel (for some reason in the past), to fix the problem I remove the parallel structure and put everything to run in one batch.
Now it is working like a charm!!
Before:
After:

Resources