I have created a SSIS package that exports several rows to Excel, usually less than 200 (including the header). When I run it in VS2015 Debug everything turns green.
I even wait like this question says.
Still, nothing but the header ever gets to the file.
I know it's not much data, but I'm trying to automate it as the data will eventually grow and I don't want to manually run this all the time.
Any ideas? I'm using SQL Server 2012 and wrote this SSIS package with VS2015.
Something that occasionally happens with Excel destinations is that hidden formatting will cause the data flow connector to begin writing data at a row other than 1.
If this happens, you'll need to recreate your template from a fresh Excel document, and reconstruct the header as needed.
It depends on the buffer size that underlying process uses. I monitored the consumption of C: drive while the SSIS package was writing to the Excel destination, and found that the space was getting full, and as soon the whole space is occupied package ended with success without writing any row to excel destination. Therefore I cleared enough space from my C: drive (around 2 GB) and everything started working fine then.
Also found the following useful thread that might be helpful for someone.
Related
First, I am new to SSIS so I am still getting the hang of things.
I am using Visual Studio 19 and SSMS 19
Regardless, I have set-up an OLE DB Package from .TSV file to table in SSMS. The issue is that it took 1 hour and 11 minutes to execute for 500,000 rows.
The data is extremely variable so I have set-up a staging table in SSMS that is essentially all varchar(max) columns. Once all the data is inserted, then I was going to look at some aggregations like max(len(<column_name>)) in order to better optimize the table and the SSIS package.
Anyways, there are 10 of these files so I need to create a ForEach File loop. This would take at minimum (1.17 hours)*10=11.70 hours of total runtime.
I thought this was a bit long and created a BULK INSERT Task, but I am having some issues.
It seems very straightforward to set-up.
I added the Bulk Insert Task to the Control Flow tab and went into the Bulk Insert Task Editor Dialogue Box.
From here, I configured the Source and Destination connections. Both of which went very smoothly. I only have one local instance of SQL Server on my machine so I used localhost.<database_name> and the table name for the Destination Connection.
I run the package and it executes just fine without any errors or warnings. It takes less than a minute for a roughly 600 MB .TSV file to load into a SSMS table with about 300 columns of varchar(max).
I thought this was too quick and it was. Nothing loaded, but the package executed!!!
I have tried searching for this issue with no success. I checked my connections too.
Do I need Data Flow Tasks for Bulk Insert Tasks? Do I need any connection managers? I had to configure Data Flow Tasks and connection managers for the OLE DB package, but the articles I have referenced do not do this for Bulk Insert Tasks.
What am I doing wrong?
Any advice from someone more well-versed in SSIS would be much appreciated.
Regarding my comment about using a derived column in place of a real destination, it would look like 1 in the image below. You can do this in a couple of steps:
Run the read task only and see how long this takes. Limit the total read to a sample size so your test does not take an hour.
Run the read task with a derived column as a destination. This will test the total read time, plus the amount of time to load the data into memory.
If 1) takes a long time, it could indicate a bottleneck with slow read times on the disk where the file is or a network bottleneck if the file is on another server on a shared drive. If 2) adds a lot more time, it could indicate a memory bottleneck on the server that SSIS is running. Please note that you testing this on a server is the best way to test performance, because it removes a lot of issues that probably won't exist there such as network bottlenecks and memory constraints.
Lastly, please turn on the feature noted as 2) below, AutoAdjustBufferSize. This will change the settings for DefaultBufferSize (max memory in the buffer) and DefaultBufferMaxRows (total rows allowed in each buffer, these are the numbers that you see next to the arrows in the dataflow when you run the package interactively). Because your column sizes are so large, this will give a hint to the server to maximize the buffer size which gives you a bigger and faster pipeline to push the data through.
One final note, if you add the real destination and that has a significant impact on time, you can look into issues with the target table. Make sure there are no indexes including a cluster index, make sure tablock is on, make sure there are no constraints or triggers.
We have an SSIS package that is apparently termed as 'slow' by the development team. Since they do not have a person with SSIS ETL, as a DBA I tried digging into it. Below is the information I found:
SQL Server was 2014 version upgraded -inplace to 2017 so it has SSIS of both versions.
They load a SQL Server table of size 200 GB into SSIS and then zip the data into flatfile using command line zip functionality.
The data flow task simple hits a select * from view - the view is nothing but containing the table with no other fancy joins.
While troubleshooting I found that on SQL Server, there is hardly any load coming, possibly because the select command is running in single thread and not utilizing SQL server cores.
When I run the same select * command (only for 5 seconds, since it is 200 GB table), even my command is single threaded.
The package has a configuration file that the SQL job shows (this is how the package runs) with some connection settings.
Opening the package in BIDS show defaultBufferMaxRows as 10000 only (possibly default value) (since configuration file or any variables does not has a customer value, I guess this is what the package is using too).
Both SQL and SSIS are on same server. SQL has been allocated max memory leaving around 100 GB for SSIS and OS.
Kindly share any ideas on how can I force the SQL Server to run this select command using multiple threads so that entire table gets inside SSIS buffer pool faster.
Edit: I am aware that bcp can read data faster than any process and save it to flatfile but at this point changes to the SSIS package has to be kept minimum and exploring options that can be incorporated within SSIS package.
Edit2: Parallelism works perfectly for my SQL Server as I verified for a lot of other queries.The table in question is 200 GB. It is something with SSIS only which is not hammering my DB as hard as it should.
Edit3: I have made some progress, adjusted the buffer value to 100 MB and max rows to 100000 and now the package seem to be doing better. when I run this package on the server directly using dtexec utility, it generates good load of 40- 50 MB per second but through SQL job it never generates lod more than 10 MB. so I am trying to figure out this behavior.
Edit4: I found that when I run the package directly from logging to the server and invoking dtexec utility, it runs good because it generates good load on the DB causing data I\O to remain steady between 30-50 MB\sec.
The same thing from SQL job never exceeds the I\O more than 10 MB\sec.
I even tried to run the package using agent and opting for cmdline operation but no changes. Agent literally sucks here, any pointers on what could be wrong here?
Final Try:
I am stumped at the observation I have finally:
1)Same package runs 3x faster when run from command prompt from windows node by invoking dtexc utility
2) Exact same package runs 3 times slower than above when involked by SQL agent which has sysadmin permissions on windows as well as SQL Server
In both cases, I tried to see the version of DTEXEC they invoke, and they both invoke the same version. So why one would be so slow is out of my understanding.
I don't think that there is a general solution to this issue since it is a particular case that you didn't provide much information. Since there are two components in your data flow task (OLE DB Source and Flat File Destination), I will try to give some suggestions related to each component.
Before giving suggestions for each component, it is good to mention the following:
If no transformations are applied within the data flow task, It is not recommended to use this task. It is preferable to use bcp utility
Check the TempDb and the database log size.
If a clustered index exists, try to rebuild it. If not, try to create a clustered index.
To check the component that is slowing the package execution, open the package in Visual Studio and try to remove the flat file destination and replace it with a dummy Script Component (write any useless code, for example: string s = "";). And then run the package; if it is fast enough, then the problem is caused by the Flat File Destination, else you need to troubleshoot the OLE DB Source.
Try executing the query in the SQL Server management studio and shows the execution plan.
Check the package TargetServerVersion property within the package configuration and make sure it is correct.
OLE DB Source
As you mentioned, you are using a Select * from view query where data is stored in a table that contains a considerable amount of data. The SQL Server query optimizer may find that reading data using Table Scan is more efficient than reading from indexes, especially if your table does not have a clustered index (row store or column store).
There are many things you may try to improve data load:
Try replacing the Select * from view with the original query used to create the view.
Try changing the data provider used in the OLE DB Connection Manager: SQL Server Native Client, Microsoft OLE DB provider for SQL Server (not the old one).
Try increasing the DefaultBufferMaxRows and DefaultBufferSize properties. more info
Try replacing using SQL Command with specific column names instead of selecting the view name (Table of View data access mode). more info
Try to load data in chunks
Flat File Destination
Check that the flat file directory is not located on the same drive where SQL Server instance is installed
Check that the flat file is not located on a busy drive
Try to export data into multiple flat files instead of one huge file (split data into smaller files) , since when the exported data size increase in a single file, writing to this file become slower, then the package will become slower. (Check the 5th suggestion above)
Any indexes on the table could slow loading. If there are any indexes, try dropping them before the load and then recreating them after. This would also update the index statistics, which would be skewed by the bulk insert.
Are you seeing SQL server utilizing other cores too for other queries? If not, maybe someone played with the following settings:
Check these under server configuration setting:
Maximum Degree of Parallelism
Cost Threshold for Parallelism (server configuration setting).
Does processors affinitized to a CPU.
Also, MaxDOP query hint can cause this too but you said there is no fancy stuff in the view.
Also, it seems you have enough memory on error, why not increase defaultBufferMaxRows to an extremely large number so that SQL server doesn't get slowed down waiting for the buffer to get empty. Remember, they are using the same disk and they will have to wait for each other to use the disk, which will cause extra wait times for the both. It's better SQL server uses it, put into the buffer, and then SSIS starts processing and writing it into disk.
DefaultBufferSize : default is 10MB, max possible 2^31-1 bytes
DefaultBufferMaxRows : default is 10000
you can set AutoAdjustBufferSize so that DefaultBufferSize is automatically calculated based on DefaultBufferMaxRows
See other performance troubleshooting ideas here
https://learn.microsoft.com/en-us/sql/integration-services/data-flow/data-flow-performance-features?view=sql-server-ver15
Edit 1: Some other properties you can check out. These are explained in the above link as well
MaxConcurrentExecutables (package property): This defines how many threads a package can use.
EngineThreads (Data Flow property): how many threads the data flow engine can use
Also try running dtsexec under the same proxy user used by SQL agent to see if you get different result with this account versus your account. You can use runas /user:... cmd to open a command window under that user and then execute dtexec.
Try changing the proxy user used in SQL Agent to a new one and see if it will help. Or try giving elevated permissions in the directories it needs access to.
Try keeping the package in file-system and execute through dtexec from the SQL Agent directly instead of using catalog.start_execution.
Not your case but for other readers: if you have "Execute Package Task", make sure the child packages to be executed are set to run in-process via ExecuteOutOfProcess property. This just reduces overhead of using more processes.
Not your case but for other readers: if you're testing in BIDS, it will run in debug mode by default and thus run slow. Use CTRL-F5 (start without debugging). The best is to use dtexec directly to test the performance
A data flow task may not be the best choice to move this data. SSIS Data Flow tasks are an ETL tool where you can do transformations, look ups, redirect invalid rows, add derived columns and a lot more. If the data flow task is simple and only moves data with no manipulation or redirection of rows then ditch the Data Flow task and use a simple Execute SQL Task and OPENROWSET to import the flat file that was generated from command line and zipped up. Assuming the flat file is a .csv file here are some working examples to query a .csv and insert the data to a table.
You need [Ad Hoc Distributed Queries] run_value set to 1
into dbo.Destination
SELECT *
from openrowset('MSDASQL', 'Driver={Microsoft Text Driver (*.txt; *.csv)};
DefaultDir=D:\YourCsv.csv;Extensions=csv;','select * from YourCsv.csv') File;
Here is some additional examples https://sqlpowershell.blog/2015/02/09/t-sql-read-csv-files-using-openrowset/
There are suggestions in this MSDN article: MSDN DataFlow performance features
Key ones appear to be:
Check the EngineThreads property of the DataFlow task, which tells SSIS how may source and worker threads it should use
If using OLE DB Source to select data from a view uses "SQL Command" and write a SELECT * From View rather than Table or View
Let us know how you get on
You may be facing I/O bottleneck while writing the 200GB to the flat file. I don't see any problem with SQL Query.
If possible create multiple files and split the data (either by modifying SSIS or changing the select query)
I've been searching all over for an example of doing this but I'm not finding it. I know it is possible because we had it working at one point but the resource that developed the process isn't currently available to fix the process which is currently corrupted beyond repair. In fact corrupted so badly we can't even get into take a look at what was there to build a copy of the process over again.
What we have is a 'Production_DB' and a 'Test_DB' which and the two are essentially the same. What was taking place is that a SSIS task was firing at the end of each work day and refreshing 'Test_DB' with the data that is in 'Production_DB'. In this way testing can take place and changed can be made to the test bed without any concern that it will get too far afield of the live data because each evening this data is brought back to exactly what is in production. Meanwhile for testing purposes all testing is begin measure against actual real life data examples so when processes are pointed at the production data set there is less chance of issues.
The problem we have is several months back we didn't realize it but the SSIS package and source files became corrupted beyond readability. So, now we are looking for a way to replace the package to restore the process, but as of yet I have not been able to find an example that I can use to build from.
We are on SQL Server 2008 R2.
If anyone has some references they can point me to it would be greatly appreciated!
Depending on the amount of tables and the SQL Server version you can use the import export wizard to identify prod as the source and test as the destination...use that wizard to create a task and SAVE the ending task (it should save as an SSIS package I believe). This will get you a quick way of making the SSIS package to copy the data over and you can even overwrite the destination data if you would like.
Right click the database > tasks > import data
The same data flow in a SSIS package runs 5 times slower on Production Server.
On Dev, the data flow shifts data from a Development database to text file on a network folder. On the Development Server this process runs in 1 second per file . So ADO.NET Source to Flat File destination with nothing else.
On Production, (exactly the same data), the data flow shifts data from a Production Database to text file on the same network folder . On Production the same process runs in 5 seconds per file. Again ADO.net source to Flat File destination.
Now obvious difference is the database. Nothing else is different apart from the server the SSIS package is running from.
So what is the best way to determine the bottleneck ? Should I separate the source and destination in the data flow to determine which part has the problem. Can I increase the packet size or use fast parse on the flat file to speed things up ? Is there a quicker way to work out the problem? On Prod I am limited to what I can test. This is the only place the problem happens. Will Performance counters help me ? Is there a special diagnostics package someone has that may help?
A lot of ideas come flowing to my head. The question to focus on is how to work out what the bottleneck is in less than 5 minutes ?
I have a VB.NET windows application that pulls information from an MS Access database. The primary role of the application is to extract information from Excel files in various formats, standarize the file layout and write that out to csv files. The application uses MS Access as the source for the keys and cross reference files.
The windows app uses typed datasets for much of the user interaction between the database. The standardization is done on the on each clients machine. The application is not... how can I say this...FAST :-).
Question: What is the best way to migrate the DB and application to SQL Server 2005. I am thinking it might be a good idea to write the code for the standarization in and SSIS packages.
What is the appropriate way to go about this migration?
The application pulls data from 250 excel files each week and approximatley 800 files each month with an average of about 5000 rows per file. There are 13 different file formats that are standarized and out put into 3 different standard formats. The application takes between 25 min. and 40 min to run depending on which data run we are taling about. 95% of the appliction is the standarization process. All the user does is pick a few parameters then start the run.
Microsoft provide a free tool to migrate an Access Database to SQL Server. Once you've upgraded you should be able to change your connection string to point at SQL Server.
You might want to run your app through a profiler to ensure that the Access DB is really what's slowing down your app, and not something else. It would be a shame to go through all the work to convert it over to SQL server, and have nothing to show for it.
The Access upsizing wizard can be used as a starting point.
You may be able to change the backend to be SQL Server with linked tables in Access without changing your front end. Then, you can modify the front end to go directly to SQL Server at will.
Unless you are hitting Access very heavily, I doubt that it is your bottleneck.
As far as reading the Excel files, SSIS can do it, but it might not be as reliable as the mechanism you are using in VB.NET right now, if your VB.NET code has a lot of smart logic to deal with a degree of variation in the input files
As far as writing data out to CSV, SSIS is fine, and I've found SSIS to be a pretty good performer.
If you could give more details about the workflow and how much the user interacts with the database versus the program pulling configuration, it might be easier to help with your architecture.
SSIS is very configurable on the fly (package configuring itself somewhat while it is running), and in many cases it could be programmed to read a variety of Excel files and convert them to CSV, but it's not as configurable on the fly as a hand-coded system. It is also possible to use the SSIS object model to generate packages programmatically and then execute them - this does not have some of the limitations of a package configuring itself, but the object model is pretty complex.
Making sure the scope is clear:
Use a .NET program to
drive an Access database front-end which enables you to
Extract data from a number of Excel spreadsheets,
Massaging the data appropriately, and
Save the result in a CSV file.
What sorts of volumes are we talking about? How many clients, how many spreadsheets per client, how many rows per spreadsheet (I think it would be 32767 max for a single spreadsheet, right? And how much time are we talking about?
Seems like a lot of moving parts. And Access usually is a pretty good tool (with VBA) to do this sort of thing by itself.
It doesn't seem like enough volume to provide a major time sink for a well-designed Access database front-ending Excel to accomplish the whole process using VBA. If your alternative involves installing and operating SQL Server (in place of Access) on each client, I would be surprised if the admin and operational overhead doesn't increase.
So Weekly, per client:
250 files at 25 minutes
= 10 files / minute
or 6 seconds per file.
Monthly, per client:
800 files at 40 minutes
= 20 files/minute
or 3 seconds per file.
My expectation would be less than 1 sec. per file (5000 rows) round trip including:
a. Import or attach xls to mdb,
b. Transform via Access SQL
c. Export to csv
The only explanation that comes to mind is that perhaps the .NET app is reading, transforming, and saving a row at a time. Is that possibly the case?
If you convert to SSIS, then that probably obsoletes the .NET app, because SSIS will want to handle the ETL (and save) itself. So you will basically be rewriting the software. But you may have better resources for SSIS than for Access. But it seems to me like overkill. BUt then .NET rather than VBA also is maybe overkill; and rewriting in VBA is work, too. The least effort would I think be to see if you can do the entire ETL (and save) using Access SQL for most of it, and using VBA just for scripting, to iterate through input files in a directory or some such.
I think you could at least prototype the basic use cases and find out if you can find out pretty quickly where the time is being spent now (as suggested by earlier responses.) But that would be worth finding out before committing redevelopment resources aimed at the wrong part of the problem. If you can expand a bit in those areas, I could probably direct you further. But Access is pretty well suited for this sort of thing, at (IMHO) a lower TCO than SQL Server + SSIS + .NET.
Not to mention that I'd be surprised if the csv files are the true end point, which may play a role in the decision. Isn't the Excel data really ending up further down the path?
Finally, how objectionable is a 25-40 minute process that presumably is unattended, can run over lunch break, and maybe basically works ok?
Notes:
Per week
Excel Files 250
Minutes 25
Minutes/File 0.1
Sec/File 6
Per month
Excel files 800
Minutes 40
Minutes/File 0.05
Sec/File 3