The same data flow in a SSIS package runs 5 times slower on Production Server.
On Dev, the data flow shifts data from a Development database to text file on a network folder. On the Development Server this process runs in 1 second per file . So ADO.NET Source to Flat File destination with nothing else.
On Production, (exactly the same data), the data flow shifts data from a Production Database to text file on the same network folder . On Production the same process runs in 5 seconds per file. Again ADO.net source to Flat File destination.
Now obvious difference is the database. Nothing else is different apart from the server the SSIS package is running from.
So what is the best way to determine the bottleneck ? Should I separate the source and destination in the data flow to determine which part has the problem. Can I increase the packet size or use fast parse on the flat file to speed things up ? Is there a quicker way to work out the problem? On Prod I am limited to what I can test. This is the only place the problem happens. Will Performance counters help me ? Is there a special diagnostics package someone has that may help?
A lot of ideas come flowing to my head. The question to focus on is how to work out what the bottleneck is in less than 5 minutes ?
Related
First, I am new to SSIS so I am still getting the hang of things.
I am using Visual Studio 19 and SSMS 19
Regardless, I have set-up an OLE DB Package from .TSV file to table in SSMS. The issue is that it took 1 hour and 11 minutes to execute for 500,000 rows.
The data is extremely variable so I have set-up a staging table in SSMS that is essentially all varchar(max) columns. Once all the data is inserted, then I was going to look at some aggregations like max(len(<column_name>)) in order to better optimize the table and the SSIS package.
Anyways, there are 10 of these files so I need to create a ForEach File loop. This would take at minimum (1.17 hours)*10=11.70 hours of total runtime.
I thought this was a bit long and created a BULK INSERT Task, but I am having some issues.
It seems very straightforward to set-up.
I added the Bulk Insert Task to the Control Flow tab and went into the Bulk Insert Task Editor Dialogue Box.
From here, I configured the Source and Destination connections. Both of which went very smoothly. I only have one local instance of SQL Server on my machine so I used localhost.<database_name> and the table name for the Destination Connection.
I run the package and it executes just fine without any errors or warnings. It takes less than a minute for a roughly 600 MB .TSV file to load into a SSMS table with about 300 columns of varchar(max).
I thought this was too quick and it was. Nothing loaded, but the package executed!!!
I have tried searching for this issue with no success. I checked my connections too.
Do I need Data Flow Tasks for Bulk Insert Tasks? Do I need any connection managers? I had to configure Data Flow Tasks and connection managers for the OLE DB package, but the articles I have referenced do not do this for Bulk Insert Tasks.
What am I doing wrong?
Any advice from someone more well-versed in SSIS would be much appreciated.
Regarding my comment about using a derived column in place of a real destination, it would look like 1 in the image below. You can do this in a couple of steps:
Run the read task only and see how long this takes. Limit the total read to a sample size so your test does not take an hour.
Run the read task with a derived column as a destination. This will test the total read time, plus the amount of time to load the data into memory.
If 1) takes a long time, it could indicate a bottleneck with slow read times on the disk where the file is or a network bottleneck if the file is on another server on a shared drive. If 2) adds a lot more time, it could indicate a memory bottleneck on the server that SSIS is running. Please note that you testing this on a server is the best way to test performance, because it removes a lot of issues that probably won't exist there such as network bottlenecks and memory constraints.
Lastly, please turn on the feature noted as 2) below, AutoAdjustBufferSize. This will change the settings for DefaultBufferSize (max memory in the buffer) and DefaultBufferMaxRows (total rows allowed in each buffer, these are the numbers that you see next to the arrows in the dataflow when you run the package interactively). Because your column sizes are so large, this will give a hint to the server to maximize the buffer size which gives you a bigger and faster pipeline to push the data through.
One final note, if you add the real destination and that has a significant impact on time, you can look into issues with the target table. Make sure there are no indexes including a cluster index, make sure tablock is on, make sure there are no constraints or triggers.
Actually we have flat file with billion records to insert in SQL server.we tried bcp its taking more times to complete or failed in the middle of process. Could please advice on this!!!!
Some things to consider when moving large amounts of data to/from SQL Server (any data server really).
If you can split the file into pieces and load the pieces. Odds are you are not utilizing or anywhere near stressing the SQL Server resources overall and the long duration of the BCP copy is opening you up to any interruption in communication. So, split the file up and load in pieces asynchronously (safest) or load 2, 4 or 10 file pieces at once... see if you can get the DBA out of their chair :).
If you can place the file on a local drive to the SQL Server. Avoid network/storage potential interruptions. BCP is not forgiving of a break in communication.
Eliminate any indexing on destination table. BCP into a dumb, heap, empty, boring table. Avoid any extras and use char/varchar columns if you can (avoiding the conversion cpu costs as well).
You can tweak the batch size that BCP will que up before writing to SQL Server. The default is 1000, but you can crank this up to 100,000 or more. Do some testing to see what works best for you. With larger batch youll save some time, less touching of physical disk (this depends on a lot of other things too thouhg).
If the file must be pulled across a nework, you can also tweak the network package size. Search for help on calculating your ideal packet size.
There are other options available, but as others have stated, without more detail you cant get a targeted answer.
Share your bcp command. share structure of file. share details on table you are bcp'ing into. Distributed environment (SQL Server and bcp file?), any network invovled? How many records per second are you getting? Is the file wide or narrow? How wide is the file? 1 billion records of 5 integer columns isnt that much data at all. But 1 billion records that are 2000 bytes wide...thats a monster!
I have created a SSIS package that exports several rows to Excel, usually less than 200 (including the header). When I run it in VS2015 Debug everything turns green.
I even wait like this question says.
Still, nothing but the header ever gets to the file.
I know it's not much data, but I'm trying to automate it as the data will eventually grow and I don't want to manually run this all the time.
Any ideas? I'm using SQL Server 2012 and wrote this SSIS package with VS2015.
Something that occasionally happens with Excel destinations is that hidden formatting will cause the data flow connector to begin writing data at a row other than 1.
If this happens, you'll need to recreate your template from a fresh Excel document, and reconstruct the header as needed.
It depends on the buffer size that underlying process uses. I monitored the consumption of C: drive while the SSIS package was writing to the Excel destination, and found that the space was getting full, and as soon the whole space is occupied package ended with success without writing any row to excel destination. Therefore I cleared enough space from my C: drive (around 2 GB) and everything started working fine then.
Also found the following useful thread that might be helpful for someone.
I am confused as to what problems ssis packages solve. I need to create an application to copy content from our local network to our live servers across a dedicated line, that may be unreliable. From our live server the content needs to be replicated across all other servers. The database also needs to be updated with all the files that arrived successfully so it may be available to the user.
I was told that ssis can do this but my question is, is this the right thing to us? SSIS is for data transformation, not for copying files from one network to the other. Can ssis really do this?
My rule of thumb is: if no transformation, no aggregation, no data mapping and no disparate sources then no SSIS.
You may want to explore Transactional Replication:
http://technet.microsoft.com/en-us/library/ms151176.aspx
and if you are on SQL Server 2012 you can also take a look at Availability Groups: http://technet.microsoft.com/en-us/library/ff877884.aspx
I would use SSIS for this scenario. It has built in restart functionality ("checkpoints") which I would use to manage partial retries when your line fails. It is also easy to configure the Control Flow so tasks can run in parallel, e.g. Site 2 isn't left waiting for data if Site 1 is slow.
I'm using SDS to migrate data from a SQL server to a Mysql database. My tests of moving the data of a database that was not in use worked correctly though they took like 48 hours to migrate all the existing data. I configured dead triggers to move all current data and triggers to move the new added data.
When moving to a live database that it is in use the data is being migrated too slow. On the log file I keep getting the message:
[corp-000] - DataExtractorService - Reached one sync byte threshold after 1 batches at 105391240 bytes. Data will continue to be synchronized on the next sync
I have like 180 tables and I have created 15 channels for the dead triggers and 6 channels for the triggers. For the configuration file I have:
job.routing.period.time.ms=2000
job.push.period.time.ms=5000
job.pull.period.time.ms=5000
I have none foreign key configuration so there wont be an issue with that. What I would like to know is how to make this process faster. Should I reduce the number of channels?
I do not know what could be the issue since the first test I ran went very well. Is there a reason why the threshold is not being clreared.
Any help will be apreciated.
Thanks.
How large are your tables? How much memory does the SymmetricDS instance have?
I've used SymmetricDS for a while, and without having done any profiling on it I believe that reloading large databases went quicker once I increased available memory (I usually run it in a Tomcat container).
That being said, SymmetricDS isn't by far as quick as some other tools when it comes to the initial replication.
Have you had a look at the tmp folder? Can you see any progress in file size. That is, the files which SymmetricDS temporarily writes to locally before sending the batch off to the remote side? Have you tried turning on more fine grained logging to get more details? What about database timeouts? Could it be that the extraction queries are running too long, and that the database just cuts them off?