I know 2008 is outdated but a need is a need irrelevant to sources
I am using SSIS 2008 and SQL Server 2008 R2 ,
My requirements are
Load multiple source file data using Bulk load in control flow (Each File size is 20 Gig .txt files-- In total 150 Gig for 5 files apprx)
The SQL table is single (Tabledata with 243 columns)
I used SSIS 5 files bulk loading but it got blocked for a long time any help is appreciated..
To get the best performance in loading a table, we want to shovel the data in by the truckload (table lock). The problem is, there's only room for one truck in the bay. Otherwise, if you want multiple feeds into a table at one time, you're likely looking at throwing data in by the shovel full - that way, 5 workers can be there and their loading won't block each other but the throughput is lessened.
If you're Enterprise Edition, or you want to go old school with a partitioned view on Standard Edition, then you could have each partition/individual table loaded in parallel and then you have N worker processes pouring the data in as fast as the disk subsystem allows and none of the contention you're currently experiencing.
As #David Browne points out, SQL Server supports parallel bulk loads to unindexed heap tables
To get BU lock you need to specify the TABLOCK option with each bulk import stream without blocking other bulk import streams
In your OLE DB Destination, this will be a Fast Load (default) and check the Table lock checkbox. As the msdn article calls out, the destination table needs to be empty or the lock will be IX-Tab and not BU-tab
Yes, you can load from multiple sources. I read from multiple sources at the same time, and write using the fast load option.
Related
First, I am new to SSIS so I am still getting the hang of things.
I am using Visual Studio 19 and SSMS 19
Regardless, I have set-up an OLE DB Package from .TSV file to table in SSMS. The issue is that it took 1 hour and 11 minutes to execute for 500,000 rows.
The data is extremely variable so I have set-up a staging table in SSMS that is essentially all varchar(max) columns. Once all the data is inserted, then I was going to look at some aggregations like max(len(<column_name>)) in order to better optimize the table and the SSIS package.
Anyways, there are 10 of these files so I need to create a ForEach File loop. This would take at minimum (1.17 hours)*10=11.70 hours of total runtime.
I thought this was a bit long and created a BULK INSERT Task, but I am having some issues.
It seems very straightforward to set-up.
I added the Bulk Insert Task to the Control Flow tab and went into the Bulk Insert Task Editor Dialogue Box.
From here, I configured the Source and Destination connections. Both of which went very smoothly. I only have one local instance of SQL Server on my machine so I used localhost.<database_name> and the table name for the Destination Connection.
I run the package and it executes just fine without any errors or warnings. It takes less than a minute for a roughly 600 MB .TSV file to load into a SSMS table with about 300 columns of varchar(max).
I thought this was too quick and it was. Nothing loaded, but the package executed!!!
I have tried searching for this issue with no success. I checked my connections too.
Do I need Data Flow Tasks for Bulk Insert Tasks? Do I need any connection managers? I had to configure Data Flow Tasks and connection managers for the OLE DB package, but the articles I have referenced do not do this for Bulk Insert Tasks.
What am I doing wrong?
Any advice from someone more well-versed in SSIS would be much appreciated.
Regarding my comment about using a derived column in place of a real destination, it would look like 1 in the image below. You can do this in a couple of steps:
Run the read task only and see how long this takes. Limit the total read to a sample size so your test does not take an hour.
Run the read task with a derived column as a destination. This will test the total read time, plus the amount of time to load the data into memory.
If 1) takes a long time, it could indicate a bottleneck with slow read times on the disk where the file is or a network bottleneck if the file is on another server on a shared drive. If 2) adds a lot more time, it could indicate a memory bottleneck on the server that SSIS is running. Please note that you testing this on a server is the best way to test performance, because it removes a lot of issues that probably won't exist there such as network bottlenecks and memory constraints.
Lastly, please turn on the feature noted as 2) below, AutoAdjustBufferSize. This will change the settings for DefaultBufferSize (max memory in the buffer) and DefaultBufferMaxRows (total rows allowed in each buffer, these are the numbers that you see next to the arrows in the dataflow when you run the package interactively). Because your column sizes are so large, this will give a hint to the server to maximize the buffer size which gives you a bigger and faster pipeline to push the data through.
One final note, if you add the real destination and that has a significant impact on time, you can look into issues with the target table. Make sure there are no indexes including a cluster index, make sure tablock is on, make sure there are no constraints or triggers.
Need advise here: using Alteryx Designer, I'm pulling a large dataset from SQL Server (10M rows) and need to move into Greenplum DB
I tried both with connecting using Input Data (SQL Server) and Output Data (GP) and also Connect In-DB (SQL Server) and Write Data In-DB (GP)
Any approach is taking a life to complete at the point that i have to cancel the process (to give an idea, over the weekend it ran for 18hours and advanced no further than 1%)
Any good advice or trick to speed up these sort of massive bulk data loading would be very very highly appreciated!
I can control or do modifications on SQL Server and Alteryx to increase performance but not in Greenplum
Thanks in advance.
Regards,
Erick
I'll break down the approaches that you're taking.
You won't be able to use IN-DB tools as the Databases are different, hence you can't push the processing on to the DB...
Using the standard Alteryx Tools, you are bringing the whole table on to your machine and then pushing it out again, there are multiple ways that this could be done depending on where your blockage is.
Looking first at the extract from SQL, 10M rows isn't that much and so you could split the process and write it as a yxdb. If that fails or takes several hours, then you will need to look at the connection to the SQL Server or the resources available on the SQL Server.
Then for the push into Greenplum, there is no PostgreS bulk loader at present and so you can either just try and write the whole table, Or you can write segments of the table into temp tables in Greenplum and then execute a command to combine those tables.
We are pulling millions of rows daily from SQL servers to Greenplum and we use open source tool called Outsourcer. it's great tool and take care of cleansing and other.. We are using this tool for past 3.5 yrs and no issue till now.. It take care of all parallelism and millions of rows are loaded within minutes.
It support incremental or full load. If you need supports Jon Robert owner of the Outsourcers will response to your email within minutes. Here is the link for the tool
https://www.pivotalguru.com/
Loading data from my OLTP database (it's part of ETL) via OPENQUERY or SSIS Data Flow to another SQL Server database (Warehouse which run this SSIS package / OPENQUERY statement), kills it. As I checked in Performance Monitor I use resources from source database, not from destiny. Is possible to reverse this resource utilization (using SQL Server 2016 or SSIS)?
The problem here is in your destination write operation. If you are using OLE DB Destination with fast load access mode try setting the rows per batch value to a non-zero value and reduce the maximum insert commit size to a value that will be easy on your memory and CPU. SSIS will not have to wait for the default of 2147483647 before writing to the destination table which can have a large impact on your log file slowing your process down. Please refer to this Article for more info on setting this values. All the best
How does your export query looks like? Is it just a simple data dump or do you have some complex logic in (e.g. doing some denormalization/aggregation with the export)?
If it's just a simple export, check on which server your SSIS package runs and what resources it uses. In any case, you need to read the data from your source system, so expect some read disc operations.
In general it is better to get the data from an OLTP as quickly as possible and then apply other operations in further steps of your ETL process on your ETL/Data warehouse server. In order to reduce an impact on your transactional system.
Hope it helps.
We are seeing enormous amounts of data-traffic to and fro our SSIS server. We cannot find the culprit. Is there any way to find out which package is causing all the trafffic? Any advice on that? We are thinking that maybe all the merges we do cause all the traffic. Our SSIS machine gets data from several production SQL servers, merges that with data in our warehouses. Dies that mean that
a) new data is transfered to the SSI machine,
b) existing data is transferred to the SSIS machine,
c) Merge is done and then all data is transferred to the
warehouse?
Then how would you go about limiting all the data moved from and to?
The answer to your questions a, b and c (if you're using SSIS transformation components in SSIS) is essentially “yes, all new data and existing data required for transformation will flow into SSIS instance, and the resulting merged data will flow out of SSIS instance to the target server”. More detailed explanation is below.
Assuming that you are using SQL Server 2012 and above, you would be able to enable Verbose logging to capture the number of rows transferred. The details are captured in [catalog].[execution_data_statistics]. If you are looking for the size in bytes, you would need to calculate that based on the columns that are being extracted and transformed against the number of rows. The [catalog].[execution_data_statistics] captures package name, task name, data flow path and source/destination component name, the time of execution and execution path, which is great for diagnosing.
SSIS is an in-memory pipeline. If you have 3 separate servers, Source, SSIS and Target, the amount of data/traffic will vary. As an example, if the Data Flow Tasks require transformation and use components such as Merge, Merge Join, Lookup etc, you can expect data flowing from Source Server, SSIS Server and Target Server.
On the other hand if you are running a simple Data Flow Task with SQL Server Destination for the Target between 2 databases with the same source and target, SSIS will issue a BULK INSERT statement on the target (= source = SSIS server) instance. In this case, there will be very low data traffic across the network (at least not related to the BULK INSERT statement).
If your package contains an “Execute SQL Task” component that invoke MERGE t-sql statements, this would not cause data traffic into/out of SSIS Server. The activity will be done on the SQL Server instance that the MERGE statement is executed on. If you are using Linked Servers, then the data will flow into/out of linked server as required by the MERGE statement just the same way as if you're invoking the statement on the instance.
My recommendation for limiting the amount of data moved from and to, is to be selective at the source level. For example, if you know that you are only going to be using ColumnA, ColumnB, ColumnC in dbo.Customer, then use
SELECT [ColumnA], [ColumnB], [ColumnC] FROM [dbo].[Customer] --
Better!
instead of the following statement which potentially can retrieve more than those 3 columns:
SELECT *
FROM [dbo].[Customer] -- Do Not Use
There are also a number of best practices to optimize SSIS including reducing bandwidth and optimizing the amount of data transferred, that you can follow. Please have a read here: http://blogs.msdn.com/b/sqlcat/archive/2013/09/16/top-10-sql-server-integration-services-best-practices.aspx.
If you are working on Hybrid platform, you may also be interested in reading "SSIS for Azure and Hybrid Data Movement" white paper (https://msdn.microsoft.com/en-us/library/jj901708.aspx). This white paper has an additional link to "SSIS Operational and Tuning Guide" that would be useful as well.
In addition, you may also be interested in having a look at SSIS Reporting Pack available on CodePlex to get more visualization of SSIS executions on the server.
Hope this helps.
Julie
i need to copy the tables and data (about 5 yrs data, 6200 tables) stored in sqlserver, i am using datastage and odbc connection to connect and datstage automatically creates the table with data, but its taking 2-3 hours per table as tables are very large(0.5 gig, 300+columns and about 400k rows).
How can i achieve this the fastes as at this rate i am able to only copy 5 tables per day but within 30 days i need move over these 6000 tables.
6000 tables at 0.5 Gb each would be about 3 terabytes. Plus indexes.
I probably wouldn't go for an ODBC connection, but the question is where is the bottleneck.
You have an extract stage from SQL Server. You have the transport from the SQL Server box to the Oracle box. You have the load.
If the network is the limiting capability, you are probably best off extracting to a file, compressing it, transferring the compressed file, uncompressing it, and then loading it. External tables in Oracle are the fastest way to load data from flat file (delimited or fixed length), preferably spread over multiple physical disks to spread the load and without logging.
Unless there's a significant transformation happening, I'd forget datastage. Anything that isn't extracting or loading is excess to be minimised.
Can you do the transfer of separate tables simultaneously in parallel?
We regularly transfer large flat files into SQL Server and I run them in parallel - it uses more bandwidth on the network and SQL Server, but they complete together faster than in series.
Have you thought about scripting out the table schemas and creating them in Oracle and then using SSIS to bulk-copy the data into Oracle? Another alternative would be to use linked servers and a series of "Select * INTO xxx" statements that would copy the schema and data over (minus key constarints), but I think the performance would be quite pitiful with 6000 tables.