SSIS 2014 ADO.NET connection to Oracle slow for 1 table (the rest is fine) - sql-server

We're in the middle of doing a new data warehouse roll-out using SQL Server 2014. One of my data sources is Oracle, and unfortunately the recommended Attunity component for quick data access is not available for SSIS 2014 just yet.
I've tried avoiding using OLEDB, as that requires installation of specific Oracle client tools that have caused me a lot of frustration before, and with the Attunity stuff supposedly being in the works (MS promised they'd arrive in August already), I'm reluctant to go through the ordeal again.
Therefore, I'm using ADO.NET. All things considered, performance is acceptable for the time being, with the exception of 1 particular table.
This particular table in Oracle has a bunch of varchar columns, and I've come to the conclusion that it's because of the width of the selected row that this table performs particularly slow. To prove that, rather than selecting all columns as they exist in Oracle (which is the original package I created), I truncated all widths to the maximum length of the values actually stored (CAST(column AS varchar(46)). This reduced the time to run the same package to 17 minutes (still way below what I'd call acceptable, and it's not something I'd put in production because it'll open up a world of future pain, but it proves the width of the columns are definitely a factor).
I increased the network packet size in SQL Server, but that did not seem to help much. I have not managed to figure out a good way to alter the packet size on the ADO.NET connector for Oracle (SQL Server does have that option). I attempted to see if adding Packet size=32000;to the connection string for the Oracle connector, but that just threw an error, indicating it simply won't be accepted. The same applies to FetchSize.
Eventually, I came up with a compromise where I split the load into three different parts, dividing the varchar columns between these parts, and using two MERGE JOIN objects to well, merge the data back into a single combined dataset. Running that and doing some extrapolation leads me to think that method would have taken roughly 30 minutes to complete (but without the potential of data loss using the CAST solution from above). However, that's still not acceptable.
I'm currently in the process of trying some other options (not using MERGE JOIN but dumping into three different tables, and then merging those on the SQL Server itself, and splitting the package up into even more different loads in an attempt to further speed up the individual parts of the load), but surely there must be something easier.
Does anyone have experience with how to load data from Oracle through ADO.NET, where wide rows would cause delays? If so, are there any particular guidelines I should be aware of, or any additional tricks you might have come across that could help me reduce load time while the Attunity component is unavailable?
Thanks!

The updated Attunity drivers have just been released by Microsoft:
Hi all, I am pleased to inform you that the Oracle and TeraData connector V3.0 for SQL14
SSIS is now available for download!!!!!
Microsoft SSIS Connectors by Attunity Version 3.0 is a minor release.
It supports SQL Server 2014 Integration Services and includes bug
fixes and support for updated Oracle and Teradata product releases.
For details, please look at the download page.
http://www.microsoft.com/en-us/download/details.aspx?id=44582
Source: https://connect.microsoft.com/SQLServer/feedbackdetail/view/917247/when-will-attunity-ssis-connector-support-sql-server-2014

Related

Alteryx - bulk copy from SQL Server to Greenplum - need tips to increase performance

Need advise here: using Alteryx Designer, I'm pulling a large dataset from SQL Server (10M rows) and need to move into Greenplum DB
I tried both with connecting using Input Data (SQL Server) and Output Data (GP) and also Connect In-DB (SQL Server) and Write Data In-DB (GP)
Any approach is taking a life to complete at the point that i have to cancel the process (to give an idea, over the weekend it ran for 18hours and advanced no further than 1%)
Any good advice or trick to speed up these sort of massive bulk data loading would be very very highly appreciated!
I can control or do modifications on SQL Server and Alteryx to increase performance but not in Greenplum
Thanks in advance.
Regards,
Erick
I'll break down the approaches that you're taking.
You won't be able to use IN-DB tools as the Databases are different, hence you can't push the processing on to the DB...
Using the standard Alteryx Tools, you are bringing the whole table on to your machine and then pushing it out again, there are multiple ways that this could be done depending on where your blockage is.
Looking first at the extract from SQL, 10M rows isn't that much and so you could split the process and write it as a yxdb. If that fails or takes several hours, then you will need to look at the connection to the SQL Server or the resources available on the SQL Server.
Then for the push into Greenplum, there is no PostgreS bulk loader at present and so you can either just try and write the whole table, Or you can write segments of the table into temp tables in Greenplum and then execute a command to combine those tables.
We are pulling millions of rows daily from SQL servers to Greenplum and we use open source tool called Outsourcer. it's great tool and take care of cleansing and other.. We are using this tool for past 3.5 yrs and no issue till now.. It take care of all parallelism and millions of rows are loaded within minutes.
It support incremental or full load. If you need supports Jon Robert owner of the Outsourcers will response to your email within minutes. Here is the link for the tool
https://www.pivotalguru.com/

SSIS Transferring Data to an Oracle DB is Extremely Slow

We are transferring data to an Oracle Database from two different sources and it's extremely slow.
Please see notes and images below. Any suggestions?
Notes:
We're using the Microsoft OLE DB Provider for Oracle.
One data source is SQL Server and includes about 5M records.
The second data source is Oracle and includes about 700M records.
When trying to transfer the SQL Server data, we broke it up into
five "Data Flow Tasks" in the "Control Flow". Each "Data Flow Task"
in turn use an "OLE DB Source" which internally uses a "SQL command"
that effectively selects 1M of the 5M records. When we run this
package it ran the first data flow task for about 3 hours and only
transferred about 50,000 records until we ended the process.
We had similar experience with the Oracle data as well.
For some reason saving to a Oracle Destination is extremely slow.
Interestingly, we once transfer the same 700M records from Oracle to
SQL Server (so the opposite direction) and it worked as expected in
about 4.5 to 5 hours.
Images:
On the Oracle side you can examine v$session to see where the time is being spent (if AWR is licensed on the Oracle instance you can use DBA_HIST_ACTIVE_SESS_HISTORY or v$active_session_history).
I work on Oracle performance problems every day (over 300 production Oracle instances), so I feel qualified to say that I can't give you a specific answer to your question, but I can point you in the right direction.
Typical process mistakes that make inserts slow:
not using array insert
connecting to the DB for each insert (sound strange? believe me
I've seen DataStage and other ETL tools set-up this way)
app server/client not on same local area network as the Oracle instance
indexes on table(s) being inserted into (especially problematic with
bit mapped indexes); requires index update and table update per
statement
redo log files too small on Oracle instance (driving up
redo log file switching)
log_buffer parameter on DB side too small
not enough db writers (see db_writer_processes initialization
parameter)
committing too often
Not an answer, just a bunch of observations and questions...
Any one of the components in the data pipeline could be the bottleneck.
You first need to observe the row counts when running interactively in SSIS and see if there is any obvious clogging going on - i.e. do you have a large rowcount right before your Data conversion transformation and a low one after? Or is it at the Oracle destination? Or is it just taking a long time to come out of SQL? A quick way to check the SQL side is to dump it to a local file instead - that mostly measures the SQL select performance without any blocking from Oracle.
When you run your source query in SQL Server, how long does it take to return all rows?
Your data conversion transformation can be performed in the source query. Every transformation requires set up of buffers, memory etc. and can slow down and block your dataflow. Avoid these and do it in the source query instead
Various buffers and config that exists in Oracle driver. Already addressed in detail by #RogerCornejo. For read performance out of Oracle, I have found altering FetchBufferSize made a huge difference, but you are doing writes here so that's not the case.
Lastly, where are the two database servers and the SSIS client tool situated network wise? If you are running this across three different servers then you have network throughput to consider.
If you use a linked server as suggested, note that SSIS doesn't do any processing at all so you take that whole piece out of the equation
And if you're just looking for the fastest way to transfer data, you might find that dumping to a file and bulk inserting is the fastest
Thank you all for your suggestions. For those who may run into a similar problem in the future, I'm posting what finally worked for me. The answer was ... switching the provider. The ODBC or Attunity providers were much faster, by a factor of almost 800X.
Remember that my goal was to move data from a SQL Server Database to an Oracle database. I originally used an OLE DB provider for both the source and destination. This provider works fine if you are moving data from SQL Server to SQL Server because it allows you to use the "Fast Load" option on the destination which in turn allows you to use batch processing.
However, the OLE DB provider doesn't allow the "Fast Load" option with an Oracle DB as the destination (couldn't get it to work and read elsewhere that it doesn't work). Because I couldn't use the "Fast Load" option I couldn't batch and instead was inserting records row by row which was extremely slow.
A colleague suggested trying ODBC and others suggested trying Microsoft's Attunity Connectors for Oracle. I didn't think the difference would be so great because in my experience ODBC had similar (and sometimes less) performance than OLE DB (hadn't tried Attunity). BUT... that was when moving data from and to a SQL Server database or staying in the Microsoft world.
When moving data from a SQL Server database to an Oracle database, there was a huge difference! Both ODBC and Attunity out performed OLE DB dramatically.
Here were my summarized performance test results inserting 5.4M records from a SQL Server database to an Oracle Database.
When doing all the work on one local computer.
OLE DB source and destination inserted 12 thousand records per minute which would have taken approx. 7 hours to complete.
ODBC source and destination inserted 9 Million records per minute which only took approx. 30 seconds to complete.
When moving data from one network/remote computer to another network/remote computer.
OLE DB source and destination inserted 115 records per minute which would have taken approx. 32 days to complete.
ODBC source and destination inserted 1 Million records per minute which only took approx. 5 minutes to complete.
Big difference!
Now why when working locally it only took 30 seconds and remotely it took 5 minutes is another issue for another day, but for now I have something workable (it should be slower on the network, but surprised it's that much slower).
Thanks again to everyone!
Extra notes:
My OLE DB results were similar with either Microsoft's or Oracle OLE DB providers for Oracle databases.
Attunity was a little faster than ODBC. I didn't get to test on remote servers or on larger data set, but locally it was a consitently about 2 to 3 seconds faster than ODBC. Those seconds could add up on a large data set so take note.

SSIS Stored Procedure uses Temp Table 2008 and 2014

I'm currently writing an SSIS package that retrieves data from a stored procedure via an OLE DB Source. The stored procedure contains a rather nasty query that I've been able to improve with the use of temp tables. If I switch these temp tables to table variables, the logical reads jump from about 1.3 million to about 56 million. I'm uncomfortable enough with the 1.3 million, but there is no way that I can be satisfied with the 56 million logical reads. Thus, I can't really convert the temp tables to table variables.
However, SSIS (or rather SQL Server) can't resolve the metadata for this query, so the package won't run. I've found a few different solutions online, but none of them seem to work for both SQL Server 2008 and SQL Server 2014. We are currently in the process of upgrading all of our servers to 2014, and this particular package runs against 2008 in DEV, 2014 in QA, and 2008 in production currently. By the fall, the PROD tier will be 2014, and the DEV tier will be promoted sometime after that. Unfortunately, I can't wait until these upgrades happen to release this SSIS package. The data needs to start moving by next week. Thus, I need to figure out a way to get the metadata resolved for both environments. Here's what I've tried so far:
Add a dummy select in an IF 1=0 block which returns the proper metadata. This works in 2008, but not 2014.
Use SET FMTONLY OFF at the beginning of the stored procedure. This works in 2008, but not 2014. Furthermore, it causes the stored procedure to run once for each column returned (over 30 in this case), which is a deal-breaker even if it did work.
Use EXEC ... WITH RESULT SETS (( ... ));. This works in 2014, but not in 2008.
Deploy a stored procedure which returns the proper metadata, build and deploy the SSIS package, then modify the stored procedure to the proper version. This hasn't seemed to work in either environment, and this would complicate of any other ETL applications developed within our ETL framework.
If I can't figure anything out, I could either deploy different stored procedures and packages to the different tiers, but I would very much prefer against this. For one, this would complicate future releases, and I would also need to ensure that I don't forget about updating the stored procedure and package once we upgrade the servers.
I could also make real tables in the database which would take the place of these temp tables. I don't really like this solution, but it's something that I could tolerate. If I end up doing this, I would probably switch to using the WITH RESULT SETS in the future.
However, I personally don't care much for either of these solutions, so I was wondering if there is any workaround that I missed that might work a bit better.
Despite your reluctance, I think you've made the right choice and a dedicated staging area is the right way to go. Most of the production ETLs I've worked with have a dedicated staging database, never mind tables. You then have the benefit of being able to control the storage more explicitly, which makes performance more reliable and the whole thing generally more maintainable. For example, you can create a dedicated contiguous block of fast disk space for these tables with their own file group etc. I'd certainly rather see 2 separate SPs relying on a few physical tables than a really gnarly single one.
That said, without knowing any specifics this is just my experience, so a caveat for future readers: As with all things database, be sure to measure the actual performance of your scenario (before and after) rather than making any assumptions based on the query plan - it might be misleading you.

SQL Server Using TableDiff on large tables

We have a process which uses uses SQL Server's amazing tableDiff via:
Microsoft SQL Server\100\COM\Tablediff.exe
It's SQL Server 2008 R2. It connects from one instance to another identical instance. It works very well!
I have a situation where a table which now has 10767594 records is taking 2.5 hours to complete, it only has one table in the job. How can I improve this?
The process is triggered by a Windows Scheduled Task, this calls a .bat file, the .bat file contains the recommended code which has no issue. We have a couple of these in place and have had for some time. It's just the one job that deals with the big table from instance to instance that is taking too long.
I have realised that the source table does have an index but the destination table does not. I will put an index on this table, what else can I do?
Does table diff run better with indexes?
Is there a ways to use table diff more effectively?
E.g. if I capture the lastProcessedID can I run tableDiff next time for all records where id > lastProcessedID?
Any advice would be great. Thank you in advance
EDITED:
MY SOLUTION - This was a very very big surprise. As I mentioned above, the 10 million+ record table which was identical on the source and destination except for 2 indexes (on the source). After waiting for out of hours since this is an internal production server I applied the indexes to the source. Now I run the tableDiff job which has not been changed at all and it completes in under 2 minutes. 2.5 hours to 2 mins!
I have accepted the answer below because it very very helpful. I did go down the Merge Replication path however after setting up replication and publishing I found out that the production instance was not able to be a subscriber due to the replication not be ticked on install. As Jason says its a reasonable amount of research, learning and setting up. Since I am not a DBA and had not looked at this before it was a worth while experience.
The performance issue is because the remote queries pull every record from each place to do the comparison to generate the output. Indexes can help slightly to make the pull a little faster from each location, but it's not likely to be significant.
An incremental approach is definitely better. I don't believe tablediff directly supports comparing 2 queries. If it did, you could do something like EXCEPT or INTERSECT to do the comparisons. If you're trying to keep these databases in sync, why not consider other solutions, like log shipping, mirroring, SSIS, replication, clustering, etc.

SQL Server vs. Access insert performance, in particular when using GUID

I'm interested to know how I could improve the performance of SQL Server when using sequential GUID when using Access 2007 as a front end to SQL Server 2008 (please note it's the only context I'm interested in).
I have made some tests (and gotten some fairly surprising results, in particular from SQL Server when using sequential GUID: the insert performance degrades very very quickly and it doesn't seem right to degrade so quickly to me.
Basically the test is as follow:
From the Access front-end, using VBA only, insert 100,000 records in batches of 1000,
sequentially.
I tried it both with a Identity and a sequential GUID as the PK.
I tried it in SQL Server 2008 Standard (no special tweaking just default install) as and an Access 2007 database as the back-end. All tables linked back to the front-end.
Some of the results (more, with raw data available on my blog entry about the test):
It's clear that, as the database grows, the insert performance is reduced but SQL Server isn't performing very well at all here.
http://blog.nkadesign.com/wp-content/uploads/2009/04/chart02.png
Expanded view of the results for SQL Server:
http://blog.nkadesign.com/wp-content/uploads/2009/04/chart03.png
Edit 13APR2009
I've found an issue with my server configuration and I updated the tests on my blog.
Thanks to all for your replies, they helped me a lot.
There's two things at play here. First, it's important to point out that SQL doesn't necessarily work very well, for a specific use case, out of the box. It is a professional product designed to be tuned by a person who knows what they're doing.
By comparison, Access is designed to work very well for most use cases without any configuration. The downside of this trade-off is covered in the second point:
SQL Server is designed for scalability. Notice how Access severely degrades with only 100,000 records. It would probably drop very steeply below SQL's line before a million. By comparison, SQL server holds almost perfectly steady, with the variation stabilizing after about 45,000 records and will continue to hold at many millions.
Edit I think there also may be something else at play here we're not seeing. I thought your SQL numbers looked bad, so I ran a test of my own. On my desktop running Windows Vista 3.6 ghz and 2gb of RAM, inserts with sequential GUID on SQL Server performed:
Average of 1382 inserts per second at 0 records
Average of 1426 inserts per second at 500k records
Averaging 1609.6 inserts per second from 0 to 500k with an average floor of 992 inserts/sec and an average ceiling of 1989 inserts/sec.
So accounting for the normal variance incurred by running this on an in-use desktop, I'd say SQL Server inserts basically scale linearly from 0 records to half a million. On a dedicated, tuned server I'd expect even more consistency (not to mention far better performance):
Excel chart, inserts per second http://img24.imageshack.us/img24/9485/insertspersecond.jpg
My question is whether your test setup represents the reality of your application or not. In short, are you testing the right thing?
Is your app going to be appending large numbers of records one at a time?
Or is it going to be appending batches of records based on a SQL SELECT?
If the latter, you might look at trying to do it all server-side, particularly if the source table(s) in the SELECT are on the server. It's important to realize that with ODBC, a batch append is going to be sent to the SQL Server as a single insert for every single row (every similar to the recordset-based approach in your test code). If you move the same process entirely server-side, it can be done as a batch operation.
Also, you should test again using ADO instead of DAO. It may optimize the operation completely differently.
Last of all, someone brought to my attention just this past week this fascinating article by Andy Baron:
Optimizing Microsoft Office Access Applications Linked to SQL Server
I'm still absorbing the contents of that very useful article, and it discusses several issues in regard to non-GUID-specific topics that may help you optimize your process for maximum efficiency.
You realize at least part of the decreasing performance is the log filling up, and that a GUID id what, 40 bytes longer than an int?
But I'm not quibbling; it's good to see someone taking actual metrics rather than just handwaving. Modded up.
Where are you getting the data from?
Does it change the numbers if you use the Access Export menu options rather than record-at-a-time-in-a-loop?
VBA is really sensitive to the connection paramters too, and there are lots of options that aren't necessarily intuitive.
If an identity column is acceptable, why are you even considering a sequential GUID (which is something of a tacked-on facility in MSSQL last I checked).
EDIT:
Looking at your code and briefly reviewing the Recordset docs on MSDN, I see you may be able to use more efficient parameters. E.g. your dbSeeChanges and dbOpenDynaset, which are appropriate if you are trying to allow for other users messing with the same rows (or needing to get back the inserted IDENTITY value or probably GUID), but I don't think you need those. In essence, after every INSERT or UPDATE, you're reading the record back from the database into VBA. I'd read through those connection config settings carefully, and I bet you'll come up with something a lot more satisfactory.
The last time I saw something like that (really slow insertion with GUID PK) was because of the log-file filling up. Insertion performance was dropping like a stone, pretty fast (no hard measurement, just looking at live traces, but it sure looked like it was kinda logarithmic). This was pre-loading of historical data.
Moved over to identity PK, took care of actually cleaning up the log file, and everything went much better afterwards (a couple of hours where the first version took several hours and was not finished).
Also, just a thought, are there any transactions involved? Maybe SQL Server transactions create a big performance hit that access does not have (given that access is not really geared towards concurrent access).

Resources