optimizing a lookup task in ssis - sql-server

I got this doubt about this kind of queries. I am migrating an ETL from Access to SSIS. One query involves an Inner Join with a table in an Oracle Database:
SELECT
SQL_TABLE.COLUMN1,
SQL_TABLE.COLUMN2,
ORACLE_TABLE.COLUMN5,
ORACLE_TABLE.COLUMN6
FROM
SQL_TABLE INNER JOIN ORACLE_TABLE ON
SQL_TABLE.ID_PPAL = ORACLE_TABLE.IDENTIF
WHERE
(((ORACLE_TABLE.COLUMN6) Is Not Null));
The issue is, the Oracle table has more than 18 million registers and the sql table has less than 300 records. The Inner Join should gives something like 2500 records as a result.
First I tried using a merge join task as you can see in the picture, but this is not efficient at all because of the characteristics of the tables, but looking for a possible situation someone proposed me using a look up task, but this only gives me one record for every match it founds, and this is not useful for me, I can not lose any record.
I wonder if is there another way to perform this query, because I can not believe that access would be more efficient than SSIS in this aspect.

In my experience SQL Server will not optimize queries involving Oracle. The fastest approach I found was 1) Use Oracle Drivers to access data from SSIS. 2) Use fast load (with table lock) to load the Oracle table (with a where condition if appropriate) into a SQL Server Work Table. 3) Create a clustered index the table. 4) Do the join. If you are going to reuse the package you will want to truncate the work table and drop the index as the first two steps of the package.

You should check any filters or try to do joins in Oracle database and thus leaking a little. If the result is incorrect, try using variables to store data and create scripts.
This can serve you:
http://www.bidn.com/blogs/ShawnHarrison/ssis/4579/looping-through-variable-values-with-a-foreach-loop-container

Related

what would be the best way to copy table from server A to server B in sql 2012?

I have table A that is on server 1 and table B that is on server 2.
Table contain around 1.5 million rows.
What would be the fastest way to copy table A to server B? On nightly basis.
Or what would be the fastest way to bring only records that changed in table A and bring it to table B?
So far I tried MERGE along with HASHBYTES function to only capture records that changed. It works perfectly if target and source tables are on the same server. (takes approx 1 min).
But if target is on server B but the source is on server A - than it takes more than 15 min.
What is on your opinion the best and fastest technique for such operations?
Some sorts of replications? Or maybe SSIS would be the best for that?
My 2 cents. Since you qualified your question with "On nightly basis", I'd say do this in SSIS.
I would use SSIS, it is designed to do fast large data copies between servers.
Also, if you can drop table B then you could try using SELECT INTO rather than INSERT INTO.
SELECT INTO is much faster as it is minimally logged but note that table B will be locked while the insert is running.
You could also try disabling indexes on Table B before you insert and re enabling them later.

SSIS - How to improve a work flow

Previously I have asked for a possible solution for a situation that I had to face in order to implement a sql query (which is implementing originally in access). I have reach a solution (after asking a lot) but I would like to know if anyone has another way to execute this query.
I have got two different tables, one in sql and another in oracle (S and O)
O(A, B, C) => PK=(A,B) and S(D,E,F) => PK = (D,E)
The query looks like this
SELECT A,B,C,E,F
FROM S INNER JOIN O ON
S.D = O.A (Only one attribute of the PK in O)
S has over 10.000 registers and O more than 700 millions. Given this, is not logic to implement a merge join, or a look up because I will have only the first match between D and A.
So I thought that it will be better to assemble the query in the Oracle side. To do this I have implemented an scheme like this.
With the sql I have executed this query:
with tmp(A) as ( select distinct D as A from S
)
select cast( select concat(' or A = ', A)
from tmp
for xml path('')) as nvarchar(max)) as ID
I am getting a string with the values that I gonna search on oracle.
Finally in the data flow, I am creating an expression like this:
select A, B, C
from O
where A= '' + #ID
I downnload this values to sql server and then I am able to manipulate them as I wish.
The use of the foreach loop was necessary because I am storing the string of sql inside an object variable. I found that SSIS has some troubles with the nvarchar(max) variables.
Some considerations:
1) The Oracle database is administered for another area of the company and they only gives reading permissions over the tables.
2) The DBA of the sql server does not allow to download the O table on a staging area. Not possibilities of negotiations with him, besides, this tabla is updated every day with more registers. He only manages this server and does not have any authority over Oracle.
3) The solution that was given for some members of my team was to create a query in oracle between different tables that can give me the attributes of O that I need, as a result I could get more than 3 millions of register and not all of the attributes A are presented in S. Even more, some the values of D has been manipulated, so possibly they are not going to be present in O.
With this implementation I am getting more than 150.000 registers from Oracle. But I would like to know if another solution can be implemented or if there are other components that I can use to reach the same results. Believe me when I say that I have read, asked and searched a lot before to implement this flow.
EDITED:
Option 1 (You say that you cannot use this solution – but it would be the first one – the best)
Use a DBLink to let Oracle access S table (you must use Oracle Database Gateway). Create a view in Oracle joining O and S. And finally use linked server to let SQL Server access the Oracle Joining view and get the results.
The process is as follow:
You must convince your Oracle DBA to configure the Oracle Database Gateway for SQL Server (see
http://docs.oracle.com/cd/B28359_01/gateways.111/b31043/conf_sql.htm#CIHGADGB)
. When it is properly configured then you can create a DBLink from SQL Server to Oracle. With the DBLink Oracle will have have a direct
access to S table.
Now create a view V just joining O and S table.
As you want the result back in your SQL Server and you cannot use
SSIS then you can proceed as described in:
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/111df59c-b309-4d59-b56c-9cd5574ee181/how-to-access-oracle-table-from-sql-server-?forum=transactsql
Option 2 (You say that you cannot use this solution – but it would be the second one)
As your Oracle admins seem to be monsters that will kill you if they get their paws on you. Then you can try (if they let you create a table in oracle):
Create a linked server in SQL Server (to access Oracle from SQL
Server). As I mentioned in the "normal case".
And Create a (temporary) table in Oracle schema with only 1 column (it will store D values from SQL Server)
Everytime you need to evaluate your query execute in SQL Server:
INSERT INTO ORACLE_LINKED_SERVER.ORACLE_OWNER.TEMP_TABLE
SELECT DISTINCT D FROM S;
SELECT * FROM OPENQUERY('SELECT * FROM ORACLE_OWNER.O WHERE A IN (SELECT D FROM ORACLE_OWNER.TEMP_TABLE)');
And finally don't forget to delete the Oracle's temp table:
DELETE * FROM ORACLE_LINKED_SERVER.ORACLE_OWNER.TEMP_TABLE;
Option 3 (If you have an Oracle license and one available host)
You can install your own Oracle server in your host and use Option 2.
Option 4
If your solution is really the only way out, then let's try to improve it a little bit.
As you know, your solution works but it is a little bit aggressive (you are transforming a relational algebra semijoin operator into a relational algebra selection operator with a monster condition). You say that the Oracle table is updated everyday with more register, but if the update rate of your tables are lower than your query rate then you can create a result cache that you can use while the tables S or O are not changed.
Proceed as follows:
Create a table in your SQL Server to store the Oracle result of your monster query. And before build and launch your query execute this:
SELECT last_user_update
FROM sys.dm_db_index_usage_stats
WHERE database_id = DB_ID( 'YourDatabaseName')
AND OBJECT_ID=OBJECT_ID('S')
This returns the most recent time when your table S was update. Store this value in a table (create a new table or store this value in a typical parameter table).
Create your monster query. But before launch it, send this query to Oracle:
SELECT MAX(ORA_ROWSCN)
FROM O;
It returns the last SCN (System Change Number) that cause a change in the table. Store this value in a table (create a new table or store this value in a typical parameter table).
Launch the big query and store its result into the cache table.
Finally, when you need to repeat the big query, first execute in your SQL Server:
SELECT last_user_update
FROM sys.dm_db_index_usage_stats
WHERE database_id = DB_ID( 'YourDatabaseName')
AND OBJECT_ID=OBJECT_ID('S')
And execute in Oracle:
SELECT MAX(ORA_ROWSCN)
FROM O;
If one or both values have changed with respect the one you have stored in your parameter table, then you must store them in the parameters table (updating the old values) and launch again the big query. But if none of the values have changed, then your cache is up to date, and you can use it.
Note that
SCN is not absolutely precise, but it is a good approximation (see: http://docs.oracle.com/cd/B19306_01/server.102/b14200/pseudocolumns007.htm)
The greater is your query rate with respect to your update rate, the better is this solution.
If you can tolerate working with old values, then you can improve the cache with expiration time.

Optimization problems with View using Clustered Index Insert on tempdb on SQL Server 2008

I am creating a Java function that needs to use a SQL query with a lot of joins before doing a full scan of its result. Instead of hard-coding a lot of joins I decided to create a view with this complex query. Then the Java function just uses the following query to get this result:
SELECT * FROM VW_####
So the program is working fine but I want to make it faster since this SELECT command is taking a lot of time. After taking a look on its plan execution plan I created some indexes and made it +-30% faster but I want to make it faster.
The problem is that every operation in the execution plan have cost between 0% and 4% except one operation, a clustered-index insert that has +-50% of the execution cost. I think that the system is using a temporary table to store the view's data, but an index in this view isn't useful for me because I need all rows from it.
So what can I do to optimize that insert in the CWT_PrimaryKey? I think that I can't turn off that index because it seems to be part of the SQL Server's internals. I read somewhere that this operation could appear when you use cursors but I think that I am not using (or does the view use it?).
The command to create the view is something simple (no T-SQL, no OPTION, etc) like:
create view VW_#### as SELECTS AND JOINS HERE
And here is a picture of the problematic part from the execution plan: http://imgur.com/PO0ZnBU
EDIT: More details:
Well the query to create the problematic view is a big query that join a lot of tables. Based on a single parameter the Java-Client modifies the query string before creating it. This view represents a "data unit" from a legacy Database migrated to the SQLServer that didn't had any Foreign or Primary Key, so our team choose to follow this strategy. Because of that the view have more than 50 columns and it is made from the join of other seven views.
Main view's query (with a lot of Portuguese words): http://pastebin.com/Jh5vQxzA
The other views (from VW_Sintese1 until VW_Sintese7) are created like this one but without using extra views, they just use joins with the tables that contain the data requested by the main view.
Then the Java Client create a prepared Statement with the query "Select * from VW_Sintese####" and execute it using the function "ExecuteQuery", something like:
String query = "Select * from VW_Sintese####";
PreparedStatement ps = myConn.prepareStatement(query,ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_READ_ONLY);
ResultSet rs = ps.executeQuery();
And then the program goes on until the end.
Thanks for the attention.
First: you should post the code of the view along with whatever is using the views because of the rest of this answer.
Second: the definition of a view in SQL Server is later used to substitute in querying. In other words, you created a view, but since (I'm assuming) it isn't an indexed view, it is the same as writing the original, long SELECT statement. SQL Server kind of just swaps it out in the DML statement.
From Microsoft's 'Querying Microsoft SQL Server 2012': T-SQL supports the following table expressions: derived tables, common table expressions (CTEs), views, inline table-valued functions.
And a direct quote:
It’s important to note that, from a performance standpoint, when SQL Server optimizes
queries involving table expressions, it first unnests the table expression’s logic, and therefore interacts with the underlying tables directly. It does not somehow persist the table expression’s result in an internal work table and then interact with that work table. This means that table expressions don’t have a performance side to them—neither good nor
bad—just no side.
This is a long way of reinforcing the first statement: please include the SQL code in the view and what you're actually using as the SELECT statement. Otherwise, we can't help much :) Cheers!
Edit: Okay, so you've created a view (no performance gain there) that does 4-5 LEFT JOIN on to the main view (again, you're not helping yourself out much here by eliminating rows, etc.). If there are search arguments you can use to filter down the resultset to fewer rows, you should have those in here. And lastly, you're ordering all of this at the top, so your query engine will have to get those views, join them up to a massive SELECT statement, figure out the correct order, and (I'm guessing here) the result count is HUGE and SQL's db engine is ordering it in some kind of temporary table.
The short answer: get less data (fewer columns and only the rows you need); don't order the results if the resultset is very large, just get the data to the client and then sort it there.
Again, if you want more help, you'll need to post table schemas and index strategies for all tables that are in the query (including the views that are joined) and you'll need to include all view definitions (including the views that are joined).

SQL Query Optimization

I am using SQL Server 2008 and I need to optimize my queries.For that purpose I am using Database Engine Tuning Advisor.
My question is can I check the performance of only one SQL query at a time or more than one suing new session?
To analyze one query at a time right click it in the SSMS script window and choose the option "Analyze Query in DTA" For this workload select the option "keep all existing PDS" to avoid loads of drop recommendations for indexes not used by the query under examination.
To do more than one first capture a trace file with a representative workload sample then you can analyse that with the DTA.
There are simple steps that must follow when writes SQL Query:-
1-Take the name of the columns in the select query instead of *
2-Avoid sub queries
3-Avoid to use operator IN operator
4-Use having as a filter in in Group By
5-Don not save image in database instead of this save the image
Path in database Ex: saving image in the DB takes large space and each
time needs to serialization when saving or retrieving images in the database.
6-Each table should have a primary key
7-Each table should have a minimum of one clustered index
8-Each table should have an appropriate amount of non-clustered index Non-clustered index should be created on columns of table based on query which is running
9-Following priority orders should be followed when any index is
created a) WHERE clause, b) JOIN clause, c) ORDER BY clause, d)SELECT clause
10-Do not to use Views or replace views with original source table
11-Triggers should not be used if possible, incorporate
the logic of trigger in stored procedure
12-Remove any adhoc queries and use Stored Procedure instead
13-Check if there is atleast 30% HHD is empty it will be improves the performance a bit
14-If possible move the logic of UDF to SP as well
15-Remove any unnecessary joins from the table
16-If there is cursor used in a query, see if there is any other way to avoid the use of this
(either by SELECT … INTO or INSERT … INTO, etc)

Large Data Service Architecture

Everyday a company drops a text file with potentially many records (350,000) onto our secure FTP. We've created a windows service that runs early in the AM to read in the text file into our SQL Server 2005 DB tables. We don't do a BULK Insert because the data is relational and we need to check it against what's already in our DB to make sure the data remains normalized and consistent.
The problem with this is that the service can take a very long time (hours). This is problematic because it is inserting and updating into tables that constantly need to be queried and scanned by our application which could affect the performance of the DB and the application.
One solution we've thought of is to run the service on a separate DB with the same tables as our live DB. When the service is finished we can do a BCP into the live DB so it mirrors all of the new records created by the service.
I've never worked with handling millions of records in a DB before and I'm not sure what a standard approach to something like this is. Is this an appropriate way of doing this sort of thing? Any suggestions?
One mechanism I've seen is to insert the values into a temporary table - with the same schema as the target table. Null IDs signify new records and populated IDs signify updated records. Then use the SQL Merge command to merge it into the main table. Merge will perform better than individual inserts/updates.
Doing it individually, you will incur maintenance of the indexes on the table - can be costly if its tuned for selects. I believe with merge its a bulk action.
It's touched upon here:
What's a good alternative to firing a stored procedure 368 times to update the database?
There are MSDN articles about SQL merging, so Googling will help you there.
Update: turns out you cannot merge (you can in 2008). Your idea of having another database is usually handled by SQL replication. Again I've seen in production a copy of the current database used to perform a long running action (reporting and aggregation of data in this instance), however this wasn't merged back in. I don't know what merging capabilities are available in SQL Replication - but it would be a good place to look.
Either that, or resolve the reason why you cannot bulk insert/update.
Update 2: as mentioned in the comments, you could stick with the temporary table idea to get the data into the database, and then insert/update join onto this table to populate your main table. The difference is now that SQL is working with a set so can tune any index rebuilds accordingly - should be faster, even with the joining.
Update 3: you could possibly remove the data checking from the insert process and move it to the service. If you can stop inserts into your table while this happens, then this will allow you to solve the issue stopping you from bulk inserting (ie, you are checking for duplicates based on column values, as you don't yet have the luxury of an ID). Alternatively with the temporary table idea, you can add a WHERE condition to first see if the row exists in the database, something like:
INSERT INTO MyTable (val1, val2, val3)
SELECT val1, val2, val3 FROM #Tempo
WHERE NOT EXISTS
(
SELECT *
FROM MyTable t
WHERE t.val1 = val1 AND t.val2 = val2 AND t.val3 = val3
)
We do much larger imports than that all the time. Create an SSIS pacakge to do the work. Personally I prefer to create a staging table, clean it up, and then do the update or import. But SSIS can do all the cleaning in memory if you want before inserting.
Before you start mirroring and replicating data, which is complicated and expensive, it would be worthwhile to check your existing service to make sure it is performing efficiently.
Maybe there are table scans you can get rid of by adding an index, or lookup queries you can get rid of by doing smart error handling? Analyze your execution plans for the queries that your service performs and optimize those.

Resources