In a stored procedure running on server A, is it faster to join a table on server B to a table on server A or to the same table on server B? - sql-server

I am working on building a data store database in SQL Server for reporting purposes. Let's call this server A. I am creating scripts that will pull in the necessary data from several linked servers (B and C). I am trying to make them as performant as possible.
I have one script that pulls data using a complex query with many joins, but all from tables on server B. I already have some of the data required for these joins on server A from a previous load script.
So here is my question - which is faster:
Join the tables on server B to the tables which have the required data on server A
Just do all the joins on server B
I think number 2 would be faster, but I know that doing things over the network via linked servers is slower, so I am not sure.

Remote joins are typically much slower than local joins. So you should try rather hard to only join tables in one place.
To join all the tables on B and return the results to A run a "passthrough" query on B using OPENQUERY or EXEC(#sql) at B.

Related

SSIS, query Oracle table using ID's from SQL Server?

Here's the basic idea of what I want to do in SSIS:
I have a large query against a production Oracle database, and I need the following where clause that brings in a long list of ids from SQL Server. From there, the results are sent elsewhere.
select ...
from Oracle_table(s) --multi-join
where id in ([select distinct id from SQL_SERVER_table])
Alternatively, I could write the query this way:
select ...
from Oracle_table(s) --multi-join
...
join SQL_SERVER_table sst on sst.ID = Oracle_table.ID
Here are my limitations:
The Oracle query is large and cannot be run without the where id in (... clause
This means I cannot run the Oracle query, then join it against the ids in another step. I tried this, and the DBA's killed the temp table after it became 3 TB in size.
I have 160k id's
This means it is not practical to iterate through the id's one by one. In the past, I have run against ~1000 IDs, using a comma-separated list. It runs relatively fast - a few minutes.
The main query is in Oracle, but the ids are in SQL Server
I do not have the ability to write to Oracle
I've found many questions like this.
None of the answers I have found have a solution to my limitations.
Similar question:
Query a database based on result of query from another database
To prevent loading all rows from the Oracle table. The only way is to apply the filter in the Oracle database engine. I don't think this can be achieved using SSIS since you have more than 160000 ids in the SQL Server table, which cannot be efficiently loaded and passed to the Oracle SQL command:
Using Lookups and Merge Join will require loading all data from the Oracle database
Retrieving data from SQL Server, building a comma-separated string, and passing it to the Oracle SQL command cannot be done with too many IDs (160K).
The same issue using a Script Task.
Creating a Linked Server in SQL Server and Joining both tables will load all data from the Oracle database.
To solve your problem, you should search for a way to create a link to the SQL Server database from the Oracle engine.
Oracle Heterogenous Services
I don't have much experience in Oracle databases. Still, after a small research, I found something in Oracle equivalent to "Linked Servers" in SQL Server called "heterogeneous connectivity".
The query syntax should look like this:
select *
from Oracle_table
where id in (select distinct id from SQL_SERVER_table#sqlserverdsn)
You can refer to the following step-by-step guides to read more on how to connect to SQL Server tables from Oracle:
What is Oracle equivalent for Linked Server and can you join with SQL Server?
Making a Connection from Oracle to SQL Server - 1
Making a Connection from Oracle to SQL Server - 2
Heterogeneous Database connections - Oracle to SQL Server
Importing Data from SQL Server to a staging table in Oracle
Another approach is to use a Data Flow Task that imports IDs from SQL Server to a staging table in Oracle. Then use the staging table in your Oracle query. It would be better to create an index on the staging table. (If you do not have permission to write to the Oracle database, try to get permission to a separate staging database.)
Example of exporting data from SQL Server to Oracle:
Export SQL Server Data to Oracle using SSIS
Minimizing the data load from the Oracle table
If none of the solutions above solves your issue. You can try minimizing the data loaded from the Oracle database as much as possible.
As an example, you can try to get the Minimum and Maximum IDs from the SQL Server table, store both values within two variables. Then, you can use both variables in the SQL Command that loads the data from the Oracle table, like the following:
SELECT * FROM Oracle_Table WHERE ID > #MinID and ID < #MaxID
This will remove a bunch of useless data in your operation. In case your ID column is a string, you can use other measures to filter data, such as the string length, the first character.

Alternate to SQL Server Linked Server

I am trying to build a program that compares 2 database servers that have exact table but in some table have additional column. I am using linked server to connect these 2 database servers.
But I found a problem, when I try to compare some data the connection is mostly timeout. And when I check Activity Monitor and Execution plan, more than 90% is in remote query - this makes comparing 1 record that has 5 child entries run for 5-7 minutes.
This is a sample query that I try to run.
Select
pol.PO_TXN_ID, pol.Pol_Num
From
ServerA.InstanceA.dbo.POLine pol
Where
not exist (Select 1
From ServerB.InstanceA.dbo.POLine pol2
where pol.PO_TXN_ID = pol2.PO_TXN_ID
and pol.Pol_Num = pol2.Pol_Num)
I tried using OPENROWSET, but our administrator does not permit to install it on the production server.
Is there any alternative that I can use to optimize my query instead using linked server?
Options:
OpenQuery() / 4 part naming with temp tables.
ETL (eg: SQL Server Integration Services)
The problem with linked servers especially with 4 part naming like in your example:
The query engine doesn't know how to optimize it. He can't access statistics on the linked servers
Resulting in doing full table scans, pulling all the data to the source SQL server and then processing it. (High network IO, Bad execution plans, resulting in long running queries)
Option 1
Create a temp table (preferably with indexes)
Query the linked server with OPENQUERY and preferably a filter condition. eg:
CREATE TABLE #MyTempTable(Id INT NOT NULL PRIMARY KEY, /*Other columns*/)
INSERT INTO #MyTempTable(Id, , /*Other columns*/)
SELECT *
FROM OPENQUERY(ServerA, 'SELECT Id, /*Other columns*/ FROM Table WHERE /*Condition*/')
Use the temp table(s) to do your calculation.
Still needs at least 1 linked server
OPENQUERY has better performance when your database is not a SQL Server (e.g. Postgres, MySql, Oracle,...) as the query is executed on the linked server instead of pulling all the data to the source server.
Option 2
You can use an ETL tool like SQL Server Integration Services (SSIS)
Load the data from the 2 servers
Use a Slowly changing dimension or lookup component to determine the differences.
Insert/update what you want/need
No linked servers are needed, SSIS can connect to the databases directly

Very slow queries in MS Access with joined MS SQL table via ODBC

What is the best solution when I would like to use an Access front-end application with some linked table (via ODBC) from MSSQL Server?
The difficulty of this for me is that I have to use complex queries with many multiple joins (and functions called from queries).
It is very-very slow because of the joins between the two DB (and there is a lot of data in some tables, the 2 GB Access mdb limit is the reason of the MSSQL DB upgrade).
Pass-through query doesn't help because of the joined Access tables.
With OPENDATASOURCE('Microsoft.ACE.OLEDB.12.0'... it is still slow in SQL Server too. I tried ODBC linked view with WHERE clause from MSSQL, but it
seems as slow as the full table.
I have to move all of joined Access tables to the MSSQL DB and convert all queries to Pass-Through? Is there any other solution?
I have to move all of joined Access tables to the MSSQL DB
Yes, definitely.
and convert all queries to Pass-Through?
Not necessarily, only those that are still slow.
"Normal" INNER JOIN queries, using only linked tables from one server database, are handled by Access and the ODBC driver in a way that everything is processed on the server. They should be (more or less) as fast as when run on the server (or as Pass-Through query).
Only "complex" queries, especially involving multiple INNER and OUTER JOINs, won't work like that. You'll notice that they are still very slow when running on linked tables. These need to be changed to Pass-Through queries.
Edit: I just noticed
functions called from queries
You can't call VBA functions from PT queries, and they will again kill performance when called from Access queries running on linked MSSQL tables (because they have to be processed locally).
You'll need to learn to create views in MSSQL, probably also user defined functions and/or stored procedures.
In the long run, you'll find that views are actually easier to manage than PT queries.

What is the reasoning for using OPENQUERY within a tsql stored procedure?

I am currently reviewing some jobs that run stored procedures on a database. All of these stored procedures are connecting to a linked server(s). I am not too familiar with this functionality. I am at the moment attempting to determine why these were used versus just a normal query as the queries I am running seem to be pulling in the data.
I read this, which is MSDNs explanation of openquery. :
http://technet.microsoft.com/en-us/library/ms188427.aspx
I also read this, which is a stackoverflow link talking about why not to use it on local server. :
Why is using OPENQUERY on a local server bad?
My question is do you basically just use this when the stored procedure requires the embedded credentials of the linked server? Or are there more reasons for using OpenQuery that I am not aware of?
Two advantages I can think of using openquery. It can reduce the amount of data you'd need to transfer by doing the necessary filtering on the remote server. It can allow the query optimizer on the remote server to choose the optimal execution plan when joining tables.
The other alternative is using REMOTE JOIN
I've had some luck using it but Aaron Bertrand has a nice write up about it here.. http://www.mssqltips.com/sqlservertip/2765/revisit-your-use-of-the-sql-server-remote-join-hint/
Here is the MS documentation
REMOTE
Specifies that the join operation is performed on the site of the right table. This is useful when the left table is a local table and the right table is a remote table. REMOTE should be used only when the left table has fewer rows than the right table.
If the right table is local, the join is performed locally. If both tables are remote but from different data sources, REMOTE causes the join to be performed on the site of the right table. If both tables are remote tables from the same data source, REMOTE is not required.
REMOTE cannot be used when one of the values being compared in the join predicate is cast to a different collation using the COLLATE clause.
REMOTE can be used only for INNER JOIN operations.

What are the drawbacks of using linked servers in SQL Server?

Are there any huge performance issues or security concerns?
Using SQL Server 2005 and higher
Server migrations are more convoluted
Security can be tricky to set up for multi-hop
Non-SQL Server ones requires a local driver installed (Sybase, DB2 etc)
Clusters, off-site DR: registry entries + drivers
Non-SQL Server x64 woes. 'Nuff said
Non-SQL Server ones don't play well (how many places to enter the password?)
Performance (in other answers)
I've set up linked servers to Access, DB2, Oracle, Sybase and the odd proprietary ODBC driver. I'd prefer SSIS or .net code now...
Yes - Queries which join two datasets in different physical databases perform poorly.
e.g. If you run a query between table A on the current server and B on a linked server.
Select A.Field1, B.Field2 FROM A INNER JOIN B on A.Id = B.Id
WHERE B.Id = #InputId
you may find that all the records for table B are retrieved - effectively
Select * from Table B
into the working server.
What you'd want to do instead is have a usp on the linked server which takes an Id as a parameter and returns a filtered recordset from Table B
Then rewrite the query above to join Table A to the usp instead.
Having one (or many) set up on the server isn't the issue - the performance hit will be when you come to actually query them.
I have a Linked SQLServer 2005 set up, which is in the same physical building (on the same network) and it's not a problem - fast as you like.
I also have another Linked (Oracle) server which is on the other side of the world that is like walking through treacle and times out, drops connections (see here!)
Sorry to be vague, but ...it depends!

Resources