Joining tables from two Oracle databases in SAS - database

I am joining two tables together that are located in two separate oracle databases.
I am currently doing this in sas by creating two libname connections to each database and then simply using something like the below.
libname dbase_a oracle user= etc... ;
libname dbase_b oracle user= etc... ;
proc sql;
create table t1 as
select a.*, b.*
from dbase_a.table1 a inner join dbase_b.table2 b
on a.id = b.id;
quit;
However the query is painfully slow. Can you suggest any better options to speed up such a query (short of creating a database link going down the path of creating a database link)?
Many thanks for looking at this.

If those two databases are on the same server and you are able to execute cross-database queries in Oracle, you could try using SQL pass-through:
proc sql;
connect to oracle (user= password= <...>);
create table t1 as
select * from connection to oracle (
select a.*, b.*
from dbase_a.schema_a.table1 a
inner join dbase_b.schema_b.table2 b
on a.id = b.id;
);
disconnect from oracle;
quit;
I think that, in most cases, SAS attemps as much as possible to have the query executed on the database server, even if pass-through was not explicitely specified. However, when that query queries tables that are on different servers, different databases on a system that does not allow cross-database queries or if the query contains SAS-specific functions that SAS is not able to translate in something valid on the DBMS system, then SAS will indeed resort to 'downloading' the complete tables and processing the query locally, which can evidently be painfully inefficient.

The select is for all columns from each table, and the inner join is on the id values only. Because the join criteria evaluation is for data coming from disparate sources, the baggage of all columns could be a big factor in the timing because even non-match rows must be downloaded (by the libname engine, within the SQL execution context) during the ON evaluation.
One approach would be to:
Select only the id from each table
Find the intersection
Upload the intersection to each server (as a scratch table)
Utilize the intersection on each server as pass through selection criteria within the final join in SAS
There are a couple variations depending on the expected number of id matches, the number of different ids in each table, or knowing table-1 and table-2 as SMALL and BIG. For a large number of id matches that need transfer back to a server you will probably want to use some form of bulk copy. For a relative small number of ids in the intersection you might get away with enumerating them directly in a SQL statement using the construct IN (). The size of a SQL statement could be limited by the database, the SAS/ACCESS to ORACLE engine, the SAS macro system.
Consider a data scenario in which it has been determined the potential number of matching ids would be too large for a construct in (id-1,...id-n). In such a case the list of matching ids are dealt with in a tabular manner:
libname SOURCE1 ORACLE ....;
libname SOURCE2 ORACLE ....;
libname SCRATCH1 ORACLE ... must specify a scratch schema ...;
libname SCRATCH2 ORACLE ... must specify a scratch schema ...;
proc sql;
connect using SOURCE1 as PASS1;
connect using SOURCE2 as PASS2;
* compute intersection from only id data sent to SAS;
create table INTERSECTION as
(select id from connection to PASS1 (select id from table1))
intersect
(select id from connection to PASS2 (select id from table2))
;
* upload intersection to each server;
create table SCRATCH1.ids as select id from INTERSECTION;
create table SCRATCH2.ids as select id from INTERSECTION;
* compute inner join from only data that matches intersection;
create table INNERJOIN as select ONE.*, TWO.* from
(select * from connection to PASS1 (
select * from oracle-path-to-schema.table1
where id in (select id from oracle-path-to-scratch.ids)
))
JOIN
(select * from connection to PASS2 (
select * from oracle-path-to-schema.table2
where id in (select id from oracle-path-to-scratch.ids)
));
...
For the case of both table-1 and table-2 having very large numbers of ids that exceed the resource capacity of your SAS platform you will have to also iterate the approach for ranges of id counts. Techniques for range criteria determination for each iteration is a tale for another day.

Related

Compare 3 SQL tables from 3 different databases in Microsoft SQL Server

I need to compare three tables from three different databases in SQL Server. Is this even possible?
I have 3 different data bases: prod, test1, test2. I have a tables with definitions called DEFINITIONS in each database. There are different values in each of the table depending on the database. My job is to compare all of these 3 tables and point the differences.
I was thinking about using the EXCEPT or INTERSECT operators to show the differences or similarities between these 3 tables but I cannot find any information how to merge these 3 databases.
Thanks for any tips!
You can do it by using except / intersect...
Main idea:
-- This creates rows that exist in db1 but not in db2
select * from db1.dbo.table1 t
except
select * from db2.dbo.table2 t
union
-- This creates rows that exist in db2 but not in db1
select * from db2.dbo.table2 t
except
select * from db1.dbo.table1 t
-- Etc...
To get the simularities you change EXCEPT to INTERSECT
The problem with this solution is that one column difference will generate two missing rows, one from db1 and one from db2.
This can be solved by using FULL OUTER JOIN ON primary keys from both tables and just displays row values.
Something like:
select CASE WHEN t.ID IS NULL THEN 'Missing in 1' WHEN t2.ID IS NULL THEN 'Missing in 2' ELSE 'Both exists'
, t.*, t2.*
from db1.dbo.table1 t
FULL OUTER JOIN db2.dbo.table2 t2
ON t2.ID = t.ID
Then you just need to format data for your usage.
A couple of caveats of these approaches:
All tables must have same number / type of columns for EXCEPT SELECT * to work. Otherwise you need to choose which columns to match
Collations of varchar fields should match between the two database tables, otherwise EXCEPT / INTERSECT will crash. You can solve it by "re-collating" the columns by using: SELECT ..., somevarcharcolumn COLLATE DATABASE_DEFAULT
There is also tools for this in Visual Studio and probably other clients (schema and data compare) etc.
Excel has some nice functions for this too, if you load data with matching rows from each table, you can color the diffing fields by using VLOOKUP etc

How can I reference tables from multiple databases within the same server in a common table expression (CTE)?

I have a SQL Server 2014 Express with multiple databases. One of them has general tables with information common to the remaining databases (let's call this database UniversalData).
The other databases have information that is pertinent to a specific site (let's call one of these databases Site01Data). The universal data may change and I don't want to replicate it regularly to the other site-specific databases, so I want to include the UniversalData table in some queries, some of which involve CTEs.
What I am trying to accomplish:
WITH CTE1 AS
(
SELECT *
FROM UniversalData.dbo.someTable
),
CTE2 AS
(
SELECT *
FROM Site01Data.dbo.anotherTable
),
CTE3 AS
(
SELECT CTE1.field1, CTE2.field2
FROM CTE1
JOIN CTE2 ON CTE1.idx = CTE2.idx
)
SELECT *
FROM CTE3;
This doesn't generate an error, but I seem to get no data from the CTE1 in my final query (null result set). Intuitively, does this mean it is saving a temp table in the UniversalData database that is not accessible from the Site01Data database?
How can I use a CTE with tables from different databases on the same server?
There are lots of ways to do this..
You could read the tables in one database into a temp table on the second database and then join to it.. or join both of them on the fly.
but first.. refrain from doing select *.. specify the columns
You could go
select t1.column1,t2.column2
from UniversalData.dbo.someTable t1
inner join Site01Data.dbo.anotherTable t2
on t2.ida = t2.idx
and so onn.. it depends on which way you want to specify the join and what sort of join you want to choose..
This assumes that both the data bases are on the same instance.. else you will need linked servers
Specify servername.site1data.dbo.table etc and use linked servers if appropriate across different servernames

Nested pass-through queries?

I have an ODBC connection to a SQL Server database, and because I'm returning large record sets with my queries, I've found that it's faster to run pass-through queries than native Access queries.
But I'm finding it hard to write and organize my queries because, as far as I know, I can't save several different pass-through queries and join them in another pass-through query. I have read-only access to this database, so I can't save stored procedures in SQL Server and then reference them in the pass-through.
For example, suppose I want to get only those entries with the maximum value of o_version from the following query:
select d.o_filename,d.o_version,parent.o_projectname
from dms_doc d
left join
dms_proj p
on
d.o_projectno=p.o_projectno
left join
dms_proj parent
on
p.o_parentno=parent.o_projectno
where
p.o_projectname='ABC'
and
lower(left(right(d.o_filename,4),3))='xls'
and
charindex('xyz',lower(d.o_filename))=0
I want to get only those entries with the maximum value of d.o_version. Ordinarily I would save this as a query called, e.g., abc, and then write another query abcMax:
select * from abc
inner join
(select o_filename,o_projectname,max(o_version) as maxVersion from abc
group by o_filename,o_projectname) abc2
on
abc.o_filename=abc2.o_filename
and
abc.o_projectname=abc2.o_projectname
where
abc.o_version=abc2.maxVersion
But if I can't store abc as a query that can be used in the pass-through query abcMax, then not only do I have to copy the entire body of abc into abcMax several times, but if I make any changes to the content of abc, then I need to make them to every copy that's embedded in abcMax.
The alternative is to write abcMax as a regular Access query that calls abc, but that will reduce the performance because the query is now being handled by ACE instead of SQL Server.
Is there any way to nest stored pass-through queries in Access? Or is creating stored procedures in SQL Server the only way to accomplish this?
If you have (or can get) permission to create temporary tables on the SQL Server then you might be able to use them to some advantage. For example, you could run one pass-through query to create a temporary table with the results from the first query (vastly simplified, in this example):
CREATE TABLE #abc (o_filename NVARCHAR(50), o_version INT, o_projectname NVARCHAR(50));
INSERT INTO #abc SELECT o_filename, o_version, o_projectname FROM dms_doc;
and then your second pass-through query could just reference the temporary table
select * from #abc
inner join
(select o_filename,o_projectname,max(o_version) as maxVersion from #abc
group by o_filename,o_projectname) abc2
on
#abc.o_filename=abc2.o_filename
and
#abc.o_projectname=abc2.o_projectname
where
#abc.o_version=abc2.maxVersion
When you're finished you can run a pass-through query to explicitly delete the temporary table
DROP TABLE #abc
or SQL Server will delete it for you automatically when your connection to the SQL Server closes.
For anyone still needing this info:
Pass through queries allow for the use of cte queries as can be used with Oracle SQL. Similar to creating multiple select queries, but much faster and efficient, without the clutter and confusion of “stacked” Select queries since you can see all the underlying queries in one view.
Example:
With Prep AS (
SELECT A.name,A.city
FROM Customers AS A
)
SELECT P.city, COUNT(P.name) AS clients_per_city
FROM Prep AS P
GROUP BY P.city

sql server linked server to oracle returns no data found when data exists

I have a linked server setup in SQL Server to hit an Oracle database. I have a query in SQL Server that joins on the Oracle table using dot notation. I am getting a “No Data Found” error from Oracle. On the Oracle side, I am hitting a table (not a view) and no stored procedure is involved.
First, when there is no data I should just get zero rows and not an error.
Second, there should actually be data in this case.
Third, I have only seen the ORA-01403 error in PL/SQL code; never in SQL.
This is the full error message:
OLE DB provider "OraOLEDB.Oracle" for linked server "OM_ORACLE" returned message "ORA-01403: no data found".
Msg 7346, Level 16, State 2, Line 1
Cannot get the data of the row from the OLE DB provider "OraOLEDB.Oracle" for linked server "OM_ORACLE".
Here are some more details, but it probably does not mean anything since you don’t have my tables and data.
This is the query with the problem:
select *
from eopf.Batch b join eopf.BatchFile bf
on b.BatchID = bf.BatchID
left outer join [OM_ORACLE]..[OM].[DOCUMENT_UPLOAD] du
on bf.ReferenceID = du.documentUploadID;
I can’t understand why I get a “no data found” error. The query below uses the same Oracle table and returns no data but I don’t get an error - I just get no rows returned.
select * from [OM_ORACLE]..[OM].[DOCUMENT_UPLOAD] where documentUploadID = -1
The query below returns data. I just removed one of the SQL Server tables from the join. But removing the batch table does not change the rows returned from batchFile (271 rows in both cases – all rows in batchFile have a batch entry). It should still be joining the same batchFile rows to the same Oracle rows.
select *
from eopf.BatchFile bf
left outer join [OM_ORACLE]..[OM].[DOCUMENT_UPLOAD] du
on bf.ReferenceID = du.documentUploadID;
And this query returns 5 rows. It should be the same 5 from the original query. ( I can’t use this because I need data from the batch and batchFile table).
select *
from [OM_ORACLE]..[OM].[DOCUMENT_UPLOAD] du
where du.documentUploadId
in
(
select bf.ReferenceID
from eopf.Batch b join eopf.BatchFile bf
on b.BatchID = bf.BatchID);
Has anyone experienced this error?
Today I experienced the same problem with an inner Join. As creating a Table Valued Function suggested by codechurn or using a Temporary Table suggested by user1935511 or changing the Join Types suggested by cymorg are no options for me, I like to share my solution.
I used Join Hints to drive the query optimizer into the right direction, as the problem seems to rise up from nested loops join strategy with the remote table locally . For me HASH, MERGE and REMOTE join hints worked.
For you REMOTE will not be an option because it can be used only for inner join operations. So using something like the following should work.
select *
from eopf.Batch b
join eopf.BatchFile bf
on b.BatchID = bf.BatchID
left outer merge join [OM_ORACLE]..[OM].[DOCUMENT_UPLOAD] du
on bf.ReferenceID = du.documentUploadID;
I've had the same problem.
Solution1: load the data from the Oracle database into a temp table, then join to that temp table instead - here's a link.
From this post a link you can find out that the problem can be with using left join.
I've checked with my problem and after changing my query it solved the problem.
In my case I had a complex view made from a linked table, 3 views based on the linked table and a local table. I was using Inner Joins throughout and this problem manifested. Changing the joins to Left and Right Outer Joins (as appropriate) resolved the issue.
Another way to work around the problem is to pull back the Oracle data into a Table Valued Function. This will cause SQL Server to go out and retrieve all of the data from Oracle and throw it into a resultant table variable. For all intent and purpose, the Oracle data is now "local" to SQL Server if you use the resultant Table Valued Function in a query.
I believe the original problem is that SQL Server is trying to optimize the execution of your compound query which includes the remote Oracle query results in-line. By using a Table Valued Function to wrap the Oracle call, SQL Server will optimize the compound query on the resultant table variable returned from the function and not the results from the remote query execution.
CREATE function [dbo].[documents]()
returns #results TABLE (
DOCUMENT_ID INT NOT NULL,
TITLE VARCHAR(6) NOT NULL,
LEGALNAME VARCHAR(50) NOT NULL,
AUTHOR_ID INT NOT NULL,
DOCUMENT_TYPE VARCHAR(1) NOT NULL,
LAST_UPDATE DATETIME
) AS
BEGIN
INSERT INTO #results
SELECT CAST(DOCUMENT_ID AS INT) AS DOCUMENT_ID, TITLE, LEGALNAME, CAST(AUTHOR_ID AS INT) AS AUTHOR_ID, DOCUMENT_TYPE, LAST_UPDATE
FROM OPENQUERY(ORACLE_SERVER,
'select DOCUMENT_ID, TITLE, LEGALNAME, AUTHOR_ID, DOCUMENT_TYPE, FUNDTYPE, LAST_UPDATE
from documents')
return
END
You can then use the Table Valued Function as it it were a table in your SQL queries:
SELECT * FROM DOCUMENTS()
I resolved it by avoiding the = operator. Try using this instead:
select * from [OM_ORACLE]..[OM].[DOCUMENT_UPLOAD] where documentUploadID < 0

How to get a list of all tables in two different databases

I'm trying to create a little SQL script (in SQL Server Management Studio) to get a list of all tables in two different databases. The goal is to find out which tables exist in both databases and which ones only exist in one of them.
I have found various scripts on SO to list all the tables of one database, but so far I wasn't able to get a list of tables of multiple databases.
So: is there a way to query SQL Server for all tables in a specific database, e.g. SELECT * FROM ... WHERE databaseName='first_db' so that I can join this with the result for another database?
SELECT * FROM database1.INFORMATION_SCHEMA.TABLES
UNION ALL
SELECT * FROM database2.INFORMATION_SCHEMA.TABLES
UPDATE
In order to compare the two lists, you can use FULL OUTER JOIN, which will show you the tables that are present in both databases as well as those that are only present in one of them:
SELECT *
FROM database1.INFORMATION_SCHEMA.TABLES db1
FULL JOIN database2.INFORMATION_SCHEMA.TABLES db2
ON db1.TABLE_NAME = db2.TABLE_NAME
ORDER BY COALESCE(db1.TABLE_NAME, db2.TABLE_NAME)
You can also add WHERE db1.TABLE_NAME IS NULL OR db2.TABLE_NAME IS NULL to see only the differences between the databases.
As far as I know, you can only query tables for the active database. But you could store them in a temporary table, and join the result:
use db1
insert #TableList select (...) from sys.tables
use db2
insert #TableList2 select (...) from sys.tables
select * from #TableList tl1 join Tablelist2 tl2 on ...
Just for completeness, this is the query I finally used (based on Andriy M's answer):
SELECT * FROM DB1.INFORMATION_SCHEMA.Tables db1
LEFT OUTER JOIN DB2.INFORMATION_SCHEMA.Tables db2
ON db1.TABLE_NAME = db2.TABLE_NAME
ORDER BY db1.TABLE_NAME
To find out which tables exist in db2, but not in db1, replace the LEFT OUTER JOIN with a RIGHT OUTER JOIN.

Resources