SAS Merge conversion in SQL with condition - sql-server

I am trying to convert the below SAS Merge code to SQL and for me it’s the first time to convert the SAS script to SQL.
merge tb208;
proc sort data=tb208;
by rsn_cde;
run;
proc sort data=dclhcl;
by rsn_cde;
run;
data dclhcl;
merge tb208 (in=t) dclhcl (in=d);
by rsn_cde;
if d;
run;
My conversion for the above merge in SQL
SELECT t.*
,d.*
FROM tb208 t
JOIN dclhcl d ON t.rsn_cde = d.rsn_cde
Post JOIN condition I am unable to convert the If Condition in SAS to SQL. Can anyone help me with an solution for the above SAS Merge conversion.

The MERGE statement here is mapped to SQL right outer join.
SELECT t.*
,d.*
FROM tb208 t
RIGHT OUTER JOIN dclhcl d ON t.rsn_cde = d.rsn_cde
Pay attention to the following differences between SAS and RDBMS:
Missing values (aka NULL in SQL). In SAS it is just a unique value that would go to the beginning of the merged data set. In the databases, your mileage may vary.
If you have multiple records with the same rsn_code in both left and right tables (many-to-many merge), you CAN NOT reproduce SAS semantics using SQL.

Related

Select from one table where not in another in SQL Server

I have a SQL Server database on my computer, and there are two tables in it.
This is the first one:
SELECT
[ParticipantID]
,[ParticipantName]
,[ParticipantNumber]
,[PhoneNumber]
,[Mobile]
,[Email]
,[Address]
,[Notes]
,[IsDeleted]
,[Gender]
,[DOB]
FROM
[Gym].[dbo].[Participant]
and this is the second one
SELECT
[ParticipationID]
,[ParticipationNumber]
,[ParticpationTypeID]
,[AddedByEmployeeID]
,[AddDate]
,[ParticipantID]
,[TrainerID]
,[ParticipationDate]
,[EndDate]
,[Fees]
,[PaidFees]
,[RemainingFees]
,[IsPeriodParticipation]
,[NoOfVisits]
,[Notes]
,[IsDeleted]
FROM
[Gym].[dbo].[Participation]
Now I need to write a T-SQL query that can return
SELECT
Participant.ParticipantNumber,
Participation.ParticipationDate,
Participation.EndDate
FROM
Participation
WHERE
Participant.ParticipantID = Participation.ParticipantID;
and I'm going to be thankful
SQL Server performs sort, intersect, union, and difference operations using in-memory sorting and hash join technology. Using this type of query plan, SQL Server supports vertical table partitioning, sometimes called columnar storage.
SQL Server employs three types of join operations:
Nested Loops joins
Merge joins
Hash joins
Join Fundamentals
By using joins, you can retrieve data from two or more tables based on logical relationships between the tables. Joins indicate how Microsoft SQL Server should use data from one table to select the rows in another table.
A join condition defines the way two tables are related in a query by:
Specifying the column from each table to be used for the join. A typical join condition specifies a foreign key from one table and its associated key in the other table.
Specifying a logical operator (for example, = or <>,) to be used in comparing values from the columns.
Inner joins can be specified in either the FROM or WHERE clauses. Outer joins can be specified in the FROM clause only. The join conditions combine with the WHERE and HAVING search conditions to control the rows that are selected from the base tables referenced in the FROM clause.
Follow this link to help you understand joins better in mssql:
link to joins

Joining tables from two Oracle databases in SAS

I am joining two tables together that are located in two separate oracle databases.
I am currently doing this in sas by creating two libname connections to each database and then simply using something like the below.
libname dbase_a oracle user= etc... ;
libname dbase_b oracle user= etc... ;
proc sql;
create table t1 as
select a.*, b.*
from dbase_a.table1 a inner join dbase_b.table2 b
on a.id = b.id;
quit;
However the query is painfully slow. Can you suggest any better options to speed up such a query (short of creating a database link going down the path of creating a database link)?
Many thanks for looking at this.
If those two databases are on the same server and you are able to execute cross-database queries in Oracle, you could try using SQL pass-through:
proc sql;
connect to oracle (user= password= <...>);
create table t1 as
select * from connection to oracle (
select a.*, b.*
from dbase_a.schema_a.table1 a
inner join dbase_b.schema_b.table2 b
on a.id = b.id;
);
disconnect from oracle;
quit;
I think that, in most cases, SAS attemps as much as possible to have the query executed on the database server, even if pass-through was not explicitely specified. However, when that query queries tables that are on different servers, different databases on a system that does not allow cross-database queries or if the query contains SAS-specific functions that SAS is not able to translate in something valid on the DBMS system, then SAS will indeed resort to 'downloading' the complete tables and processing the query locally, which can evidently be painfully inefficient.
The select is for all columns from each table, and the inner join is on the id values only. Because the join criteria evaluation is for data coming from disparate sources, the baggage of all columns could be a big factor in the timing because even non-match rows must be downloaded (by the libname engine, within the SQL execution context) during the ON evaluation.
One approach would be to:
Select only the id from each table
Find the intersection
Upload the intersection to each server (as a scratch table)
Utilize the intersection on each server as pass through selection criteria within the final join in SAS
There are a couple variations depending on the expected number of id matches, the number of different ids in each table, or knowing table-1 and table-2 as SMALL and BIG. For a large number of id matches that need transfer back to a server you will probably want to use some form of bulk copy. For a relative small number of ids in the intersection you might get away with enumerating them directly in a SQL statement using the construct IN (). The size of a SQL statement could be limited by the database, the SAS/ACCESS to ORACLE engine, the SAS macro system.
Consider a data scenario in which it has been determined the potential number of matching ids would be too large for a construct in (id-1,...id-n). In such a case the list of matching ids are dealt with in a tabular manner:
libname SOURCE1 ORACLE ....;
libname SOURCE2 ORACLE ....;
libname SCRATCH1 ORACLE ... must specify a scratch schema ...;
libname SCRATCH2 ORACLE ... must specify a scratch schema ...;
proc sql;
connect using SOURCE1 as PASS1;
connect using SOURCE2 as PASS2;
* compute intersection from only id data sent to SAS;
create table INTERSECTION as
(select id from connection to PASS1 (select id from table1))
intersect
(select id from connection to PASS2 (select id from table2))
;
* upload intersection to each server;
create table SCRATCH1.ids as select id from INTERSECTION;
create table SCRATCH2.ids as select id from INTERSECTION;
* compute inner join from only data that matches intersection;
create table INNERJOIN as select ONE.*, TWO.* from
(select * from connection to PASS1 (
select * from oracle-path-to-schema.table1
where id in (select id from oracle-path-to-scratch.ids)
))
JOIN
(select * from connection to PASS2 (
select * from oracle-path-to-schema.table2
where id in (select id from oracle-path-to-scratch.ids)
));
...
For the case of both table-1 and table-2 having very large numbers of ids that exceed the resource capacity of your SAS platform you will have to also iterate the approach for ranges of id counts. Techniques for range criteria determination for each iteration is a tale for another day.

Sql code into kdb code

I am taking a query in SQL Server and trying to write it in KDB. I have done this before without issue but this time the kdb query seems to be returning a different amount of rows than SQL when it should return the same. My SQL code is:
SELECT
*
FROM tbl_Master
INNER JOIN map_2012
ON tbl_Master.RM_ID = map_2012.RMID
LEFT JOIN src_CQ
ON map_2012.INST = src_CQ.INST
My kdb code is
Y:map_2012 ij `RMID xkey select RMID:RM_ID, RG_ID, RM_Name from
tbl__Master
Y:src_CQ lj `INST xkey Y
It is simple code but I can not figure out why the returned rows are so much different
In SQL query you left join src_CQ to map_2012. To do the same in KDB map_2012 has to be on the left and src_CQ on the right:
Y:Y lj `INST xkey src_CQ
Also note, that in case map_2012 has duplicate values in RMID column or src_CQ has duplicate values in INST column, keying of table in KDB will remove rows with the duplicates. Under these circumstances another approach is required, I believe equi-join eq will help in this case.

SQL Server CTE referred in self joins slow

I have written a table-valued UDF that starts by a CTE to return a subset of the rows from a large table.
There are several joins in the CTE. A couple of inner and one left join to other tables, which don't contain a lot of rows.
The CTE has a where clause that returns the rows within a date range, in order to return only the rows needed.
I'm then referencing this CTE in 4 self left joins, in order to build subtotals using different criterias.
The query is quite complex but here is a simplified pseudo-version of it
WITH DataCTE as
(
SELECT [columns] FROM table
INNER JOIN table2
ON [...]
INNER JOIN table3
ON [...]
LEFT JOIN table3
ON [...]
)
SELECT [aggregates_columns of each subset] FROM DataCTE Main
LEFT JOIN DataCTE BananasSubset
ON [...]
AND Product = 'Bananas'
AND Quality = 100
LEFT JOIN DataCTE DamagedBananasSubset
ON [...]
AND Product = 'Bananas'
AND Quality < 20
LEFT JOIN DataCTE MangosSubset
ON [...]
GROUP BY [
I have the feeling that SQL Server gets confused and calls the CTE for each self join, which seems confirmed by looking at the execution plan, although I confess not being an expert at reading those.
I would have assumed SQL Server to be smart enough to only perform the data retrieval from the CTE only once, rather than do it several times.
I have tried the same approach but rather than using a CTE to get the subset of the data, I used the same select query as in the CTE, but made it output to a temp table instead.
The version referring the CTE version takes 40 seconds. The version referring the temp table takes between 1 and 2 seconds.
Why isn't SQL Server smart enough to keep the CTE results in memory?
I like CTEs, especially in this case as my UDF is a table-valued one, so it allowed me to keep everything in a single statement.
To use a temp table, I would need to write a multi-statement table valued UDF, which I find a slightly less elegant solution.
Did some of you had this kind of performance issues with CTE, and if so, how did you get them sorted?
Thanks,
Kharlos
I believe that CTE results are retrieved every time. With a temp table the results are stored until it is dropped. This would seem to explain the performance gains you saw when you switched to a temp table.
Another benefit is that you can create indexes on a temporary table which you can't do to a cte. Not sure if there would be a benefit in your situation but it's good to know.
Related reading:
Which are more performant, CTE or temporary tables?
SQL 2005 CTE vs TEMP table Performance when used in joins of other tables
http://msdn.microsoft.com/en-us/magazine/cc163346.aspx#S3
Quote from the last link:
The CTE's underlying query will be
called each time it is referenced in
the immediately following query.
I'd say go with the temp table. Unfortunately elegant isn't always the best solution.
UPDATE:
Hmmm that makes things more difficult. It's hard for me to say with out looking at your whole environment.
Some thoughts:
can you use a stored procedure instead of a UDF (instead, not from within)?
This may not be possible but if you can remove the left join from you CTE you could move that into an indexed view. If you are able to do this you may see performance gains over even the temp table.

Using the results of a query in OPENQUERY

I have a SQL Server 2005 database that is linked to an Oracle database. What I want to do is run a query to pull some ID numbers out of it, then find out which ones are in Oracle.
So I want to take the results of this query:
SELECT pidm
FROM sql_server_table
And do something like this to query the Oracle database (assuming that the results of the previous query are stored in #pidms):
OPENQUERY(oracledb,
'
SELECT pidm
FROM table
WHERE pidm IN (' +
#pidms + ')')
GO
But I'm having trouble thinking of a good way to do this. I suppose that I could do an inner join of queries similar to these two. Unfortunately, there are a lot of records to pull within a limited timeframe so I don't think that will be a very performant option to choose.
Any suggestions? I'd ideally like to do this with as little Dynamic SQL as possible.
Ahhhh, pidms. Brings back bad memories! :)
You could do the join, but you would do it like this:
select sql.pidm,sql.field2 from sqltable as sql
inner join
(select pidm,field2 from oracledb..schema.table) as orcl
on
sql.pidm = orcl.pidm
I'm not sure if you could write a PL/SQL procedure that would take a table variable from sql...but maybe.....no, I doubt it.
Store openquery results in a temp table, then do an inner join between the SQL table and the temp table.
I don't think you can do a join since OPENQUERY requires a pure string (as you wrote above).
BG: Actually JOIN IN SQLServer to Oracle by OpenQuery works, avoiding #tmp table and allowing JOIN to SQL without Param* - ex.
[SQL SP] LEFT JOIN OPENQUERY(ORADB,
'SELECT COUNT(distinct O.ORD_NUM) LCNT,
O.ORD_MAIN_NUM
FROM CUSTOMER.CUST_FILE C
JOIN CUSTOMER.ORDER_NEW O
ON C.ID = O.ORD_ID
WHERE C.CUS_ID NOT IN (''2'',''3'')
GROUP BY O.ORD_MAIN_MACNUM') LC
ON T.ID = LC.ORD_MAIN_ID*
Cheers, Bill Gibbs

Resources