Using PySpark in a Microsoft SQL Server using JDBC for connection - sql-server

I'm using PySpark in a Microsoft SQL Server using JDBC for connection.
query = """(
WITH table_1 AS (
SELECT
code_1,
a
FROM my_database_table_1
),
table_2 AS (
SELECT
code_2,
b
FROM my_database_table_2
)
SELECT
table_1.code_1 AS tb1_code_1,
table_2.code_2 AS tb2_code_2
FROM table_1
INNER JOIN table_2
ON table_1.code_1 = table_2.APRCH_CODIGO
) AS _
"""
df_python = spark.read.jdbc(url=jdbc_url, table=query, properties=properties)
I'm getting the following error:
com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'WITH'.
Does anyone know why I'm getting such error?
Edit 1:
I replaced == by = in the INNER JOIN clause.
I didn't include a , after the closing parenthesis in table_2, as it's not necessary.
( and ) as _ is required by JDBC.
To simplify, this is another query that returns the same error as the query above:
query = """(
WITH table_1 (code_1)
AS
(
SELECT code_1
FROM my_database_table_1
)
SELECT code_1
FROM table_1
) as _
"""
And this is a query that works:
query = """(
SELECT code_1
FROM my_database_table_1
) as _
"""
I'm starting to think that the ( and ) as _ clauses, required by JDBC, may be causing problems with the WITH clause.
Edit 2:
Well, apparently CTEs simply don't work with this driver, so I'll have to find another way out without using WITH.

In sql the equality operator is = and not ==, you put == there in the JOIN, maybe this is the error

In sql the equality operator is = and not ==, you put == there in the JOIN, maybe this is the error.
try with this code
from pyspark.sql import SparkSession
import os
driver = "/home/romerito/Documents/apache-spark-3.1.2/spark-3.1.2-bin-hadoop3.2/jars/mssql-jdbc-9.2.1.jre11.jar"
spark = (
SparkSession
.builder
.appName("load-sample-jdbc")
.master("local[2]")
.config("spark.driver.extraClassPath", driver)
.getOrCreate()
)
credentials = (
readCredentials(os.getcwd()+f"/others/credentials-mssql.txt")
)
server = credentials['server']
port = credentials['port']
database = credentials['database']
user = credentials['user']
password = credentials['password']
connection = f"jdbc:sqlserver://{server}:{port};databaseName={database}"
query = """(
WITH table_1 AS (
SELECT
code_1,
a
FROM my_database_table_1
),
table_2 AS (
SELECT
code_2,
b
FROM my_database_table_2
),
SELECT
table_1.code_1 AS tb1_code_1,
table_2.code_2 AS tb2_code_2
FROM table_1
INNER JOIN table_2
ON table_1.code_1 = table_2.APRCH_CODIGO
) AS _
"""
query = spark.read \
.format('jdbc') \
.option('url', f'{connection}') \
.option('user', f'{user}') \
.option('password', f'{password}') \
.option('dbtable', f'{query}')
query.show()

Related

Replacement for rowid in SQL Server

I have an Oracle select that I need to execute in SQL Server (the table is exported from an Oracle database to a SQL Server database). I can replace nvl with isnull and decode with case I guess, but how to deal with the rowid in this specific case?
select sum(
nvl(
(select sum(b.restsaldo) from reskontro.erkrysskid b
where 1=1
and b.fakturanr = a.fakturanr
and b.kundenr = a.kundenr
and b.resknr = b.resknr
and a.rowid = decode(a.reskfunknr,31,a.rowid,b.rowid)
and nvl(b.restsaldo,0) <> 0
and b.krysskidid <= a.krysskidid
and not exists (select * from reskontro.erkrysskid c
where b.kundenr = c.kundenr
and b.resknr = c.resknr
and a.resklinr < c.resklinr
and a.krysskidid < c.krysskidid
and b.fakturanr = c.fakturanr
and c.reskfunknr in (31,75)
and nvl(c.attfort,-1) = -1)
),0
)
) as restsaldo from reskontro.erkrysskid a
where 1=1
and a.kundenr = 1
and a.resknr = 1
SQL Server doesn't have a ROWID pseudo column. In Oracle this is being used in the context of a self-join to determine if the two rows being joined are the same row. In SQL Server simply compare the table's key columns instead.
eg, if the table has a key on a Id column, use
and a.Id = case when a.reskfunknr = 31 then a.Id else b.Id end

Select with WITH

I am new in MySQL I have a db2 select and a would like to do this in MSsql and with WITH clause
db2
1 SQL.
SELECT
SQLUSER.TT_VALUTRAZ.SIFRA3,
SQLUSER.TT_VALUTRAZ.SIFVAL32,
SQLUSER.TT_VALUTRAZ.DATUM,
SQLUSER.TT_VALUTRAZ.RAZMERJE
FROM
SQLUSER.TT_VALUTRAZ
WHERE
(
(SQLUSER.TT_VALUTRAZ.DATUM >= '1.5.2007')
) ---> this go to DW.TEMP_PFPC_TT_VALUTRAZ
2 sql.
SELECT
'705' AS SIFRA3,
'891' AS SIFVAL32,
A.DATUM,
A.RAZMERJE AS RAZMERJE
FROM
DW.TEMP_PFPC_TT_VALUTRAZ A
WHERE
A.DATUM >= '1.5.2007' AND
A.SIFRA3 = '891' AND
A.SIFVAL32 = '978' AND
('705', '891', A.DATUM) NOT IN
(
SELECT
SIFRA3,
SIFVAL32,
DATUM
FROM
DW.TEMP_PFPC_TT_VALUTRAZ
WHERE
SIFRA3 = '705' AND
SIFVAL32 = '891'
)
now I like to join this two SQL statement and would like to use ons with clause and MSsql syntax
There are many ways of doing this. I only posted one way. There are some places that have been changed due to syntax issues. Let me know this answer if this answer is useful.
; WITH CTE_1 AS ( SELECT
SQLUSER.TT_VALUTRAZ.SIFRA3,
SQLUSER.TT_VALUTRAZ.SIFVAL32,
SQLUSER.TT_VALUTRAZ.DATUM,
SQLUSER.TT_VALUTRAZ.RAZMERJE
FROM
SQLUSER.TT_VALUTRAZ
WHERE
(
(SQLUSER.TT_VALUTRAZ.DATUM >= '1.5.2007')
) ---> this go to DW.TEMP_PFPC_TT_VALUTRAZ
)
, CTE_2 AS (
SELECT * FROM
(
SELECT
'705' AS SIFRA3,
'891' AS SIFVAL32,
A.DATUM,
A.RAZMERJE AS RAZMERJE
FROM
DW.TEMP_PFPC_TT_VALUTRAZ A
) AS B
WHERE
B.DATUM >= '1.5.2007' AND
B.SIFRA3 = 891 AND
B.SIFVAL32 = 978 AND
B.SIFRA3 NOT IN (705) --use of sub queries in where clause has been fixed.
AND B.SIFVAL32 NOT IN (891) --use of sub queries in where clause has been fixed.
AND B.DATUM NOT IN
(
SELECT
DATUM
FROM
DW.TEMP_PFPC_TT_VALUTRAZ
)
)
SELECT * FROM
CTE_1 AS [C]
INNER JOIN CTE_2 AS [CT]
ON [CTE_1]. [SIFRA3] = [CT].[SIFRA3]
AND [C]. [SIFVAL32] = [CT].[SIFVAL32]
AND [C].[DATUM] = [CT].[DATUM]
AND [C].[RAZMERJE] = [CT].[RAZMERJE]

Incorrect syntax near "Exists" … - Convert Query from MySQL to SQL Server

My application uses a query that is working fine using a MySQL/MariaDB-Database.
I did modify my application to be more flexible and work with Microsoft SQL Server too.
Unfortunately the following query does NOT work using a SQL Server database:
select
p.PrinterGUID,
(exists (select 1
from computerdefaultprinter cdp
where cdp.PrinterGUID = p.PrinterGUID and
cdp.ComputerGUID = '5bec3779-b002-46ba-97c4-19158c13001f')
) as is_computer_default,
(exists (select 1
from userdefaultprinter udp
where udp.PrinterGUID = p.PrinterGUID and
udp.UserGUID = 'd3cf699b-8d71-4dbc-92f3-402950042054')
) as is_user_default
from
((select cm.PrinterGUID
from computermapping cm
where cm.ComputerGUID = '5bec3779-b002-46ba-97c4-19158c13001f'
) union -- to remove duplicates
(select PrinterGUID
from usermapping um
where um.UserGUID = 'd3cf699b-8d71-4dbc-92f3-402950042054')) p;
Running this query throws an error
Incorrect syntax near the keyword 'exists'
Microsoft SQL Server Management Studio returns the following in German:
I have created a SQL Fiddle with some example data: SQL-Fiddle
If necessary, more background-information is available here: UNION 2 Select-queries with computed columns
Is it possible to modify this query to work in both MySQL and SQL Server?
Thank you very much!
The literal translation of your query would be the following:
select
p.PrinterGUID,
CASE WHEN exists (select 1
from computerdefaultprinter cdp
where cdp.PrinterGUID = p.PrinterGUID and
cdp.ComputerGUID = '5bec3779-b002-46ba-97c4-19158c13001f')
THEN 1 ELSE 0 END as is_computer_default,
CASE WHEN exists (select 1
from userdefaultprinter udp
where udp.PrinterGUID = p.PrinterGUID and
udp.UserGUID = 'd3cf699b-8d71-4dbc-92f3-402950042054')
THEN 1 ELSE 0 END as is_user_default
from (select cm.PrinterGUID
from computermapping cm
where cm.ComputerGUID = '5bec3779-b002-46ba-97c4-19158c13001f'
union -- to remove duplicates
select PrinterGUID
from usermapping um
where um.UserGUID = 'd3cf699b-8d71-4dbc-92f3-402950042054') p;
Notice the use of a CASE expressions to determine the value of the column for when the EXISTS evaluates to true or not.
Just try out the following query. You don't need to use exists like this in sql server. Instead filter out the rows at the end using is_computer_default or is_user_default
select p.PrinterGUID,
(select 1
from computerdefaultprinter cdp
where cdp.PrinterGUID = p.PrinterGUID and
cdp.ComputerGUID = '5bec3779-b002-46ba-97c4-19158c13001f')
as is_computer_default,
(select 1
from userdefaultprinter udp
where udp.PrinterGUID = p.PrinterGUID AND
udp.UserGUID = 'd3cf699b-8d71-4dbc-92f3-402950042054'
) as is_user_default
from ((select cm.PrinterGUID
from computermapping cm
where cm.ComputerGUID = '5bec3779-b002-46ba-97c4-19158c13001f'
) union -- to remove duplicates
(select PrinterGUID
from usermapping um
where um.UserGUID = 'd3cf699b-8d71-4dbc-92f3-402950042054'
)
) p;

Correctly specify dbtable in SqlContext [duplicate]

I think I am missing something but can't figure what.
I want to load data using SQLContext and JDBC using particular sql statement
like
select top 1000 text from table1 with (nolock)
where threadid in (
select distinct id from table2 with (nolock)
where flag=2 and date >= '1/1/2015' and userid in (1, 2, 3)
)
Which method of SQLContext should I use? Examples I saw always specify table name and lower and upper margin.
Thanks in advance.
You should pass a valid subquery as a dbtable argument. For example in Scala:
val query = """(SELECT TOP 1000
-- and the rest of your query
-- ...
) AS tmp -- alias is mandatory*"""
val url: String = ???
val jdbcDF = sqlContext.read.format("jdbc")
.options(Map("url" -> url, "dbtable" -> query))
.load()
* Hive Language Manual SubQueries: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
val url = "jdbc:postgresql://localhost/scala_db?user=scala_user"
Class.forName(driver)
val connection = DriverManager.getConnection(url)
val df2 = spark.read
.format("jdbc")
.option("url", url)
.option("dbtable", "(select id,last_name from emps) e")
.option("user", "scala_user")
.load()
The key is "(select id,last_name from emps) e", here you can write a subquery in place of table_name.

SQL server UPDATE OPENQUERY

I am trying to link two servers (the host is SQL Server 2008 and the destination is ORACLE) so that a table on the oracle server is populated by data held in multiple tables on the SQL server.
I have set up an INSERT OPENQUERY, which works perfectly fine. I also want to setup another query to update those records.
So, my insert query, selects records created in the last 24 hours, and inserts them, no problem. But I am struggling to create an UPDATE OPENQUERY script, that will run every 24 hours to make the relevant changes.
Here is my INSERT
INSERT OPENQUERY(ukdev662, 'SELECT DOCKET_NUMBER,
APP_NUMBER,
ATT,
LIT_H,
REVIEW,
COUNTRY,
NOTICE,
COUNTRY_CODE,
FORMALITY_NAME
FROM GWDMDV29.LOOKUPS__IPMANAGER')
SELECT PMasters.DocketNumber + ISNULL(Codes_RelationType.Code, '') + ISNULL(PMasters.FilingNumber, '') + 'PTE'
,ISNULL(PMasters.AppNumber, '[null]')
,ISNULL(PartyDetails_Att.Party, 'Att not assigned')
,0
,CONVERT(DATETIME, '20130128')
,uktst1633.dbo.Countries.Description
,'Preserve'
,uktst1633.dbo.Countries.WIPO
,ISNULL(PartyDetails_Associate.Party, '[null]')
FROM uktst1633.dbo.Countries
RIGHT OUTER JOIN uktst1633.dbo.PatentMasters PMasters
ON ( PMasters.Country = uktst1633.dbo.Countries.CountryID )
INNER JOIN uktst1633.dbo.Codes Codes_CType
ON ( PMasters.CType = Codes_CType.CodeID
AND Codes_CType.CodeTypeID = 'CSP'
)
INNER JOIN uktst1633.dbo.Codes Codes_RelationType
ON ( Codes_RelationType.CodeID = PMasters.RelationType
AND Codes_RelationType.CodeTypeID = 'RLP'
)
LEFT OUTER JOIN uktst1633.dbo.Parties Parties_Att
ON ( Parties_Att.PartyID = PMasters.Att
AND ( Parties_Att.PartyTypeID = 'ATP'
OR Parties_Att.PartyTypeID IS NULL
)
)
LEFT OUTER JOIN uktst1633.dbo.PartyDetails PartyDetails_Att
ON ( PartyDetails_Att.PartyDetailID = Parties_Att.PartyDetailID )
LEFT OUTER JOIN uktst1633.dbo.Parties Parties_Associate
ON ( Parties_Associate.PartyID = PMasters.Associate
AND ( Parties_Associate.PartyTypeID = 'AGP'
OR Parties_Associate.PartyTypeID IS NULL
)
)
LEFT OUTER JOIN uktst1633.dbo.PartyDetails PartyDetails_Associate
ON ( PartyDetails_Associate.PartyDetailID = Parties_Associate.PartyDetailID )
WHERE ( ( Codes_RelationType.Description = 'CONTINUATION'
OR Codes_RelationType.Description = 'DIVISION'
)
AND Codes_CType.Description = 'Regular'
AND PMasters.CreateDate >= DATEADD(day, -1, GETDATE())
)
Thanks for any help . I have searched around and cannot find much on OPENQUERY, it all seems to be pretty basic stuff. I tried following this syntax's examples but it got me nowhere...
UPDATE
Table
SET
Table.col1 = other_table.col1,
Table.col2 = other_table.col2
FROM
Table
INNER JOIN
other_table
ON
Table.id = other_table.id

Resources