Spark JDBC SQL connector "SELECT INTO" Statement - sql-server

I am trying to read the data from SQL Server. Because of some requirements I need to create a temp table using SELECT INTO statement, that will be used further in the query. But when I run the query I get the following error
com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near the keyword 'INTO'
My question is, is the SELECT INTO statement allowed with Spark SQL Connector?
Here is a sample query and code
drivers = {"mssql": "com.microsoft.sqlserver.jdbc.SQLServerDriver"}
sparkDf = spark.read.format("jdbc") \
.option("url", connectionString) \
.option("query", "SELECT * INTO #TempTable FROM Table1") \
.option("user", username) \
.option("password", password) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()

It's because the driver is doing extra stuff which is getting in your way. More explicitly, the problem is documented here. Here's docs from that page for the "query" option you're using:
A query that will be used to read data into Spark. The specified query
will be parenthesized and used as a subquery in the FROM clause. Spark
will also assign an alias to the subquery clause. As an example, spark
will issue a query of the following form to the JDBC Source.
SELECT FROM (<user_specified_query>) spark_gen_alias
Below are a couple of restrictions while using this option.
It is not allowed to specify dbtable and query options at the same time.
It is not allowed to specify query and partitionColumn options at the same time. When specifying partitionColumn option is required, the
subquery can be specified using dbtable option instead and partition
columns can be qualified using the subquery alias provided as part of
dbtable.
Example:
spark.read.format("jdbc")
.option("url", jdbcUrl)
.option("query", "select c1, c2 from t1")
.load()
Essentially, that wrapper that the driver is putting around your code is causing the problem. And because their wrapper starts with SELECT * FROM ( and ends with ) spark_generated_alias, you're very limited in how you can "break out" of it and still execute the statements you want.
Here's how I do it.
I break it into 3 separate queries, because normal (#) and global (##) temp-tables don't work (the driver disconnects after each query).
Query 1:
SELECT 1 AS col) AS tbl; --terminates the "SELECT * FROM (" the driver prepends
--Write whatever Sql you want, then select into a new "real" table.
--E.g. here's your example, but with a "real" table.
SELECT * INTO _TempTable FROM Table1;
SELECT 1 FROM (SELECT 1 AS col --and the driver will append ") spark_generated_alias". The driver ignores all result-sets but the first.
Query 2:
SELECT * FROM _TempTable;
Query 3 (you cannot run this until after you're done with the DataFrame):
SELECT 1 AS col) AS tbl; --terminates the "SELECT * FROM (" the driver prepends
DROP TABLE IF EXISTS _TempTable;
SELECT 1 FROM (SELECT 1 AS col --and the driver will append ") spark_generated_alias". The driver ignores all result-sets but the first.

Related

Delta table linked to sql table

I'm new on databricks and spark, we create delta table using data from sql. Theese table are kind of mirrored. Basicalli if I insert a new row to sql it affects delta, I can even insert from databricks having sql updated, but deleting is allowed only from sql.
By the way, I don't understand how it works, if I create a delta table with this command the delta and sql table are linked
spark.sql("""
create table IF NOT EXISTS dbname.delta_table
using org.apache.spark.sql.jdbc
OPTIONS (
url '""" + sql_url + """',
dbtable 'dbname.sql_table',
user '""" + sql_user + """',
password '""" + sql_password + """',
TRUNCATE true
)
""");
But if I try with pyspark, there's no link between table
spark.read \
.format("jdbc") \
.option("url", url_sql) \
.option("dbtable", sql_table) \
.option("user", sql_user) \
.option("password", sql_password) \
.option("truncate", True) \
.load() \
.write \
.saveAsTable(delta_table)
I would like to know how to get the same result with pyspark and how to get more documentation about it, I didn't find what I was looking for, I don't know what kind of relationship there's between table and the keyword related to this.
Thanks for help
Sergio
I've been looking online all day to find the correct topic but I didn't find anything
You're doing different things:
First SQL statements creates a metadata entry in the hive metastore that points to the SQL database. So when you read from it, Spark under the hood connects via JDBC protocol, and loads the data.
In the second approach you're actually loading data from database, and create a managed table that is stored in Delta format (default format). This table is the snapshot of the SQL server at the time of execution.
Really, if you want to create a table as in your first case, you just need to continue to use spark.sql.

How to Use OpenQuery to do Create Alias (IBM DB2) in SQL Server

I use linked server to connect AS400 DB2.
For example: select query can work
select *
from openquery([DB2], 'select t1.* from lib.table01 t1
fetch first 1 rows only')
But I want to use query
Create Alias Library.T_temp For Library.T1 (MemberName)
in SQL Server.
It returned an error because it have no return rows.
As following (it will return error):
Select * from OpenQuery([DB2],' Create Alias...')
Update OpenQuery([DB2],' Create Alias...')
Is there any method to do that?
Thanks
Don't try..
Your openquery() is the preferred solution.
By using openquery(), the SQL statement is passed to Db2 and run there. Since you've included a fetch first 1 rows only only 1 row is returned.
the query form
select TOP 1 t1.*
from db2.myibmi.lib.table01 t1
offset 0 rows
first first 1 row only
Will actually pull back all rows to SQL Server, then filter them on the SQL Server.
(At least I know that's how it used when a WHERE clause was included. I assume TOP isn't any better)

SSIS: Variable from SQL to Data Flow Task

Pretty new to BI and SQL in general, but a few months ago I didn't even know what a model is and now here I am...trying to build a package that runs daily.
Currently running this is Excel via PowerQuery but because the data is so much, I have to manually change the query every month. Decided to move it into SSIS.
Required outcome: Pull the last date in my Database and use it as a variable in the model (as I have millions of rows, I only want to load lines with dates greater than what I have in my table already).
Here is my Execute SQL Task:
I set up a variable for the SQL query
and trying to use it in my OLE DB query like this
Execute SQL Task: results, are fine - returns date as "dd/mm/yyyy hh24:mi:ss"
SELECT MAX (CONVACCT_CREATE_DATE) AS Expr1 FROM GOMSDailySales
Variable for OLE DB SQL Query:
"SELECT fin_booking_code, FIN_DEPT_CODE, FIN_ACCT_NO, FIN_PROD_CODE, FIN_PROG_CODE, FIN_OPEN_CODE, DEBIT_AMT, CREDIT_AMT, CURRENCY_CODE, PART_NO, FIN_DOC_NO, CREATE_DATE
FROM cuown.converted_accounts
WHERE (CREATE_DATE > TO_DATE(#[User::GetMaxDate],'yyyy/mm/dd hh24:mi:ss'))
AND (FIN_ACCT_NO LIKE '1%')"
Currently getting missing expression error, if I add " ' " to my #[User::GetMaxDate], I get a year must be between 0 and xxxx error.
What am I doing wrong / is there a cleaner way to get this done?
In the OLEDB source use the following, change the data access mode to SQL command, and use the following command:
SELECT fin_booking_code, FIN_DEPT_CODE, FIN_ACCT_NO, FIN_PROD_CODE, FIN_PROG_CODE, FIN_OPEN_CODE, DEBIT_AMT, CREDIT_AMT, CURRENCY_CODE, PART_NO, FIN_DOC_NO, CREATE_DATE
FROM cuown.converted_accounts
WHERE (CREATE_DATE > TO_DATE(?,'yyyy/mm/dd hh24:mi:ss'))
AND (FIN_ACCT_NO LIKE '1%')
And click on the parameters button and map #[User::GetMaxDate] to the first parameter.
For more information, check the following answer: Parameterized OLEDB source query
Alternative method
If parameters are not supported in the OLE DB provider you are using, create a variable of type string and evaluate this variable as the following expression:
"SELECT fin_booking_code, FIN_DEPT_CODE, FIN_ACCT_NO, FIN_PROD_CODE, FIN_PROG_CODE, FIN_OPEN_CODE, DEBIT_AMT, CREDIT_AMT, CURRENCY_CODE, PART_NO, FIN_DOC_NO, CREATE_DATE
FROM cuown.converted_accounts
WHERE CREATE_DATE > TO_DATE('" + (DT_WSTR, 50)#[User::GetMaxDate] +
"' ,'yyyy/mm/dd hh24:mi:ss') AND FIN_ACCT_NO LIKE '1%'"
Then from the OLE DB source, change the data access mode the SQL Command from variable and select the string variable you created.
Your trying to use the SSIS variable like a variable in the query. When constructing a SQL query in a string variable you simply need to concatenate the strings together. The expression for your query string variable should look like this.
"SELECT fin_booking_code, FIN_DEPT_CODE, FIN_ACCT_NO, FIN_PROD_CODE, FIN_PROG_CODE, FIN_OPEN_CODE, DEBIT_AMT, CREDIT_AMT, CURRENCY_CODE, PART_NO, FIN_DOC_NO, CREATE_DATE
FROM cuown.converted_accounts
WHERE CREATE_DATE > " + #[User::GetMaxDate] +
"AND (FIN_ACCT_NO LIKE '1%')"

Removing query decorations from psql output in PostgreSQL

I have an SQL file which extracts data from a PostgreSQL database in a particular format. However, I'm getting query "decoration", specifically a "SELECT N" output when creating my temporary tables.
This is an example SQL file.
create temporary table a as
select 1 as the_code, 'Test'::text as the_name;
create temporary table b as
select 2 as the_code, 'Foo'::text as the_name;
select * from a union all
select * from b;
And this is the example output, produced with psql --tuples-only --no-align --file query.sql:
SELECT 1
SELECT 1
1|Test
2|Foo
Note that this question is separate from How to hide result set decoration in Psql output in that it is in relation to different query "decoration". That is, my question is how to remove the "SELECT N", while the existing question is how to remove the column headers and "(n rows)" footer.
EDIT
As a workaround, I could use common table expressions (CTEs) from which the select is run, but it'd be nice if there was a solution to the above so I don't have to rework my exiting SQL.

SSIS SQL TASK MAX(DATE) to Variable in DATA FLOW

OK this seems like it should be insanely easy, but I cannot figure it out. Every where I look online says to create temp tables and VB scripts and I cannot believe I have to do that. My goal is to insert all the records in a table with a date later than the max date in that destination table.
UPDATE The 2 tables are in two different non linked SQL databases
So:
Select #[User::Dated] = MAX(Dateof) from Table2
Insert into Table2
Select *
From Table1
Where DateOf > #[User::Dated]
I am trying to do this in SSIS. I declared a variable, the SQL execution step looks like it is assigning the single row output to it. But when I got go into the data flow it give me no parameters to choose, when I force the known parameter which is in the project scope it says no parameter exists
Create two OLE DB data sources each pointing at you two databases.
Create a variable called max_date and make its data type String.
Place an Execute SQL Task on the Control Flow, change its connection type to OLE DB and for the connection select the name of the data source that contains Table2. Set the ResultSet to Single Row. Add the following for the SQLStatement:
SELECT CAST(MAX(Dateof) AS VARCHAR) AS max_date FROM Table2
Go to the Result Set pane, click Add and enter the following:
Result Name: max_date
Variable Name: User::max_date
You can now use the max_date variable in an expression to create a SQL statement, for example you could use it in another Execute SQL Task which would use the second Data Connection like so:
"INSERT INTO Table2
SELECT *
FROM Table1
WHERE DateOf > '" + #[User::max_date] + "'"
Or in an OLE DB Source in a data flow like so:
"SELECT *
FROM Table1
WHERE DateOf > '" + #[User::max_date] + "'"
You can do this in a single SQL Task if you want:
Insert into Table2
Select *
From Table1
Where DateOf > (Select MAX(Dateof) from Table2)
If you want to use multiple Execute SQL Task items in the control flow, or want to make use of the parameter in a data flow instead, you have to change the General > Result Set option for your MAX() query to Single Row, then move from General to Result Set and Add a new variable for your result set to occupy.
To use that variable in your INSERT INTO.... query via Execute SQL Task, you'll construct your query with a ? for each parameter and map them in the parameter mapping section. If a variable is used multiple times in a query it's easiest to use a stored procedure, so you can simply pass the relevant parameters in SSIS.

Resources