Send SQL queries from DataBricks to a SQL Server using Pyspark [duplicate] - sql-server

This question already has answers here:
How to run SQL statement from Databricks cluster
(2 answers)
Closed 2 years ago.
It is very straight forward to send custom SQL queries to a SQL database on Python.
connection = mysql.connector.connect(host='localhost',
database='Electronics',
user='pynative',
password='pynative##29')
sql_select_Query = "select * from Laptop" #any custom sql statement not particularly select statement
cursor = connection.cursor()
cursor.execute(sql_select_Query)
records = cursor.fetchall()
However, I have scoured the internet to do a similar task on Databricks and I haven't found any solution. It's worth mentioning that I can read from and write to SQL Server database using JDBC but I want to send a custom SQL statement for example a "bulk insert" statement that I want to execute within the SQL Server database.
Here is how I read data from SQL Server using JDBC.
table_name="dbo.myTable"
spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)

Please reference this document: SQL Databases using JDBC:
Databricks Runtime contains JDBC drivers for Microsoft SQL Server and Azure SQL Database. See the Databricks runtime release notes for the complete list of JDBC libraries included in Databricks Runtime.
This article covers how to use the DataFrame API to connect to SQL
databases using JDBC and how to control the parallelism of reads
through the JDBC interface. This article provides detailed examples
using the Scala API, with abbreviated Python and Spark SQL examples
at the end. For all of the supported arguments for connecting to SQL
databases using JDBC, see JDBC To Other Databases.
Python example:
jdbcHostname = "<hostname>"
jdbcDatabase = "employees"
jdbcPort = 1433
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2};user={3};password={4}".format(jdbcHostname, jdbcPort, jdbcDatabase, username, password)
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)
But the traditional jdbc connector writes data into your database using row-by-row insertion. You can use the Spark connector to write data to Azure SQL and SQL Server using bulk insert. It significantly improves the write performance when loading large data sets or loading data into tables where a column store index is used.
import com.microsoft.azure.sqldb.spark.bulkcopy.BulkCopyMetadata
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
/**
Add column Metadata.
If not specified, metadata is automatically added
from the destination table, which may suffer performance.
*/
var bulkCopyMetadata = new BulkCopyMetadata
bulkCopyMetadata.addColumnMetadata(1, "Title", java.sql.Types.NVARCHAR, 128, 0)
bulkCopyMetadata.addColumnMetadata(2, "FirstName", java.sql.Types.NVARCHAR, 50, 0)
bulkCopyMetadata.addColumnMetadata(3, "LastName", java.sql.Types.NVARCHAR, 50, 0)
val bulkCopyConfig = Config(Map(
"url" -> "mysqlserver.database.windows.net",
"databaseName" -> "MyDatabase",
"user" -> "username",
"password" -> "*********",
"dbTable" -> "dbo.Clients",
"bulkCopyBatchSize" -> "2500",
"bulkCopyTableLock" -> "true",
"bulkCopyTimeout" -> "600"
))
df.bulkCopyToSqlDB(bulkCopyConfig, bulkCopyMetadata)
//df.bulkCopyToSqlDB(bulkCopyConfig) if no metadata is specified.
Ref: Use Spark Connector
HTH.

Related

Connect python-polars to SQL server (no support currently)

How can I directly connect MS SQL Server to polars?
The documentation does not list any supported connections but recommends the use of pandas.
Update:
SQL Server Authentication works per answer, but Windows domain authentication is not working. see issue
Ahh, actually MsSQL is supported for loading directly into polars (via the underlying library that does the work, which is connectorx); the documentation is just slightly out of date - I'll take a look and refresh it accordingly.
Here you can connect to MS SQL Server with Polar (connectorx under the hood). Just use a connection string:
import polars as pl
# usually don't store sensitive info in plain text
username = 'my_username'
password = '1234'
server = 'SERVER1'
database = 'db1'
trusted_conn = 'no' # or yes
conn = f'mssql://{username}:{password}#{server}/{database}?driver=SQL+Server&trusted_connection={trusted_conn}'
query = "SELECT * FROM table1"
df = pl.read_sql(query, conn)

Pandas df to SQL Server, connecting without user/password

I'm currently trying to write a pandas data frame into a new SQL Server table, and I'm having trouble figuring out how to connect WITHOUT USING USER/PASSWORD.
Pandas documentation states that an engine must be created via sqlalchemy, and the company only gave me a sample code (not using pandas, for other tasks) for the connection via pymssql:
    server = "server name"
    conn = pymssql.connect(server, database='TestDatabase')
    cursor = conn.cursor()
    cursor.execute(instruction)
    conn.close()
Now I must pass a connection to sqlalchemy, which the sqlalchemy documentation states would be something like
engine = create_engine("mssql+pymssql://<username>:<password>#<freetds_name>/?charset=utf8",
encoding='latin1', echo=True)
but our SQL Server instance uses local authentication, and I found no model for this.
How can I create this connection string?

Can I work with both local and ODBC linked tables in an Access database from Python?

How can pypyodbc connect to linked tables in the .accdb database? Is this possible at all, or is this a limitation of pyodbc?
I need to get data from an MS Acess .accdb database into Python. This works perfectly and I can use pypyodbc to access tables and queries defined inside the .accdb Database. However, the database also has tables linked to an external SQL Server. When accessing such linked tables pypyodbc complains that it cannot connect to the SQL server.
test.accdb contains two tables: Test (local table) and cidb_ain (linked SQL table)
The following Python 3 code is my attempt to access the data:
import pypyodbc as pyodbc
cnxn = pyodbc.connect(driver='Microsoft Access Driver (*.mdb, *.accdb)',
dbq='test.accdb',
readonly=True)
cursor = cnxn.cursor()
# access to the local table works
for row in cursor.execute("select * from Test"):
print(row)
print('----')
# access to the linked table fails
for row in cursor.execute("select * from cidb_ain"):
print(row)
Output:
(1, 'eins', 1)
(2, 'zwei', 2)
(3, 'drei', 3)
----
Traceback (most recent call last):
File "test_02_accdb.py", line 14, in <module>
for row in cursor.execute("select * from cidb_ain"):
File "C:\software\installed\miniconda3\lib\site-packages\pypyodbc.py", line 1605, in execute
self.execdirect(query_string)
File "C:\software\installed\miniconda3\lib\site-packages\pypyodbc.py", line 1631, in execdirect
check_success(self, ret)
File "C:\software\installed\miniconda3\lib\site-packages\pypyodbc.py", line 986, in check_success
ctrl_err(SQL_HANDLE_STMT, ODBC_obj.stmt_h, ret, ODBC_obj.ansi)
File "C:\software\installed\miniconda3\lib\site-packages\pypyodbc.py", line 964, in ctrl_err
raise Error(state,err_text)
pypyodbc.Error: ('HY000', "[HY000] [Microsoft][ODBC-Treiber für Microsoft Access] ODBC-Verbindung zu 'SQL Server Native Client 11.0SQLHOST' fehlgeschlagen.")
The error message roughly translates to "ODBC connection to 'SQL Server Native Client 11.0SQLHOST' failed".
I cannot access the SQL Server through the .accdb database with pypyodbc, but querying the cidb_ain table from within MS Access is possible. Furthermore, I can connect to the SQL Server directly:
cnxn = pyodbc.connect(driver='SQL Server Native Client 11.0',
server='SQLHOST',
trusted_connection='yes',
database='stuffdb')
Considering that (1) MS Access (and Matlab too) can use the information contained in the .accdb file to query the linked tables, and (2) the SQL Server is accessible, I assume the problem is related to pypyodbc. (The way driver name and host name are mangled into 'SQL Server Native Client 11.0SQLHOST' in the error message seems somewhat suspicious, too.)
I have no previous experience with Access, so please be patient and let me know if I omitted important information that seemed unnecessary to me...
First, MS Access is a unique type of database application that is somewhat different than other RDMS's (e.g., SQLite, MySQL, PostgreSQL, Oracle, DB2) as it ships with both a default back-end Jet/ACE SQL Relational Engine (which by the way is not an Access-restricted component but a general Microsoft technology) and a front-end GUI interface and report generator. In essence, Access is a collection of objects.
Linked tables are somewhat a feature of the front-end side of MS Access used to replace the default Jet/ACE database (i.e., local tables) for another backend database, specifically for you SQL Server. Moreover, linked tables are ODBC/OLEDB connections themselves! You had to have used a DSN, Driver, or Provider to even establish and create linked tables in the MS Access file.
Hence, any external client, here being your Python script, that connects to the MS Access database [driver='Microsoft Access Driver (*.mdb, *.accdb)] is actually connecting to the backend Jet/ACE database. Client/script never interacts with frontend objects. In your error Python reads the ODBC connection of the linked table and since the SQL Server Driver/Provider [SQL Server Native Client 11.0SQLHOST] is never called in script, the script fails.
Altogether, to resolve your situation you must connect Python directly to the SQL Server database (and not use MS Access as a medium) to connect to any local tables there, here being cidb_ain. Simply use the connection string of the Access linked table:
#(USING DSN)
db = pypyodbc.connect('DSN=dsn name;')
cur = db.cursor()
cur.execute("SELECT * FROM dbo.cidb_ain")
for row in cur.fetchall()
print(row)
cur.close()
db.close()
# (USING DRIVER)
constr = 'Trusted_Connection=yes;DRIVER={SQL Server};SERVER=servername;' \
'DATABASE=database name;UID=username;PWD=password'
db = pypyodbc.connect(constr)
cur = db.cursor()
cur.execute("SELECT * FROM dbo.cidb_ain")
for row in cur.fetchall()
print(row)
cur.close()
db.close()
Update:
It turns out that the solution to this problem is as simple as setting pyodbc.pooling = False before establishing the connection to the Access database:
import pyodbc
# ... also works with `import pypyodbc as pyodbc`, too
pyodbc.pooling = False # this prevents the error
cnxn = pyodbc.connect(r"DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};DBQ= ... ")
(previous answer)
It appears that neither pypyodbc nor pyodbc can read a SQL Server linked table from an Access database. However, System.Data.Odbc in .NET can do it so IronPython can, too.
To verify, I created a table named [Foods] in SQL Server
id guestId food
-- ------- ----
1 1 pie
2 2 soup
I created an ODBC linked table named [dbo_Foods] in Access which pointed to that table on SQL Server.
I also created a local Access table named [Guests] ...
id firstName
-- ---------
1 Gord
2 Jenn
... and a saved Access query named [qryGuestPreferences] ...
SELECT Guests.firstName, dbo_Foods.food
FROM Guests INNER JOIN dbo_Foods ON Guests.id = dbo_Foods.guestId;
Running the following script in IronPython ...
import clr
import System
clr.AddReference("System.Data")
from System.Data.Odbc import OdbcConnection, OdbcCommand
connectString = (
r"DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};"
r"DBQ=C:\Users\Public\Database1.accdb;"
)
conn = OdbcConnection(connectString)
conn.Open()
query = """\
SELECT firstName, food
FROM qryGuestPreferences
"""
cmd = OdbcCommand(query, conn)
rdr = cmd.ExecuteReader()
while rdr.Read():
print("{0} likes {1}.".format(rdr["firstName"], rdr["food"]))
conn.Close()
... returns
Gord likes pie.
Jenn likes soup.

Matching the schema of a SQL Server DB to an Oracle DB

Are there known products that can perform schema matching on any level between SQL Server and Oracle as described here? If not a product, would there be a documented methodology on how to perform maybe some semantic search and comparison of db tables, fields, and even data?
I have an existing SQL Server database which currently experiences a lot of trouble updating its data as it uses a lot of undocumented and unreadable legacy code to extract information from various external data sources. Fortunately, there exists an Oracle database that, based on the nature of the business, seems to contain all the information required by the SQL Server DB. The problem is, the schemas between the two environments are vastly different. They don't follow a common naming convention, and may not even follow the same normalization (some tables may be flat on one and normalized on the other).
The naive approach of trying to go through each table and column in SQL Server and then manually and visually searching for possible matches on the Oracle one seems quite impractical, given that there are hundreds of tables between the two databases.
There are commercial solutions(i.e. Database Compare Suite, Cross Database Studio) in the market which can be used to compare both homogeneous and heterogeneous database environments. But it's not good idea to spent money for those tools only for the schema comparison.
There might be several methods/tools/solutions which has certain limitations based on their scope. I am giving solution to compare schema matching between SQL Server and Oracle. Level of matching like table contents and data structure depends on your level of implementation.
In my Approach, suggesting following steps to full fill your requirements:
Establish SQL Server link server to Oracle Database. Steps are mentioned in the reply trail.
Now, access SQL Server and Oracle tables directly from SQL Server environment itself. It's feasible to compare table data using following approach:
SQL Server :
select field1, field2, field_N from openquery(DEV, 'select * from oracle_owner_schema.testtable')
minus
select field1, field2, field_N from sqlserver_database.schema_name.testtable
3 You can compare data structure(field length, data type, default value, etc.) by using dictionary table from SQL Server and Oracle in a same way.
Establish SQL Server link server to Oracle Database
Setup ODAC
Copy missing DLLs from instaclient to c:\oracle\product...\client_1\BIN [Location in which ODAC was installed.]
Add hostname, portnumber, and service name in c:\oracle\product...\client_1\network\admin\tnsnames.ora, sample:
LOCALSERVER =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = IP_Hostname_Oracle_Server)(PORT = 1521))
)
(CONNECT_DATA =
(SERVICE_NAME = orcl)
)
)
Configuration in the SQL Server:
a. Connect to SQL server
b. Database>Server Objects>link servers>providers>OraOLEDB.Oracle>Right
c. Click>Properties>"Check Enable for 'Allow Inprocess'", then Save it.
d. Database>Server Objects>link servers>Right Click>New Link Server>
**************General TAB*************************
Linked server: Any Name for link server in MSSQL
Provide name: Oracle Provider
Product Name: Oracle
Datasource: Provide name which you have added in tnsnames.ora
****************Security TAB***************************
Choose "Be made using this security contest:"
Remote login: username for remote database <<e.g. guest >>
With password: password <<e.g. guest>>
How to use:
select * from openquery(Linked_server_Name, 'select * from schema_name.table_name');
i.e. select * from openquery(Any Name_for_link_server_in_MSSQL, 'select * from oracle_owner_schema.testtable');
Note: Version of ODAC and oracle client should be same.

How can I connect database(Microsoft SQL server 2012) with Mathematica?

I installed Microsoft SQL Server 2012 and created new database, some new tables & also inserted some values into that table.
I want to access that data from Mathematica. I read documentation about OpenSqlConnection[]and JDBC[] but didn not get it. I didn't create any drivers in my system.
I installed database in my system & I want to connect database with Mathematica.
Can anyone help me?
Here's my recommendation:
Bring in the DatabaseLink package:
Needs["DatabaseLink`"];
Open a connection to the database:
conn = OpenSQLConnection[JDBC["Microsoft SQL Server(jTDS)", "/"], "Username" -> "", "Password" -> ""];
Start using the database. Here is an example query on table "Names"
bunchOfNames = SQLSelect[conn, {"Names"}]
Needs["DatabaseLink`"]
//SQL Security
conn = OpenSQLConnection[
JDBC["Microsoft SQL Server(jTDS)", "serverName:1433/"],
"Username" -> "domain\username", "Password" -> "1234",
"Catalog" -> "MathematicaTestDB", "instance" -> "I2"]
//Windows Integrated
conn = OpenSQLConnection[
JDBC["Microsoft SQL Server(jTDS)", "serverName:1433/"],
"Catalog" -> "MathematicaTestDB", "instance" -> "Instance2"]
d1 = SQLExecute[conn, "SELECT * FROM DUMMYDATA"]
For the Windows Integrated you need to download the jTDS dist, extract out the ntlmauth.dll file. jTDS must be able to load the native SPPI library (ntlmauth.dll). Place this DLL anywhere in the system path (defined by the PATH system variable) and you're all set.

Resources