Working with Python in Azure Databricks to Write DF to SQL Server - sql-server

We just switched away from Scala and moved over to Python. I've got a dataframe that I need to push into SQL Server. I did this multiple times before, using the Scala code below.
var bulkCopyMetadata = new BulkCopyMetadata
bulkCopyMetadata.addColumnMetadata(1, "Title", java.sql.Types.NVARCHAR, 128, 0)
bulkCopyMetadata.addColumnMetadata(2, "FirstName", java.sql.Types.NVARCHAR, 50, 0)
bulkCopyMetadata.addColumnMetadata(3, "LastName", java.sql.Types.NVARCHAR, 50, 0)
val bulkCopyConfig = Config(Map(
"url" -> "mysqlserver.database.windows.net",
"databaseName" -> "MyDatabase",
"user" -> "username",
"password" -> "*********",
"dbTable" -> "dbo.Clients",
"bulkCopyBatchSize" -> "2500",
"bulkCopyTableLock" -> "true",
"bulkCopyTimeout" -> "600"
))
df.bulkCopyToSqlDB(bulkCopyConfig, bulkCopyMetadata)
That's documented here.
https://learn.microsoft.com/en-us/azure/sql-database/sql-database-spark-connector
I'm looking for an equivalent Python script to do the same job. I searched for the same, but didn't come across anything. Does someone here have something that would do the job? Thanks.

Please try to refer to PySpark offical document JDBC To Other Databases to directly write a PySpark dataframe to SQL Server via the jdbc driver of MS SQL Server.
Here is the sample code.
spark_jdbcDF.write
.format("jdbc")
.option("url", "jdbc:sqlserver://yourserver.database.windows.net:1433")
.option("dbtable", "<your table name>")
.option("user", "username")
.option("password", "password")
.save()
Or
jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.mysql.jdbc.Driver"
}
spark_jdbcDF.write \
.jdbc(url=jdbcUrl, table="<your table anem>",
properties=connectionProperties ).save()
Hope it helps.

Here is the complete PySpark code to write a Spark Data Frame to an SQL Server database including where to input database name and schema name:
df.write \
.format("jdbc")\
.option("url", "jdbc:sqlserver://<servername>:1433;databaseName=<databasename>")\
.option("dbtable", "[<optional_schema_name>].<table_name>")\
.option("user", "<user_name>")\
.option("password", "<password>")\
.save()

Related

How to connect MS SQL Database with Azure Databricks and run command

I want to connect Azure MS SQL Database with Azure Databricks via python spark. I could do this with pushdown_query if I run Select * from.... But I need to run ALTER DATABASE to scale up/down.
I must change this part
spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
otherwise I get this error Incorrect syntax near the keyword 'ALTER'.
Anyone can help me. Much appreciated.
jdbcHostname = "xxx.database.windows.net"
jdbcDatabase = "abc"
jdbcPort = 1433
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : "..............",
"password" : "............",
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
pushdown_query = "(ALTER DATABASE [DBNAME] MODIFY (SERVICE_OBJECTIVE = 'S0')) dual_down"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)

How to connect to SQL Server using Pyodbc in Databricks?

I'm trying to connect to a database in an on prem SQL Server but I'm getting this error that I'm not quite understanding. Apparently when I run this in data-bricks it can't find the driver I'm specifying.
Try it like this.
from pyspark.sql.session import SparkSession
spark = SparkSession \
.builder \
.config("spark.driver.extraClassPath","mssql-jdbc-6.4.0.jre8.jar") \
.appName("Python Spark Data Source Example") \
.getOrCreate()
dbDF = spark.read.format("jdbc") \
.option("url","jdbc:sqlserver://your_server_name.database.windows.net:1433;databaseName=your_database_name") \
.option("dbtable","dbo.ml_securitymastersample") \
.option("user","your_user_name") \
.option("nullValue", ["","N.A.","NULL"]) \
.option("password","your_password").load()
# display the dimensions of a Spark DataFrame
def shape(df):
print("Shape: ", df.count(), ",", len(df.columns))
# check % of missing values across all columns
dbDf.select([count(when(col(c).isNull), c)).alias(c) for c in dbDf.columns]).show()
Does that work for you?

Pyspark connection to the Microsoft SQL server?

I have a huge dataset in SQL server, I want to Connect the SQL server with python, then use pyspark to run the query.
I've seen the JDBC driver but I don't find the way to do it, I did it with PYODBC but not with a spark.
Any help would be appreciated.
Please use the following to connect to Microsoft SQL:
def connect_to_sql(
spark, jdbc_hostname, jdbc_port, database, data_table, username, password
):
jdbc_url = "jdbc:sqlserver://{0}:{1}/{2}".format(jdbc_hostname, jdbc_port, database)
connection_details = {
"user": username,
"password": password,
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
}
df = spark.read.jdbc(url=jdbc_url, table=data_table, properties=connection_details)
return df
spark is a SparkSession object, and the rest are pretty clear.
You can also pass pushdown queries to read.jdbc
I use pissall's function (connect_to_sql) but I modified it a little.
from pyspark.sql import SparkSession
def connect_to_sql(
spark, jdbc_hostname, jdbc_port, database, data_table, username, password
):
jdbc_url = "jdbc:mysql://{0}:{1}/{2}".format(jdbc_hostname, jdbc_port, database)
connection_details = {
"user": username,
"password": password,
"driver": "com.mysql.jdbc.Driver",
}
df = spark.read.jdbc(url=jdbc_url, table=data_table, properties=connection_details)
return df
if __name__=='__main__':
spark = SparkSession \
.builder \
.appName('test') \
.master('local[*]') \
.enableHiveSupport() \
.config("spark.driver.extraClassPath", <path to mysql-connector-java-5.1.49-bin.jar>) \
.getOrCreate()
df = connect_to_sql(spark, 'localhost', <port>, <database_name>, <table_name>, <user>, <password>)
or you can use SparkSession .read method
df = spark.read.format("jdbc").option("url","jdbc:mysql://localhost/<database_name>").option("driver","com.mysql.jdbc.Driver").option("dbtable",<table_name>).option("user",<user>).option("password",<password>).load()

MS SQL with Spark Scala

Please provide solution for following issue
I am using com.microsoft.sqlserver.jdbc.SQLServerDriver to access data in spark project with following code,
val jdbcDF = sqlContext.read
.format("jdbc")
.option("driver" , "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", "jdbc:sqlserver://127.0.0.1/sparkDB")
.option("dbtable", " SELECT * FROM test_table" )
.option("user", "sa")
.option("password", "User#123456")
.load()
But I get following error
Exception in thread "main" com.microsoft.sqlserver.jdbc.SQLServerException:
The TCP/IP connection to
the host 127.0.0.1/sparkDB, port 1433 has failed. Error:
"127.0.0.1/sparkDB. Verify the connection properties. Make sure that an
instance of SQL Server is running on the host and accepting TCP/IP
connections at the port. Make sure that TCP connections to the port are not
blocked by a firewall.".
where as my firewall is off.
Try using
val jdbcDF = sqlContext.read
.format("jdbc")
.option("driver" , "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", "jdbc:sqlserver://127.0.0.1\\sparkDB")
.option("dbtable", " (SELECT * FROM test_table) AS Q" )
.option("user", "sa")
.option("password", "User#123456")
.load()
or
val jdbcDF = sqlContext.read
.format("jdbc")
.option("driver" , "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", "jdbc:sqlserver://127.0.0.1;instancename=sparkDB")
.option("dbtable", " (SELECT * FROM test_table) AS Q" )
.option("user", "sa")
.option("password", "User#123456")
.load()

Connect to a Highly-Available SQL Server from R

We've recently upgraded to SQL Server 2012 which is Highly-Available DR enabled. When connecting using SSMS we need to specify MultiSubnetFailover=True additional connection option + increase timeouts.
How can we replicate this in R? Without this, we observe sporadic connectivity/timeout issues.
Related, but for Python
> packageVersion('RODBC')
[1] '1.3.6'
> packageVersion('Base')
[1] '2.15.2'
If you are using a Data Source Name, you can add extra arguments to odbcConnect
odbcConnect(DSN, uid = "user_name", pwd = "password", MultiSubnetFailover = "True")
If you are using a connection string, you just need to add the arguments in your string.
odbcDriverConnect("driver=DRIVER; server=SERVER; database=DATABASE; uid=user_name; pwd=password; MultiSubnetFailover = True")

Resources