MS SQL with Spark Scala - sql-server

Please provide solution for following issue
I am using com.microsoft.sqlserver.jdbc.SQLServerDriver to access data in spark project with following code,
val jdbcDF = sqlContext.read
.format("jdbc")
.option("driver" , "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", "jdbc:sqlserver://127.0.0.1/sparkDB")
.option("dbtable", " SELECT * FROM test_table" )
.option("user", "sa")
.option("password", "User#123456")
.load()
But I get following error
Exception in thread "main" com.microsoft.sqlserver.jdbc.SQLServerException:
The TCP/IP connection to
the host 127.0.0.1/sparkDB, port 1433 has failed. Error:
"127.0.0.1/sparkDB. Verify the connection properties. Make sure that an
instance of SQL Server is running on the host and accepting TCP/IP
connections at the port. Make sure that TCP connections to the port are not
blocked by a firewall.".
where as my firewall is off.

Try using
val jdbcDF = sqlContext.read
.format("jdbc")
.option("driver" , "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", "jdbc:sqlserver://127.0.0.1\\sparkDB")
.option("dbtable", " (SELECT * FROM test_table) AS Q" )
.option("user", "sa")
.option("password", "User#123456")
.load()
or
val jdbcDF = sqlContext.read
.format("jdbc")
.option("driver" , "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("url", "jdbc:sqlserver://127.0.0.1;instancename=sparkDB")
.option("dbtable", " (SELECT * FROM test_table) AS Q" )
.option("user", "sa")
.option("password", "User#123456")
.load()

Related

Databricks and SQL server issue with token

I need your help to create a "permanently" connection from databricks to sql server database in Azure.
I have a code in pyspark to connect to database, using driver "com.microsoft.sqlserver.jdbc.spark" and JAR spark_mssql_connector_2_12_3_0_1_0_0_alpha.jar.
I have created a class to connect to DB is via token
class SQLSpark():
database_name: str = ""
sql_service_name: str = ""
service_principal_id: str = ""
service_principal_secret: str = ""
tenant_id: str = ""
authority: str = ""
state = None
except_error = None
def __init__(self, database_name, service_principal_id, service_principal_secret, tenant_id,
authority, spark, sql_service_name=None):
self.database_name = database_name
self.sql_service_name = sql_service_name
self.service_principal_id = service_principal_id
self.service_principal_secret = service_principal_secret
self.tenant_id = tenant_id
self.authority = authority
self.state = True
self.except_error = ""
self._spark_session = spark
context = adal.AuthenticationContext(self.authority)
token = context.acquire_token_with_client_credentials("https://database.windows.net", self.service_principal_id,
self.service_principal_secret)
self._access_token = token["accessToken"]
server_name = "jdbc:sqlserver://" + self.sql_service_name + ".database.windows.net"
self._url = server_name + ";" + "databaseName=" + self.database_name + ";"
def select_table(self, table, sql_query):
try:
logger.info(f"Reading table {table} in DB {self.database_name} ")
df = self._spark_session.read.format("com.microsoft.sqlserver.jdbc.spark") \
.options(
url=self._url,
databaseName=self.database_name,
accessToken=self._access_token,
hostNameInCertificate="*.database.windows.net",
query=sql_query) \
.load()
self.custom_logger.info(f"Table {table} in database {self.database_name} has been read")
return df
except Exception as ex:
logger.error(f"Failed to read table {table}")
logger.error(ex)
The problem is that I have to process huge data and processes took more that 1h to process and database token expired. Is there a way to refresh the token when I call to select_table method?
Error given is:
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Login failed for user '<token-identified principal>'. Token is expired.
Full error:
Py4JJavaError: An error occurred while calling o9092.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 59.0 failed 4 times, most recent failure: Lost task 0.3 in stage 59.0 (TID 2611, 10.139.64.5, executor 0): com.microsoft.sqlserver.jdbc.SQLServerException: Login failed for user '<token-identified principal>'. Token is expired. ClientConnectionId:009909b8-d779-4df2-b077-59cf4c4b3c73
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262)
at com.microsoft.sqlserver.jdbc.TDSTokenHandler.onEOF(tdsparser.java:283)
at com.microsoft.sqlserver.jdbc.TDSParser.parse(tdsparser.java:129)
at com.microsoft.sqlserver.jdbc.TDSParser.parse(tdsparser.java:37)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.sendLogon(SQLServerConnection.java:5173)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.logon(SQLServerConnection.java:3810)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.access$000(SQLServerConnection.java:94)
at com.microsoft.sqlserver.jdbc.SQLServerConnection$LogonCommand.doExecute(SQLServerConnection.java:3754)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7225)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3053)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:2562)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:2216)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectInternal(SQLServerConnection.java:2067)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:1204)
at com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:825)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:64)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:272)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:320)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:320)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:320)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
at org.apache.spark.scheduler.Task.run(Task.scala:117)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:655)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:658)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1152)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1152)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1152)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2721)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2668)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2656)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Login failed for user '<token-identified principal>'. Token is expired. ClientConnectionId:009909b8-d779-4df2-b077-59cf4c4b3c73
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:262)
at com.microsoft.sqlserver.jdbc.TDSTokenHandler.onEOF(tdsparser.java:283)
at com.microsoft.sqlserver.jdbc.TDSParser.parse(tdsparser.java:129)
at com.microsoft.sqlserver.jdbc.TDSParser.parse(tdsparser.java:37)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.sendLogon(SQLServerConnection.java:5173)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.logon(SQLServerConnection.java:3810)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.access$000(SQLServerConnection.java:94)
at com.microsoft.sqlserver.jdbc.SQLServerConnection$LogonCommand.doExecute(SQLServerConnection.java:3754)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7225)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3053)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:2562)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:2216)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectInternal(SQLServerConnection.java:2067)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:1204)
at com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:825)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$createConnectionFactory$1(JdbcUtils.scala:64)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:272)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:320)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:320)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:320)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
at org.apache.spark.scheduler.Task.run(Task.scala:117)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:655)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:658)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Couple of things I can think of.
Check if there is an option to provide a refresh URL to Spark. So it can get new token. Similar to this but for your SQL Server instead of ADLS. You'll probably have to use some other API like acquire_token_with_refresh_token() to create the token.
I know some of the token generator implementations allow you to provide a requested expiry period while making the call to create a new token. If your does, then create a token valid for 2-3-6-whatever-you-need hours instead of letting it set expiry to default one hour.
Other option assuming your code is NOT correct. I.e. there is NOT a good reason to create token in __init__(). You should create token near where you use it. I.e.
class SQLSpark():
# ...
def __init__(self, database_name, service_principal_id, service_principal_secret, tenant_id,
authority, spark, sql_service_name=None):
# same as OP, except no token is created and stored in self.token
def select_table(self, table, sql_query):
# ...
# Generate the token closer to it's use.
token = adal.AuthenticationContext(self.authority).acquire_token_with_client_credentials("https://database.windows.net",
self.service_principal_id, self.service_principal_secret)
df = self._spark_session.read.format("com.microsoft.sqlserver.jdbc.spark") \
.options(
# ...
accessToken=token["accessToken"],
query=sql_query) \
.load()
# ...

How to connect to SQL Server using Pyodbc in Databricks?

I'm trying to connect to a database in an on prem SQL Server but I'm getting this error that I'm not quite understanding. Apparently when I run this in data-bricks it can't find the driver I'm specifying.
Try it like this.
from pyspark.sql.session import SparkSession
spark = SparkSession \
.builder \
.config("spark.driver.extraClassPath","mssql-jdbc-6.4.0.jre8.jar") \
.appName("Python Spark Data Source Example") \
.getOrCreate()
dbDF = spark.read.format("jdbc") \
.option("url","jdbc:sqlserver://your_server_name.database.windows.net:1433;databaseName=your_database_name") \
.option("dbtable","dbo.ml_securitymastersample") \
.option("user","your_user_name") \
.option("nullValue", ["","N.A.","NULL"]) \
.option("password","your_password").load()
# display the dimensions of a Spark DataFrame
def shape(df):
print("Shape: ", df.count(), ",", len(df.columns))
# check % of missing values across all columns
dbDf.select([count(when(col(c).isNull), c)).alias(c) for c in dbDf.columns]).show()
Does that work for you?

Working with Python in Azure Databricks to Write DF to SQL Server

We just switched away from Scala and moved over to Python. I've got a dataframe that I need to push into SQL Server. I did this multiple times before, using the Scala code below.
var bulkCopyMetadata = new BulkCopyMetadata
bulkCopyMetadata.addColumnMetadata(1, "Title", java.sql.Types.NVARCHAR, 128, 0)
bulkCopyMetadata.addColumnMetadata(2, "FirstName", java.sql.Types.NVARCHAR, 50, 0)
bulkCopyMetadata.addColumnMetadata(3, "LastName", java.sql.Types.NVARCHAR, 50, 0)
val bulkCopyConfig = Config(Map(
"url" -> "mysqlserver.database.windows.net",
"databaseName" -> "MyDatabase",
"user" -> "username",
"password" -> "*********",
"dbTable" -> "dbo.Clients",
"bulkCopyBatchSize" -> "2500",
"bulkCopyTableLock" -> "true",
"bulkCopyTimeout" -> "600"
))
df.bulkCopyToSqlDB(bulkCopyConfig, bulkCopyMetadata)
That's documented here.
https://learn.microsoft.com/en-us/azure/sql-database/sql-database-spark-connector
I'm looking for an equivalent Python script to do the same job. I searched for the same, but didn't come across anything. Does someone here have something that would do the job? Thanks.
Please try to refer to PySpark offical document JDBC To Other Databases to directly write a PySpark dataframe to SQL Server via the jdbc driver of MS SQL Server.
Here is the sample code.
spark_jdbcDF.write
.format("jdbc")
.option("url", "jdbc:sqlserver://yourserver.database.windows.net:1433")
.option("dbtable", "<your table name>")
.option("user", "username")
.option("password", "password")
.save()
Or
jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.mysql.jdbc.Driver"
}
spark_jdbcDF.write \
.jdbc(url=jdbcUrl, table="<your table anem>",
properties=connectionProperties ).save()
Hope it helps.
Here is the complete PySpark code to write a Spark Data Frame to an SQL Server database including where to input database name and schema name:
df.write \
.format("jdbc")\
.option("url", "jdbc:sqlserver://<servername>:1433;databaseName=<databasename>")\
.option("dbtable", "[<optional_schema_name>].<table_name>")\
.option("user", "<user_name>")\
.option("password", "<password>")\
.save()

Pyspark connection to the Microsoft SQL server?

I have a huge dataset in SQL server, I want to Connect the SQL server with python, then use pyspark to run the query.
I've seen the JDBC driver but I don't find the way to do it, I did it with PYODBC but not with a spark.
Any help would be appreciated.
Please use the following to connect to Microsoft SQL:
def connect_to_sql(
spark, jdbc_hostname, jdbc_port, database, data_table, username, password
):
jdbc_url = "jdbc:sqlserver://{0}:{1}/{2}".format(jdbc_hostname, jdbc_port, database)
connection_details = {
"user": username,
"password": password,
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
}
df = spark.read.jdbc(url=jdbc_url, table=data_table, properties=connection_details)
return df
spark is a SparkSession object, and the rest are pretty clear.
You can also pass pushdown queries to read.jdbc
I use pissall's function (connect_to_sql) but I modified it a little.
from pyspark.sql import SparkSession
def connect_to_sql(
spark, jdbc_hostname, jdbc_port, database, data_table, username, password
):
jdbc_url = "jdbc:mysql://{0}:{1}/{2}".format(jdbc_hostname, jdbc_port, database)
connection_details = {
"user": username,
"password": password,
"driver": "com.mysql.jdbc.Driver",
}
df = spark.read.jdbc(url=jdbc_url, table=data_table, properties=connection_details)
return df
if __name__=='__main__':
spark = SparkSession \
.builder \
.appName('test') \
.master('local[*]') \
.enableHiveSupport() \
.config("spark.driver.extraClassPath", <path to mysql-connector-java-5.1.49-bin.jar>) \
.getOrCreate()
df = connect_to_sql(spark, 'localhost', <port>, <database_name>, <table_name>, <user>, <password>)
or you can use SparkSession .read method
df = spark.read.format("jdbc").option("url","jdbc:mysql://localhost/<database_name>").option("driver","com.mysql.jdbc.Driver").option("dbtable",<table_name>).option("user",<user>).option("password",<password>).load()

[ISQL]ERROR: Could not SQLConnect: [IM004][unixODBC][Driver Manager]Driver's SQLAllocHandle on SQL_HANDLE_HENV failed

I have defined the connection steps below but I get an error. Tried most of the available drivers in /usr/lib64/ but still same error.
/usr/local/etc/freetds.conf
#Server defined by Victor
[THESERVER]
host = 172.xx.xxx.xx
port = 1433
tds version = 7.2
/etc/odbc.ini
# Defined by victor
[CRMCONNECT]
Description = "CRMConnect"
Driver = msSQL
Trace = Yes
Servername = THESERVER
hostname = 172.xx.xxx.xx
Database = "THE_DB"
UserName = "user"
Password = "password"
PROTOCOL = TCPIP
/etc/odbcinst.ini
[msSQL]
Description = msSQL Driver
Driver = /usr/lib64/libodbc.so.2.0.0
fileusage=1
dontdlclose=1
command
[root#darubini ~]# isql -v "CRMCONNECT" "user" "password"
Error message
[IM004][unixODBC][Driver Manager]Driver's SQLAllocHandle on SQL_HANDLE_HENV failed
[ISQL]ERROR: Could not SQLConnect
Below is a snapshot of the available libodbc and libtds drivers . Where would i be going wrong.

Resources