SQL Server through JDBC in PySpark - sql-server

os.environ.get("PYSPARK_SUBMIT_ARGS", "--master yarn-client --conf spark.yarn.executor.memoryOverhead=6144 \
--executor-memory 1G –jars /mssql/jre8/sqljdbc42.jar --driver-class-path /mssql/jre8/sqljdbc42.jar")
source_df = sqlContext.read.format('jdbc').options(
url='dbc:sqlserver://xxxx.xxxxx.com',
database = "mydbname",
dbtable=mytable,
user=username,
password=pwd,
driver='com.microsoft.jdbc.sqlserver.SQLServerDriver'
).load()
I am trying to load SQL Server Table using Spark Context.
But running into the following error.
Py4JJavaError: An error occurred while calling o59.load.
: java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
I have the jar file in the location. Is that the correct jar file?
Is there a problem with the code.
Not sure what is the problem.
Scala error
scala> classOf[com.microsoft.sqlserver.jdbc.SQLServerDriver]
<console>:27: error: object sqlserver is not a member of package com.microsoft
classOf[com.microsoft.sqlserver.jdbc.SQLServerDriver]
scala> classOf[com.microsoft.jdbc.sqlserver.SQLServerDriver]
<console>:27: error: object jdbc is not a member of package com.microsoft
classOf[com.microsoft.jdbc.sqlserver.SQLServerDriver]

The configuration is similar to Spark-Oracle configuration.
Here is my Spark-sqlserver configurations:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.master('local[*]')\
.appName('Connection-Test')\
.config('spark.driver.extraClassPath', '/your/jar/folder/sqljdbc42.jar')\
.config('spark.executor.extraClassPath', '/your/jar/folder/sqljdbc42.jar')\
.getOrCreate()
sqlsUrl = 'jdbc:sqlserver://your.sql.server.ip:1433;database=YourSQLDB'
qryStr = """ (
SELECT *
FROM yourtable
) t """
spark.read.format('jdbc')\
.option('url',sqlsUrl)\
.option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver')\
.option('dbtable', qryStr )\
.option("user", "yourID") \
.option("password", "yourPasswd") \
.load().show()
Set the location of the jar file you downloaded => "/your/jar/folder/sqljdbc42.jar". The jar file can be downloaded from: https://www.microsoft.com/en-us/download/details.aspx?id=54671 (*google sqljdbc42.jar if the link does not work)
Set the correct jdbc url => 'jdbc:sqlserver://your.sql.server.ip:1433;database=YourSQLDB' (change the port number if you have a different setting)
Set the correct driver name => .option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver')
Enjoy

I installed Spark in Windows and got the same error while connecting to SQL Server following the steps described here https://docs.azuredatabricks.net/spark/latest/data-sources/sql-databases.html#python-example. I solved this like below -
1) Download SQL Server JDBC driver from here https://www.microsoft.com/en-us/download/details.aspx?id=11774.
2) Unzip as "Microsoft JDBC Driver 6.0 for SQL Server"
3) Find the JDBC jar file (like sqljdbc42.jar) in folder "Microsoft JDBC Driver 6.0 for SQL Server".
4) Copy the jar file (like sqljdbc42.jar) to "jars" folder under Spark home folder. In my case, I copied it and pasted it to "D:\spark-2.3.1-bin-hadoop2.6\jars"
5) restart pyspark
In this way I solved this for Windows server.

Related

How to get data from VerticaDB with Pyspark

I am trying to get data from VerticaDb with pyspark but I have error is called Class Not Found Exception.
Error: Py4JJavaError: An error occurred while calling o165.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.vertica.spark.datasource.VerticaSource.
My code is here :
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
from pyspark import sql
# Create the spark session
spark = SparkSession \
.builder \
.appName("Vertica Connector Pyspark Example") \
.getOrCreate()
spark_context = spark.sparkContext
sql_context = sql.SQLContext(spark_context)
# The name of our connector for Spark to look up
format = "com.vertica.spark.datasource.VerticaSource"
# Set connector options based on our Docker setup
table = "*****"
db = "*****"
user = "********"
password = "********"
host = "******"
part = "1";
staging_fs_url="****"
#spark.read.format("com.vertica.spark.datasource.VerticaSource").options(opt).load()
readDf = spark.read.load(
# Spark format
format=format,
# Connector specific options
host=host,
user=user,
password=password,
db=db,
table=table)
# Print the DataFrame contents
readDf.show()
Thanks
This is from official documentaion on how to enable Vertica as data source in Spark-
The Vertica Connector for Apache Spark is packaged as a JAR file. You install this file on your Spark cluster to enable Spark and Vertica to exchange data. In addition to the Connector JAR file, you also need the Vertica JDBC client library. The Connector uses this library to connect to the Vertica database.
Both of these libraries are installed with the Vertica server and are available on all nodes in the Vertica cluster in the following locations:
The Spark Connector files are located in /opt/vertica/packages/SparkConnector/lib.
The JDBC client library is /opt/vertica/java/vertica-jdbc.jar.
Make sure Vertica JDBC jar is copied at the Spark library path.
Getting the Spark Connector
Deploying the Vertica Connector for Apache Spark

Connect to SQL Server database from RHEL with Windows Authentication

I have been working for two weeks in the installation of Superset (from Airbnb) for data visualization on a virtual RHEL machine and the connection with a SQL Server database. But I still cannot connect to this database because of a problem of driver I guess. I tried many things and I would like to know if you have a solution, about a driver I need, about modifications in my configuration etc...
Someone told me about jTDS driver. Maybe I need something like this but for python. If you have any idea, here is what I already did.
1) I tried to connect to the database from Superset :
SQL Alchemy URI : mssql://user:password#fr0-iacls-190.eu.company.corp:10001/dbname
ERROR : {"error": "Connection failed!\n\n
The error message returned was:\n(pyodbc.Error) ('IM002', '[IM002] [unixODBC][DriverManager]
Data source name not found, and no default driver specidied (0) (SQLDriverConnect)')"}
2) I tried almost the same with mssql+pymssql :
SQL Alchemy URI : mssql+pymssql://user:password#fr0-iacls-190.eu.company.corp:10001/dbname
ERROR:{"error":"Connection failed!\n\n
The error message returned was:\n(pymssql.OperationalError) (18456, 'DB-Lib error message 20018,
severity 14:\\nGeneral SQL Server error: Check messages from the SQL Server\\n
DB-Lib error message 20002, severity 9:\\n Adaptive Server connection failed (fr0-iacls-190.eu.company.corp:10001)\\n')"}
3) I tried to connect to the database from my terminal on virtual RHEL machine :
# tsql -S fr0-iacls-190.eu.company.corp -U user
Password:
locale is "en_US.UTF-8"
locale charset is "UTF-8"
using default charset "UTF-8"
20^C
Here I have a timer that increase the number every second. I stopped the example after 20 seconds.
4) Finally I tried a python script like this one :
import pyodbc
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER=fr0-iacls-190.eu.company.corp;DATABASE=dbname;UID=;PWD=password')
The empty UID is a tip used in another StackOverflow post.
# python connect.py
Traceback (most recent call last):
File "connect.py", line 2, in <module>
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER=fr0-iacls-190.eu.company.corp;DATABASE=dbname;UID=;PWD=password')
pyodbc.Error: ('01000', "[01000] [unixODBC][Driver Manager]Can't open lib 'SQL Server' : file not found (0) (SQLDriverConnect)")
To finish, I read about two files, odbc.ini and odbcinst.ini but I don't know if they are in the good directory (/etc). I am not working on root (~) but in in the parent directory of root (cd ~/..)...
Here are the two files if necessary :
odbc.ini
;
; odbc.ini
;
[ODBC Data Sources]
JDBC = Sybase JDBC Server
[JDBC]
Driver = /usr/local/lib/libtdsodbc.so
Description = Sybase JDBC Server
Trace = No
Servername = JDBC
Database = pubs2
UID = guest
[Default]
Driver = /usr/local/lib/libtdsodbc.so
odbcinst.ini
[PostgreSQL]
Description=ODBC for PostgreSQL
Driver=/usr/lib/psqlodbcw.so
Setup=/usr/lib/libodbcpsqlS.so
Driver64=/usr/lib64/psqlodbcw.so
Setup64=/usr/lib64/libodbcpsqlS.so
FileUsage=1
[MySQL]
Description=ODBC for MySQL
Driver=/usr/lib/libmyodbc5.so
Setup=/usr/lib/libodbcmyS.so
Driver64=/usr/lib64/libmyodbc5.so
Setup64=/usr/lib64/libodbcmyS.so
FileUsage=1
[MSSQLTest]
Driver = ODBC Driver 13 for SQL Server
Server = [http:]fr0-iacls-190.eu.company.corp[,10001]
#
# Note:
# Port is not a valid keyword in the ~/.odbc.ini file
# for the Microsoft ODBC driver on Linux
#
[ODBC Driver 13 for SQL Server]
Description=Microsoft ODBC Driver 13 for SQL Server
Driver=/opt/microsoft/msodbcsql/lib64/libmsodbcsql-13.1.so.4.0
UsageCount=1
Many thanks for your time and your help.
A few things to try:
Correct the python connection string:
You have the aliased the MS SQL driver in /etc/odbcinst.ini to [ODBC Driver 13 for SQL Server] therefore in your python code you should be using that, rather than SQL Server:
import pyodbc
cnxn = pyodbc.connect('Driver={ODBC Driver 13 for SQL Server};Server=fr0-iacls-190.eu.company.corp;Port=10001;Database=dbname;UID=user;PWD=password')
replacing user and password with correct credentials.
Use isql to test a connection using odbc.ini:
Generally you use the odbcinst.ini to setup driver configuration, and then odbc.ini for database instances (referencing the drivers), thus a valid entry to your odbc.ini could be:
[friendly_database_name]
Description=A description to help you remember what this connection is for
Server=fr0-iacls-190.eu.company.corp
Port=10001
Database=dbname
UID=user
PWD=password
If you have isql installed (comes as part of the unixODBC package if not), then you can test with:
$ isql -3 -v friendly_database_name
Ensure there's not a firewall blocking you:
I'm not sure why the tsql command is failing for you, it should return the 1> prompt. Ensure that you can establish a telnet connection to the database server:
$ telnet fr0-iacls-190.eu.company.corp 10001
which should give you something like:
Trying 12.34.56.78...
Connected to fr0-iacls-190.eu.company.corp.
Escape character is '^]'
and then to ctrl + ] and type quit to exit (or do ctrl + c to cancel if the telnet test fails).

Connect to a SQL Server database through RODBC

My question is related to
Trying to connect to an ODBC server using RODBC in ubuntu
and
How to specify include and lib directories when locally installing RODBC?
but I could not find suitable answers to my case.
I want to connect to a SQL Server database on a remote server using RODBC.
I have installed unixodbc and freetds, and can connect in the terminal with T-SQL, so the connection exists.
But when trying to connect in R with (all sensitive info have been replaced by ***):
odbcConnect(dsn="TESTSQL", uid=***, pwd=***)
I get:
Warning messages:
1: In RODBC::odbcDriverConnect("DSN=TESTSQL;UID=***;PWD=***") : [RODBC] ERROR: state 01000, code 0, message [unixODBC][Driver Manager]Can't open lib '/usr/local/Cellar/freetds/0.95.18/lib/libtdsodbc.so' : file not found
2: In RODBC::odbcDriverConnect("DSN=TESTSQL;UID=***;PWD=***") :
ODBC connection failed
The odbc.ini file being:
[ODBC Data Sources]
TESTSQL = Test database
[TESTSQL]
Driver = MSSQL
Servername = ***.**.**.**
Port = **
Database = ****
TDS_Version = 8.0
I had installed the latest version of freetds, that is 1.00.27, I am hence surprised that this library libtdsodbc.so is missing.
Is that normal? Would you recommend to install the version 0.95.18 or rather stay with 1.00.27 and look for that missing library?
I had to remove freetds:
brew remove freetds
then resintalling it, specifying --with-unixodbc to have the libtsdodbc.so created:
brew install freetds --with-unixodbc
In the odbc.ini, I had then to take care not to confuse "Server" and "Servername", and link the driver to the libtdsodbc.so, so that my odbc.ini looks like:
[ODBC Data Sources]
TESTSQL = Test database
[TESTSQL]
Driver = /usr/local/lib/libtdsodbc.so
Server = ***.**.**.**
Port = **
Database = ****
TDS_Version = 8.0
and connected using the RODBC package
ch1 <- odbcConnect(dsn="TESTSQL", uid=***, pwd=***)
> ch1
RODBC Connection 5
Details:
case=nochange
DSN=TESTSQL
UID=****
PWD=******
it works!
Further detailed informations from this page
http://eriqande.github.io/2014/12/19/setting-up-rodbc.html

Linux Open Suse "pyodbc.Error: ('01000', "[01000] [unixODBC][Driver Manager]Can't open lib 'SQL Server' : file not found (0) (SQLDriverConnect)")"

I know this question was asked before but I never really got a proper answer that would solve my problem. I am trying to connect to a SQL server on a windows machine from a linux Open Suse12.4 machine.
pyodbc.connect('DRIVER={SQL Server};SERVER=servername;DATABASE=dbname;UID=userid;PWD=password')
the exact error I got was as below:
pyodbc.Error: ('01000', "[01000] [unixODBC][Driver Manager]Can't open lib 'SQL Server' : file not found (0) (SQLDriverConnect)")
and below is my odbcinst.ini file :
[Easysoft ODBC-SQL Server]
Driver=/usr/local/easysoft/sqlserver/lib/libessqlsrv.so
Setup=/usr/local/easysoft/sqlserver/lib/libessqlsrvS.so
Threading=0
FileUsage=1
DontDLClose=1
UsageCount=2
[Easysoft ODBC-SQL Server SSL]
Driver=/usr/local/easysoft/sqlserver/lib/libessqlsrv_ssl.so
Setup=/usr/local/easysoft/sqlserver/lib/libessqlsrvS.so
Threading=0
FileUsage=1
DontDLClose=1
UsageCount=2
This post helped me to pinpoint my issue. My situation is that i have install the ODBC driver following this post "https://github.com/mkleehammer/pyodbc/wiki/Connecting-to-SQL-Server-from-RHEL-6-or-Centos-7" and turn out, i found that DRIVER "SQL Server" does not exist in my ini file. I changed DRIVER in connection string as "cnxn = pyodbc.connect("Driver={ODBC Driver 13 for SQL Server};Server=XXXXX;Database=XXX;Uid=XXX;Pwd=XXX;")" and it works
If you are using an offline REHL server, then follow the below method to setup connection to Microsoft SQL Server.
Download UNIXODBC & MSSQLTools packages—e.g., unixODBC-2.3.7-1.rh.x86_64.rpm/mssql-tools-17.9.1.1-1.x86_64.rpm—from the Microsoft repo, as per your REHL version
Place downloaded files on the REHL server via winscp or any ssh client
Install these two files in sequence given below:
yum localinstall unixODBC-2.3.7-1.rh.x86_64.rpm
yum localinstall mssql-tools-17.9.1.1-1.x86_64.rpm)
Go to the installation folder and copy the path as shown in e.g.,
/opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.9.so.1.1
Put this path in code:
driverpath = r"/opt/microsoft/msodbcsql17/lib64/libmsodbcsql-17.9.so.1.1"
Your problem will get solved.
Pyodbc is not able to locate Driver = {SQL Server} used. In my case, It was mainly because the name I gave in odbcinst.ini file and related files wasn't correct.
Instead using
Driver =/usr/local/lib/libmsodbcsql.13.dylib; using in connection uri helped me connect and hence understand that my configuration files were incorrect.
Different types of libraries for connecting to SQL Server installed which causes the conflict.
I corrected it and was able to connect.

Eclipse SQL Server Connection

I am trying to establish connection to SQL Server from eclipse.
I added to the project build path the jar sqlserverjdbc.jar .
This is my code:
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");
//Connection m_Connection = DriverManager.getConnection(
// "jdbc:microsoft:sqlserver://localhost:1433;DatabaseName=TMO", "****", "****");
String url = "jdbc:sqlserver://10.25.50.14;databaseName=TMO;integratedSecurity=true";
//Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver");
Connection conn = DriverManager.getConnection(url);
Statement m_Statement = conn.createStatement();
When I call getConnection(url) I get an error:
Nov 17, 2014 1:08:43 AM com.microsoft.sqlserver.jdbc.SQLServerConnection
SEVERE: Java Runtime Environment (JRE) version 1.7 is not supported by this driver. Use the sqljdbc4.jar class library, which provides support for JDBC 4.0.
Exception in thread "main" java.lang.UnsupportedOperationException: Java Runtime Environment (JRE) version 1.7 is not supported by this driver. Use the sqljdbc4.jar class library, which provides support for JDBC 4.0.
at com.microsoft.sqlserver.jdbc.SQLServerConnection.(SQLServerConnection.java:238)
at com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:841)
at java.sql.DriverManager.getConnection(Unknown Source)
at java.sql.DriverManager.getConnection(Unknown Source)
at com.ccih.analytics.clustering.tmo.DexRunner.getKeyPhrasesFromNewNlpDB_W_SAMPLE(DexRunner.java:394)
at com.ccih.analytics.clustering.tmo.DexRunner.main(DexRunner.java:59)
ERROR: JDWP Unable to get JNI 1.2 environment, jvm->GetEnv() return code = -2
JDWP exit error AGENT_ERROR_NO_JNI_ENV(183): [../../../src/share/back/util.c:838]
Am I using the wrong jar ? I checked my eclipse project in the 'Referenced Libraries' and I saw the entry of the related jar with all the .class files
Yep You are Using the wrong jar which is not compatible for your JRE. I had same problem and i tried using the sqljdbc4.jar library and it worked completely fine for me.You can download .jar from here.DOWNLOAD
i hope this helps.

Resources