I am trying to get data from VerticaDb with pyspark but I have error is called Class Not Found Exception.
Error: Py4JJavaError: An error occurred while calling o165.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.vertica.spark.datasource.VerticaSource.
My code is here :
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
from pyspark import sql
# Create the spark session
spark = SparkSession \
.builder \
.appName("Vertica Connector Pyspark Example") \
.getOrCreate()
spark_context = spark.sparkContext
sql_context = sql.SQLContext(spark_context)
# The name of our connector for Spark to look up
format = "com.vertica.spark.datasource.VerticaSource"
# Set connector options based on our Docker setup
table = "*****"
db = "*****"
user = "********"
password = "********"
host = "******"
part = "1";
staging_fs_url="****"
#spark.read.format("com.vertica.spark.datasource.VerticaSource").options(opt).load()
readDf = spark.read.load(
# Spark format
format=format,
# Connector specific options
host=host,
user=user,
password=password,
db=db,
table=table)
# Print the DataFrame contents
readDf.show()
Thanks
This is from official documentaion on how to enable Vertica as data source in Spark-
The Vertica Connector for Apache Spark is packaged as a JAR file. You install this file on your Spark cluster to enable Spark and Vertica to exchange data. In addition to the Connector JAR file, you also need the Vertica JDBC client library. The Connector uses this library to connect to the Vertica database.
Both of these libraries are installed with the Vertica server and are available on all nodes in the Vertica cluster in the following locations:
The Spark Connector files are located in /opt/vertica/packages/SparkConnector/lib.
The JDBC client library is /opt/vertica/java/vertica-jdbc.jar.
Make sure Vertica JDBC jar is copied at the Spark library path.
Getting the Spark Connector
Deploying the Vertica Connector for Apache Spark
Related
I am trying to run PyFlink walkthough, but instead of sinking data to Elasticsearch, i want to use InfluxDB.
Note: the code in walkthough (link above) is working as expected.
In order for this to work, we need to put InfluxDB connector inside docker container.
The other Flink connectors are placed inside container with these commands in Dockerfile:
# Download connector libraries
RUN wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-json/${FLINK_VERSION}/flink-json-${FLINK_VERSION}.jar; \
wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-kafka_2.12/${FLINK_VERSION}/flink-sql-connector-kafka_2.12-${FLINK_VERSION}.jar; \
wget -P /opt/flink/lib/ https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-elasticsearch7_2.12/${FLINK_VERSION}/flink-sql-connector-elasticsearch7_2.12-${FLINK_VERSION}.jar;
I need help in order to:
Put an InfluxDB connector into container
Modify the CREATE TABLE statement below, in order to work for InfluxDB
CREATE TABLE es_sink (
id VARCHAR,
value DOUBLE
) with (
'connector' = 'elasticsearch-7',
'hosts' = 'http://elasticsearch:9200',
'index' = 'platform_measurements_1',
'format' = 'json'
)
From the documentation:
Table and SQL APIs currently (14/06/2022) does not support InfluxDB - a sql/table connector does not exist.
Here are the known connectors that you can use:
From Maven Apache Flink
From Apache Bahir
You can:
Use Flink streaming connector for InfluxDB from Apache Bahir (only DataStream API)
or
Implement your own sink
I wanted to do a realtime deployment of my model on azure, so I plan to create an image which firsts queries an ID in azure SQL db to get the required features, then predicts using my model and returns the predictions. The error I get from PyODBC library is that drivers are not installed
I tried it on the azure ML jupyter notebook to establish the connection and found that no drivers are being installed in the environment itself. After some research i found that i should create a docker image and deploy it there, but i still met with the same results
driver= '{ODBC Driver 13 for SQL Server}'
cnxn = pyodbc.connect('DRIVER='+driver+';SERVER='+server+';PORT=1433;DATABASE='+database+';UID='+username+';PWD='+ password+';Encrypt=yes'+';TrustServerCertificate=no'+';Connection Timeout=30;')
('01000', "[01000] [unixODBC][Driver Manager]Can't open lib 'ODBC
Driver 13 for SQL Server' : file not found (0) (SQLDriverConnect)")
i want a result to the query instead i get this message
and/or you could use pymssql==2.1.1, if you add the following docker steps, in the deployment configuration (using either Environments or ContainerImages - preferred is Environments):
from azureml.core import Environment
from azureml.core.environment import CondaDependencies
conda_dep = CondaDependencies()
conda_dep.add_pip_package('pymssql==2.1.1')
myenv = Environment(name="mssqlenv")
myenv.python.conda_dependencies=conda_dep
myenv.docker.enabled = True
myenv.docker.base_dockerfile = 'FROM mcr.microsoft.com/azureml/base:latest\nRUN apt-get update && apt-get -y install freetds-dev freetds-bin vim gcc'
myenv.docker.base_image = None
Or, if you're using the ContainerImage class, you could add these Docker Steps
from azureml.core.image import Image, ContainerImage
image_config = ContainerImage.image_configuration(runtime= "python", execution_script="score.py", conda_file="myenv.yml", docker_file="Dockerfile.steps")
# Assuming this :
# RUN apt-get update && apt-get -y install freetds-dev freetds-bin vim gcc
# is in a file called Dockerfile.steps, it should produce the same result.
See this answer for more details on how I've done it using an Estimator Step and a custom docker container. You could use this Dockerfile to locally create a Docker container for that Estimator step (no need to do that if you're just using an Estimator run outside of a pipeline) :
FROM continuumio/miniconda3:4.4.10
RUN apt-get update && apt-get -y install freetds-dev freetds-bin gcc
RUN pip install Cython
For more details see this posting :using estimator in pipeline with custom docker images. Hope that helps!
Per my experience, I think the comment as #DavidBrowne-Microsoft said is right.
There is a similar SO thread I am getting an error while connecting to an sql DB in Jupyter Notebook answered by me, which I think it will help you to install the latest msodbcsql driver for Linux on Microsoft Azure Notebook or Docker.
Meanwhile, there is a detail about the connection string for Azure SQL Database which you need to carefully note, that you should use {ODBC Driver 17 for SQL Server} instead of {ODBC Driver 13 for SQL Server} if your Azure SQL Database had been created recently (ignore the connection string shown in Azure portal).
you can use AzureML built in solution dataset to connect to your SQL server.
To do so, you can first create an azure_sql_database datastore. reference here
Then create a dataset by passing the datastore you created and the query you want to run.
reference here
sample code
from azureml.core import Dataset, Datastore, Workspace
workspace = Workspace.from_config()
sql_datastore = Datastore.register_azure_sql_database(workspace = workspace,
datastore_name = 'sql_dstore',
server_name = 'your SQL server name',
database_name = 'your SQL database name',
tenant_id = 'your directory ID/tenant ID of the service principal',
client_id = 'the Client ID/Application ID of the service principal',
client_secret = 'the secret of the service principal')
sql_dataset = Dataset.Tabular.from_sql_query((sql_datastore, 'SELECT * FROM my_table'))
You can also do it via UI at ml.azure.com where you can register an azure SQL datastore using your user name and password.
I need to configure Apache ignite in Redash for BI dashboard but couldn't figure out how to do the same since there is no direct support for ignite in Redash.
It is possible using Python query runner, which is available for stand-alone installs only. It allows you to run arbitrary Python code, in which you can query Apache Ignite via JDBC.
First, add redash.query_runner.python query runner to settings.py:
and install Python JDBC bridge module together with dependencies:
sudo apt-get install python-setuptools
sudo apt-get install python-jpype
sudo easy_install JayDeBeApi
Then after VM restart you should add Python data source (you might need to tweak module path):
And then you can actually run the query (you will need to provide Apache Ignite core JAR and JDBC connection string as well):
import jaydebeapi
conn = jaydebeapi.connect('org.apache.ignite.IgniteJdbcThinDriver', 'jdbc:ignite:thin://localhost', {}, '/home/ubuntu/.m2/repository/org/apache/ignite/ignite-core/2.3.2/ignite-core-2.3.2.jar')
curs = conn.cursor()
curs.execute("select c.Id, c.CreDtTm from TABLE.Table c")
data = curs.fetchall()
result = {"columns": [], "rows": []}
add_result_column(result, "Id", "Identifier", "string")
add_result_column(result, "CreDtTm", "Create Date-Time", "integer")
for d in data:
add_result_row(result, {"Id": d[0], "CreDtTm": d[1]})
Unfortunately there's no direct support for JDBC in Redash (that I'm aware of) so all that boilerplate is needed.
os.environ.get("PYSPARK_SUBMIT_ARGS", "--master yarn-client --conf spark.yarn.executor.memoryOverhead=6144 \
--executor-memory 1G –jars /mssql/jre8/sqljdbc42.jar --driver-class-path /mssql/jre8/sqljdbc42.jar")
source_df = sqlContext.read.format('jdbc').options(
url='dbc:sqlserver://xxxx.xxxxx.com',
database = "mydbname",
dbtable=mytable,
user=username,
password=pwd,
driver='com.microsoft.jdbc.sqlserver.SQLServerDriver'
).load()
I am trying to load SQL Server Table using Spark Context.
But running into the following error.
Py4JJavaError: An error occurred while calling o59.load.
: java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
I have the jar file in the location. Is that the correct jar file?
Is there a problem with the code.
Not sure what is the problem.
Scala error
scala> classOf[com.microsoft.sqlserver.jdbc.SQLServerDriver]
<console>:27: error: object sqlserver is not a member of package com.microsoft
classOf[com.microsoft.sqlserver.jdbc.SQLServerDriver]
scala> classOf[com.microsoft.jdbc.sqlserver.SQLServerDriver]
<console>:27: error: object jdbc is not a member of package com.microsoft
classOf[com.microsoft.jdbc.sqlserver.SQLServerDriver]
The configuration is similar to Spark-Oracle configuration.
Here is my Spark-sqlserver configurations:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.master('local[*]')\
.appName('Connection-Test')\
.config('spark.driver.extraClassPath', '/your/jar/folder/sqljdbc42.jar')\
.config('spark.executor.extraClassPath', '/your/jar/folder/sqljdbc42.jar')\
.getOrCreate()
sqlsUrl = 'jdbc:sqlserver://your.sql.server.ip:1433;database=YourSQLDB'
qryStr = """ (
SELECT *
FROM yourtable
) t """
spark.read.format('jdbc')\
.option('url',sqlsUrl)\
.option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver')\
.option('dbtable', qryStr )\
.option("user", "yourID") \
.option("password", "yourPasswd") \
.load().show()
Set the location of the jar file you downloaded => "/your/jar/folder/sqljdbc42.jar". The jar file can be downloaded from: https://www.microsoft.com/en-us/download/details.aspx?id=54671 (*google sqljdbc42.jar if the link does not work)
Set the correct jdbc url => 'jdbc:sqlserver://your.sql.server.ip:1433;database=YourSQLDB' (change the port number if you have a different setting)
Set the correct driver name => .option('driver', 'com.microsoft.sqlserver.jdbc.SQLServerDriver')
Enjoy
I installed Spark in Windows and got the same error while connecting to SQL Server following the steps described here https://docs.azuredatabricks.net/spark/latest/data-sources/sql-databases.html#python-example. I solved this like below -
1) Download SQL Server JDBC driver from here https://www.microsoft.com/en-us/download/details.aspx?id=11774.
2) Unzip as "Microsoft JDBC Driver 6.0 for SQL Server"
3) Find the JDBC jar file (like sqljdbc42.jar) in folder "Microsoft JDBC Driver 6.0 for SQL Server".
4) Copy the jar file (like sqljdbc42.jar) to "jars" folder under Spark home folder. In my case, I copied it and pasted it to "D:\spark-2.3.1-bin-hadoop2.6\jars"
5) restart pyspark
In this way I solved this for Windows server.
I am new to HADOOP hdfs and Sqoop.
I installed hadoop hdfs on one machine with single node. I am able to put and get the file to/from hdfs repectively.
Requirement [Do not want to install hadoop and sqoop on client machine]:
I want to access hdfs from differnt machine using WebHDFS without installation of hadoop on client machine and that part is working fine.
To access HDFF, I am using webhdfs java client jar.
Now I want to execute export/import command of sqoop with remote hdfs.
Case: Export to local file system where HADOOP as well as Sqoop is not installed, we are using only HADDOP and Sqoop client jar.
public int importToHdfs(String tablename, String stmpPath){
int resultcode=-1;
try {
String s = File.separator;
String outdir = stmpPath.substring(0,stmpPath.lastIndexOf(s));
String[] str = {"import"
, "--connect", jdbcURL
,"--table",tablename
, "--username", sUserName, "--password", sPassword
, "-m", "1"
,"--target-dir","/tmp/user/hduser"
,"--outdir",outdir
};
Configuration conf = new Configuration();
resultcode = Sqoop.runTool(str,conf);
}catch(Exception ex){
System.out.print(ex.toString());
}
return resultcode;
}