Cannot Insert into SQL using PySpark, but works in SQL - sql-server

I have created a table below in SQL using the following:
CREATE TABLE [dbo].[Validation](
[RuleId] [int] IDENTITY(1,1) NOT NULL,
[AppId] [varchar](255) NOT NULL,
[Date] [date] NOT NULL,
[RuleName] [varchar](255) NOT NULL,
[Value] [nvarchar](4000) NOT NULL
)
NOTE the identity key (RuleId)
When inserting values into the table as below in SQL it works:
Note: Not inserting the Primary Key as is will autofill if table is empty and increment
INSERT INTO dbo.Validation VALUES ('TestApp','2020-05-15','MemoryUsageAnomaly','2300MB')
However when creating a temp table on databricks and executing the same query below running this query on PySpark as below:
%python
driver = <Driver>
url = "jdbc:sqlserver:<URL>"
database = "<db>"
table = "dbo.Validation"
user = "<user>"
password = "<pass>"
#import the data
remote_table = spark.read.format("jdbc")\
.option("driver", driver)\
.option("url", url)\
.option("database", database)\
.option("dbtable", table)\
.option("user", user)\
.option("password", password)\
.load()
remote_table.createOrReplaceTempView("YOUR_TEMP_VIEW_NAMES")
sqlcontext.sql("INSERT INTO YOUR_TEMP_VIEW_NAMES VALUES ('TestApp','2020-05-15','MemoryUsageAnomaly','2300MB')")
I get the error below:
AnalysisException: 'unknown requires that the data to be inserted have the same number of columns as the target table: target table has 5 column(s) but the inserted data has 4 column(s), including 0 partition column(s) having constant value(s).;'
Why does it work on SQL but not when passing the query through databricks? How can I insert through pyspark without getting this error?

The most straightforward solution here is use JDBC from a Scala cell. EG
%scala
import java.util.Properties
import java.sql.DriverManager
val jdbcUsername = dbutils.secrets.get(scope = "kv", key = "sqluser")
val jdbcPassword = dbutils.secrets.get(scope = "kv", key = "sqlpassword")
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://xxxx.database.windows.net:1433;database=AdventureWorks;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", driverClass)
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val stmt = connection.createStatement()
val sql = "INSERT INTO dbo.Validation VALUES ('TestApp','2020-05-15','MemoryUsageAnomaly','2300MB')"
stmt.execute(sql)
connection.close()
You could use pyodbc too, but the SQL Server ODBC drivers aren't installed by default, and the JDBC drivers are.
A Spark solution would be to create a view in SQL Server and insert against that. eg
create view Validation2 as
select AppId,Date,RuleName,Value
from Validation
then
tableName = "Validation2"
df = spark.read.jdbc(url=jdbcUrl, table=tableName, properties=connectionProperties)
df.createOrReplaceTempView(tableName)
sqlContext.sql("INSERT INTO Validation2 VALUES ('TestApp','2020-05-15','MemoryUsageAnomaly','2300MB')")
If you want to encapsulate the Scala and call it from another language (like Python), you can use a scala package cell.
eg
%scala
package example
import java.util.Properties
import java.sql.DriverManager
object JDBCFacade
{
def runStatement(url : String, sql : String, userName : String, password: String): Unit =
{
val connection = DriverManager.getConnection(url, userName, password)
val stmt = connection.createStatement()
try
{
stmt.execute(sql)
}
finally
{
connection.close()
}
}
}
and then you can call it like this:
jdbcUsername = dbutils.secrets.get(scope = "kv", key = "sqluser")
jdbcPassword = dbutils.secrets.get(scope = "kv", key = "sqlpassword")
jdbcUrl = "jdbc:sqlserver://xxxx.database.windows.net:1433;database=AdventureWorks;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
sql = "select 1 a into #foo from sys.objects"
sc._jvm.example.JDBCFacade.runStatement(jdbcUrl,sql, jdbcUsername, jdbcPassword)

Related

NIFI - upload binary.zip to SQL Server as varbinary

I am trying to upload a binary.zip to SQL Server as varbinary type column content.
Target Table:
CREATE TABLE myTable ( zipFile varbinary(MAX) );
My NIFI Flow is very simple:
-> GetFile:
filter:binary.zip
-> UpdateAttribute:<br>
sql.args.1.type = -3 # as varbinary according to JDBC types enumeration
sql.args.1.value = ??? # I don't know what to put here ! (I've triying everything!)
sql.args.1.format= ??? # Is It required? I triyed 'hex'
-> PutSQL:<br>
SQLstatement= INSERT INTO myTable (zip_file) VALUES (?);
What should I put in sql.args.1.value?
I think it should be the flowfile payload, but it would work as part of the INSERT in the PutSQL? Not by the moment!
Thanks!
SOLUTION UPDATE:
Based on https://issues.apache.org/jira/browse/NIFI-8052
(Consider I'm sending some data as attribute parameter)
import java.nio.charset.StandardCharsets
import org.apache.nifi.controller.ControllerService
import groovy.sql.Sql
def flowFile = session.get()
def lookup = context.controllerServiceLookup
def dbServiceName = flowFile.getAttribute('DatabaseConnectionPoolName')
def tableName = flowFile.getAttribute('table_name')
def fieldName = flowFile.getAttribute('field_name')
def dbcpServiceId = lookup.getControllerServiceIdentifiers(ControllerService).find
{ cs -> lookup.getControllerServiceName(cs) == dbServiceName }
def conn = lookup.getControllerService(dbcpServiceId)?.getConnection()
def sql = new Sql(conn)
flowFile.read{ rawIn->
def parms = [rawIn ]
sql.executeInsert "INSERT INTO " + tableName + " (date, "+ fieldName + ") VALUES (CAST( GETDATE() AS Date ) , ?) ", parms
}
conn?.close()
if(!flowFile) return
session.transfer(flowFile, REL_SUCCESS)
session.commit()
maybe there is a nifi native way to insert blob however you could use ExecuteGroovyScript instead of UpdateAttribute and PutSQL
add SQL.mydb parameter on the level of processor and link it to required DBCP pool.
use following script body:
def ff=session.get()
if(!ff)return
def statement = "INSERT INTO myTable (zip_file) VALUES (:p_zip_file)"
def params = [
p_zip_file: SQL.mydb.BLOB(ff.read()) //cast flow file content as BLOB sql type
]
SQL.mydb.executeInsert(params, statement) //committed automatically on flow file success
//transfer to success without changes
REL_SUCCESS << ff
inside the script SQL.mydb is a reference to groovy.sql.Sql oblject

sqlalchemy.orm.exc.UnmappedInstanceError: Class 'builtins.dict' is not mapped AND using marshmallow-sqlalchemy

I don't get it. I'm trying to start a brand new table in MS SQL Server 2012 with the following:
In SQL Server:
TABLE [dbo].[Inventory](
[Index_No] [bigint] IDENTITY(1,1) NOT NULL,
[Part_No] [varchar(150)] NOT NULL,
[Shelf] [int] NOT NULL,
[Bin] [int] NOT NULL,
PRIMARY KEY CLUSTERED
(
[Index_No] ASC
)
UNIQUE NONCLUSTERED
(
[Part_No] ASC
)
GO
NOTE: This is a BRAND NEW TABLE! There is no data in it at all
Next, this is the Database.py file:
import pymssql
from sqlalchemy import create_engine, Table, MetaData, select, Column, Integer, Float, String, text,
func, desc, and_, or_, Date, insert
from marshmallow_sqlalchemy import SQLAlchemyAutoSchema
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
USERNAME = "name"
PSSWD = "none_of_your_business"
SERVERNAME = "MYSERVER"
INSTANCENAME = "\SQLSERVER2012"
DB = "Inventory"
engine = create_engine(f"mssql+pymssql://{USERNAME}:{PSSWD}#{SERVERNAME}{INSTANCENAME}/{DB}")
class Inventory(Base):
__tablename__ = "Inventory"
Index_No = Column('Index_No', Integer, primary_key=True, autoincrement=True)
Part_No = Column("Part_No", String, unique=True)
Shelf = Column("Shelf", Integer)
Bin = Column("Bin", Integer)
def __repr__(self):
return f'Drawing(Index_No={self.Index_No!r},Part_No={self.Part_No!r}, Shelf={self.Shelf!r}, ' \
f'Bin={self.Bin!r})'
class InventorySchema(SQLAlchemyAutoSchema):
class Meta:
model = Inventory
load_instance = True
It's also to note that I'm using SQLAlchemy 1.4.3, if that helps out.
and in the main.py
import Database as db
db.Base.metadata.create_all(db.engine)
data_list = [{Part_No:123A, Shelf:1, Bin:5},
{Part_No:456B, Shelf:1, Bin:7},
{Part_No:789C, Shelf:2, Bin:1}]
with db.Session(db.engine, future=True) as session:
try:
session.add_all(data_list) #<--- FAILS HERE AND THROWS AN EXCEPTION
session.commit()
except Exception as e:
session.rollback()
print(f"Error! {e!r}")
raise
finally:
session.close()
Now what I've googled on this "Class 'builtins.dict' is not mapped", most of the solutions brings me to marshmallow-sqlalchemy package which I've tried, but I'm still getting the same error. So I've tried moving the Base.metadata.create_all(engine) from the Database.py into the main.py. I also tried implementing a init function in the Inventory class, and also calling the Super().init, which doesn't work
So what's going on?? Why is it failing and is there a better solution to this problem?
Try creating Inventory objects:
data_list = [
Inventory(Part_No='123A', Shelf=1, Bin=5),
Inventory(Part_No='456B', Shelf=1, Bin=7),
Inventory(Part_No='789C', Shelf=2, Bin=1)
]

Read error with spark.read against SQL Server table (via JDBC Connection)

I have a problem in Zeppelin when I try to create a dataframe reading directly from a SQL table. The problem is that I dont know how to read a SQL column with the geography type.
SQL table
This is the code that I am using, and the error that I obtain.
Create JDBC connection
import org.apache.spark.sql.SaveMode
import java.util.Properties
val jdbcHostname = "XX.XX.XX.XX"
val jdbcDatabase = "databasename"
val jdbcUsername = "user"
val jdbcPassword = "XXXXXXXX"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname};database=${jdbcDatabase}"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
Read from SQL
import spark.implicits._
val table = "tablename"
val postcode_polygons = spark.
read.
jdbc(jdbcUrl, table, connectionProperties)
Error
import spark.implicits._
table: String = Lookup.Postcode50m_Lookup
java.sql.SQLException: Unsupported type -158
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:233)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:290)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:290)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:289)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:64)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:114)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:193)
Adding to thebluephantom answer have you tried changing the type to string as below and loading the table.
val jdbcDF = spark.read.format("jdbc")
.option("dbtable" -> "(select toString(SData) as s_sdata,toString(CentroidSData) as s_centroidSdata from table) t")
.option("user", "user_name")
.option("other options")
.load()
This is the final solution in my case, the idea of moasifk is correct, but in my code I cannot use the function "toString". I have applied the same idea but with another sintaxis.
import spark.implicits._
val tablename = "Lookup.Postcode50m_Lookup"
val postcode_polygons = spark.
read.
jdbc(jdbcUrl, table=s"(select PostcodeNoSpaces, cast(SData as nvarchar(4000)) as SData from $tablename) as postcode_table", connectionProperties)

revoscaler sqlServerdData rxImport uniqueidentifier column failed

I'm trying to import data from SQL Server, but I'm having issues importing a table which consists of uniqueidentifier column type.
I'm using R Client 3.3.2.0 to query database.
Database table:
Code:
sqlConnString = "DRIVER=ODBC Driver 11 for SQL Server;SERVER=JDIMKO;DATABASE=Test;UID=sa;PWD=***;"
colClasses = c("id" = "integer", "ui" = "character")
sqlServerData <- RxSqlServerData(
sqlQuery = "select * from tbl1",
connectionString = sqlConnString, colClasses = colClasses)
custData = rxImport(sqlServerData)
Error:
Unhandled SQL data type!!!
Unhandled SQL data type!!!
Could not open data source.
Error in doTryCatch(return(expr), name, parentenv, handler) :
Could not open data source.
RxSqlServerData not supported UNIQUEIDENTIFIER data type. You should convert it to varchar.
sqlServerData <- RxSqlServerData(
sqlQuery = "select id, CONVERT(VARCHAR(36), ui) ui from tbl1",
connectionString = sqlConnString, colClasses = colClasses)

Field.types not working when using dbWriteTable from R DBI package in SQL Server database

I'm using the DBI package along with the odbc package to connect to a SQL Server database. I'm trying to write a table with column types specified by the field.types argument. For some reason this isn't working, and R chooses its own data types when writing.
A reproducible example:
table <- data.frame(
col1 = 1:2,
col2 = c("a", "b")
)
con <- dbConnect(
odbc::odbc(),
dsn = "dsn",
UID = login,
PWD = password,
Port = 1433
)
dbWriteTable(
conn = con,
value = table,
name = "tableName",
row.names = FALSE,
field.types = c(
col1 = "varchar(50)",
col2 = "varchar(50)"
)
)
The result: a table called "tableName" with columns
[col1] [int] NULL,
[col2] [varchar](255) NULL
My questions:
How can I correct my example above, so that the column types on the database will be varchar(50) for both columns?
How can I use the field.types argument correctly for other examples?
What I would like to know is, what "types" should I use: do I need "int" or "integer" or "INT" (R is case sensitive so it might matter)? And then, where can I find a list of these data types? I have tried using dbDataType, but using the types this function returns doesn't work either. Or am I doing something else wrong?
Thanks in advance for any help.
I tried your example (slighly modified):
lybrary(RMySQL)
lybrary(lubridate)
conn=dbConnect(MySQL(), user='root', password='', dbname='DB', host='localhost')
table <- data.frame(
col1 = 1:2,
col2 = c("a", "b"),
col3 = c(now(), now()+seconds(1))
)
dbWriteTable(
conn = conn,
name='test_DB',
value = table,
row.names = FALSE,
field.types = c(
col1 = "varchar(50)",
col2 = "varchar(50)",
col3 = 'timestamp'
)
)
and it worked smoothly:
At least the following data types are accepted (from DBI documentation):
integer
numeric
logical for Boolean values (some backends may return an integer)
NA
character
factor (bound as character, with warning)
Date
POSIXct timestamps
POSIXlt timestamps
lists of raw for blobs (with NULL entries for SQL NULL values)
objects of type blob::blob

Resources