Received an invalid column length from the bcp client in spark job - sql-server

I am playing around with spark and wanted to store a data frame in a sql database. It works but not when saving a datetime column:
from pyspark.sql import SparkSession,Row
from pyspark.sql.types import IntegerType,TimestampType,StructType,StructField,StringType
from datetime import datetime
...
spark = SparkSession.builder \
...
.getOrCreate()
# Create DataFrame
rdd = spark.sparkContext.parallelize([
Row(id=1, title='string1', created_at=datetime.now())
])
schema = StructType([
StructField("id", IntegerType(), False),
StructField("title", StringType(), False),
StructField("created_at", TimestampType(), True)
])
df = spark.createDataFrame(rdd, schema)
df.show()
try:
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite") \
.option("truncate", True) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.save()
except ValueError as error :
print("Connector write failed", error)
Schema:
Error:
com.microsoft.sqlserver.jdbc.SQLServerException: Received an invalid column length from the bcp client for colid 3
From my understanding the error states that datetime.now() has invalid length. But how can that be, if it is a standard datetime? Any ideas what the issue is?

There are problems with the code to create the dataframe. You are missing libraries. The code below creates the dataframe correctly.
#
# 1 - Make test dataframe
#
# libraries
from pyspark.sql import Row
from datetime import datetime
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, TimestampType
# create rdd
rdd = spark.sparkContext.parallelize([Row(id=1, title='string1', created_at=datetime.now())])
# define structure
schema = StructType([
StructField("id", IntegerType(), False),
StructField("title", StringType(), False),
StructField("created_at", TimestampType(), True)
])
# create df
df = spark.createDataFrame(rdd, schema)
# show df
display(df)
The output is shown above. We need to create a table that follows the nullability and data types.
The code below creates a table called stack_overflow.
-- drop table
drop table if exists stack_overflow
go
-- create table
create table stack_overflow
(
id int not null,
title varchar(100) not null,
created_at datetime2 null
)
go
-- show data
select * from stack_overflow
go
Next, we need to define our connection properties.
#
# 2 - Set connection properties
#
server_name = "jdbc:sqlserver://svr4tips2030.database.windows.net"
database_name = "dbs4tips2020"
url = server_name + ";" + "databaseName=" + database_name + ";"
user_name = "jminer"
password = "<your password here>"
table_name = "stack_overflow"
Last, we want to execute the code to write the dataframe.
#
# 3 - Write test dataframe
#
try:
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite") \
.option("truncate", True) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", user_name) \
.option("password", password) \
.save()
except ValueError as error :
print("Connector write failed", error)
Executing a select query shows that the data was written correctly.
In short, look at the documentation for Spark SQL Types. I found out that datetime2 works nicely.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.spark.sql.types?view=spark-dotnet
One word of caution, this code does not handle date time offset. Also, there is no data type in Spark to use in an offset mapping.
# Sample date time offset value
import pytz
from datetime import datetime, timezone, timedelta
user_timezone_setting = 'US/Pacific'
user_timezone = pytz.timezone(user_timezone_setting)
the_event = datetime.now()
localized_event = user_timezone.localize(the_event)
print(localized_event)
The code above creates a variable with the following data.
But once we cast it to a dataframe, it loses the offset since it is converted to UTC time. If UTC offset is important, you will have to pass that information as separate integer.

SQLServer datetime datatype has time range - 00:00:00 through 23:59:59.997. output of datetime.now() will not fit in for datetime, you need to change the datatype on SQLSever table to datetime2

Related

NIFI - upload binary.zip to SQL Server as varbinary

I am trying to upload a binary.zip to SQL Server as varbinary type column content.
Target Table:
CREATE TABLE myTable ( zipFile varbinary(MAX) );
My NIFI Flow is very simple:
-> GetFile:
filter:binary.zip
-> UpdateAttribute:<br>
sql.args.1.type = -3 # as varbinary according to JDBC types enumeration
sql.args.1.value = ??? # I don't know what to put here ! (I've triying everything!)
sql.args.1.format= ??? # Is It required? I triyed 'hex'
-> PutSQL:<br>
SQLstatement= INSERT INTO myTable (zip_file) VALUES (?);
What should I put in sql.args.1.value?
I think it should be the flowfile payload, but it would work as part of the INSERT in the PutSQL? Not by the moment!
Thanks!
SOLUTION UPDATE:
Based on https://issues.apache.org/jira/browse/NIFI-8052
(Consider I'm sending some data as attribute parameter)
import java.nio.charset.StandardCharsets
import org.apache.nifi.controller.ControllerService
import groovy.sql.Sql
def flowFile = session.get()
def lookup = context.controllerServiceLookup
def dbServiceName = flowFile.getAttribute('DatabaseConnectionPoolName')
def tableName = flowFile.getAttribute('table_name')
def fieldName = flowFile.getAttribute('field_name')
def dbcpServiceId = lookup.getControllerServiceIdentifiers(ControllerService).find
{ cs -> lookup.getControllerServiceName(cs) == dbServiceName }
def conn = lookup.getControllerService(dbcpServiceId)?.getConnection()
def sql = new Sql(conn)
flowFile.read{ rawIn->
def parms = [rawIn ]
sql.executeInsert "INSERT INTO " + tableName + " (date, "+ fieldName + ") VALUES (CAST( GETDATE() AS Date ) , ?) ", parms
}
conn?.close()
if(!flowFile) return
session.transfer(flowFile, REL_SUCCESS)
session.commit()
maybe there is a nifi native way to insert blob however you could use ExecuteGroovyScript instead of UpdateAttribute and PutSQL
add SQL.mydb parameter on the level of processor and link it to required DBCP pool.
use following script body:
def ff=session.get()
if(!ff)return
def statement = "INSERT INTO myTable (zip_file) VALUES (:p_zip_file)"
def params = [
p_zip_file: SQL.mydb.BLOB(ff.read()) //cast flow file content as BLOB sql type
]
SQL.mydb.executeInsert(params, statement) //committed automatically on flow file success
//transfer to success without changes
REL_SUCCESS << ff
inside the script SQL.mydb is a reference to groovy.sql.Sql oblject

Cannot Insert into SQL using PySpark, but works in SQL

I have created a table below in SQL using the following:
CREATE TABLE [dbo].[Validation](
[RuleId] [int] IDENTITY(1,1) NOT NULL,
[AppId] [varchar](255) NOT NULL,
[Date] [date] NOT NULL,
[RuleName] [varchar](255) NOT NULL,
[Value] [nvarchar](4000) NOT NULL
)
NOTE the identity key (RuleId)
When inserting values into the table as below in SQL it works:
Note: Not inserting the Primary Key as is will autofill if table is empty and increment
INSERT INTO dbo.Validation VALUES ('TestApp','2020-05-15','MemoryUsageAnomaly','2300MB')
However when creating a temp table on databricks and executing the same query below running this query on PySpark as below:
%python
driver = <Driver>
url = "jdbc:sqlserver:<URL>"
database = "<db>"
table = "dbo.Validation"
user = "<user>"
password = "<pass>"
#import the data
remote_table = spark.read.format("jdbc")\
.option("driver", driver)\
.option("url", url)\
.option("database", database)\
.option("dbtable", table)\
.option("user", user)\
.option("password", password)\
.load()
remote_table.createOrReplaceTempView("YOUR_TEMP_VIEW_NAMES")
sqlcontext.sql("INSERT INTO YOUR_TEMP_VIEW_NAMES VALUES ('TestApp','2020-05-15','MemoryUsageAnomaly','2300MB')")
I get the error below:
AnalysisException: 'unknown requires that the data to be inserted have the same number of columns as the target table: target table has 5 column(s) but the inserted data has 4 column(s), including 0 partition column(s) having constant value(s).;'
Why does it work on SQL but not when passing the query through databricks? How can I insert through pyspark without getting this error?
The most straightforward solution here is use JDBC from a Scala cell. EG
%scala
import java.util.Properties
import java.sql.DriverManager
val jdbcUsername = dbutils.secrets.get(scope = "kv", key = "sqluser")
val jdbcPassword = dbutils.secrets.get(scope = "kv", key = "sqlpassword")
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://xxxx.database.windows.net:1433;database=AdventureWorks;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", driverClass)
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val stmt = connection.createStatement()
val sql = "INSERT INTO dbo.Validation VALUES ('TestApp','2020-05-15','MemoryUsageAnomaly','2300MB')"
stmt.execute(sql)
connection.close()
You could use pyodbc too, but the SQL Server ODBC drivers aren't installed by default, and the JDBC drivers are.
A Spark solution would be to create a view in SQL Server and insert against that. eg
create view Validation2 as
select AppId,Date,RuleName,Value
from Validation
then
tableName = "Validation2"
df = spark.read.jdbc(url=jdbcUrl, table=tableName, properties=connectionProperties)
df.createOrReplaceTempView(tableName)
sqlContext.sql("INSERT INTO Validation2 VALUES ('TestApp','2020-05-15','MemoryUsageAnomaly','2300MB')")
If you want to encapsulate the Scala and call it from another language (like Python), you can use a scala package cell.
eg
%scala
package example
import java.util.Properties
import java.sql.DriverManager
object JDBCFacade
{
def runStatement(url : String, sql : String, userName : String, password: String): Unit =
{
val connection = DriverManager.getConnection(url, userName, password)
val stmt = connection.createStatement()
try
{
stmt.execute(sql)
}
finally
{
connection.close()
}
}
}
and then you can call it like this:
jdbcUsername = dbutils.secrets.get(scope = "kv", key = "sqluser")
jdbcPassword = dbutils.secrets.get(scope = "kv", key = "sqlpassword")
jdbcUrl = "jdbc:sqlserver://xxxx.database.windows.net:1433;database=AdventureWorks;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
sql = "select 1 a into #foo from sys.objects"
sc._jvm.example.JDBCFacade.runStatement(jdbcUrl,sql, jdbcUsername, jdbcPassword)

Read error with spark.read against SQL Server table (via JDBC Connection)

I have a problem in Zeppelin when I try to create a dataframe reading directly from a SQL table. The problem is that I dont know how to read a SQL column with the geography type.
SQL table
This is the code that I am using, and the error that I obtain.
Create JDBC connection
import org.apache.spark.sql.SaveMode
import java.util.Properties
val jdbcHostname = "XX.XX.XX.XX"
val jdbcDatabase = "databasename"
val jdbcUsername = "user"
val jdbcPassword = "XXXXXXXX"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname};database=${jdbcDatabase}"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
Read from SQL
import spark.implicits._
val table = "tablename"
val postcode_polygons = spark.
read.
jdbc(jdbcUrl, table, connectionProperties)
Error
import spark.implicits._
table: String = Lookup.Postcode50m_Lookup
java.sql.SQLException: Unsupported type -158
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:233)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:290)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:290)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:289)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:64)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:114)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:193)
Adding to thebluephantom answer have you tried changing the type to string as below and loading the table.
val jdbcDF = spark.read.format("jdbc")
.option("dbtable" -> "(select toString(SData) as s_sdata,toString(CentroidSData) as s_centroidSdata from table) t")
.option("user", "user_name")
.option("other options")
.load()
This is the final solution in my case, the idea of moasifk is correct, but in my code I cannot use the function "toString". I have applied the same idea but with another sintaxis.
import spark.implicits._
val tablename = "Lookup.Postcode50m_Lookup"
val postcode_polygons = spark.
read.
jdbc(jdbcUrl, table=s"(select PostcodeNoSpaces, cast(SData as nvarchar(4000)) as SData from $tablename) as postcode_table", connectionProperties)

com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near 'apply'

Am trying to use cross apply in azure data bricks notebook it throws the error that incorrect syntax near apply while the same query is working fine in SQL server.Is azure data bricks notebook is not supporting apply operators?
enter ## Connection parameters ##
sqlserver = 'servername'
port = '1433'
database = 'dbname'
user = 'username'
pswd = "pwd"
query = "(select * from repair_detail a\
cross apply (select top 1 * from order c where a.RO_NO=c.RO_NO) b\
) AS CustSales"
## Load Data Frame ##
df1 = spark.read \
.option('user', user) \
.option('password', pswd) \
.jdbc('jdbc:sqlserver://' + sqlserver + ':' + port + ';database=' +database,
query)
# Show the resulting DataFrame
df1.show()

Behavior of Dates when using SQLServer vs. odbc R packages to connect to the database

I am working on an application that tracks tasks and dates using a SQL database. My application currently utilizes the RSQLServer package to connect to the database, because I can get that working on Windows and Linux. I've noticed that it does not store dates consistently.
Here is my MWE (ish - database connections will have to be set up yourself).
library(DBI)
library(dplyr)
library(lubridate)
# --- Connect to the database via RSQLServer and odbc --------------------------
db <- "your SQL database"
server <- "your SQL server"
conn <- dbConnect(RSQLServer::SQLServer(), server = server, database = db,
properties = list(user = "", password = "",
useNTLMv2 = TRUE, domain = "")
)
conn2 <- dbConnect(odbc::odbc(), dsn = "")
# --- Create the test table ----------------------------------------------------
dplyr::db_drop_table(conn, "TestTable")
if (!dbExistsTable(conn, "TestTable")) {
TestProcessStr <- "
CREATE TABLE TestTable(
Process_ID INT NOT NULL IDENTITY(1,1),
Start_Dt DATE NOT NULL,
End_Dt DATE DEFAULT '9999-12-31',
Comment VARCHAR(30),
PRIMARY KEY( Process_ID )
);"
dbExecute(conn, TestProcessStr)
} else {
message("TestTable exists")
}
# --- Write to test table using different connections --------------------------
rowadd <- data_frame(Start_Dt = Sys.Date(), End_Dt = Sys.Date() + months(3),
Comment = "SQLServer, Date as Date")
write_res <- dbWriteTable(conn, name = "TestTable",
value = rowadd, append = T)
# Convert all dates to character
rowadd <- rowadd %>% mutate_if(is.Date, as.character) %>%
mutate(Comment = "SQLServer, Date as Char")
write_res <- dbWriteTable(conn, name = "TestTable",
value = rowadd, append = T)
rowadd <- data_frame(Start_Dt = Sys.Date(), End_Dt = Sys.Date() + months(3)) %>%
mutate(Comment = "ODBC, Date as Date")
write_res <- dbWriteTable(conn2, name = "TestTable",
value = rowadd, append = T)
# Convert all dates to character
rowadd <- rowadd %>% mutate_if(is.Date, as.character) %>%
mutate(Comment = "ODBC, Date as Character")
write_res <- dbWriteTable(conn2, name = "TestTable",
value = rowadd, append = T)
# View database status
ttab <- dbReadTable(conn, "TestTable")
ttab
# --- Disconnect ---------------------------------------------------------------
dbDisconnect(conn)
dbDisconnect(conn2)
This produces the following output on my machine:
> ttab
Process_ID Start_Dt End_Dt Comment
1 1 2018-01-14 2018-04-14 SQLServer, Date as Date
2 2 2018-01-15 2018-04-15 SQLServer, Date as Char
3 3 2018-01-15 2018-04-15 ODBC, Date as Date
4 4 2018-01-15 2018-04-15 ODBC, Date as Character
Why does RSQLServer treat Date objects differently than dates-as-character objects? odbc::odbc() doesn't display the same behavior - is there something different about the java-based methods used by RSQLServer?
Ideally, I'd like to not have to convert all dates to strings before each and every write operation. I can't switch to the odbc package because I can't get the DSN properly configured on Linux without getting my IT department involved.

Resources