How to read a binary column data into spark dataframe? - sql-server

I am trying to read a table from SqlServer which has one column: travel_CDE of datatype: binary(8).
This is how the source data looks like:
select * from sourcetable=>
location type_code family travel_CDE
Asia Landlocked Terrain 0xD9F21933D5346766
Below is my read statement:
val dataframe = spark.read.format("jdbc")
.option("url", s"jdbc:sqlserver://url:port;DatabaseName=databasename")
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
.option("user", "username")
.option("password", "password")
.option("dbtable", tablename)
.option("partitionColumn", partitionColumn)
.option("numPartitions", numPartitions)
.option("lowerBound", 1)
.option("upperBound", upperBound)
.option("fetchsize", 100)
.load()
When I print the schema of the dataframe, I see that the column is being read in binary datatype:
scala> dataframe.printSchema()
root
|-- location: string (nullable = true)
|-- type_code: string (nullable = true)
|-- family: string (nullable = true)
|-- travel_CDE: binary (nullable = true)
But when I read the data inside of my spark dataframe, I see data being represented in a different format for the column travel_CDE.
Example:
scala> dataframe.select("travel_CDE").take(1)
res11: Array[org.apache.spark.sql.Row] = Array([[B#a4a0ce])
So I thought spark reading data in its own format and I took that column out and re-applied a schema of binary datatype as below.
import org.apache.spark.sql.types.{StructType, StructField, BinaryType}
val schema = StructType(Array(StructField("bintype", BinaryType, true)))
val bincolDF = dataframe.select("travel_CDE")
val bindColtypeDF = spark.createDataFrame(bincolDF.rdd, schema)
But even after applying BinaryType on that column, I still see the same format of data as earlier.
scala> bindtype.take(1)
res9: Array[org.apache.spark.sql.Row] = Array([[B#1d48ff1])
I am saving this dataframe into big query and I see the same format (wrong data format) there as well.
Below is how I am doing it.
dataframe.write.format("bigquery").option("table", s"Bigquery_table_name").mode("overwrite").save()
Could anyone let me know what should I do to read the source data properly in a way that Spark reads it in the same format.
Is it even possible to do it while reading the data or should I convert the column after reading data into dataframe ?
Any help is much appreciated.

Related

Received an invalid column length from the bcp client in spark job

I am playing around with spark and wanted to store a data frame in a sql database. It works but not when saving a datetime column:
from pyspark.sql import SparkSession,Row
from pyspark.sql.types import IntegerType,TimestampType,StructType,StructField,StringType
from datetime import datetime
...
spark = SparkSession.builder \
...
.getOrCreate()
# Create DataFrame
rdd = spark.sparkContext.parallelize([
Row(id=1, title='string1', created_at=datetime.now())
])
schema = StructType([
StructField("id", IntegerType(), False),
StructField("title", StringType(), False),
StructField("created_at", TimestampType(), True)
])
df = spark.createDataFrame(rdd, schema)
df.show()
try:
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite") \
.option("truncate", True) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", username) \
.option("password", password) \
.save()
except ValueError as error :
print("Connector write failed", error)
Schema:
Error:
com.microsoft.sqlserver.jdbc.SQLServerException: Received an invalid column length from the bcp client for colid 3
From my understanding the error states that datetime.now() has invalid length. But how can that be, if it is a standard datetime? Any ideas what the issue is?
There are problems with the code to create the dataframe. You are missing libraries. The code below creates the dataframe correctly.
#
# 1 - Make test dataframe
#
# libraries
from pyspark.sql import Row
from datetime import datetime
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, TimestampType
# create rdd
rdd = spark.sparkContext.parallelize([Row(id=1, title='string1', created_at=datetime.now())])
# define structure
schema = StructType([
StructField("id", IntegerType(), False),
StructField("title", StringType(), False),
StructField("created_at", TimestampType(), True)
])
# create df
df = spark.createDataFrame(rdd, schema)
# show df
display(df)
The output is shown above. We need to create a table that follows the nullability and data types.
The code below creates a table called stack_overflow.
-- drop table
drop table if exists stack_overflow
go
-- create table
create table stack_overflow
(
id int not null,
title varchar(100) not null,
created_at datetime2 null
)
go
-- show data
select * from stack_overflow
go
Next, we need to define our connection properties.
#
# 2 - Set connection properties
#
server_name = "jdbc:sqlserver://svr4tips2030.database.windows.net"
database_name = "dbs4tips2020"
url = server_name + ";" + "databaseName=" + database_name + ";"
user_name = "jminer"
password = "<your password here>"
table_name = "stack_overflow"
Last, we want to execute the code to write the dataframe.
#
# 3 - Write test dataframe
#
try:
df.write \
.format("com.microsoft.sqlserver.jdbc.spark") \
.mode("overwrite") \
.option("truncate", True) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.option("url", url) \
.option("dbtable", table_name) \
.option("user", user_name) \
.option("password", password) \
.save()
except ValueError as error :
print("Connector write failed", error)
Executing a select query shows that the data was written correctly.
In short, look at the documentation for Spark SQL Types. I found out that datetime2 works nicely.
https://learn.microsoft.com/en-us/dotnet/api/microsoft.spark.sql.types?view=spark-dotnet
One word of caution, this code does not handle date time offset. Also, there is no data type in Spark to use in an offset mapping.
# Sample date time offset value
import pytz
from datetime import datetime, timezone, timedelta
user_timezone_setting = 'US/Pacific'
user_timezone = pytz.timezone(user_timezone_setting)
the_event = datetime.now()
localized_event = user_timezone.localize(the_event)
print(localized_event)
The code above creates a variable with the following data.
But once we cast it to a dataframe, it loses the offset since it is converted to UTC time. If UTC offset is important, you will have to pass that information as separate integer.
SQLServer datetime datatype has time range - 00:00:00 through 23:59:59.997. output of datetime.now() will not fit in for datetime, you need to change the datatype on SQLSever table to datetime2

Adding file names from array in dataframe column in spark scala

val hadoopConf = new Configuration()
val fs = FileSystem.get(hadoopConf)
val status = fs.listStatus(new Path("/home/Test/")).map(_.getPath().toString)
val df = spark.read.format("json").load(status : _*)
How to add the file name in a new column in df?
I tried:
val dfWithCol = df.withColumn("filename",input_file_name())
But it adds the same file name in all the columns?
Can anyone suggest a better approach?
It is expected behaviour because your json file having more than one record in it.
Spark adds the filenames for each record and you want to check all the unique filenames then do distinct on filename column
//to get unique filenames
df.select("filename").distinct().show()
Example:
#source data
hadoop fs -cat /user/shu/json/*.json
{"id":1,"name":"a"}
{"id":1,"name":"a"}
val hadoopConf = new Configuration()
val fs = FileSystem.get(hadoopConf)
val status = fs.listStatus(new Path("/user/shu/json")).map(_.getPath().toString)
val df = spark.read.format("json").load(status : _*)
df.withColumn("filename",input_file_name()).show(false)
//unique filenames for each record
+---+----+----------------------------------------------------------------------------+
|id |name|input |
+---+----+----------------------------------------------------------------------------+
|1 |a |hdfs://nn:8020/user/shu/json/i.json |
|1 |a |hdfs://nn:8020/user/shu/json/i1.json |
+---+----+----------------------------------------------------------------------------+
in the above example you can see unique filenames for each record (as i have 1 record in each json file).

Read error with spark.read against SQL Server table (via JDBC Connection)

I have a problem in Zeppelin when I try to create a dataframe reading directly from a SQL table. The problem is that I dont know how to read a SQL column with the geography type.
SQL table
This is the code that I am using, and the error that I obtain.
Create JDBC connection
import org.apache.spark.sql.SaveMode
import java.util.Properties
val jdbcHostname = "XX.XX.XX.XX"
val jdbcDatabase = "databasename"
val jdbcUsername = "user"
val jdbcPassword = "XXXXXXXX"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname};database=${jdbcDatabase}"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
Read from SQL
import spark.implicits._
val table = "tablename"
val postcode_polygons = spark.
read.
jdbc(jdbcUrl, table, connectionProperties)
Error
import spark.implicits._
table: String = Lookup.Postcode50m_Lookup
java.sql.SQLException: Unsupported type -158
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:233)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:290)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$8.apply(JdbcUtils.scala:290)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:289)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:64)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:114)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:307)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:193)
Adding to thebluephantom answer have you tried changing the type to string as below and loading the table.
val jdbcDF = spark.read.format("jdbc")
.option("dbtable" -> "(select toString(SData) as s_sdata,toString(CentroidSData) as s_centroidSdata from table) t")
.option("user", "user_name")
.option("other options")
.load()
This is the final solution in my case, the idea of moasifk is correct, but in my code I cannot use the function "toString". I have applied the same idea but with another sintaxis.
import spark.implicits._
val tablename = "Lookup.Postcode50m_Lookup"
val postcode_polygons = spark.
read.
jdbc(jdbcUrl, table=s"(select PostcodeNoSpaces, cast(SData as nvarchar(4000)) as SData from $tablename) as postcode_table", connectionProperties)

Spark: how to change datarframe Array[String] to RDD[Array[String]]

I have transactions as a DataFrame array<string>:
transactions: org.apache.spark.sql.DataFrame = [collect_set(b): array<string>]
I want to change it to RDD[Array[string]], but when I am changing it to RDD, it's getting changed to org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]:
val sam: RDD[Array[String]] = transactions.rdd
<console>:42: error: type mismatch;
found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
required: org.apache.spark.rdd.RDD[Array[String]]
val sam: RDD[Array[String]] = transactions.rdd
transactions.rdd will return RDD[Row], as it is in message.
You can manually convert Row to Array:
val sam = transactions.rdd.map(x => x.getList(0).toArray.map(_.toString))
More Spark 2.0 style it would be:
val sam = transactions.select("columnName").as[Array[String]].rdd
Replace columnName with proper column name from DataFrame - probably you should rename collect_set(b) to more user-friendly name
Dataframe is actually a array[Row] so whenever you run collect on a dataframe it will create a array[Row] and when you are converting it RDD it becomes a RDD[Row].
So if you want RDD[Array[String]] you can do it this way:
val sam = transactions.rdd.map(x => x.toString().stripPrefix("[").stripSuffix("]").split(fieldSeperator))

Primary keys with Apache Spark

I am having a JDBC connection with Apache Spark and PostgreSQL and I want to insert some data into my database. When I use append mode I need to specify id for each DataFrame.Row. Is there any way for Spark to create primary keys?
Scala:
If all you need is unique numbers you can use zipWithUniqueId and recreate DataFrame. First some imports and dummy data:
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType}
val df = sc.parallelize(Seq(
("a", -1.0), ("b", -2.0), ("c", -3.0))).toDF("foo", "bar")
Extract schema for further usage:
val schema = df.schema
Add id field:
val rows = df.rdd.zipWithUniqueId.map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
Create DataFrame:
val dfWithPK = sqlContext.createDataFrame(
rows, StructType(StructField("id", LongType, false) +: schema.fields))
The same thing in Python:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, LongType
row = Row("foo", "bar")
row_with_index = Row(*["id"] + df.columns)
df = sc.parallelize([row("a", -1.0), row("b", -2.0), row("c", -3.0)]).toDF()
def make_row(columns):
def _make_row(row, uid):
row_dict = row.asDict()
return row_with_index(*[uid] + [row_dict.get(c) for c in columns])
return _make_row
f = make_row(df.columns)
df_with_pk = (df.rdd
.zipWithUniqueId()
.map(lambda x: f(*x))
.toDF(StructType([StructField("id", LongType(), False)] + df.schema.fields)))
If you prefer consecutive number your can replace zipWithUniqueId with zipWithIndex but it is a little bit more expensive.
Directly with DataFrame API:
(universal Scala, Python, Java, R with pretty much the same syntax)
Previously I've missed monotonicallyIncreasingId function which should work just fine as long as you don't require consecutive numbers:
import org.apache.spark.sql.functions.monotonicallyIncreasingId
df.withColumn("id", monotonicallyIncreasingId).show()
// +---+----+-----------+
// |foo| bar| id|
// +---+----+-----------+
// | a|-1.0|17179869184|
// | b|-2.0|42949672960|
// | c|-3.0|60129542144|
// +---+----+-----------+
While useful monotonicallyIncreasingId is non-deterministic. Not only ids may be different from execution to execution but without additional tricks cannot be used to identify rows when subsequent operations contain filters.
Note:
It is also possible to use rowNumber window function:
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
df.withColumn("id", rowNumber().over(w)).show()
Unfortunately:
WARN Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
So unless you have a natural way to partition your data and ensure uniqueness is not particularly useful at this moment.
from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("id", monotonically_increasing_id()).show()
Note that the 2nd argument of df.withColumn is monotonically_increasing_id() not monotonically_increasing_id .
I found the following solution to be relatively straightforward for the case where zipWithIndex() is the desired behavior, i.e. for those desirng consecutive integers.
In this case, we're using pyspark and relying on dictionary comprehension to map the original row object to a new dictionary which fits a new schema including the unique index.
# read the initial dataframe without index
dfNoIndex = sqlContext.read.parquet(dataframePath)
# Need to zip together with a unique integer
# First create a new schema with uuid field appended
newSchema = StructType([StructField("uuid", IntegerType(), False)]
+ dfNoIndex.schema.fields)
# zip with the index, map it to a dictionary which includes new field
df = dfNoIndex.rdd.zipWithIndex()\
.map(lambda (row, id): {k:v
for k, v
in row.asDict().items() + [("uuid", id)]})\
.toDF(newSchema)
For anyone else who doesn't require integer types, concatenating the values of several columns whose combinations are unique across the data can be a simple alternative. You have to handle nulls since concat/concat_ws won't do that for you. You can also hash the output if the concatenated values are long:
import pyspark.sql.functions as sf
unique_id_sub_cols = ["a", "b", "c"]
df = df.withColumn(
"UniqueId",
sf.md5(
sf.concat_ws(
"-",
*[
sf.when(sf.col(sub_col).isNull(), sf.lit("Missing")).otherwise(
sf.col(sub_col)
)
for sub_col in unique_id_sub_cols
]
)
),
)

Resources