Spark streaming nested execution serialization issues - database

I am trying to connect DB2 database in the spark streaming application and the database query execution statement causing "org.apache.spark.SparkException: Task not serializable" issues. Please advise. Below is the sample code I have for reference.
dataLines.foreachRDD{rdd=>
val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
val dataRows=rdd.map(rs => rs.value).map(row =>
row.split(",")(1)-> (row.split(",")(0), row.split(",")(1), row.split(",")(2)
, "cvflds_"+row.split(",")(3).toLowerCase, row.split(",")(4), row.split(",")(5), row.split(",")(6))
)
val db2Conn = getDB2Connection(spark,db2ConParams)
dataRows.foreach{ case (k,v) =>
val table = v._4
val dbQuery = s"(SELECT * FROM $table ) tblResult"
val df=getTableData(db2Conn,dbQuery)
df.show(2)
}
}
Below is other function code:
private def getDB2Connection(spark: SparkSession, db2ConParams:scala.collection.immutable.Map[String,String]): DataFrameReader = {
spark.read.format("jdbc").options(db2ConParams)
}
private def getTableData(db2Con: DataFrameReader,tableName: String):DataFrame ={
db2Con.option("dbtable",tableName).load()
}
object SparkSessionSingleton {
#transient private var instance: SparkSession = _
def getInstance(sparkConf: SparkConf): SparkSession = {
if (instance == null) {
instance = SparkSession
.builder
.config(sparkConf)
.getOrCreate()
}
instance
}
}
Below is the error log:
2018-03-28 22:12:21,487 [JobScheduler] ERROR org.apache.spark.streaming.scheduler.JobScheduler - Error running job streaming job 1522289540000 ms.0
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:916)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:915)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:915)
at ncc.org.civil.receiver.DB2DataLoadToKudu$$anonfun$createSparkContext$1.apply(DB2DataLoadToKudu.scala:139)
at ncc.org.civil.receiver.DB2DataLoadToKudu$$anonfun$createSparkContext$1.apply(DB2DataLoadToKudu.scala:128)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:254)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:254)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:254)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:253)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.NotSerializableException: org.apache.spark.sql.DataFrameReader
Serialization stack:
- object not serializable (class: org.apache.spark.sql.DataFrameReader, value: org.apache.spark.sql.DataFrameReader#15fdb01)
- field (class: ncc.org.civil.receiver.DB2DataLoadToKudu$$anonfun$createSparkContext$1$$anonfun$apply$2, name: db2Conn$1, type: class org.apache.spark.sql.DataFrameReader)
- object (class ncc.org.civil.receiver.DB2DataLoadToKudu$$anonfun$createSparkContext$1$$anonfun$apply$2, )
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 30 more

Ideally you should keep the closure in dataRows.foreach clear of any connection objects, since the closure is meant to be serialized to executors and run there. This concept is covered in depth # this official link
In your case below line is the closure that is causing the issue:
val df=getTableData(db2Conn,dbQuery)
So, instead of using spark to get the DB2 table loaded, which in your case becomes(after combining the methods):
spark.read.format("jdbc").options(db2ConParams).option("dbtable",tableName).load()
Use plain JDBC in the closure to achieve this. You can use db2ConParams in the jdbc code. (I assume its simple enough to be serializable). The link also suggests using rdd.foreachPartition and ConnectionPool to further optimize.
You have not mentioned what you are doing with the table data except df.show(2). If the rows are huge, then you may discuss more about your use case. Perhaps, you need to consider a different design then.

Related

java.lang.IllegalStateException: The Kryo Output still contains data from a previous serialize call on Flink KeyedProcessFunction

I am using a KeyedProcessFunction on Flink 1.16.0 with a
private lazy val state: ValueState[Feature] = {
val stateDescriptor = new ValueStateDescriptor[Feature]("CollectFeatureProcessState", createTypeInformation[Feature])
getRuntimeContext.getState(stateDescriptor)
}
which is used in my process function as follows
override def processElement(value: Feature, ctx: KeyedProcessFunction[String, Feature, Feature]#Context, out: Collector[Feature]): Unit = {
val current: Feature = state.value match {
case null => value
case exists => combine(value, exists)
}
if (checkForCompleteness(current)) {
out.collect(current)
state.clear()
} else {
state.update(current)
}
}
Feature is a protobuf class that I registered with kryo as follows (using chill-protobuf 0.7.6)
env.getConfig.registerTypeWithKryoSerializer(classOf[Feature], classOf[ProtobufSerializer])
Within the first few seconds of running the app, I get this exception:
2023-02-07 09:17:04,246 WARN org.apache.flink.runtime.taskmanager.Task [] - KeyedProcess -> (Map -> Sink: signalSink, Map -> Flat Map -> Sink: FeatureSink, Sink: logsink) (2/2)#0 (fa4aae8fb7d2a7a94eafb36fe5470851_6760a9723a5626620871f040128bad1b_1_0) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkRuntimeException: Error while adding data to RocksDB
at org.apache.flink.contrib.streaming.state.RocksDBValueState.update(RocksDBValueState.java:109)
at com.grab.grabdefence.acorn.app.functions.stream.CollectFeatureProcessFunction$.processElement(CollectFeatureProcessFunction.scala:69)
at com.grab.grabdefence.acorn.app.functions.stream.CollectFeatureProcessFunction$.processElement(CollectFeatureProcessFunction.scala:18)
at org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:83)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233)
at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:542)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:831)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:780)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:914)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalStateException: The Kryo Output still contains data from a previous serialize call. It has to be flushed or cleared at the end of the serialize call.
at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.serialize(KryoSerializer.java:358)
at org.apache.flink.contrib.streaming.state.AbstractRocksDBState.serializeValueInternal(AbstractRocksDBState.java:158)
at org.apache.flink.contrib.streaming.state.AbstractRocksDBState.serializeValue(AbstractRocksDBState.java:180)
at org.apache.flink.contrib.streaming.state.AbstractRocksDBState.serializeValue(AbstractRocksDBState.java:168)
at org.apache.flink.contrib.streaming.state.RocksDBValueState.update(RocksDBValueState.java:107)
... 16 more
I checked KryoSerializer.serialize and I do not understand why this exception is thrown, the AbstractRocksDBState.serializeValue will always do a clear() before passing the DataOutputView to the KryoSerializer so it baffles me why output.position() != 0 could ever be true at the beginning of a serialization.

I am getting an error in Kapt Debug Kotlin. I have update versions of dependencies in gradle file. still facing this issue

My app was running smoothly but I am getting this error now.I am getting an error in Kapt Debug Kotlin. I have update versions of dependencies in gradle file. still facing this issue. How it can be resolved? I saw somewhere to see your room database , dao and data class. still not able to figure out what is the issue.
The error is showing this file
ROOM DATABASE
#Database(entities = [Transaction::class], version = 1, exportSchema = false)
abstract class MoneyDatabase : RoomDatabase(){
abstract fun transactionListDao():transactionDetailDao
companion object {
// Singleton prevents multiple instances of database opening at the
// same time.
#Volatile
private var INSTANCE: MoneyDatabase? = null
fun getDatabase(context: Context): MoneyDatabase {
// if the INSTANCE is not null, then return it,
// if it is, then create the database
return INSTANCE ?: synchronized(this) {
val instance = Room.databaseBuilder(
context.applicationContext,
MoneyDatabase::class.java,
"transaction_database"
).build()
INSTANCE = instance
// return instance
instance
}
}
}
}
DAO
#Dao
interface transactionDetailDao {
#Insert(onConflict = OnConflictStrategy.IGNORE)
suspend fun insert(transaction : Transaction)
#Delete
suspend fun delete(transaction : Transaction)
#Update
suspend fun update(transaction: Transaction)
#Query("SELECT * FROM transaction_table ORDER BY id ASC")
fun getalltransaction(): LiveData<List<Transaction>>
}
DATA CLASS
enum class Transaction_type(){
Cash , debit , Credit
}
enum class Type(){
Income, Expense
}
#Entity(tableName = "transaction_table")
data class Transaction(
val name : String,
val amount : Float,
val day : Int,
val month : Int,
val year : Int,
val comment: String,
val datePicker: String,
val transaction_type : String,
val category : String,
val recurring_from : String,
val recurring_to : String
){
#PrimaryKey(autoGenerate = true) var id :Long=0
}
The error is resolved. I was using the kotlin version 1.6.0. I reduced it to 1.4.32. As far as I understood, above(latest) version of Kotlin along with Room and coroutines doesn’t work well.
I believe that your issue is due to the use of an incorrect class being inadvertently used, a likely culprit being Transaction as that it also a Room class
Perhaps in transactionDetailDao (although it might be elsewhere)
See if you have import androidx.room.Transaction? (or any other imports with Transaction)?
If so delete or comment out the import
As an example with, and :-
And with the import commented out :-
Imported from github, had a play issue definitely appears to be with co-routines. commented out suspends in the Dao :-
#Dao
interface transactionDetailDao {
#Insert(onConflict = OnConflictStrategy.IGNORE)
suspend fun insert(transaction : Transaction)
#Delete
suspend fun delete(transaction : Transaction)
#Update
suspend fun update(transaction: Transaction)
#Query("SELECT * FROM transaction_table ORDER BY id ASC")
fun getalltransaction(): LiveData<List<Transaction>>
}
Compiled ok and ran and had a play e.g. :-

super confused with table and dataset or datastream conversion

I am using Flink 1.12, and I am super confused with when table and dataset/datastream conversion can be performed.
In the following code, I want to print the table content to the console, and I tried the following 3 ways
,all of them throws exception
table.toDataSet[Row].print()
table.toAppendStream[Row].print()
table.print()
I would ask how to print the table content to the console,eg, using the print method
import org.apache.flink.api.scala._
import org.apache.flink.table.api.bridge.scala._
import org.apache.flink.table.api.{DataTypes, EnvironmentSettings, TableEnvironment, TableResult}
import org.apache.flink.table.descriptors.{Csv, FileSystem, Schema}
import org.apache.flink.types.Row
object Sql021_PlannerOldBatchTest {
def main(args: Array[String]): Unit = {
val settings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build()
val env = TableEnvironment.create(settings)
val fmt = new Csv().fieldDelimiter(',').deriveSchema()
val schema = new Schema()
.field("a", DataTypes.STRING())
.field("b", DataTypes.STRING())
.field("c", DataTypes.DOUBLE())
env.connect(new FileSystem().path("D:/stock.csv")).withSchema(schema).withFormat(fmt).createTemporaryTable("sourceTable")
val table = env.sqlQuery("select * from sourceTable")
//ERROR: Only tables that originate from Scala DataSets can be converted to Scala DataSets.
// table.toDataSet[Row].print()
//ERROR:Only tables that originate from Scala DataStreams can be converted to Scala DataStreams.
table.toAppendStream[Row].print()
//ERROR: table doesn't has the print method
// table.print()
}
}
In the streaming case, this will work
tenv.toAppendStream(table, classOf[Row]).print()
env.execute()
and the batch case you can do
val tableResult: TableResult = env.executeSql("select * from sourceTable")
tableResult.print()

spark scala typesafe config safe iterate over value of a specific column name

I have found similar post on Stackoverflow. However, I could not solve my issue So, this is why I write this post.
Aim
The aim is to perform a column projection [projection = filter columns] while loading a SQL table (I use SQL Server).
According to the scala cookbook this is the way to filter colums [using an Array]:
sqlContext.read.jdbc(url,"person",Array("gender='M'"),prop)
However, I do not want to hardcode Array("col1", "col2", ...) inside my Scala code this is why I am using a config file with typesafe (see hereunder).
Config file
dataset {
type = sql
sql{
url = "jdbc://host:port:user:name:password"
tablename = "ClientShampooBusinesLimited"
driver = "driver"
other = "i have a lot of other single string elements in the config file..."
columnList = [
{
colname = "id"
colAlias = "identifient"
}
{
colname = "name"
colAlias = "nom client"
}
{
colname = "age"
colAlias = "âge client"
}
]
}
}
Let's focus on 'columnList': The name of the SQL column correspond exatecly to 'colname'. 'colAlias' is a field that I will use later.
data.scala file
lazy val columnList = configFromFile.getList("dataset.sql.columnList")
lazy val dbUrl = configFromFile.getList("dataset.sql.url")
lazy val DbTableName= configFromFile.getList("dataset.sql.tablename")
lazy val DriverName= configFromFile.getList("dataset.sql.driver")
configFromFile is created by myself in another custom class. But this does not matter. The type of columnList is "ConfigList" this type comes from typesafe.
main file
def loadDataSQL(): DataFrame = {
val url = datasetConfig.dbUrl
val dbTablename = datasetConfig.DbTableName
val dbDriver = datasetConfig.DriverName
val columns = // I need help to solve this
/* EDIT 2 march 2017
This code should not be used. Have a look at the accepted answer.
*/
sparkSession.read.format("jdbc").options(
Map("url" -> url,
"dbtable" -> dbTablename,
"predicates" -> columns,
"driver" -> dbDriver))
.load()
}
So all my issue is to extract the 'colnames' values in order to put them in a suitable array. Can someone help me to write the right operhand of 'val columns' ?
Thanks
If you're looking for a way to read the list of colname values into a Scala Array - I think this does it:
import scala.collection.JavaConverters._
val columnList = configFromFile.getConfigList("dataset.sql.columnList")
val colNames: Array[String] = columnList.asScala.map(_.getString("colname")).toArray
With the supplied file this would result in Array(id, name, age)
EDIT:
As to your actual goal, I actually don't know of any option named predication (nor can I find evidence for one in the sources, using Spark 2.0.2).
JDBC Data Source performs "projection pushdown" based on the actual columns selected in the query used. In other words - only selected columns would be read from DB, so you can use the colNames array in a select immediately following the DF creation, e.g.:
import org.apache.spark.sql.functions._
sparkSession.read
.format("jdbc")
.options(Map("url" -> url, "dbtable" -> dbTablename, "driver" -> dbDriver))
.load()
.select(colNames.map(col): _*) // selecting only desired columns

Scala, Slick connect to MSSQL server

I am trying to connect to a MSSQL database using the slick framework. The following code shows my first attempt but I can't figure out what is wrong.
This error occurs when leaving it as shown below:
[1] value create is not a member of scala.slick.lifted.DDL
Now I delete the line because I do not necessarily need to create the table within my scala code. But then another error arises:
[2] value map is not a member of object asd.asd.App.Coffees
package asd.asd
import scala.slick.driver.SQLServerDriver._
import scala.slick.session.Database.threadLocalSession
object App {
object Coffees extends Table[(String, Int, Double)]("COFFEES") {
def name = column[String]("COF_NAME", O.PrimaryKey)
def supID = column[Int]("SUP_ID")
def price = column[Double]("PRICE")
def * = name ~ supID ~ price
}
def main(args : Array[String]) {
println( "Hello World!" )
val db = slick.session.Database.forURL(url = "jdbc:jtds:sqlserver", user = "test", password = "test", driver = "scala.slick.driver.SQLServerDriver")
db withSession {
Coffees.ddl.create [1]
// Coffees.insertAll(
// ("Colombian", 101, 7.99),
// ("Colombian_Decaf", 101, 8.99),
// ("French_Roast_Decaf", 49, 9.99)
// )
val q = for {
c <- Coffees [2]
} yield (c.name, c.price, c.supID)
println(q.selectStatement)
q.foreach { case (n, p, s) => println(n + ": " + p) }
}
}
}
Problem solved. What I did was the following: Update to the latest Slick version and then adjust the code like demonstrated here. Afterwards you need to exchange the line
import scala.slick.driver.H2Driver.simple._
with
import scala.slick.driver.SQLServerDriver.simple._
And modify the connect string to
[...]
Database.forURL("jdbc:jtds:sqlserver://localhost:1433/<DB>;instance=<INSTANCE>", driver = "scala.slick.driver.SQLServerDriver") withSession {
[...]
After this worked out, I decided to use a c3p0 pooled connection (Which makes Slick very much faster and thus first usable, I do highly recommend to use connection pooling!). This left me with the following database object.
package utils
import scala.slick.driver.SQLServerDriver.simple._
import com.mchange.v2.c3p0.ComboPooledDataSource
object DatabaseUtils {
private val ds = new ComboPooledDataSource
ds.setDriverClass("scala.slick.driver.SQLServerDriver")
ds.setUser("supervisor")
ds.setPassword("password1")
ds.setMaxPoolSize(20)
ds.setMinPoolSize(3)
ds.setTestConnectionOnCheckin(true)
ds.setIdleConnectionTestPeriod(300)
ds.setMaxIdleTimeExcessConnections(240)
ds.setAcquireIncrement(1)
ds.setJdbcUrl("jdbc:jtds:sqlserver://localhost:1433/db_test;instance=SQLEXPRESS")
ds.setPreferredTestQuery("SELECT 1")
private val _database = Database.forDataSource(ds)
def database = _database
}
You can use this like displayed below.
DatabaseUtils.database withSession {
implicit session =>
[...]
}
Last the Maven dependency for c3p0 and the latest Slick version.
<dependency>
<groupId>com.typesafe.slick</groupId>
<artifactId>slick_2.10</artifactId>
<version>2.0.0-M3</version>
</dependency>
<dependency>
<groupId>com.mchange</groupId>
<artifactId>c3p0</artifactId>
<version>0.9.2.1</version>
</dependency>

Resources