java.lang.IllegalStateException: The Kryo Output still contains data from a previous serialize call on Flink KeyedProcessFunction - apache-flink

I am using a KeyedProcessFunction on Flink 1.16.0 with a
private lazy val state: ValueState[Feature] = {
val stateDescriptor = new ValueStateDescriptor[Feature]("CollectFeatureProcessState", createTypeInformation[Feature])
getRuntimeContext.getState(stateDescriptor)
}
which is used in my process function as follows
override def processElement(value: Feature, ctx: KeyedProcessFunction[String, Feature, Feature]#Context, out: Collector[Feature]): Unit = {
val current: Feature = state.value match {
case null => value
case exists => combine(value, exists)
}
if (checkForCompleteness(current)) {
out.collect(current)
state.clear()
} else {
state.update(current)
}
}
Feature is a protobuf class that I registered with kryo as follows (using chill-protobuf 0.7.6)
env.getConfig.registerTypeWithKryoSerializer(classOf[Feature], classOf[ProtobufSerializer])
Within the first few seconds of running the app, I get this exception:
2023-02-07 09:17:04,246 WARN org.apache.flink.runtime.taskmanager.Task [] - KeyedProcess -> (Map -> Sink: signalSink, Map -> Flat Map -> Sink: FeatureSink, Sink: logsink) (2/2)#0 (fa4aae8fb7d2a7a94eafb36fe5470851_6760a9723a5626620871f040128bad1b_1_0) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkRuntimeException: Error while adding data to RocksDB
at org.apache.flink.contrib.streaming.state.RocksDBValueState.update(RocksDBValueState.java:109)
at com.grab.grabdefence.acorn.app.functions.stream.CollectFeatureProcessFunction$.processElement(CollectFeatureProcessFunction.scala:69)
at com.grab.grabdefence.acorn.app.functions.stream.CollectFeatureProcessFunction$.processElement(CollectFeatureProcessFunction.scala:18)
at org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:83)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233)
at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134)
at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105)
at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:542)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:231)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:831)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:780)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:935)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:914)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:728)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalStateException: The Kryo Output still contains data from a previous serialize call. It has to be flushed or cleared at the end of the serialize call.
at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.serialize(KryoSerializer.java:358)
at org.apache.flink.contrib.streaming.state.AbstractRocksDBState.serializeValueInternal(AbstractRocksDBState.java:158)
at org.apache.flink.contrib.streaming.state.AbstractRocksDBState.serializeValue(AbstractRocksDBState.java:180)
at org.apache.flink.contrib.streaming.state.AbstractRocksDBState.serializeValue(AbstractRocksDBState.java:168)
at org.apache.flink.contrib.streaming.state.RocksDBValueState.update(RocksDBValueState.java:107)
... 16 more
I checked KryoSerializer.serialize and I do not understand why this exception is thrown, the AbstractRocksDBState.serializeValue will always do a clear() before passing the DataOutputView to the KryoSerializer so it baffles me why output.position() != 0 could ever be true at the beginning of a serialization.

Related

RESTEASY002020: Unhandled asynchronous exception with quarkus 1.0 Final and RestEeasy JAX-RS Resource

I have this code, that performs parallel execution of some function that does http calls, this located inside some #Singleton service that is called from the RestEeasy JAX-RS Resource
final Flowable<Map<String, List<Data>>> relatedMaps = Flowable.range(0, requestList.size())
concatMapEager(index ->
fetchByHttp(requestList.get(index))
.subscribeOn(Schedulers.io())
.toFlowable(),
requestList.size(),
1
);
where the fetchByHttp is:
Singe fetchByHttp(request) {
return Single.fromCallable( () -> {
...restClient.getData(request)
...
requestList.size() is about 100. or less.
and sometimes I get this issue:
14:28:35 ERROR [or.jb.re.re.i18n] (RxCachedThreadScheduler-232) RESTEASY002020: Unhandled asynchronous exception, sending back 500: java.lang.NullPointerException
at org.jboss.resteasy.core.ServerResponseWriter.writeNomapResponse(ServerResponseWriter.java:91)
at org.jboss.resteasy.core.AsyncResponseConsumer.sendBuiltResponse(AsyncResponseConsumer.java:148)
at org.jboss.resteasy.core.AsyncResponseConsumer.internalResume(AsyncResponseConsumer.java:115)
at org.jboss.resteasy.core.AsyncResponseConsumer$CompletionStageResponseConsumer.accept(AsyncResponseConsumer.java:237)
at org.jboss.resteasy.core.AsyncResponseConsumer$CompletionStageResponseConsumer.accept(AsyncResponseConsumer.java:216)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
at io.reactivex.internal.observers.ConsumerSingleObserver.onSuccess(ConsumerSingleObserver.java:62)
at io.smallrye.context.propagators.rxjava2.ContextPropagatorOnSingleCreateAction$ContextCapturerSingle.lambda$onSuccess$2(ContextPropagatorOnSingleCreateAction.java:50)
at io.smallrye.context.SmallRyeThreadContext.lambda$withContext$0(SmallRyeThreadContext.java:215)
at io.smallrye.context.propagators.rxjava2.ContextPropagatorOnSingleCreateAction$ContextCapturerSingle.onSuccess(ContextPropagatorOnSingleCreateAction.java:50)
at io.smallrye.context.propagators.rxjava2.ContextPropagatorOnSingleCreateAction$ContextCapturerSingle.lambda$onSuccess$2(ContextPropagatorOnSingleCreateAction.java:50)
at io.smallrye.context.SmallRyeThreadContext.lambda$withContext$0(SmallRyeThreadContext.java:215)
On Resource side:
#Timeout(20000)
#GET
#Path("/data")
#Produces(MediaType.APPLICATION_JSON)
public CompletionStage<Data> getData(){
final Single<Data> dataSingle = service.getData();
final CompletableFuture<Data> dataFuture = new CompletableFuture<>();
dataSingle
//.subscribeOn(Schedulers.io())
.subscribe(dataFuture::complete);
return dataFuture;
}
Found this. If it is fixed. Then Q: what's going on and how handle it?
Quarkus 1.0 Final.

buffer pool is destroyed on POJO type

I have a custom source that emits an user-defined data type, BaseEvent. The following code works fine when BaseEvent is not POJO.
But, when I changed it to POJO by adding a default constructor, I’m getting “Buffer Pool is destroyed” runtime exception on the Collect method. I'm running Flink 1.7.0
DataStream<BaseEvent> eventStream = see.addSource(new AgoraSource(configFile, instance));
DataStream<Tuple4<String, Long, Double, String>> result_order = eventStream
.filter(e -> e instanceof OrderEvent)
.map(e -> (OrderEvent)e)
.map(e -> new Tuple3<>(e.SecurityID, Long.valueOf(1), Double.valueOf(e.OriginalQuantity))).returns(info_tuple3)
.keyBy(e -> e.f0)
.timeWindow(Time.seconds(5))
.reduce((a, b) -> new Tuple3<>(a.f0, a.f1 + b.f1, a.f2 + b.f2))
.map(e -> new Tuple4<>(e.f0, e.f1, e.f2, "Order")).returns(info_tuple4);

AKKA HTTP + AKKA stream 100% CPU utilization

I have a web API exposing one GET endpoint using Akka HTTP and, the logic that it takes the parameter form the requester go and call external web service using AKKA Streams and based on the response it goes and query another endpoint also using akka stream.
first external endpoint call looks like this
def poolFlow(uri: String): Flow[(HttpRequest, T), (Try[HttpResponse], T), HostConnectionPool] =
Http().cachedHostConnectionPool[T](host = uri, 80)
def parseResponse(parallelism: Int): Flow[(Try[HttpResponse], T), (ByteString, T), NotUsed] =
Flow[(Try[HttpResponse], T)].mapAsync(parallelism) {
case (Success(HttpResponse(_, _, entity, _)), t) =>
entity.dataBytes.alsoTo(Sink.ignore)
.runFold(ByteString.empty)(_ ++ _)
.map(e => e -> t)
case (Failure(ex), _) => throw ex
}
def parse(result: String, data: RequestShape): (Coord, Coord, String) =
(data.src, data.dst, result)
val parseEntity: Flow[(ByteString, RequestShape), (Coord, Coord, String), NotUsed] =
Flow[(ByteString, RequestShape)] map {
case (entity, request) => parse(entity.utf8String, request)
}
and the stream consumer
val routerResponse = httpRequests
.map(buildHttpRequest)
.via(RouterRequestProcessor.poolFlow(uri)).async
.via(RouterRequestProcessor.parseResponse(2))
.via(RouterRequestProcessor.parseEntity)
.alsoTo(Sink.ignore)
.runFold(Vector[(Coord, Coord, String)]()) {
(acc, res) => acc :+ res
}
routerResponse
then I do some calculations on routerResponse and create a post to the other external web service,
Second external Stream Consumer
def poolFlow(uri: String): Flow[(HttpRequest, Unit), (Try[HttpResponse], Unit), Http.HostConnectionPool] =
Http().cachedHostConnectionPoolHttps[Unit](host = uri)
val parseEntity: Flow[(ByteString, Unit), (Unit.type, String), NotUsed] = Flow[(ByteString, Unit)] map {
case (entity, _) => parse(entity.utf8String)
}
def parse(result: String): (Unit.type, String) = (Unit, result)
val res = Source.single(httpRequest)
.via(DataRobotRequestProcessor.poolFlow(uri))
.via(DataRobotRequestProcessor.parseResponse(1))
.via(DataRobotRequestProcessor.parseEntity)
.alsoTo(Sink.ignore)
.runFold(List[String]()) {
(acc, res) => acc :+ res._2
}
The Get Endpoint consume the first stream and then build the second request based on the first response,
Notes:
the first external service is fast 1-2 seconds time, and the second's external service is slow 3-4 seconds time.
the first endpoint is being queried using parallelism=2 and the second endpoint is being queried using parallelism=1
The Service is running on AWS ECS Cluster, and for the test purposes it is running on a single node
the problem,
that the web service work for some time but the CPU utilization get higher by dealing with more request, I would assume something to do with back pressure is being triggered, and the CPU stays highly utilized after no request is being sent also which is strange.
Does anybody have a clue whats going on

Spark streaming nested execution serialization issues

I am trying to connect DB2 database in the spark streaming application and the database query execution statement causing "org.apache.spark.SparkException: Task not serializable" issues. Please advise. Below is the sample code I have for reference.
dataLines.foreachRDD{rdd=>
val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
val dataRows=rdd.map(rs => rs.value).map(row =>
row.split(",")(1)-> (row.split(",")(0), row.split(",")(1), row.split(",")(2)
, "cvflds_"+row.split(",")(3).toLowerCase, row.split(",")(4), row.split(",")(5), row.split(",")(6))
)
val db2Conn = getDB2Connection(spark,db2ConParams)
dataRows.foreach{ case (k,v) =>
val table = v._4
val dbQuery = s"(SELECT * FROM $table ) tblResult"
val df=getTableData(db2Conn,dbQuery)
df.show(2)
}
}
Below is other function code:
private def getDB2Connection(spark: SparkSession, db2ConParams:scala.collection.immutable.Map[String,String]): DataFrameReader = {
spark.read.format("jdbc").options(db2ConParams)
}
private def getTableData(db2Con: DataFrameReader,tableName: String):DataFrame ={
db2Con.option("dbtable",tableName).load()
}
object SparkSessionSingleton {
#transient private var instance: SparkSession = _
def getInstance(sparkConf: SparkConf): SparkSession = {
if (instance == null) {
instance = SparkSession
.builder
.config(sparkConf)
.getOrCreate()
}
instance
}
}
Below is the error log:
2018-03-28 22:12:21,487 [JobScheduler] ERROR org.apache.spark.streaming.scheduler.JobScheduler - Error running job streaming job 1522289540000 ms.0
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:916)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:915)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:915)
at ncc.org.civil.receiver.DB2DataLoadToKudu$$anonfun$createSparkContext$1.apply(DB2DataLoadToKudu.scala:139)
at ncc.org.civil.receiver.DB2DataLoadToKudu$$anonfun$createSparkContext$1.apply(DB2DataLoadToKudu.scala:128)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:254)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:254)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:254)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:253)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.NotSerializableException: org.apache.spark.sql.DataFrameReader
Serialization stack:
- object not serializable (class: org.apache.spark.sql.DataFrameReader, value: org.apache.spark.sql.DataFrameReader#15fdb01)
- field (class: ncc.org.civil.receiver.DB2DataLoadToKudu$$anonfun$createSparkContext$1$$anonfun$apply$2, name: db2Conn$1, type: class org.apache.spark.sql.DataFrameReader)
- object (class ncc.org.civil.receiver.DB2DataLoadToKudu$$anonfun$createSparkContext$1$$anonfun$apply$2, )
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 30 more
Ideally you should keep the closure in dataRows.foreach clear of any connection objects, since the closure is meant to be serialized to executors and run there. This concept is covered in depth # this official link
In your case below line is the closure that is causing the issue:
val df=getTableData(db2Conn,dbQuery)
So, instead of using spark to get the DB2 table loaded, which in your case becomes(after combining the methods):
spark.read.format("jdbc").options(db2ConParams).option("dbtable",tableName).load()
Use plain JDBC in the closure to achieve this. You can use db2ConParams in the jdbc code. (I assume its simple enough to be serializable). The link also suggests using rdd.foreachPartition and ConnectionPool to further optimize.
You have not mentioned what you are doing with the table data except df.show(2). If the rows are huge, then you may discuss more about your use case. Perhaps, you need to consider a different design then.

Akka stream stops after one element

My akka stream is stopping after a single element. Here's my stream:
val firehoseSource = Source.actorPublisher[FirehoseActor.RawTweet](
FirehoseActor.props(
auth = ...
)
)
val ref = Flow[FirehoseActor.RawTweet]
.map(r => ResponseParser.parseTweet(r.payload))
.map { t => println("Received: " + t); t }
.to(Sink.onComplete({
case Success(_) => logger.info("Stream completed")
case Failure(x) => logger.error(s"Stream failed: ${x.getMessage}")
}))
.runWith(firehoseSource)
FirehoseActor connects to the Twitter firehose and buffers messages to a queue. When the actor receives a Request message, it takes the next element and returns it:
def receive = {
case Request(_) =>
logger.info("Received request for next firehose element")
onNext(RawTweet(queue.take()))
}
The problem is that only a single tweet is being printed to the console. The program doesn't quit or throw any errors, and I've sprinkled logging statements around, and none are printed.
I thought the sink would keep applying pressure to pull elements through but that doesn't seem to be the case since neither of the messages in Sink.onComplete get printed. I also tried using Sink.ignore but that only printed a single element as well. The log message in the actor only gets printed once as well.
What sink do I need to use to make it pull elements through the flow indefinitely?
Ah I should have respected totalDemand in my actor. This fixes the issue:
def receive = {
case Request(_) =>
logger.info("Received request for next firehose element")
while (totalDemand > 0) {
onNext(RawTweet(queue.take()))
}
I was expecting to receive a Request for each element in the stream, but apparently each Flow will send a Request.

Resources