apache beam application failing in Flink with NotSerializableException - apache-flink

my use case use beam generate sequence to create unbounded source and do further transforms. it is working fine in direct runner. As per docs, Flink serialize Java primitive types and boxed forms, but I am getting the following error.
Caused by: org.apache.flink.api.common.InvalidProgramException: [lastEmitted type:LONG pos:0, startTime type:RECORD pos:1] is not serializable. The object probably contains or references non serializable fields.
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:140)
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:115)
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:115)
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:115)
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:115)
at org.apache.flink.api.java.ClosureCleaner.clean(ClosureCleaner.java:115)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.clean(StreamExecutionEnvironment.java:1558)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.addSource(StreamExecutionEnvironment.java:1470)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.addSource(StreamExecutionEnvironment.java:1414)
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.addSource(StreamExecutionEnvironment.java:1396)
at org.apache.beam.runners.flink.FlinkStreamingTransformTranslators$UnboundedReadSourceTranslator.translateNode(FlinkStreamingTransformTranslators.java:221)
... 31 more
Caused by: java.io.NotSerializableException: org.apache.avro.Schema$Field
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at java.util.ArrayList.writeObject(ArrayList.java:768)
source
GenerateSequence.from(0).withRate(1, Duration.standardSeconds(10));
thanks for your help

Related

Flink Falling back to Kyro when using Custom Avro Deserializer

I have a very simple Flink Streaming job that basically consumes 2 records from a topic (r1 with schema s1 and r2 with schema s2, both s1 and s2 are backward+forward compatible).
Schema definition: s2 has 1 field extra than s1.
I'm trying to make the FlinkKafkaConsumer run using ConfluentRegistryAvroSerializationSchema.forGeneric(), which takes in s1 as reader schema and also schema registry URL, it doesn't work. The deserializer just ignores the additional column as the reader schema doesn't have it. How is schema registry being useful here?
Additionally, I tried writing a custom wrapper for KafkaAvroDeserializer referring this: Is it possible to deserialize Avro message(consuming message from Kafka) without giving Reader schema in ConfluentRegistryAvroDeserializationSchema. This solution explores using only the writer schema without reader schema provided - which is exactly what I need.
This gives me another runtime exception with Flink as follows:
Caused by: com.esotericsoftware.kryo.KryoException: java.lang.UnsupportedOperationException
Serialization trace:
reserved (org.apache.avro.Schema$Field)
fieldMap (org.apache.avro.Schema$RecordSchema)
schema (org.apache.avro.generic.GenericData$Record)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:143)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:21)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:354)
at org.apache.flink.streaming.runtime.streamrecord.StreamElementSerializer.deserialize(StreamElementSerializer.java:191)
at org.apache.flink.streaming.runtime.streamrecord.StreamElementSerializer.deserialize(StreamElementSerializer.java:46)
at org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:53)
at org.apache.flink.runtime.io.network.api.serialization.NonSpanningWrapper.readInto(NonSpanningWrapper.java:337)
at org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.readNonSpanningRecord(SpillingAdaptiveSpanningRecordDeserializer.java:128)
at org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.readNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:103)
at org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:93)
at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:95)
at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:496)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:203)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:809)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:761)
at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.UnsupportedOperationException
at java.util.Collections$UnmodifiableCollection.add(Collections.java:1057)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:22)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
... 30 more
How do I make this work and understand the core concept of schema registry better?

Using KeyBy vs reinterpretAsKeyedStream() when reading from Kafka

I have a simple Flink stream processing application (Flink version 1.13). The Flink app reads from Kakfa, does stateful processing of the record, then writes the result back to Kafka.
After reading from Kafka topic, I choose to use reinterpretAsKeyedStream() and not keyBy() to avoid a shuffle, since the records are already partitioned in Kakfa. The key used to partition in Kakfa is a String field of the record (using the default kafka partitioner). The Kafka topic has 24 partitions.
The mapping class is defined as follows. It keeps track of the state of the record.
public class EnvelopeMapper extends
KeyedProcessFunction<String, Envelope, Envelope> {
...
}
The processing of the record is as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource)
DataStreamUtils.reinterpretAsKeyedStream(messageStream, Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
With parallelism of 1, the code runs fine. With parallelism greater than 1 (e.g. 4), I am running into the follow error:
2022-06-12 21:06:30,720 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source -> Map -> Flat Map -> KeyedProcess -> Map -> Sink: Unnamed (4/4) (7ca12ec043a45e1436f45d4b20976bd7) switched from RUNNING to FAILED on 100.101.231.222:44685-bd10d5 # 100.101.231.222 (dataPort=37839).
java.lang.IllegalArgumentException: KeyGroupRange{startKeyGroup=96, endKeyGroup=127} does not contain key group 85
Based on the stack trace, it seems the exception happens when EnvelopeMapper class validates the record is sent to the right replica of the mapper object.
When reinterpretAsKeyedStream() is used, how are the records distributed among the different replicas of the EventMapper?
Thank you in advance,
Ahmed.
Update
After feedback from #David Anderson, replaced reinterpretAsKeyedStream() with keyBy(). The processing of the record is now as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource) // Line x
.map(statelessMapper1)
.flatMap(statelessMapper2);
messageStream.keyBy(Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
Is there any difference in performance if keyBy() is done right after reading from Kakfa (marked with "Line x") vs right before the stateful Mapper (EnvelopeMapper).
With
reinterpretAsKeyedStream(
DataStream<T> stream,
KeySelector<T, K> keySelector,
TypeInformation<K> typeInfo)
you are asserting that the records are already distributed exactly as they would be if you had instead used keyBy(keySelector). This will not normally be the case with records coming straight out of Kafka. Even if they are partitioned by key in Kafka, the Kafka partitions won't be correctly associated with Flink's key groups.
reinterpretAsKeyedStream is only straightforwardly useful in cases such as handling the output of a window or process function where you know that the output records are key partitioned in a particular way. To use it successfully with Kafka is can be very difficult: you must either be very careful in how the data is written to Kafka in the first place, or do something tricky with the keySelector so that the keyGroups it computes line up with how the keys are mapped to Kafka partitions.
One case where this isn't difficult is if the data is written to Kafka by a Flink job running with the same configuration as the downstream job that is reading the data and using reinterpretAsKeyedStream.

Flink kafka getting serialization issue for timestamp values

I have a flink Row with TypeInformation of Object class for all columns, I'm getting below error for timestamp columns while serializing Row to kafka. Flink version: 1.14.0
Caused by: java.lang.IllegalArgumentException: Java 8 date/time type `java.time.LocalDateTime` not supported by default: add Module "org.apache.flink.shaded.jackson2.com.fasterxml.jackson.datatype:jackson-datatype-jsr310" to enable handling
at org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper.valueToTree(ObjectMapper.java:3312)
at org.apache.flink.formats.json.JsonRowSerializationSchema.lambda$createFallbackConverter$8c80cf11$1(JsonRowSerializationSchema.java:266)
... 21 more

Flink: Key Group 91 does not belong to the local range

As title, the exception occurs in keyed windows,
java.lang.IllegalArgumentException: Key Group 91 does not belong to the local range.
at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:139)
at org.apache.flink.streaming.api.operators.HeapInternalTimerService.getIndexForKeyGroup(HeapInternalTimerService.java:431)
at org.apache.flink.streaming.api.operators.HeapInternalTimerService.getProcessingTimeTimerSetForKeyGroup(HeapInternalTimerService.java:412)
at org.apache.flink.streaming.api.operators.HeapInternalTimerService.getProcessingTimeTimerSetForTimer(HeapInternalTimerService.java:402)
at org.apache.flink.streaming.api.operators.HeapInternalTimerService.registerProcessingTimeTimer(HeapInternalTimerService.java:194)
at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator$Context.registerProcessingTimeTimer(WindowOperator.java:907)
at org.apache.flink.streaming.api.windowing.triggers.ProcessingTimeTrigger.onElement(ProcessingTimeTrigger.java:36)
at org.apache.flink.streaming.api.windowing.triggers.ProcessingTimeTrigger.onElement(ProcessingTimeTrigger.java:28)
at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator$Context.onElement(WindowOperator.java:926)
at org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement(WindowOperator.java:393)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:207)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:745)
code as:
stream.keyBy(...).timeWindow(Time.minutes(5)).apply(...)
the implement of keyBy is String result. Is there any idea about it? I have seen code in HeapInternalTimerService, but what is the case that keyGroupId out of local range?
I see two possibilities that could lead to this error.
Your key extractor function is not deterministic, i.e., it might return different values.
There is a bug in Flink.
Please check that 1. is not the case. If you are sure that the key extractor is not the problem, please reach out to the Flink user mailing list or create a Jira issue.

Hector error inserting integer

I am running a single node Cassandra instance (for dev purposes) and am looking to insert an integer row into it. My Keyspace and columnfamily are already created on Cassandra.
I am using Cassandra 1.0 with Hector 1.0.5 (Jar version). My code is as follows:
Cluster cluster = HFactory.getOrCreateCluster("Test Cluster", "10.40.14.93:9160");
Keyspace keyspaceOperator = HFactory.createKeyspace("mykeyspace", cluster)
Mutator intM = HFactory.createMutator(keyspaceOperator, IntegerSerializer.get());
for each elem in my list {
intM.insert(doc.document_id ,
"mycolfamily",
me.prettyprint.hector.api.factory.HFactory.createColumn("numAdults", doc.numAdults))
}
I get TimedOutException on my client, and in the Cassandra logs, I see a bunch of the following:
ERROR [MutationStage:357] 2012-07-20 08:15:02,106 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[MutationStage:357,5,main]
java.lang.RuntimeException: java.lang.NumberFormatException: For input string: ""
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1228)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Long.parseLong(Long.java:410)
at java.lang.Long.parseLong(Long.java:468)
at org.apache.solr.schema.TrieField.createField(TrieField.java:508)
at org.apache.solr.schema.FieldType.createFields(FieldType.java:292)
at org.apache.solr.schema.SchemaField.createFields(SchemaField.java:106)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.addFieldToDocument(SolrSecondaryIndex.java:382)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.populateDocument(SolrSecondaryIndex.java:280)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.applyIndexUpdates(SolrSecondaryIndex.java:164)
at org.apache.cassandra.db.index.SecondaryIndexManager.applyIndexUpdates(SecondaryIndexManager.java:419)
at org.apache.cassandra.db.Table.apply(Table.java:448)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:256)
at org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:415)
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1224)
... 3 more
ERROR [MutationStage:357] 2012-07-20 08:15:02,106 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[MutationStage:357,5,main]
java.lang.RuntimeException: java.lang.NumberFormatException: For input string: ""
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1228)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.NumberFormatException: For input string: ""
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Long.parseLong(Long.java:410)
at java.lang.Long.parseLong(Long.java:468)
at org.apache.solr.schema.TrieField.createField(TrieField.java:508)
at org.apache.solr.schema.FieldType.createFields(FieldType.java:292)
at org.apache.solr.schema.SchemaField.createFields(SchemaField.java:106)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.addFieldToDocument(SolrSecondaryIndex.java:382)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.populateDocument(SolrSecondaryIndex.java:280)
at com.datastax.bdp.cassandra.index.solr.SolrSecondaryIndex.applyIndexUpdates(SolrSecondaryIndex.java:164)
at org.apache.cassandra.db.index.SecondaryIndexManager.applyIndexUpdates(SecondaryIndexManager.java:419)
at org.apache.cassandra.db.Table.apply(Table.java:448)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:256)
at org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:415)
at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1224)
}
I am trialling the Datastax Enterprise (DSE) which packages Cassandra, Hadoop, Solr etc. I have created my Cassandra CF via Solr Configuration (You can post Solr config and schema xmls to a Datastax instance to create the Keyspace and CF - its a feature of DSE)
Could someone please help?
Try adding an explicit serializer to your createColumn call...like so:
me.prettyprint.hector.api.factory.HFactory.createColumn("numAdults", doc.numAdults, StringSerializer.get(), IntegerSerializer.get()))
Also, on another note, I see you're doing inserts in a loop. Doing intM.addInsertion inside the loop and then intM.execute() once its done is more efficient.

Resources