Strange transactional id errors when using the kafka sink

Strange transactional id errors when using the kafka sink - apache-flink

I had a Flink 1.15.1 job configured with
execution.checkpointing.mode='EXACTLY_ONCE'
that was failing with the following error
Sink: Committer (2/2)#732 (36640a337c6ccdc733d176b18adab979) switched from INITIALIZING to FAILED with failure cause: java.lang.IllegalStateException: Failed to commit KafkaCommittable{producerId=4521984, epoch=0, transactionalId=}
...
Caused by: org.apache.kafka.common.config.ConfigException: Invalid value for configuration transactional.id: String must be non-empty
that happened after the first checkpoint was triggered. The strange thing about it is that the KafkaSinkBuilder was used without calling setDeliverGuarantee, and hence the default delivery guarantee was expected to be used, which is NONE 1.
Is that even possible to start with? Shouldn't kafka transactions be involved only when one follows this recipe in 2?
* <p>One can also configure different {#link DeliveryGuarantee} by using {#link
* #setDeliverGuarantee(DeliveryGuarantee)} but keep in mind when using {#link
* DeliveryGuarantee#EXACTLY_ONCE} one must set the transactionalIdPrefix {#link
* #setTransactionalIdPrefix(String)}.
So, in my case, without calling setDeliverGuarantee (nor setTransactionalIdPrefix), I cannot understand why I was seeing these errors. To avoid the problem, I temporarily relaxed the checkpointing settings to
execution.checkpointing.mode='AT_LEAST_ONCE'
but I'd like to understand what was happening.

Like the JavaDoc mentions, if you enable exactly-once, you must set a transactionalIdPrefix. A complete recipe on how-to configure exactly-once with Apache Kafka can be found in this recipe: https://www.docs.immerok.cloud/docs/cookbook/exactly-once-with-apache-kafka-and-apache-flink/
Disclaimer: I work for Immerok

Related

Apache Flink with Kinesis Analytics : java.lang.IllegalArgumentException: The fraction of memory to allocate should not be 0

Background :
I have been trying to setup BATCH + STREAMING in the same flink application which is deployed on kinesis analytics runtime. The STREAMING part works fine, but I'm having trouble adding support for BATCH.
Flink : Handling Keyed Streams with data older than application watermark
Apache Flink : Batch Mode failing for Datastream API's with exception `IllegalStateException: Checkpointing is not allowed with sorted inputs.`
The logic is something like this :
The logic is something like this :
streamExecutionEnvironment.setRuntimeMode(RuntimeExecutionMode.BATCH);
streamExecutionEnvironment.fromSource(FileSource.forRecordStreamFormat(new TextLineFormat(), path).build(),
WatermarkStrategy.noWatermarks(),
"Text File")
.process(process function which transforms input)
.assignTimestampsAndWatermarks(WatermarkStrategy
.<DetectionEvent>forBoundedOutOfOrderness(orderness)
.withTimestampAssigner(
(SerializableTimestampAssigner<Event>) (event, l) -> event.getEventTime()))
.keyBy(keyFunction)
.window(TumblingEventWindows(Time.of(x days))
.process(processWindowFunction);
On doing this I'm getting the below exception :
java.lang.Exception: Exception while creating StreamOperatorStateContext.
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:254)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:272)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:441)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:582)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
at org.apache.flink.streaming.runtime.tasks.StreamTask.executeRestore(StreamTask.java:562)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:537)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:764)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:571)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for WindowOperator_90bea66de1c231edf33913ecd54406c1_(1/1) from any of the 1 provided restore options.
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:345)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:163)
... 10 more
Caused by: java.io.IOException: Failed to acquire shared cache resource for RocksDB
at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.allocateSharedCachesIfConfigured(RocksDBOperationUtils.java:306)
at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:426)
at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:90)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:328)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
... 12 more
Caused by: java.lang.IllegalArgumentException: The fraction of memory to allocate should not be 0. Please make sure that all types of managed memory consumers contained in the job are configured with a non-negative weight via `taskmanager.memory.managed.consumer-weights`.
at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:160)
at org.apache.flink.runtime.memory.MemoryManager.validateFraction(MemoryManager.java:672)
at org.apache.flink.runtime.memory.MemoryManager.computeMemorySize(MemoryManager.java:653)
at org.apache.flink.runtime.memory.MemoryManager.getSharedMemoryResourceForManagedMemory(MemoryManager.java:521)
at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.allocateSharedCachesIfConfigured(RocksDBOperationUtils.java:302)
... 17 more
Seems like kinesis-analytics does not allow clients to define a flink-conf.yaml file to define taskmanager.memory.managed.consumer-weights. Is there any way around this ?

It's not clear to me what the underlying cause of this exception is, nor how to make batch processing work on KDA.
You can try this (but I'm not sure KDA will allow it):
Configuration conf = new Configuration();
conf.setString("taskmanager.memory.managed.consumer-weights", "put-the-value-here");
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment(conf);

[flink]Task manager initialization failed

I am new to flink. I am trying to run the flink example on my local PC(windows).
However, after I run the start-cluster.bat, I login to the dashboard, it shows the task manager is 0.
I checked the log and seems it fails to initialize:
2020-02-21 23:03:14,202 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner - TaskManager initialization failed.
org.apache.flink.configuration.IllegalConfigurationException: Failed to create TaskExecutorResourceSpec
at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpec.FromConfig(TaskExecutorResourceUtils.java:72)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.startTaskManager(TaskManagerRunner.java:356)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.<init>(TaskManagerRunner.java:152)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:308)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerSecurely$2(TaskManagerRunner.java:322)
at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerSecurely(TaskManagerRunner.java:321)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:287)
Caused by: org.apache.flink.configuration.IllegalConfigurationException: The required configuration option Key: 'taskmanager.cpu.cores' , default: null (fallback keys: []) is not set
at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkConfigOptionIsSet(TaskExecutorResourceUtils.java:90)
at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.lambda$checkTaskExecutorResourceConfigSet$0(TaskExecutorResourceUtils.java:84)
at java.util.Arrays$ArrayList.forEach(Arrays.java:3880)
at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.checkTaskExecutorResourceConfigSet(TaskExecutorResourceUtils.java:84)
at org.apache.flink.runtime.taskexecutor.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:70)
... 7 more
2020-02-21 23:03:14,217 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
Basically, it looks like a required option 'taskmanager.cpu.cores' is not set. However, I can't find this property in flink-conf.yaml and in the document(https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/config.html) either.
I am using flink 1.10.0. Any help would be highly appreciated!

That configuration option is intended for internal use only -- it shouldn't be user configured, which is why it isn't documented.
The windows start-cluster.bat is failing because of a bug introduced in Flink 1.10. See https://jira.apache.org/jira/browse/FLINK-15925.
One workaround is to use the bash script, start-cluster.sh, instead.
See also this mailing list thread: https://lists.apache.org/thread.html/r7693d0c06ac5ced9a34597c662bcf37b34ef8e799c32cc0edee373b2%40%3Cdev.flink.apache.org%3E

Flink1.5.4 exception: Corrupt stream, found tag: 105

My program wants to join two streams without Flink Window.
I connect two streams and define a class A extends RichCoFlatMapFunction to handle them.
In class A, I use a Guava cache to hold all the data from flatmap1/2 method, and join them by a tag from streams.
Then Guava cache has a remove listener to collect joined&expired data to next Flink Function.
private synchronized void collect(ReqFeatures features) {
feaCollector.collect(features);
}
Each time at the beginning, it runs well, but a few hours later, it's always dead because of this exception.
java.io.IOException: Corrupt stream, found tag: 105
at org.apache.flink.streaming.runtime.streamrecord.StreamElementSerializer.deserialize(StreamElementSerializer.java:220)
at org.apache.flink.streaming.runtime.streamrecord.StreamElementSerializer.deserialize(StreamElementSerializer.java:49)
at org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55)
at org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:106)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:172)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:104)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:306)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:712)
at java.lang.Thread.run(Thread.java:748)
And sometimes there's another error log:
java.lang.IllegalStateException: When there are multiple buffers, an unfinished bufferConsumer can not be at the head of the buffers queue.
at org.apache.flink.util.Preconditions.checkState(Preconditions.java:195)
at org.apache.flink.runtime.io.network.partition.PipelinedSubpartition.pollBuffer(PipelinedSubpartition.java:158)
at org.apache.flink.runtime.io.network.partition.PipelinedSubpartitionView.getNextBuffer(PipelinedSubpartitionView.java:51)
at org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.getNextBuffer(LocalInputChannel.java:186)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:551)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:508)
at org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:94)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:209)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:104)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:306)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:712)
at java.lang.Thread.run(Thread.java:748)
If I use Flink Window Function instead, this exception doesn't occur.
Why does this exception occur, and how can I resolve it?

I can confirm this also happens in Flink 1.9.1 (albeit for us, it happens when we run flink stop <job-id>)

I fixed the same problem with getting checkpointing lock while collecting output. The users flatMap function already hold the checkpointing lock, so if u collect output in flatMap function could also fix this problem.
in flink's code:
synchronized (checkpointingLock) {
numRecordsIn.inc();
streamOperator.setKeyContextElement1(record);
streamOperator.processElement(record);
}

CamelExchangeException Invalid Correlation Key on exchange header

Apache Camel is throwing Invalid Correlation Key exception when trying to aggregate messages from my AWS SQS queue.
The messages were placed in the queue using ZipSplitter and they all appear in the queue with matching "parentId" values (which I added using a random uuid as part of the splitting -I've tried CamelSourceFile as well). I get the Exception repeatedly until the retries are exhausted.
My aggregate expression:
from(--queue--).aggregate(header("parentId"), customAggregationStrategy).completionTimeout(3000).processor(new Processor() {...}.to(--next queue--);
There is no logging emitted from my customAggregationStrategy nor from any of the subsequent processors. It fails to aggregate:
... DeadLetterChannel - Failed delivery for (MessageId: ...). On Delivery attempt: 0 caught ...CamelExchangeException: Invalid correlation key. Exchange[ID...]
The delivery attempt is 0 through 9 for my retry attempts.
The infuriating thing is that the code works everywhere but locally...which you think would narrow things down, but neither the exception nor anything else logged sheds any light onto what is going on here.

You could try to use Camel simple language when expressing the correlation key, ie:
.aggregate(simple("${headers.parentId}", customAggregationStrategy)
This way, the exceptions might be silently ignored ?
Did you activate Camel tracer (http://camel.apache.org/tracer.html) to analyse your exchanges and ease the debugging ?
I suspect you have an Exchange which does NOT have the "parentId" header. If you want to skip them, just activate the ignoreInvalidCorrelationKeys option (see http://camel.apache.org/aggregator2.html)

Camel ActiveMQ client blocking, temp storage usage immediately hits 100%

I'm seeing 100% utilisation of activemq's temp storage (configured to be 100mb), and the activemq client is blocking. This 100% usage remains permanently, and I have no idea what's going on
I have a camel route, which consumes from a queue (QUEUE.IN) using the JmsTransactionManager.
public final class RouteUnderTest extends RouteBuilder {
#Override
public void configure() throws Exception {
from("activemq-transacted:QUEUE.IN")
.bean(myBean)
.to("activemq:QUEUE.OUT");
}
}
While processing the message from this queue I'm invoking a spring-integration client (myBean) which is configured as follows
<int:gateway id="myBean" service-interface="MyBean">
<int:method name="request" request-channel="channel"/>
</int:gateway>
<int:chain input-channel="channel">
<int:transformer ref="transformedToJsonHere"/>
<jms:outbound-gateway request-destination-name="QUEUE.MYBEAN"
receive-timeout="5000"
explicit-qos-enabled="true"
time-to-live="5000"
delivery-persistent="false"/>
<int:transformer ref="transformedToAnObjectHere"/>
</int:chain>
My broker is configured to use LevelDB, and with the following usage limits:
<persistenceAdapter>
<levelDB directory="${activemq.data}/leveldb"/>
</persistenceAdapter>
<systemUsage>
<systemUsage>
<memoryUsage>
<memoryUsage percentOfJvmHeap="70"/>
</memoryUsage>
<storeUsage>
<storeUsage limit="500 mb"/>
</storeUsage>
<tempUsage>
<tempUsage limit="100 mb"/>
</tempUsage>
</systemUsage>
</systemUsage>
When my route consumes the message and then attempts to put a non-persistent message on QUEUE.OUT the client is blocked and my broker shows 100% usage of temp storage.
And I see the following activemq logs
2015-07-28 15:44:59,678 | INFO | Usage(default:temp:queue://QUEUE.MYBEAN:temp) percentUsage=0%, usage=104857600, limit=104857600, percentUsageMinDelta=1%;Parent:Usage(default:temp) percentUsage=100%, usage=104857600, limit=104857600, percentUsageMinDelta=1%: Temp Store is Full (0% of 104857600). Stopping producer (ID:orbit-vm-55561-1438094698190-1:1:3:1) to prevent flooding queue://QUEUE.MYBEAN. See http://activemq.apache.org/producer-flow-control.html for more info (blocking for: 1s) | org.apache.activemq.broker.region.Queue | ActiveMQ NIO Worker 6
The queues look like (You can see that the QUEUE.IN message has been not been dequeued because it's still being processed transactionally, and no message has gone to QUEUE.MYBEAN)
I can fix this problem with any one of the following approaches:
Use KahaDB instead of LevelDB
Increase temp storage limit (150MB seems to do it but I haven't experimented a great deal)
Configure tempDataStore in activemq.xml (see below)
When configuring the tempDataStore it looks like:
<tempDataStore>
<bean xmlns="http://www.springframework.org/schema/beans" class="org.apache.activemq.leveldb.LevelDBStore">
<property name="directory" value="${activemq.data}/tmp" />
</bean>
</tempDataStore>
I should add, we were using KahaDB previously and this worked fine, but the upgrade to LevelDB has exposed this issue. Reverting to KahaDB is not an option.
I'm hoping someone could explain what we're seeing here, as the results are really difficult to understand. Why does using LevelDB necessitate a higher temp usage limit?, and why does configuring the tempDataStore explicitly also fix the problem?
I don't fully understand what's going on here so I'm worried that simply increasing the temp usage limit a little will just hide the problem until a later date.
Versions:
ActiveMQ: 5.11.1
Camel: 2.14.0
Spring: 4.0.8.RELEASE
Spring Integration: 4.0.5.RELEASE

We ran into exactly the same issue with ActiveMQ 5.13.2
The solution when using LevelDB is to explicitly configure a dedicated tempDataStore as you did.
If not, the broker uses the same store (LevelDB) for both persistent (persistent usage) and non-persistent messages (temp usage). You may therefore end-up in situations where the broker doesn't accept any non-persistent message anymore just because the store already holds persistent ones up to the configured tempUsage limit. It will however accept persistent ones if your storeUsage limit is set higher...
When using KahaDB, the broker automatically uses another store for the non-persistent messages (created in the tmp directory). So you don't have the problem...
Look at the following code for more indepth information: https://github.com/apache/activemq/blob/activemq-5.13.2/activemq-broker/src/main/java/org/apache/activemq/broker/BrokerService.java#L1739
When reading that code, remember LevelDBStore implements PListStore, but KahaDBStore doesn't...