Flink : cannot cancel a running job (streaming) - apache-flink

I want to run a streaming job. When I try to run it locally using start-clusted.sh and the Flink Web Interface, I have no problem.
However, I am currently trying to run my job using Flink on YARN
(deployed on Google Dataproc) and when I try to cancel it, the
canceling state lasts forever and a slot remains occupied in the
TaskManager.
Here is the log I got :
2016-10-18 16:56:04,053 INFO org.apache.flink.runtime.taskmanager.Task -
Attempting to cancel task Source: pubSubMessageAcknowledgingSource ->
TrackingDisplayPushDeduplicater -> TrackingDisplayPushDeserializer ->
(Sink: TrackingDisplayPushErrorFlumeSink, Map -> Sink:
TrackingDisplayPushValidFlumeSink) (1/1)
2016-10-18 16:56:04,053 INFO org.apache.flink.runtime.taskmanager.Task -
Source: pubSubMessageAcknowledgingSource ->
TrackingDisplayPushDeduplicater -> TrackingDisplayPushDeserializer ->
(Sink: TrackingDisplayPushErrorFlumeSink, Map -> Sink:
TrackingDisplayPushValidFlumeSink) (1/1) switched to CANCELING
2016-10-18 16:56:04,053 INFO org.apache.flink.runtime.taskmanager.Task -
Triggering cancellation of task code Source:
pubSubMessageAcknowledgingSource -> TrackingDisplayPushDeduplicater ->
TrackingDisplayPushDeserializer -> (Sink:
TrackingDisplayPushErrorFlumeSink, Map -> Sink:
TrackingDisplayPushValidFlumeSink) (1/1) (38bf32d9199a0c9383a8b1e8d73a1f65).
2016-10-18 16:56:34,055 WARN org.apache.flink.runtime.taskmanager.Task -
Task 'Source: pubSubMessageAcknowledgingSource ->
TrackingDisplayPushDeduplicater -> TrackingDisplayPushDeserializer ->
(Sink: TrackingDisplayPushErrorFlumeSink, Map -> Sink:
TrackingDisplayPushValidFlumeSink) (1/1)' did not react to cancelling
signal, but is stuck in method:
java.net.PlainSocketImpl.socketConnect(Native Method)
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
java.net.Socket.connect(Socket.java:589)
java.net.Socket.connect(Socket.java:538)
sun.net.NetworkClient.doConnect(NetworkClient.java:180)
sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
sun.net.www.http.HttpClient.<init>(HttpClient.java:211)
sun.net.www.http.HttpClient.New(HttpClient.java:308)
sun.net.www.http.HttpClient.New(HttpClient.java:326)
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1169)
sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1105)
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:999)
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:933)
sun.net.www.protocol.http.HttpURLConnection.getOutputStream0(HttpURLConnection.java:1283)
sun.net.www.protocol.http.HttpURLConnection.getOutputStream(HttpURLConnection.java:1258)
com.accengage.bigdata.flink.streaming.sinks.FlumeSink.flush(FlumeSink.java:107)
com.accengage.bigdata.flink.streaming.sinks.FlumeSink.invoke(FlumeSink.java:80)
com.accengage.bigdata.flink.streaming.sinks.FlumeSink.invoke(FlumeSink.java:25)l
org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:39)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:373)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:358)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:346)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:329)
org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:39)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:373)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:358)
org.apache.flink.streaming.api.collector.selector.DirectedOutput.collect(DirectedOutput.java:126)
org.apache.flink.streaming.api.collector.selector.DirectedOutput.collect(DirectedOutput.java:35)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:346)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:329)
org.apache.flink.streaming.api.operators.StreamMap.processElement(StreamMap.java:39)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:373)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:358)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:346)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:329)
org.apache.flink.streaming.api.operators.StreamFilter.processElement(StreamFilter.java:38)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:373)
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:358)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:346)
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:329)
org.apache.flink.streaming.api.operators.StreamSource$NonTimestampContext.collect(StreamSource.java:160)
com.accengage.bigdata.flink.streaming.sources.PubSubAcknowledgingSource.run(PubSubAcknowledgingSource.java:148)
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:80)
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:53)
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:56)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:266)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
java.lang.Thread.run(Thread.java:745)
Any idea of what I am doing wrong? What could I do?
Thanks.

I assume you are using a custom Sink (com.accengage.bigdata.flink.streaming.sinks.FlumeSink) which uses some HTTP library for communicating with Flume.
Most likely, the HTTP library got struck in a loop or something when the interrupt was send to the thread (this happens for example when Interrupted exceptions are ignored)
To resolve the issue, you can either use a HTTP library which handles interrupts properly or call the library from a different thread, which will not receive the interrupts on the main thread.
In Flink 1.2 there will be some additional mechanism to avoid the system to get struck in the cancel() call. See FLINK-4715.

Related

Flink checkpoint not replaying the kafka events which were in process during the savepoint/checkpoint

I want to test end-to-end exactly once processing in flink. My job is:
Kafka-source -> mapper1 -> mapper-2 -> kafka-sink
I had put a Thread.sleep(100000) in mapper1 and then ran the job. I took the savepoint while stopping the job and then I removed the Thread.sleep(100000) form the mapper1, and I expect that the event should be replayed as it was not sinked. But that didnt happen and job is waiting for new event.
My Kafka source:
KafkaSource.<String>builder()
.setBootstrapServers(consumerConfig.getBrokers())
.setTopics(consumerConfig.getTopic())
.setGroupId(consumerConfig.getGroupId())
.setStartingOffsets(OffsetsInitializer.latest())
.setValueOnlyDeserializer(new SimpleStringSchema())
.setProperty("commit.offsets.on.checkpoint", "true")
.build();
My kafka sink:
KafkaSink.<String>builder()
.setBootstrapServers(producerConfig.getBootstrapServers())
.setDeliverGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
.setRecordSerializer(KafkaRecordSerializationSchema.builder()
.setTopic(producerConfig.getTopic())
.setValueSerializationSchema(new SimpleStringSchema()).build())
.build();
My environmentSetup for flink job:
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
environment.enableCheckpointing(2000);
environment.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
environment.getCheckpointConfig().setMinPauseBetweenCheckpoints(100);
environment.getCheckpointConfig().setCheckpointTimeout(60000);
environment.getCheckpointConfig().setTolerableCheckpointFailureNumber(2);
environment.getCheckpointConfig().setExternalizedCheckpointCleanup(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
environment.getCheckpointConfig().setCheckpointTimeout(1000);
environment.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
environment.getCheckpointConfig().enableUnalignedCheckpoints();
environment.getCheckpointConfig().setCheckpointStorage("file:///tmp/flink-checkpoints");
Configuration configuration = new Configuration();
configuration.set(ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH, true);
environment.configure(configuration);
What am I doing wrong here?
I want that any event which is in process during the cancellation/stop of the job, should restart again.
EDIT 1:
I observed that my kafka was showing offset lag for my flink's kafka-source consumer group. I am assuming it means my checkpointing is behaving right, is that correct ?
I also observed when i restarted my job from checkpoint, it didnt start to consume from the remaining offsets, while I have the consumer offset set to EARLIEST. I had to send more events to trigger the consumption on kafka-source side and then it consumed all the events.
For exactly-once, you must provide a TransactionalIdPrefix unique across all applications running against the same Kafka cluster (this is a change compared to the legacy FlinkKafkaConsumer):
KafkaSink<T> sink =
KafkaSink.<T>builder()
.setBootstrapServers(...)
.setKafkaProducerConfig(...)
.setRecordSerializer(...)
.setDeliverGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
.setTransactionalIdPrefix("unique-id-for-your-app")
.build();
When resuming from a checkpoint, Flink always uses the offsets stored in the checkpoint rather than those configured in the code or stored in the broker.

Flink 1.12.2 modifying savepoint locally - metadata has absolute paths

I'm trying to modify an existing savepoint, created with flink 1.12.2 & ververica 2.4.1, that was saved on S3.
The steps that I took are the following:
Copied the savepoint containing the '_metadata' and savepoint files from S3 to my local machine;
Opened the flink state and read the state of the operator I'm interested in;
Created and amended the dataset that I want to replace the state of that operator with ;
Trying to amend the state with the following code
BootstrapTransformation<AccountRegistrationInformation> transformation = OperatorTransformation
.bootstrapWith(accountDataSet)
.keyBy(acc -> acc.getBrand() + "-" + acc.getAccountId())
.transform(new AccountRegistrationBootstrapper());
Savepoint.load(executionEnvironment, "C:\\flinkState", new MemoryStateBackend())
.removeOperator("registration-processor")
.withOperator("registration-processor", transformation)
.write("C:\\flinkState\\transformed");
executionEnvironment.execute();
When running the above code, it amends a subset of the dataset and flink throws the following exception.
Caused by: java.io.FileNotFoundException: \<redacted>\savepoint-c680a3-c178150a8b8d\32c44059-1f59-4091-bcb5-3e1efa369ec6 (The system cannot find the path specified)
When inspecting the _metadata, I noticed that it has absolute paths in S3:
s3://<redacted>/savepoint-c680a3-c178150a8b8d/32c44059-1f59-4091-bcb5-3e1efa369ec6
What I want is to save the amended savepoint to my local machine and then move that savepoint over to S3 manually so that flink can start with the amended state.
Can anybody share their experience with this?
Full exception:
10:09:25,169 INFO org.apache.flink.runtime.state.heap.HeapKeyedStateBackend [] - Initializing heap keyed state backend with stream factory.
10:09:25,170 INFO org.apache.flink.runtime.state.heap.HeapKeyedStateBackendBuilder [] - Finished to build heap keyed state-backend.
10:09:25,171 INFO org.apache.flink.runtime.state.heap.HeapKeyedStateBackend [] - Initializing heap keyed state backend with stream factory.
10:09:25,176 INFO org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate [] - Converting recovered input channels (1 channels)
10:09:25,178 ERROR org.apache.flink.runtime.operators.DataSinkTask [] - Error in user code: \<redacted>\savepoints\d18b311a-86e8-4406-93b5-f2b398c4257f\savepoint-c680a3-c178150a8b8d\32c44059-1f59-4091-bcb5-3e1efa369ec6 (The system cannot find the path specified): DataSink (org.apache.flink.state.api.output.FileCopyFunction#da28d03) (1/1)
java.io.FileNotFoundException: \<redacted>\savepoints\d18b311a-86e8-4406-93b5-f2b398c4257f\savepoint-c680a3-c178150a8b8d\32c44059-1f59-4091-bcb5-3e1efa369ec6 (The system cannot find the path specified)
at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_282]
at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:1.8.0_282]
at java.io.FileInputStream.<init>(FileInputStream.java:138) ~[?:1.8.0_282]
at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50) ~[flink-core-1.12.2.jar:1.12.2]
at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134) ~[flink-core-1.12.2.jar:1.12.2]
at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:87) ~[flink-core-1.12.2.jar:1.12.2]
at org.apache.flink.state.api.output.FileCopyFunction.writeRecord(FileCopyFunction.java:61) ~[flink-state-processor-api_2.11-1.12.2.jar:1.12.2]
at org.apache.flink.state.api.output.FileCopyFunction.writeRecord(FileCopyFunction.java:34) ~[flink-state-processor-api_2.11-1.12.2.jar:1.12.2]
at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:235) [flink-runtime_2.11-1.12.2.jar:1.12.2]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755) [flink-runtime_2.11-1.12.2.jar:1.12.2]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:570) [flink-runtime_2.11-1.12.2.jar:1.12.2]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
10:09:25,223 WARN org.apache.flink.runtime.taskmanager.Task [] - DataSink (org.apache.flink.state.api.output.FileCopyFunction#da28d03) (1/1)#0 (d4b998c90a0fc21a64f463b6476e85aa) switched from RUNNING to FAILED.
java.io.FileNotFoundException: \<redacted>\savepoints\d18b311a-86e8-4406-93b5-f2b398c4257f\savepoint-c680a3-c178150a8b8d\32c44059-1f59-4091-bcb5-3e1efa369ec6 (The system cannot find the path specified)
at java.io.FileInputStream.open0(Native Method) ~[?:1.8.0_282]
at java.io.FileInputStream.open(FileInputStream.java:195) ~[?:1.8.0_282]
at java.io.FileInputStream.<init>(FileInputStream.java:138) ~[?:1.8.0_282]
at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50) ~[flink-core-1.12.2.jar:1.12.2]
at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:134) ~[flink-core-1.12.2.jar:1.12.2]
at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.open(SafetyNetWrapperFileSystem.java:87) ~[flink-core-1.12.2.jar:1.12.2]
at org.apache.flink.state.api.output.FileCopyFunction.writeRecord(FileCopyFunction.java:61) ~[flink-state-processor-api_2.11-1.12.2.jar:1.12.2]
at org.apache.flink.state.api.output.FileCopyFunction.writeRecord(FileCopyFunction.java:34) ~[flink-state-processor-api_2.11-1.12.2.jar:1.12.2]
at org.apache.flink.runtime.operators.DataSinkTask.invoke(DataSinkTask.java:235) ~[flink-runtime_2.11-1.12.2.jar:1.12.2]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:755) [flink-runtime_2.11-1.12.2.jar:1.12.2]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:570) [flink-runtime_2.11-1.12.2.jar:1.12.2]
10:09:25,224 INFO org.apache.flink.runtime.taskmanager.Task [] - Freeing task resources for DataSink (org.apache.flink.state.api.output.FileCopyFunction#da28d03) (1/1)#0 (d4b998c90a0fc21a64f463b6476e85aa).
10:09:25,255 INFO org.apache.flink.runtime.taskmanager.Task [] - MapPartition (2861c3d1e95af557df2962264aaf94ef) (6/8)#0 (ab4fcd08aa51c77eec1ac6d3c9fba2d3) switched from RUNNING to FINISHED.
10:09:25,255 INFO org.apache.flink.runtime.taskmanager.Task [] - Freeing task resources for MapPartition (2861c3d1e95af557df2962264aaf94ef) (6/8)#0 (ab4fcd08aa51c77eec1ac6d3c9fba2d3).
10:09:25,255 INFO org.apache.flink.runtime.taskmanager.Task [] - MapPartition (2861c3d1e95af557df2962264aaf94ef) (8/8)#0 (c0105262e4e271633df686c1b09476a9) switched from RUNNING to FINISHED.
10:09:25,256 INFO org.apache.flink.runtime.taskmanager.Task [] - Freeing task resources for MapPartition (2861c3d1e95af557df2962264aaf94ef) (8/8)#0 (c0105262e4e271633df686c1b09476a9).
The absolute path in the _metadata could be a pointer to inline state: i.e., the state stored in _metadata directly. The state stored in data files should have relative paths.
What do you use in 'C:\flinkState' in your code and what do you see '<redacted>' in the FileNotFoundException? If they are sensitive, can you provide an example of them?
Also, did you try on a Linux machine?
Update:
The added stack trace is similar to the one in https://issues.apache.org/jira/browse/FLINK-23429. Could you try to add the state processor API dependence from Flink 1.12.5 in your savepoint transformation job?

Flink Application failure on savepoint

I am observing a failure whenever I trigger a savepoint on my Flink Application which otherwise runs without issues.
Job Details:
Deployment: AWS Kinesis Data Analytics(Kubernetes)
5 Task Managers
Backend: RocksDB
Kinesis Data Units: 256 KPU
Flink Graph(Parallelism indicated in brackets)
Task Manager Details screenshot
Exception Root Cause on Flink UI:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Error at remote task manager '142.151.130.161/142.151.130.161:6121'.
at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.decodeMsg(CreditBasedPartitionRequestClientHandler.java:294)
at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelRead(CreditBasedPartitionRequestClientHandler.java:183)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
at org.apache.flink.runtime.io.network.netty.NettyMessageClientDecoderDelegate.channelRead(NettyMessageClientDecoderDelegate.java:115)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475)
at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1224)
at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1271)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:505)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:444)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:283)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1421)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:794)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:424)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:326)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.flink.runtime.io.network.partition.ProducerFailedException: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.writeAndFlushNextMessageIfPossible(PartitionRequestQueue.java:224)
at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.enqueueAvailableReader(PartitionRequestQueue.java:108)
at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.userEventTriggered(PartitionRequestQueue.java:173)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:341)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:327)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:319)
at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter.java:117)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:369)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:341)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:327)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:319)
at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter.java:117)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:369)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:341)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:327)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:319)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline.java:1439)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:341)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:327)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireUserEventTriggered(DefaultChannelPipeline.java:924)
at org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.lambda$notifyReaderNonEmpty$0(PartitionRequestQueue.java:87)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416)
at org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:331)
... 3 more
Caused by: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
at java.base/java.lang.Thread.start0(Native Method)
at java.base/java.lang.Thread.start(Thread.java:798)
at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1354)
at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.finishAndReportAsync(SubtaskCheckpointCoordinatorImpl.java:451)
at org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:267)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:917)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:907)
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:873)
at org.apache.flink.streaming.runtime.io.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:113)
at org.apache.flink.streaming.runtime.io.CheckpointBarrierAligner.processBarrier(CheckpointBarrierAligner.java:198)
at org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:93)
at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:158)
at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:67)
at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:346)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxStep(MailboxProcessor.java:191)
at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:181)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:566)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:537)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:724)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:549)
... 1 more
Any help would be appreciated to debug the issue

How to send static (Cached Data) will be refreshed around every 2 hours to Operators in Flink

I am using Singleton Class(ConfigurationUtil) in my main Method to load the static data and then using it inside my operator to process the events, but I am getting Nullpointer Exception. I can see that instance is not getting initialized in the Main class. So I tried initializing the ConfigurationUtil in AsyncDataStreamOperator is got initizalized but after loading same NullPointer exception is coming on accessing the instance data:
My Main Class where I am instantiating the Class and defining operators.
ConfigurationUtil.getInstance().loadConfigurations(properties);
SingleOutputStreamOperator<Event> enrichedStream = AsyncDataStream
.unorderedWait(eventStream, new AsyncExternalCalls(properties), 60, TimeUnit.SECONDS, 3)
.name("External Enrichment")
.uid("ExternalEnrichment");
This is inside my ConfigurationUtil
private static ConfigurationUtil instance;
private ConfigurationDetails configurationDetails;
public void loadConfigurations(Properties properties) {
// loading configuration data from db
loadExternalEnrichCache(false,properties);
loadDataEnrichCache(false,properties);
loadDataEnrichUrlMangementCache(properties);
loadErrorConfigCache(properties);
saveProperties(properties);
LOG.warn("configurationDetails: {}", this.configurationDetails);
}
public static ConfigurationUtil getInstance() {
if (instance == null) {
LOG.warn("Creating ConfigurationUtil Object");
instance = new ConfigurationUtil();
}
return instance;
}
I am using that cache inside my AsyncDataStream Operator.
ExternalEnrichmentDetailsCache extDetails = ConfigurationUtil.getInstance().getConfigurationDetails().getExternalEnrichmentDetails().get(event.getEventId());
JSONArray extEnrichment = extDetails.getExternal();
Error Logs
2021-06-07 11:47:02,912 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher - Shutting down the shard consumer threads of subtask 0 ...
2021-06-07 11:47:02,914 WARN com.telstra.eov.enrichment.AsyncExternalCalls - Async External Enrichment - Close
2021-06-07 11:47:02,915 INFO org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher - Shutting down the shard consumer threads of subtask 0 ...
2021-06-07 11:47:02,916 INFO org.apache.flink.runtime.taskmanager.Task - Source: Kinesis Source -> Event Stream -> External Enrichment -> Update data enrichment details (1/1) (36afadaa2b9493386df71c09e46688b4) switched from RUNNING to FAILED.
java.lang.NullPointerException
at com.hide.hide.hide.ExternalEnrichment.processElement(ExternalEnrichment.java:110)
at com.hide.hide.hide.ExternalEnrichment.processElement(ExternalEnrichment.java:59)
at org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:579)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:554)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:534)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:718)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:696)
at org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
at com.telstra.eov.enrichment.JsonToEventStream.processElement(JsonToEventStream.java:49)
at com.telstra.eov.enrichment.JsonToEventStream.processElement(JsonToEventStream.java:25)
at org.apache.flink.streaming.api.operators.ProcessOperator.processElement(ProcessOperator.java:66)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:579)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:554)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:534)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:718)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:696)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collect(StreamSourceContexts.java:104)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collectWithTimestamp(StreamSourceContexts.java:111)
at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.emitRecordAndUpdateState(KinesisDataFetcher.java:776)
at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.access$000(KinesisDataFetcher.java:92)
at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher$AsyncKinesisRecordEmitter.emit(KinesisDataFetcher.java:273)
at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher$SyncKinesisRecordEmitter$1.put(KinesisDataFetcher.java:288)
at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher$SyncKinesisRecordEmitter$1.put(KinesisDataFetcher.java:285)
at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.emitRecordAndUpdateState(KinesisDataFetcher.java:760)
at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.deserializeRecordForCollectionAndUpdateState(ShardConsumer.java:371)
at org.apache.flink.streaming.connectors.kinesis.internals.ShardConsumer.run(ShardConsumer.java:258)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2021-06-07 11:47:02,919 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for Source: Kinesis Source -> Event Stream -> External Enrichment -> Update data enrichment details (1/1) (36afadaa2b9493386df71c09e46688b4).
I see the close function is getting called before the NullPointer. Not sure if that is issue? Any help will be much appreciated. Let me know if any further information is required.
Flink distributes operators across multiple JVMs/servers, so using statics to "share" data doesn't work. Please see the Broadcast State Pattern page for how Flink supports sharing data.

"Could not complete snapshot" due to ConcurrentModificationException causes Flink pipeline to fail

After enabling checkpointing for our Flink pipeline, we regularly get the exception below, which causes the pipeline to fail.
The pipeline reads from Kafka, makes some stateless transformations (map) and then writes to HDFS via StreamingFileSink.
org.apache.flink.runtime.checkpoint.CheckpointException: Could not complete snapshot 1080 for operator foo -> bar -> Sink: Hadoop (1/2). Failure reason: Checkpoint was declined.
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:431)
at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1282)
at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1216)
at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:872)
at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:777)
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:708)
at org.apache.flink.streaming.runtime.io.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:88)
at org.apache.flink.streaming.runtime.io.CheckpointBarrierAligner.processBarrier(CheckpointBarrierAligner.java:113)
at org.apache.flink.streaming.runtime.io.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:155)
at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.pollNextNullable(StreamTaskNetworkInput.java:102)
at org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.pollNextNullable(StreamTaskNetworkInput.java:47)
at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:135)
at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:279)
at org.apache.flink.streaming.runtime.tasks.StreamTask.run(StreamTask.java:301)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:406)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1445)
at java.util.HashMap$EntryIterator.next(HashMap.java:1479)
at java.util.HashMap$EntryIterator.next(HashMap.java:1477)
at org.apache.flink.api.common.typeutils.base.MapSerializer.copy(MapSerializer.java:105)
at org.apache.flink.api.common.typeutils.base.MapSerializer.copy(MapSerializer.java:43)
at org.apache.flink.api.java.typeutils.runtime.PojoSerializer.copy(PojoSerializer.java:239)
at org.apache.flink.streaming.runtime.streamrecord.StreamElementSerializer.copy(StreamElementSerializer.java:105)
at org.apache.flink.streaming.runtime.streamrecord.StreamElementSerializer.copy(StreamElementSerializer.java:46)
at org.apache.flink.runtime.state.ArrayListSerializer.copy(ArrayListSerializer.java:73)
at org.apache.flink.runtime.state.PartitionableListState.<init>(PartitionableListState.java:68)
at org.apache.flink.runtime.state.PartitionableListState.deepCopy(PartitionableListState.java:80)
at org.apache.flink.runtime.state.DefaultOperatorStateBackendSnapshotStrategy.snapshot(DefaultOperatorStateBackendSnapshotStrategy.java:88)
at org.apache.flink.runtime.state.DefaultOperatorStateBackend.snapshot(DefaultOperatorStateBackend.java:261)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:406)
... 17 more
Currently, there is just a single node, and checkpointing is configured to use the local filesystem:
state.backend: filesystem
state.checkpoints.dir: file://opt/flink/checkpoints
I am completely unsure how to deal with this error.
This is Flink 1.9.1.

Resources