Flink "Encountered error while consuming partitions" + "Connection reset by peer" - apache-flink

I have a Flink streaming job running 24/7. Several times per day, I see it fail and restart with the following log messages:
10:02:08.524 [Flink Netty Server (0) Thread 0] ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue - Encountered error while consuming partitions
org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
10:02:08.524 [Flink Netty Server (0) Thread 1] ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue - Encountered error while consuming partitions
org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
10:02:08.537 [Flink Netty Server (0) Thread 0] ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue - Encountered error while consuming partitions
org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
10:02:08.560 [Flink Netty Server (0) Thread 0] ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue - Encountered error while consuming partitions
org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
10:02:08.537 [Flink Netty Server (0) Thread 1] ERROR org.apache.flink.runtime.io.network.netty.PartitionRequestQueue - Encountered error while consuming partitions
org.apache.flink.shaded.netty4.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
Flink automatically restarts and runs successfully for several hours before this happens again.
Ideally, I'd prefer to fix this and avoid the periodic application restarts if possible. Or at least understand what is causing this. However, the application is generally working and it does recover with the automatic application restart. My boss can live with this as-is if needed.
I'm using Flink 1.14.4, which is the latest version as of this writing. I'm using the newer KafkaSource API, if that matters.
I see two related SO questions. They both mention increasing TaskManager memory. My TaskMemory memory is already at 10GB and I don't see any memory related log errors or warnings in TaskManager/JobManager logs.
flink Connection reset by peer
Flink Job suddenly crashed with error: Encountered error while consuming partitions

Related

Flink1.16 Restart Strategy - DLQ for valid msg after the maximum restart

We have a restart strategy in our cluster. we have kafka broker connection issue which is resolvable after some retry . but before that retry exceed then that message will loss since job manager is restarting , is there any way we can push it in to the DLQ for that message because we will re consume the same valid message from the DLQ

Connecting to Snowflake with DataGrip fequent timeout errors

When working with Snowflake through DataGrip I often get the following timeout error:
JDBC driver encountered communication error. Message: Exception encountered for HTTP request: Operation timed out.
Usually on a second attempt it will be fine, but I always need to wait for this timeout to occur before I can do anything. If I continue to run queries after getting the initial connection it will be fine, but if I was to leave the application for a few minutes, again when I come back it struggles to re-connect for the next query.
My suspicion is that this occurs when the Snowflake warehouse is suspended. It seems like it is waiting on some acknowledgment that the warehouse has been resumed.
Has anyone else encountered this?

DMS replication Error executing source loop; Stream component failed at subtask 0

I have set up a DMS replication with
Source - SQL Server onPrem
Target - Aws MySQL
Class- dms.c5.4xlarge
Engine version-3.4.4
it was both full and incremental data . After couple of hours the job started failing and the logs started filling up in my source system.
Last Error Fatal error has occurred Task error notification received from subtask 0, thread 0 [reptask/replicationtask.c:] [] Error executing source loop; Stream component failed at subtask 0, component ; Stream component ' terminated [reptask/replicationtask.c: Stop Reason FATAL_ERROR Error Level FATAL
Has any one encountered similar issue?

flink job failed when encountered DB Connection Exception

I'm newer to flink. My flink job recieved messages from mq, and do some rule check and summary calculation, then write the result to rdbms. Sometimes the job will encounter NullPointException(due to my silly code) , or MQ connection Exception(due to non-exist Topic),and it just halted current message-processing, the job is still running, then next messages will still trigger the exception.
But today I restart DB and the job failed. What's the difference?

Running Flink build-in program sometimes arise Exception:java.io.IOException: Connecting the channel failed

I have set up a flink standalone cluster, with one master and three slaves , all SESU Linux machines. In the master Dashboard http://flink-master:8081/ I can see 3 Task Managers and 3 task slots as I have set taskmanager.numberOfTaskSlots: 1 in flink-conf.yaml in all of the slaves.
When I run a flink built-in program,like the examples/streaming/Iteration.jar,I get exception often:
java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'ccr202/127.0.0.2:49651' has failed. This might indicate that the remote task manager has been lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:132)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:84)
at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:59)
at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:156)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:480)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:502)
at org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:93)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:214)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connecting to remote task manager + 'ccr202/127.0.0.2:49651' has failed. This might indicate that the remote task manager has been lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:132)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:268)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:284)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
Caused by: java.net.ConnectException: Connection refused: ccr202/127.0.0.2:49651
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)
... 6 more
It seems that the network causes the problem,but sometimes the flink program can successfully finish.So what is the reason?
I also encounter this issue very frequently especially when there are many taskManagers. There are a few config I have tried to solve this issue. It's happened when the taskManager read the remote partition through netty connection. It timed out when request the connection. I increased the config "taskmanager.network.netty.server.numThreads", it solved the issue.

Resources