Flink slot removed exception - apache-flink

I am getting the following exception
org.apache.flink.util.FlinkException: The assigned slot container_1546939492951_0001_01_003659_0 was removed.
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:789)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:759)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:951)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:823)
at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:346)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
when running a batch process involving joining two very large datasets.
Here is what I can see in the overview. The failure happened on a task manager which did not get any inputs. Weirdly the previous set (partition -> flat map -> map) did not send anything to that task manager despite having a rebalance in front.
Am running it on EMR. I see that there is a slot.idle.timeout, would that have an effect and if so how do I specify it for that job? Can it be done on the command line?

it's possible that this is a timeout issue, but usually when this happens to me it's because there's a failure (e.g. YARN kills the container because it's running beyond pmem or vmem limits). I'd recommend carefully checking the JobManager and all TaskManager log files.

You can add the following line in java code.
env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
then your job will start on cancellation automatically.

I had a similar issue which turned out to be excessive logging in our Flink job. I am guessing this resulted in Task Manager timeouts. Removing or reducing the amount of logging fixed the issue

I just had a similar issue when running Flink on Kubernetes, it turns out to be TaskManager was OOMKilled and restarted. If you also run Flink on Kubernetes, you can check the status of your TaskManager pods:
kubectl describe pods <pod>
if you see the container was previously OOKilled, that could be the cause:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137

This issue is not always cause by oom and killed by yarn,if you have log like this:"Closing TaskExecutor connection container_e86_1590402668190_3503_01_000015 because: Container released on a lost node",and it is before your error log.I am guessing this problem cause by Nodemanager down.About 10 minutes ,flink ResourceManager can not communicate with NodeManager,ResourceManager will start to remove slot ,and restart(if you have restart strategy).

Related

How do I understand why my Flink TaskManager quits shortly after starting my job?

I'm using Flink 1.15 Docker images in Session mode pretty much the same as the Compose documentation. I have one Task Manager. A few minutes after starting my streaming job I get a stack dump log message from my Job Manager stating that the Task Manager is no longer reachable and I see that my Task Manager Docker container has exited with code 137 - which possibly indicates an out of memory error. Although docker inspect shows the OOMKilled flag as false indicating some sort of other issue.
End of stack trace from Job Manager log:
Caused by: org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id 172.18.0.5:44333-7c7193 is no longer reachable.
The TaskManager Docker logs produce no error whatsoever before exiting. If I resurrect the dead Task Manager Docker container and have a look at the log file in /opt/flink/logs/ then the last messages state that the various components in my pipeline have switched from INITIALIZING to RUNNING.
I would have expected an out of memory stack dump from the task manager if my state had become too large. Also docker inspect shows that the container did not exit because of an out of memory error.
I have no idea what causes my Task Manager to die. Any ideas how I can figure out what is causing the issue? (This happens on 1.15.1 & 1.15.2. I haven't used any other version of Flink.)
This problem happened to me when a task manager runs out of memory and when the GC takes too much time trying to free some memory.
I know you said docker inspect doesn't show that it shuts down because of memory issues, but still try to use more RAM or decrease the memory requirements of your tasks and see if it still crashes.
I ended up using nothing more sophisticated than trial and error with a variety of different test jobs. I'm not 100% sure I fixed the problem as the issue of the Task Manager crashing without an stack dump occurred sporadically. However the Task Manager hasn't crashed for several days.
The simplest job to recreate my issue was with a SourceFunction outputting a continuous stream of incrementing Longs straight to a DiscardingSink. With this setup the Task Manager would crash after a while on my Linux machine sporadically but never on my Mac.
If I added a Thread.sleep to the SourceFunctions run loop then the crash would eventually occur but take a bit longer.
I tried Source framework instead of SourceFunction where a SingleThreadMultiplexSourceReaderBase repeatedly calls fetch on a SplitReader to output the Longs. There have been fewer crashes since I did this so it didn't work 100%.
I presume my SourceFunction was overfilling some sort of buffer or making a task slot unresponsive as it never relinquished a slot once it started. (Or some other completely different explanation.)
I wish the Task Manager gave some sort of indication why it stopped running.

Flink application is running/alive always,but the job is gone

I am using the detach and yarn-cluster mode to run the flink application in job mode as follows:
flink run -d -m yarn-cluster -yn 10 -ys 1 -yqu QueueA -c com.me.MyFlinkApplicaiton
The application starts up and the job in this application starts to consume message from Kafka successuflly.
After running smoothly for serverl hours, the flink yarn application is alive/running, but the job in this application disappears(there is no job/task running any more), all the slots are freed.
My application is a simple read from Kafka source -> sink to mongodb application, and I have try/catch the whole sink function's invoke method, so there will no exception throws in sink function.
I didn't find usefull log to investigate this problem ,so I would ask what may happen that may cause this behavior
Ok, looks I have found out the problem, I have specified the restart strategy in the code as
env.setRestartStrategy(RestartStrategies.noRestart())
When the tm exits and jm is cancelled, flink will not try to restart the jm and tm.

What is Apache Flink's detached mode?

I saw this line in Flink documentation but can't figure out what 'detached mode' means. Please help. Thanks.
Run example program in detached mode:
./bin/flink run -d ./examples/batch/WordCount.jar
The Flink CLI runs jobs either in blocking or detached mode. In blocking mode, the CliFrontend (client) process keeps running, blocked, waiting for the job to complete -- after which it will print out some information. In the example below I ran a streaming job, which I cancelled from the WebUI after a few seconds:
$ flink run target/oscon-1.0-SNAPSHOT.jar
Starting execution of program
Program execution finished
Job with JobID b02da01c30585bfbc86a23446559987f has finished.
Job Runtime: 8673 ms
If you run in blocking mode, you can kill the CliFrontend (e.g., with ctrl-C) if you like, and the job will be unaffected, so long as it has run far enough to have submitted the job to the cluster.
In detached mode, the CliFrontend submits the job to the cluster and then exits straight away.
That means that the application is not attached (or bound) to your shell session. So if you close your terminal the application will still keep running (until it finished its work). For a batch example that might not be a big problem - they will process the given batch of data and end afterwards. As soon as you skip to a streaming approach the operations will take place on an "infinite stream of data" and have no defined end.
Hope that helps.

taskmanager killed on Flink cluster

I'm executing an Apache Flink program on a cluster of three nodes.
One of these works as jobmanager and taskmanager too. The other two are just taskmanager.
When I start my program (I do it on jobmanager) I obtain the following error (after a minute of no-real-execution of the program):
java.lang.Exception: TaskManager was lost/killed: c4211322e77548b791c70d466c138a49 # giordano-2-2-100-1 (dataPort=37904)
at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)
at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
at akka.actor.ActorCell.invoke(ActorCell.scala:486)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
where giordano-2-2-100-1 is the address of the job-task manager.
I set number of Task Slots equal to the machine cores (2) and the heap memory accordingly to the available memory showed by meminfo.
During the execution (before the error appears) I watched cpu usage and I noted that the two core of the job-task manager are working (at least 50% each, even 100% for one of them sometimes) while the other two nodes (the task managers) are completely free with a cpu usage around 0%.
I set correctly rpc address of the jobmanager and filled correctly slaves file putting:
giordano-2-2-100-1
giordano-2-2-100-2
giordano-2-2-100-3
Moreover I used ping from the master node to verify if the other nodes are reachable and it's ok, and telnet from the task managers to verify if the job manager was reachable, also in this case everything is ok.
Honestly I have no more ideas about what I'm doing wrong...
Furthermore I tried to execute the program on my laptop (dual core) setting a single-node cluster with the same configuration of the real cluster and the same jar. In this case everything works perfectly so I'm quietly sure the problem is in the job manager.
P.S. On stack overflow I found this reply of the same problem:TaskManager loss/killed but I don't understand how to set a different garbage collector.
This problem happened to me when a task manager runs out of memory and when the GC takes too much time trying to free some memory.
Try to use more Ram or decrease the memory requirements of your tasks.

Does a 'Broken pipe' exception cancel my job?

Currently I am running a Flink program on a remote cluster of 4 machines using 144 TaskSlots. After running for around 30 minutes I received the following error:
INFO
org.apache.flink.runtime.jobmanager.web.JobManagerInfoServlet - Info
server for jobmanager: Failed to write json updates for job
b2eaff8539c8c9b696826e69fb40ca14, because
org.eclipse.jetty.io.RuntimeIOException:
org.eclipse.jetty.io.EofException at
org.eclipse.jetty.io.UncheckedPrintWriter.setError(UncheckedPrintWriter.java:107)
at
org.eclipse.jetty.io.UncheckedPrintWriter.write(UncheckedPrintWriter.java:280)
at
org.eclipse.jetty.io.UncheckedPrintWriter.write(UncheckedPrintWriter.java:295)
at
org.apache.flink.runtime.jobmanager.web.JobManagerInfoServlet.writeJsonUpdatesForJob(JobManagerInfoServlet.java:588)
at
org.apache.flink.runtime.jobmanager.web.JobManagerInfoServlet.doGet(JobManagerInfoServlet.java:209)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:734) at
javax.servlet.http.HttpServlet.service(HttpServlet.java:847) at
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:532)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:965)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:388)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:187)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:901)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at
org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:47)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:113)
at org.eclipse.jetty.server.Server.handle(Server.java:352) at
org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:596)
at
org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1048)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:549)
at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:211)
at
org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:425)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:489)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:436)
at java.lang.Thread.run(Thread.java:745) Caused by:
org.eclipse.jetty.io.EofException at
org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:905)
at
org.eclipse.jetty.http.AbstractGenerator.flush(AbstractGenerator.java:427)
at org.eclipse.jetty.server.HttpOutput.flush(HttpOutput.java:78) at
org.eclipse.jetty.server.HttpConnection$Output.flush(HttpConnection.java:1139)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:159) at
org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:86) at
java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:154)
at org.eclipse.jetty.server.HttpWriter.write(HttpWriter.java:258) at
org.eclipse.jetty.server.HttpWriter.write(HttpWriter.java:107) at
org.eclipse.jetty.io.UncheckedPrintWriter.write(UncheckedPrintWriter.java:271)
... 24 more Caused by: java.io.IOException: Broken pipe at
sun.nio.ch.FileDispatcherImpl.write0(Native Method) at
sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at
sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at
sun.nio.ch.IOUtil.write(IOUtil.java:51) at
sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470) at
org.eclipse.jetty.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:185)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:256)
at
org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:849)
... 33 more
I know that java.io.IOException: Broken pipe means that the JobManager lost some kind of connection so I guess the whole job failed and I have to restart it. Although I think the process is not running anymore the WebInterface still lists it as running. Additionally the JobManager is still present when I use jps to identify my running processes on the cluster. So my question is if my job is lost and whether this error is happening randomly sometimes or whether my program caused it.
EDIT: My TaskManagers still send Heartbeats every few seconds and seem to be running.
It's actually a problem of the JobManagerInfoServlet, Flink's web server, which cannot sent the latest JSON updates of the requested job to your browser because of the java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method). Thus, only the GET request to the server failed.
Such a failure should not affect the execution of the currently running Flink job. Simply refreshing your browser (with Flink's web UI) should send another GET request which then hopefully completes successfully.

Resources