taskmanager killed on Flink cluster - apache-flink

I'm executing an Apache Flink program on a cluster of three nodes.
One of these works as jobmanager and taskmanager too. The other two are just taskmanager.
When I start my program (I do it on jobmanager) I obtain the following error (after a minute of no-real-execution of the program):
java.lang.Exception: TaskManager was lost/killed: c4211322e77548b791c70d466c138a49 # giordano-2-2-100-1 (dataPort=37904)
at org.apache.flink.runtime.instance.SimpleSlot.releaseSlot(SimpleSlot.java:217)
at org.apache.flink.runtime.instance.SlotSharingGroupAssignment.releaseSharedSlot(SlotSharingGroupAssignment.java:533)
at org.apache.flink.runtime.instance.SharedSlot.releaseSlot(SharedSlot.java:192)
at org.apache.flink.runtime.instance.Instance.markDead(Instance.java:167)
at org.apache.flink.runtime.instance.InstanceManager.unregisterTaskManager(InstanceManager.java:212)
at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$handleTaskManagerTerminated(JobManager.scala:1228)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:1131)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:49)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:125)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:44)
at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501)
at akka.actor.ActorCell.invoke(ActorCell.scala:486)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
where giordano-2-2-100-1 is the address of the job-task manager.
I set number of Task Slots equal to the machine cores (2) and the heap memory accordingly to the available memory showed by meminfo.
During the execution (before the error appears) I watched cpu usage and I noted that the two core of the job-task manager are working (at least 50% each, even 100% for one of them sometimes) while the other two nodes (the task managers) are completely free with a cpu usage around 0%.
I set correctly rpc address of the jobmanager and filled correctly slaves file putting:
giordano-2-2-100-1
giordano-2-2-100-2
giordano-2-2-100-3
Moreover I used ping from the master node to verify if the other nodes are reachable and it's ok, and telnet from the task managers to verify if the job manager was reachable, also in this case everything is ok.
Honestly I have no more ideas about what I'm doing wrong...
Furthermore I tried to execute the program on my laptop (dual core) setting a single-node cluster with the same configuration of the real cluster and the same jar. In this case everything works perfectly so I'm quietly sure the problem is in the job manager.
P.S. On stack overflow I found this reply of the same problem:TaskManager loss/killed but I don't understand how to set a different garbage collector.

This problem happened to me when a task manager runs out of memory and when the GC takes too much time trying to free some memory.
Try to use more Ram or decrease the memory requirements of your tasks.

Related

How do I understand why my Flink TaskManager quits shortly after starting my job?

I'm using Flink 1.15 Docker images in Session mode pretty much the same as the Compose documentation. I have one Task Manager. A few minutes after starting my streaming job I get a stack dump log message from my Job Manager stating that the Task Manager is no longer reachable and I see that my Task Manager Docker container has exited with code 137 - which possibly indicates an out of memory error. Although docker inspect shows the OOMKilled flag as false indicating some sort of other issue.
End of stack trace from Job Manager log:
Caused by: org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id 172.18.0.5:44333-7c7193 is no longer reachable.
The TaskManager Docker logs produce no error whatsoever before exiting. If I resurrect the dead Task Manager Docker container and have a look at the log file in /opt/flink/logs/ then the last messages state that the various components in my pipeline have switched from INITIALIZING to RUNNING.
I would have expected an out of memory stack dump from the task manager if my state had become too large. Also docker inspect shows that the container did not exit because of an out of memory error.
I have no idea what causes my Task Manager to die. Any ideas how I can figure out what is causing the issue? (This happens on 1.15.1 & 1.15.2. I haven't used any other version of Flink.)
This problem happened to me when a task manager runs out of memory and when the GC takes too much time trying to free some memory.
I know you said docker inspect doesn't show that it shuts down because of memory issues, but still try to use more RAM or decrease the memory requirements of your tasks and see if it still crashes.
I ended up using nothing more sophisticated than trial and error with a variety of different test jobs. I'm not 100% sure I fixed the problem as the issue of the Task Manager crashing without an stack dump occurred sporadically. However the Task Manager hasn't crashed for several days.
The simplest job to recreate my issue was with a SourceFunction outputting a continuous stream of incrementing Longs straight to a DiscardingSink. With this setup the Task Manager would crash after a while on my Linux machine sporadically but never on my Mac.
If I added a Thread.sleep to the SourceFunctions run loop then the crash would eventually occur but take a bit longer.
I tried Source framework instead of SourceFunction where a SingleThreadMultiplexSourceReaderBase repeatedly calls fetch on a SplitReader to output the Longs. There have been fewer crashes since I did this so it didn't work 100%.
I presume my SourceFunction was overfilling some sort of buffer or making a task slot unresponsive as it never relinquished a slot once it started. (Or some other completely different explanation.)
I wish the Task Manager gave some sort of indication why it stopped running.

Embedded Infinispan increases the heap memory and not going down

I am using Infinispan 10.1.6.Final with Jgroups via TCP protocol and data will be replicated to all the remaining nodes.
Issue: After some span of time Heap memory got increased and not even reducing down. Due to that app server got hanged resulting so many restarts.
Before restart observed some of the issues:
ERROR org.infinispan.interceptors.impl.InvocationContextInterceptor- ISPN000136:
org.infinispan.remoting.RemoteException: ISPN000217:
Heap Usage

Sage Maker Studio CPU Usage

I'm working in sage maker studio, and I have a single instance running one computationally intensive task:
It appears that the kernel running my task is maxed out, but the actual instance is only using a small amount of its resources. Is there some sort of throttling occurring? Can I configure this so that more of the instance is utilized?
Your ml.c5.xlarge instance comes with 4 vCPU. However, Python only uses a single CPU by default. (Source: Can I apply multithreading for computationally intensive task in python?)
As a result, the overall CPU utilization of your ml.c5.xlarge instance is low. To utilize all the vCPUs, you can try multiprocessing.
The examples below are performed using a 2 vCPU + 4 GiB instance.
In the first picture, multiprocessing is not set up. The instance CPU utilization peaks at around 50%.
single processing:
In the second picture, I created 50 processes to be run simultaneously. The instance CPU utilization rises to 100% immediately.
multiprocessing:
It might be something off with these stats your seeing, or they are showing different time spans, or the kernel has a certain resources assignment out of the total instance.
I suggest opening a terminal and running top to see what's actually going on and which UI stat it matches (note your opening the instance's terminal, and not the Jupyter UI instance terminal).

Flink slot removed exception

I am getting the following exception
org.apache.flink.util.FlinkException: The assigned slot container_1546939492951_0001_01_003659_0 was removed.
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:789)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:759)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:951)
at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:823)
at org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:346)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
when running a batch process involving joining two very large datasets.
Here is what I can see in the overview. The failure happened on a task manager which did not get any inputs. Weirdly the previous set (partition -> flat map -> map) did not send anything to that task manager despite having a rebalance in front.
Am running it on EMR. I see that there is a slot.idle.timeout, would that have an effect and if so how do I specify it for that job? Can it be done on the command line?
it's possible that this is a timeout issue, but usually when this happens to me it's because there's a failure (e.g. YARN kills the container because it's running beyond pmem or vmem limits). I'd recommend carefully checking the JobManager and all TaskManager log files.
You can add the following line in java code.
env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
then your job will start on cancellation automatically.
I had a similar issue which turned out to be excessive logging in our Flink job. I am guessing this resulted in Task Manager timeouts. Removing or reducing the amount of logging fixed the issue
I just had a similar issue when running Flink on Kubernetes, it turns out to be TaskManager was OOMKilled and restarted. If you also run Flink on Kubernetes, you can check the status of your TaskManager pods:
kubectl describe pods <pod>
if you see the container was previously OOKilled, that could be the cause:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
This issue is not always cause by oom and killed by yarn,if you have log like this:"Closing TaskExecutor connection container_e86_1590402668190_3503_01_000015 because: Container released on a lost node",and it is before your error log.I am guessing this problem cause by Nodemanager down.About 10 minutes ,flink ResourceManager can not communicate with NodeManager,ResourceManager will start to remove slot ,and restart(if you have restart strategy).

Does a 'Broken pipe' exception cancel my job?

Currently I am running a Flink program on a remote cluster of 4 machines using 144 TaskSlots. After running for around 30 minutes I received the following error:
INFO
org.apache.flink.runtime.jobmanager.web.JobManagerInfoServlet - Info
server for jobmanager: Failed to write json updates for job
b2eaff8539c8c9b696826e69fb40ca14, because
org.eclipse.jetty.io.RuntimeIOException:
org.eclipse.jetty.io.EofException at
org.eclipse.jetty.io.UncheckedPrintWriter.setError(UncheckedPrintWriter.java:107)
at
org.eclipse.jetty.io.UncheckedPrintWriter.write(UncheckedPrintWriter.java:280)
at
org.eclipse.jetty.io.UncheckedPrintWriter.write(UncheckedPrintWriter.java:295)
at
org.apache.flink.runtime.jobmanager.web.JobManagerInfoServlet.writeJsonUpdatesForJob(JobManagerInfoServlet.java:588)
at
org.apache.flink.runtime.jobmanager.web.JobManagerInfoServlet.doGet(JobManagerInfoServlet.java:209)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:734) at
javax.servlet.http.HttpServlet.service(HttpServlet.java:847) at
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:532)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:965)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:388)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:187)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:901)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at
org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:47)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:113)
at org.eclipse.jetty.server.Server.handle(Server.java:352) at
org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:596)
at
org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1048)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:549)
at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:211)
at
org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:425)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:489)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:436)
at java.lang.Thread.run(Thread.java:745) Caused by:
org.eclipse.jetty.io.EofException at
org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:905)
at
org.eclipse.jetty.http.AbstractGenerator.flush(AbstractGenerator.java:427)
at org.eclipse.jetty.server.HttpOutput.flush(HttpOutput.java:78) at
org.eclipse.jetty.server.HttpConnection$Output.flush(HttpConnection.java:1139)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:159) at
org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:86) at
java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:154)
at org.eclipse.jetty.server.HttpWriter.write(HttpWriter.java:258) at
org.eclipse.jetty.server.HttpWriter.write(HttpWriter.java:107) at
org.eclipse.jetty.io.UncheckedPrintWriter.write(UncheckedPrintWriter.java:271)
... 24 more Caused by: java.io.IOException: Broken pipe at
sun.nio.ch.FileDispatcherImpl.write0(Native Method) at
sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at
sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at
sun.nio.ch.IOUtil.write(IOUtil.java:51) at
sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470) at
org.eclipse.jetty.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:185)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:256)
at
org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:849)
... 33 more
I know that java.io.IOException: Broken pipe means that the JobManager lost some kind of connection so I guess the whole job failed and I have to restart it. Although I think the process is not running anymore the WebInterface still lists it as running. Additionally the JobManager is still present when I use jps to identify my running processes on the cluster. So my question is if my job is lost and whether this error is happening randomly sometimes or whether my program caused it.
EDIT: My TaskManagers still send Heartbeats every few seconds and seem to be running.
It's actually a problem of the JobManagerInfoServlet, Flink's web server, which cannot sent the latest JSON updates of the requested job to your browser because of the java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method). Thus, only the GET request to the server failed.
Such a failure should not affect the execution of the currently running Flink job. Simply refreshing your browser (with Flink's web UI) should send another GET request which then hopefully completes successfully.

Resources