Flink TaskManager livenessProbe doesn't work - apache-flink

I'm following this doc to configure probes for JobManager and TaskManager on Kubernetes.
JobManager works perfectly, but TaskManager doesn't work. I noticed in the pod log that the liveness probe failed:
Normal Killing 3m36s kubelet, gke-dagang-test-default-pool-494df2ba-vhs5 Killing container with id docker://taskmanager:Container failed liveness probe.. Container will be killed and recreated.
Warning Unhealthy 37s (x8 over 7m37s) kubelet, gke-dagang-test-default-pool-494df2ba-vhs5 Liveness probe failed: dial tcp 10.20.1.54:6122: connect: connection refused
I'm wondering does TM actually listen on 6122?
Flink version: 1.9.0

Turns out it is because I didn't add taskmanager.rpc.port: 6122 in flink-config.yaml, now it works perfectly.

Related

TFX/Apache Beam -> Flink jobs hang when running on more than one task manager

When I am trying to run a TFX pipeline/Apache Beam job on a Flink runner, it works fine when using 1 task manager (on one node) with parallelism 2 (2 task slots per task manager). But hangs when I try it with higher parallelism on more than one task manager with the message constantly repeating on both task managers:
INFO org.apache.beam.runners.fnexecution.environment.ExternalEnvironmentFactory [] - Still waiting for startup of environment from a65a0c5f8f962428897aac40763e57b0-1334930809.eu-central-1.elb.amazonaws.com:50000 for worker id 1-1
The Flink cluster runs on a native Kubernetes deployment on an AWS EKS Kubernetes Cluster.
I use the following parameters:
"--runner=FlinkRunner",
"--parallelism=4",
f"--flink_master={flink_url}:8081",
"--environment_type=EXTERNAL",
f"--environment_config={beam_sdk_url}:50000",
"--flink_submit_uber_jar",
"--worker_harness_container_image=none",
EDIT: Adding additional info about the configuratio
I have configured the Beam workers to run as side-cars (at least this is my understanding of how it should work), by setting the Flink parameter:
kubernetes.pod-template-file.taskmanager
it is pointing out to a template file with contents:
kind: Pod
metadata:
name: taskmanager-pod-template
spec:
#hostNetwork: true
containers:
- name: flink-main-container
#image: apache/flink:scala_2.12
env:
- name: AWS_REGION
value: "eu-central-1"
- name: S3_VERIFY_SSL
value: "0"
- name: PYTHONPATH
value: "/data/flink/src"
args: ["taskmanager"]
ports:
- containerPort: 6122 #22
name: rpc
- containerPort: 6125
name: query-state
livenessProbe:
tcpSocket:
port: 6122 #22
initialDelaySeconds: 30
periodSeconds: 60
- name: beam-worker-pool
env:
- name: PYTHONPATH
value: "/data/flink/src"
- name: AWS_REGION
value: "eu-central-1"
- name: S3_VERIFY_SSL
value: "0"
image: 848221505146.dkr.ecr.eu-central-1.amazonaws.com/flink-workers
imagePullPolicy: Always
args: ["--worker_pool"]
ports:
- containerPort: 50000
name: pool
livenessProbe:
tcpSocket:
port: 50000
initialDelaySeconds: 30
periodSeconds: 60
I have also created a kubernetes load balancer for the task managers, so clients can connect on port 50000. So I use that address when configuring:
f"--environment_config={beam_sdk_url}:50000",
EDIT 2: Looks like the Beam SDK harness on one task manager wants to connect to the endpoint running on the other task manager, but looks for it on localhost:
Log from beam-worker-pool on TM 2:
2021/08/11 09:43:16 Failed to obtain provisioning information: failed to dial server at localhost:33705
caused by:
context deadline exceeded
The provision endpoint on TM 1 is the one actually listening on the port 33705, while this is looking for it on localhost, so cannot connect to it.
EDIT 3: Showing how I test this:
...............
TM 1:
========
$ kubectl logs my-first-flink-cluster-taskmanager-1-1 -c beam-worker-pool
2021/08/12 09:10:34 Starting worker pool 1: python -m apache_beam.runners.worker.worker_pool_main --service_port=50000 --container_executable=/opt/apache/beam/boot
Starting worker with command ['/opt/apache/beam/boot', '--id=1-1', '--logging_endpoint=localhost:33383', '--artifact_endpoint=localhost:43477', '--provision_endpoint=localhost:40983', '--control_endpoint=localhost:34793']
2021/08/12 09:13:05 Failed to obtain provisioning information: failed to dial server at localhost:40983
caused by:
context deadline exceeded
TM 2:
=========
$ kubectl logs my-first-flink-cluster-taskmanager-1-2 -c beam-worker-pool
2021/08/12 09:10:33 Starting worker pool 1: python -m apache_beam.runners.worker.worker_pool_main --service_port=50000 --container_executable=/opt/apache/beam/boot
Starting worker with command ['/opt/apache/beam/boot', '--id=1-1', '--logging_endpoint=localhost:40497', '--artifact_endpoint=localhost:36245', '--provision_endpoint=localhost:32907', '--control_endpoint=localhost:46083']
2021/08/12 09:13:09 Failed to obtain provisioning information: failed to dial server at localhost:32907
caused by:
context deadline exceeded
Testing:
.........................
TM 1:
============
$ kubectl exec -it my-first-flink-cluster-taskmanager-1-1 -c beam-worker-pool -- bash
root#my-first-flink-cluster-taskmanager-1-1:/# curl localhost:40983
curl: (7) Failed to connect to localhost port 40983: Connection refused
root#my-first-flink-cluster-taskmanager-1-1:/# curl localhost:32907
Warning: Binary output can mess up your terminal. Use "--output -" to ...
TM 2:
=============
root#my-first-flink-cluster-taskmanager-1-2:/# curl localhost:32907
curl: (7) Failed to connect to localhost port 32907: Connection refused
root#my-first-flink-cluster-taskmanager-1-2:/# curl localhost:40983
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Not sure how to fix this.
Thanks,
Gorjan
It's not recommended to try to connect to the same environment with different task managers. Usually we recommend setting up the Beam workers as side cars to the task managers so there's a 1:1 correspondence, then connecting via localhost. See the example config at https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/without_job_server/beam_flink_cluster.yaml and https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/without_job_server/beam_wordcount_py.yaml
I was able to fix this by setting the Beam SDK address to localhost instead of using a load balancer. So the config I use now is:
"--runner=FlinkRunner",
"--parallelism=4",
f"--flink_master={flink_url}:8081",
"--environment_type=EXTERNAL",
"--environment_config=localhost:50000", # <--- Changed the address to localhost
"--flink_submit_uber_jar",
"--worker_harness_container_image=none",

What is the reason of below error in vespa.ai?

We are facing this below error in Vespa, after restarting the cluster we got this below issue.
1600455444.680758 10.10.000.00 1030/1 container Container.com.yahoo.filedistribution.fileacquirer.FileAcquirerImpl info Retrying waitFor for file 'e0ce64d459828eb0': 103 -- Request timed out after 60.0 seconds.
1600455446.819853 10.10.000.00 32752/146 configproxy configproxy.com.yahoo.vespa.filedistribution.FileReferenceDownloader info Request failed. Req: request filedistribution.serveFile(e0ce64d459828eb0,0)\nSpec: tcp/10.10.000.00:19070, error code: 103, set error for connection and use another for next request
We faced this issue second time, earlier we kept it ideal and it was resolved automatically, but this time it is persistent.
Looks like the configproxy is unable to talk to the config server (which is listening to port 19070 on the same host: Spec: tcp/10.10.000.00:19070). Is the config server really runnning and listening on port 19070 on this host? Try running the vespa-config-status script to see if all is well with the config system

Flink job .UnfulfillableSlotRequestException: Could not fulfill slot req. Req resource profile (ResourceProfile{UNKNOWN}) is unfulfillable

Flink job submission
$ ./bin/flink run -m 10.0.2.4:6123 /streaming/mvn-flinkstreaming-scala/mvn-flinkstreaming-scala-1.0.jar
Stream processing!!!!!!!!!!!!!!!!!
org.apache.flink.streaming.api.datastream.DataStreamSink#40ef3420
------------------------------------------------------------
The program finished with the following exception:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 31 more
Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed
... 29 more
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: Could not fulfill slot request
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: java.util.concurrent.ExecutionException: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
but when i checked the logs of the job in UI getting a different error,
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 31 more
Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed
... 29 more
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: Could not fulfill slot request ea
What should I check, my configuration parameter are as below,
A) is the -m ip_address:6123 right option or 8081 should be the port...
config..
# Note this accounts for all memory usage within the TaskManager process, including JVM metaspace and other overhead.
taskmanager.memory.process.size: 1568m
# To exclude JVM metaspace and overhead, please, use total Flink memory size instead of 'taskmanager.memory.process.size'.
# It is not recommended to set both 'taskmanager.memory.process.size' and Flink memory.
#
# taskmanager.memory.flink.size: 1280m
# The number of task slots that each TaskManager offers. Each slot runs one parallel pipeline.
taskmanager.numberOfTaskSlots: 2
# The parallelism used for programs that did not specify and other parallelism.
parallelism.default: 2
Cluster starting,
$ bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host centos1.
Starting taskexecutor daemon on host centos2.
Starting taskexecutor daemon on host centos3.
Able to process grep in master node,
]$ ps -ef | grep flink
root 12300 1 10 07:22 pts/0 00:00:05 java -Xms16384m -Xmx16384m -Dlog.file=/storage/flink-1.10.0/log/
Unable to find a process of fink related to task managers,
centos2 ~]$ psg flink
Is this a right state?
I have faced the same problem in Flink-1.10.0. So please make sure you have enough memory according to the data load.
Error which i was getting:
java.lang.OutOfMemoryError: Metaspace
jobmanager.heap.size: 1024m (Default)
taskmanager.memory.flink.size: 1280m (Default)
taskmanager.memory.jvm-metaspace.size: 256m (Default)
So I have increased taskmanager.memory.jvm-metaspace.size according to data load and it solved my issue.
For more details click here.
I've encountered this problem before, which is most likely an indicator for insufficient memory on your flink cluster. The different error messages also make sense, as they relate to each other.
Check your
jobmanager.heap.size
and
taskmanager.heap.size
in the config, increase them to a rather oversized amount and you should not see this error anymore. From here you can finetune the actual memory settings

Failed to submit JobGraph Apache Flink

I am trying to run the simple code below after building everything from Flink's github master branch for various reasons. I get an exception below and I wonder what runs on port 9065? and How to fix this exception?
val dataStream = senv.fromElements(1, 2, 3, 4)
dataStream.countWindowAll(2).sum(0).print()
senv.execute("My streaming program")
Below is the Exception
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$18(RestClusterClient.java:306)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$222(RestClient.java:196)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:268)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:284)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.CompletionException: java.net.ConnectException: Connection refused: localhost/127.0.0.1:9065
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
... 16 more
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:9065
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)
I build it from the sources in the following way (just followed the instructions on Flink github page):
git clone https://github.com/apache/flink.git
cd flink
mvn clean package -DskipTests
cd build-target
./bin/start-scala-shell.sh local
Underlying distributed runtime is currently heavily worked on in master. Starting from 1.5 the default runtime will be the one known as FLIP6, therefore ocassionally some parts might not work. I think it would be very beneficial if you could create a JIRA ticket for this.
Just to add what runs on 9065 port, in the new architecture it is the default port of Dispatcher.
I had the same exception. My issue was that I had a port conflict when started the cluster with a docker image on my machine. So I had changed the port for rest in flink config file to use 8084 instead of 8081. When I did this the cluster would start up properly but I was unable to submit the job. When I killed the conflicting process and reverted the port back to 8081, I could submit jobs successfully
I got the same error.
Use jdk 1.8 for flink 1.7.2

Which ports should I open in firewall on nodes with Apach Flink?

When I try to run my flow on Apache Flink standalone cluster I see the following exception:
java.lang.IllegalStateException: Update task on instance aaa0859f6af25decf1f5fc1821ffa55d # app-2 - 4 slots - URL: akka.tcp://flink#192.168.38.98:46369/user/taskmanager failed due to:
at org.apache.flink.runtime.executiongraph.Execution$6.onFailure(Execution.java:954)
at akka.dispatch.OnFailure.internal(Future.scala:228)
at akka.dispatch.OnFailure.internal(Future.scala:227)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:28)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#192.168.38.98:46369/user/taskmanager#1804590378]] after [10000 ms]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
at java.lang.Thread.run(Thread.java:745)
Seems like port 46369 blocked by firewall. It is true because I read configuration section and open these ports only:
6121:
comment: Apache Flink TaskManager (Data Exchange)
6122:
comment: Apache Flink TaskManager (IPC)
6123:
comment: Apache Flink JobManager
6130:
comment: Apache Flink JobManager (BLOB Server)
8081:
comment: Apache Flink JobManager (Web UI)
The same ports described in flink-conf.yaml:
jobmanager.rpc.address: app-1.stag.local
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 2048
taskmanager.numberOfTaskSlots: 4
taskmanager.memory.preallocate: false
blob.server.port: 6130
parallelism.default: 4
jobmanager.web.port: 8081
state.backend: jobmanager
restart-strategy: none
restart-strategy.fixed-delay.attempts: 2
restart-strategy.fixed-delay.delay: 60s
So, I have two questions:
This exception related to blocked ports. Right?
Which ports should I open on firewall for standalone Apache Flink cluster?
UPDATE 1
I found configuration problem in masters and slaves files (I skip new line separators between hosts described in these files). I fixed it and now I see other exceptions:
flink--taskmanager-0-app-1.stag.local.log
flink--taskmanager-0-app-2.stag.local.log
I have 2 nodes:
app-1.stag.local (with running job and task managers)
app-2.stag.local (with running task manager)
As you can see from these logs the app-1.stag.local task manager can't connect to other task manager:
java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'app-2.stag.local/192.168.38.98:35806' has failed. This might indicate that the remote task manager has been lost.
but app-2.stag.local has open port:
2016-03-18 16:24:14,347 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 39 ms). Listening on SocketAddress /192.168.38.98:35806
So, I think problem related to firewall but I don't understand where I can configure this port (or range of ports) in Apache Flink.
I have found a problem: taskmanager.data.port parameter was set to 0 by default (but documentation say what it should be set to 6121).
So, I set this port in flink-conf.yaml and now all works fine.

Resources