Flink: all traffic goes to one Task Manager (in cluster with 1 Job Manager + 2 Task Managers) - apache-flink

I have following set up:
Installation type k8s: 1.18
Flink version: 1.12
1 Job Manager
2 Task Manager
In flink-conf.yaml of task manager
flink-conf.yaml: |
state.backend: rocksdb
blob.server.port: 6124
jobmanager.rpc.port: 6123
parallelism.default: 2
queryable-state.proxy.ports: 6125
taskmanager.numberOfTaskSlots: 2
taskmanager.rpc.port: 6122
jobmanager.memory.process.size: 2900m
taskmanager.memory.process.size: 2900m
jobmanager.web.address: 0.0.0.0
rest.address: 0.0.0.0
rest.bind-address: 0.0.0.0
In flink-conf.yaml of job manager
flink-conf.yaml: |
state.backend: rocksdb
blob.server.port: 6124
jobmanager.rpc.port: 6123
parallelism.default: 2
queryable-state.proxy.ports: 6125
taskmanager.rpc.port: 6122
jobmanager.memory.process.size: 2900m
taskmanager.memory.process.size: 2900m
jobmanager.web.address: 0.0.0.0
rest.address: 0.0.0.0
rest.bind-address: 0.0.0.0
rest.port: 8081
With above configuration, only one task manager is active i.e. gets traffic and another task manager remains idle, even though number of events increases to extreme level.
Please suggest, if I am missing anything?

You have set the parallelism to 2, and given each task manager 2 slots. Thus a single task manager can provide the requested parallelism, and that's what will happen by default.
If you want the scheduler to behave differently, you could set
cluster.evenly-spread-out-slots: true
or you could reduce taskmanager.numberOfTaskSlots to 1.

Related

Apache Flink and the Zookeeper high availability doesn't work as expected

I have deployed an standalone flink( 1.15.0) cluster with 3 masters and i am using Zookeeper(3.5.0) to provide high availability. Here i share my flink.yml configuration:
high-availability: zookeeper
high-availability.storageDir: s3://bucket-name/flink
high-availability.zookeeper.quorum: zookeeper-dns:2181
state.checkpoints.dir: s3://bucket-name/flink/checkpoints
high-availability.cluster-id: flinkId
The problem is when for some reason all 3 jobmanagers fail, for example the first 1 stops and then starts again, then the second one stops and starts again and when the third one stops, the taskmanagers can't connect anymore to job managers.
I can see this logs:
2022-09-01 23:22:50,616 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with ZookeeperLeaderRetrievalDriver{connectionInformationPath='/resource_manager/connection_info'}.
2022-09-01 23:22:50,626 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with ZookeeperLeaderRetrievalDriver{connectionInformationPath='/dispatcher/connection_info'}.
2022-09-01 23:22:50,698 WARN akka.remote.transport.netty.NettyTransport 2022-09-01 23:22:50,705 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink#127.0.0.1:50505] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#127.0.0.1:50505]] Caused by: [java.net.ConnectException: Connection refused: /127.0.0.1:50505]
2022-09-01 23:22:50,698 WARN akka.remote.transport.netty.NettyTransport 2022-09-01 23:22:50,705 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink#127.0.0.1:50505] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#127.0.0.1:50505]] Caused by: [java.net.ConnectException: Connection refused: /127.0.0.1:50505]

TFX/Apache Beam -> Flink jobs hang when running on more than one task manager

When I am trying to run a TFX pipeline/Apache Beam job on a Flink runner, it works fine when using 1 task manager (on one node) with parallelism 2 (2 task slots per task manager). But hangs when I try it with higher parallelism on more than one task manager with the message constantly repeating on both task managers:
INFO org.apache.beam.runners.fnexecution.environment.ExternalEnvironmentFactory [] - Still waiting for startup of environment from a65a0c5f8f962428897aac40763e57b0-1334930809.eu-central-1.elb.amazonaws.com:50000 for worker id 1-1
The Flink cluster runs on a native Kubernetes deployment on an AWS EKS Kubernetes Cluster.
I use the following parameters:
"--runner=FlinkRunner",
"--parallelism=4",
f"--flink_master={flink_url}:8081",
"--environment_type=EXTERNAL",
f"--environment_config={beam_sdk_url}:50000",
"--flink_submit_uber_jar",
"--worker_harness_container_image=none",
EDIT: Adding additional info about the configuratio
I have configured the Beam workers to run as side-cars (at least this is my understanding of how it should work), by setting the Flink parameter:
kubernetes.pod-template-file.taskmanager
it is pointing out to a template file with contents:
kind: Pod
metadata:
name: taskmanager-pod-template
spec:
#hostNetwork: true
containers:
- name: flink-main-container
#image: apache/flink:scala_2.12
env:
- name: AWS_REGION
value: "eu-central-1"
- name: S3_VERIFY_SSL
value: "0"
- name: PYTHONPATH
value: "/data/flink/src"
args: ["taskmanager"]
ports:
- containerPort: 6122 #22
name: rpc
- containerPort: 6125
name: query-state
livenessProbe:
tcpSocket:
port: 6122 #22
initialDelaySeconds: 30
periodSeconds: 60
- name: beam-worker-pool
env:
- name: PYTHONPATH
value: "/data/flink/src"
- name: AWS_REGION
value: "eu-central-1"
- name: S3_VERIFY_SSL
value: "0"
image: 848221505146.dkr.ecr.eu-central-1.amazonaws.com/flink-workers
imagePullPolicy: Always
args: ["--worker_pool"]
ports:
- containerPort: 50000
name: pool
livenessProbe:
tcpSocket:
port: 50000
initialDelaySeconds: 30
periodSeconds: 60
I have also created a kubernetes load balancer for the task managers, so clients can connect on port 50000. So I use that address when configuring:
f"--environment_config={beam_sdk_url}:50000",
EDIT 2: Looks like the Beam SDK harness on one task manager wants to connect to the endpoint running on the other task manager, but looks for it on localhost:
Log from beam-worker-pool on TM 2:
2021/08/11 09:43:16 Failed to obtain provisioning information: failed to dial server at localhost:33705
caused by:
context deadline exceeded
The provision endpoint on TM 1 is the one actually listening on the port 33705, while this is looking for it on localhost, so cannot connect to it.
EDIT 3: Showing how I test this:
...............
TM 1:
========
$ kubectl logs my-first-flink-cluster-taskmanager-1-1 -c beam-worker-pool
2021/08/12 09:10:34 Starting worker pool 1: python -m apache_beam.runners.worker.worker_pool_main --service_port=50000 --container_executable=/opt/apache/beam/boot
Starting worker with command ['/opt/apache/beam/boot', '--id=1-1', '--logging_endpoint=localhost:33383', '--artifact_endpoint=localhost:43477', '--provision_endpoint=localhost:40983', '--control_endpoint=localhost:34793']
2021/08/12 09:13:05 Failed to obtain provisioning information: failed to dial server at localhost:40983
caused by:
context deadline exceeded
TM 2:
=========
$ kubectl logs my-first-flink-cluster-taskmanager-1-2 -c beam-worker-pool
2021/08/12 09:10:33 Starting worker pool 1: python -m apache_beam.runners.worker.worker_pool_main --service_port=50000 --container_executable=/opt/apache/beam/boot
Starting worker with command ['/opt/apache/beam/boot', '--id=1-1', '--logging_endpoint=localhost:40497', '--artifact_endpoint=localhost:36245', '--provision_endpoint=localhost:32907', '--control_endpoint=localhost:46083']
2021/08/12 09:13:09 Failed to obtain provisioning information: failed to dial server at localhost:32907
caused by:
context deadline exceeded
Testing:
.........................
TM 1:
============
$ kubectl exec -it my-first-flink-cluster-taskmanager-1-1 -c beam-worker-pool -- bash
root#my-first-flink-cluster-taskmanager-1-1:/# curl localhost:40983
curl: (7) Failed to connect to localhost port 40983: Connection refused
root#my-first-flink-cluster-taskmanager-1-1:/# curl localhost:32907
Warning: Binary output can mess up your terminal. Use "--output -" to ...
TM 2:
=============
root#my-first-flink-cluster-taskmanager-1-2:/# curl localhost:32907
curl: (7) Failed to connect to localhost port 32907: Connection refused
root#my-first-flink-cluster-taskmanager-1-2:/# curl localhost:40983
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Not sure how to fix this.
Thanks,
Gorjan
It's not recommended to try to connect to the same environment with different task managers. Usually we recommend setting up the Beam workers as side cars to the task managers so there's a 1:1 correspondence, then connecting via localhost. See the example config at https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/without_job_server/beam_flink_cluster.yaml and https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/examples/beam/without_job_server/beam_wordcount_py.yaml
I was able to fix this by setting the Beam SDK address to localhost instead of using a load balancer. So the config I use now is:
"--runner=FlinkRunner",
"--parallelism=4",
f"--flink_master={flink_url}:8081",
"--environment_type=EXTERNAL",
"--environment_config=localhost:50000", # <--- Changed the address to localhost
"--flink_submit_uber_jar",
"--worker_harness_container_image=none",

Flink 1.10.0 - The heartbeat of ResourceManager with id xxxx timed out

I am running flink standalone cluster HA in kubernetes. The same setup runs perfectly when using Flink 1.9 but getting below error continuously when using Flink 1.10.
INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - The heartbeat of ResourceManager with id 783439e4ead380c60498e32a8e1c0ce3 timed out.
DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor - Close ResourceManager connection 783439e4ead380c60498e32a8e1c0ce3.
org.apache.flink.runtime.taskexecutor.exceptions.TaskManagerException: The heartbeat of ResourceManager with id 783439e4ead380c60498e32a8e1c0ce3 timed out.
at org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerHeartbeatListener.notifyHeartbeatTimeout(TaskExecutor.java:1842)
at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:109)
flink-conf.yaml :
jobmanager.rpc.address: xx.xxx.xx.xxx
jobmanager.rpc.port: 6123
jobmanager.heap.size: 1500m
taskmanager.memory.process.size: 4000m
taskmanager.numberOfTaskSlots: 1
parallelism.default: 1
jobmanager.execution.failover-strategy: region
state.backend: filesystem
state.checkpoints.dir: file:///checkpoints
state.savepoints.dir: file:///savepoints
high-availability: zookeeper
high-availability.jobmanager.port: 50010
high-availability.zookeeper.quorum: xx.xx.xx.xx:xxxx
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: /ABCD
high-availability.storageDir: file:///recovery
heartbeat.interval: 60000
heartbeat.timeout: 60000
taskmanager.debug.memory.log: true
taskmanager.debug.memory.log-interval: 10000
taskmanager.memory.managed.fraction: 0.1
blob.server.port: 6124
query.server.port: 6125

Flink TaskManager livenessProbe doesn't work

I'm following this doc to configure probes for JobManager and TaskManager on Kubernetes.
JobManager works perfectly, but TaskManager doesn't work. I noticed in the pod log that the liveness probe failed:
Normal Killing 3m36s kubelet, gke-dagang-test-default-pool-494df2ba-vhs5 Killing container with id docker://taskmanager:Container failed liveness probe.. Container will be killed and recreated.
Warning Unhealthy 37s (x8 over 7m37s) kubelet, gke-dagang-test-default-pool-494df2ba-vhs5 Liveness probe failed: dial tcp 10.20.1.54:6122: connect: connection refused
I'm wondering does TM actually listen on 6122?
Flink version: 1.9.0
Turns out it is because I didn't add taskmanager.rpc.port: 6122 in flink-config.yaml, now it works perfectly.

Which ports should I open in firewall on nodes with Apach Flink?

When I try to run my flow on Apache Flink standalone cluster I see the following exception:
java.lang.IllegalStateException: Update task on instance aaa0859f6af25decf1f5fc1821ffa55d # app-2 - 4 slots - URL: akka.tcp://flink#192.168.38.98:46369/user/taskmanager failed due to:
at org.apache.flink.runtime.executiongraph.Execution$6.onFailure(Execution.java:954)
at akka.dispatch.OnFailure.internal(Future.scala:228)
at akka.dispatch.OnFailure.internal(Future.scala:227)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:28)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#192.168.38.98:46369/user/taskmanager#1804590378]] after [10000 ms]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
at java.lang.Thread.run(Thread.java:745)
Seems like port 46369 blocked by firewall. It is true because I read configuration section and open these ports only:
6121:
comment: Apache Flink TaskManager (Data Exchange)
6122:
comment: Apache Flink TaskManager (IPC)
6123:
comment: Apache Flink JobManager
6130:
comment: Apache Flink JobManager (BLOB Server)
8081:
comment: Apache Flink JobManager (Web UI)
The same ports described in flink-conf.yaml:
jobmanager.rpc.address: app-1.stag.local
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 2048
taskmanager.numberOfTaskSlots: 4
taskmanager.memory.preallocate: false
blob.server.port: 6130
parallelism.default: 4
jobmanager.web.port: 8081
state.backend: jobmanager
restart-strategy: none
restart-strategy.fixed-delay.attempts: 2
restart-strategy.fixed-delay.delay: 60s
So, I have two questions:
This exception related to blocked ports. Right?
Which ports should I open on firewall for standalone Apache Flink cluster?
UPDATE 1
I found configuration problem in masters and slaves files (I skip new line separators between hosts described in these files). I fixed it and now I see other exceptions:
flink--taskmanager-0-app-1.stag.local.log
flink--taskmanager-0-app-2.stag.local.log
I have 2 nodes:
app-1.stag.local (with running job and task managers)
app-2.stag.local (with running task manager)
As you can see from these logs the app-1.stag.local task manager can't connect to other task manager:
java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'app-2.stag.local/192.168.38.98:35806' has failed. This might indicate that the remote task manager has been lost.
but app-2.stag.local has open port:
2016-03-18 16:24:14,347 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 39 ms). Listening on SocketAddress /192.168.38.98:35806
So, I think problem related to firewall but I don't understand where I can configure this port (or range of ports) in Apache Flink.
I have found a problem: taskmanager.data.port parameter was set to 0 by default (but documentation say what it should be set to 6121).
So, I set this port in flink-conf.yaml and now all works fine.

Resources