Apache Flink number of taskmanagers in local mode

Apache Flink number of taskmanagers in local mode - apache-flink

I am working on an Apache Flink (1.5.0) based streaming application.
As part of this I have launched Flink in local mode on my Windows machine.
In order to run my job with the degree of parallelism of 8, I need 8 Task managers providing one task slot each.
I added a task manager with following command:
/cygdrive/b/Binaries Flink/flink-1.5.0/bin/taskmanager.sh' start
The first few times, a task manager was added successfully with following message:
[INFO] 3 instance(s) of taskexecutor are already running on ... .
Starting taskexecutor daemon on host ... .
After 5 task managers were available I got the same message
[INFO] 5 instance(s) of taskexecutor are already running on ... .
Starting taskexecutor daemon on host ... .
The problem is that a sixth task manager is never created.
When I stop one task manager it goes down to 4, I can add one additional task manager but never more than 5 task managers.
Is there any limitation to the amount of task managers?
Did anyone experience a similar behaviour?
Thank you very much

There is no limit of how many TaskManager you can start locally. The only limit is the available resources you have on your local machine.
If you are using the standalone mode in Flink 1.5.0, then you can also set the number of slots per TaskManager to 7 by adding the following line to the flink-conf.yaml:
taskmanager.numberOfTaskSlots: 7

Related

Flink application TaskManager timeout exception Flink 1.11.2 running on EMR 6.2.1

We are currently running a Flink application on EMR 6.2.1 which runs on Yarn. The flink version is 1.11.2
We're running at a parallelism of 180 with 65 task nodes in EMR. When we start up the yarn application we run the following:
flink-yarn-session --name NameOfJob --jobManagerMemory 2048 --taskManagerMemory 32000 --slots 4 --detached
When the flink job is deployed it takes about 15 minutes to start up and for all the SubTasks to start running. We see several restarts and the following exception
java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id X timed out
The job then starts up but after about 20 hours of running the checkpoints start failing and again we see the exception above.
I have seen similar questions around maybe the job manager memory needs to be increased but I'm struggling to find guidance on by how much to increase it or what recommendations may be. The other issue is by the time we can look at the logs of the task manager which fails, it's been killed and the logs are gone.
Any guidance around whether increasing the job manager memory will help and by what increments we should look at doing it would be appreciated.

how to keep apache flink task and submit record when restart jobmanager

I am using apache flink 1.10 to batch compute my stream data, today I move my apache flink kubernetes(v1.15.2) pod from machine 1 to machine 2 and find all submit task record and task list disappear, what's happening? the summit record is in the memory? what should I to keep my submit record and task list when restart the kubernetes pod of apache flink? I just found checkpoint persistant but nothing about tasks.
If lose the running task history, I must upload my task jar and recreate all task, so many task should to recreate if lose the history, is there any possible to resume the task automaticlly?

The configurations that might not be set are:
Job Manager
jobmanager.archive.fs.dir: hdfs:///completed-jobs
History Server
# Monitor the following directories for completed jobs
historyserver.archive.fs.dir: hdfs:///completed-jobs
# Refresh every 10 seconds
historyserver.archive.fs.refresh-interval: 10000
Please look at for more details: https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/historyserver.html#configuration

Configuring Ports for Flink Job/Task Manager Metrics

I am running Flink in Amazon EMR. In flink-conf.yaml, I have metrics.reporter.prom.port: 9249-9250
Depending whether the job manager and task manager are running in the same node, the task manager metrics are reported on port 9250 (if running on same node as job manager), or on port 9249 (if running on a different node).
Is there a way to configure so that the task manager metrics are always reported on port 9250?
I saw a post that we can "provide each *Manager with a separate configuration." How to do that?
Thanks

You can configure different ports for the JM and TM by starting the processes with differently configured flink-conf.yaml.
On Yarn, Flink currently uses the same flink-conf.yaml for all processes.

Remote debugging Flink local cluster

I want to deploy my jobs on a local Flink cluster during development (i.e. JobManager and TaskManager running on my development laptop), and use remote debugging. I tried adding
"-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" to the flink-conf.yaml file. Since job and task manager are running on the same machine, the task manager throws exception stating that the socket is already in use and terminates. Is there any way I can get this running.

You are probably setting env.java.opts, which affects all JVMs started by Flink. Since the jobmanager gets started first, it grabs the port before the taskmanager is started.
You can use env.java.opts.taskmanager to pass parameters only for taskmanager JVMs.

Flink state backend for TaskManager

I have a Flink v1.2 setup with 1 JobManager, 2 TaskManagers each in it's own VM. I configured the state backend to filesystem and pointed it to a local location in the case of each of the above hosts (state.backend.fs.checkpointdir: file:///home/ubuntu/Prototype/flink/flink-checkpoints). I have set parallelism to 1 and each taskanager has 1 slot.
I then run an event processing job on the JobManager which assigns it to a TaskManager.
I kill the TaskManager running the job and after a few unsuccessful attempts on the failed TaskManager Flink tries to run the job on the remaining TaskManager. At this point it fails again because it cannot find the corresponding checkpoints / state : java.io.FileNotFoundException: /home/ubuntu/Prototype/flink/flink-checkpoints/56c409681baeaf205bc1ba6cbe9f8091/chk-14/46f6e71d-ebfe-4b49-bf35-23c2e7f97923 (No such file or directory)
The folder /home/ubuntu/Prototype/flink/flink-checkpoints/56c409681baeaf205bc1ba6cbe9f8091 only exists on the TaskManager that I killed and not on the other one.
My question is am I supposed to set the same location for checkpointing / state on all the task managers if I want the above functionality?
Thanks!

The checkpoint directory you use needs to be shared across all machines that make up your Flink cluster. Typically this would be something like HDFS or S3 but can be any shared filesystem.