Is it possible to run Flink task managers on the Task nodes of AWS EMR? If yes, how different is it from running Task Managers on a core node?
Yes, you should be able to run TMs on task nodes. The only difference I'd expect is that EMR won't schedule the Flink Job Manager (JM) on a task node ("Amazon EMR ... allows application master processes to run only on core nodes").
If your workflow has sources that read from HDFS and/or sinks that write to HDFS, then subtasks of these operators running on task nodes might take longer, as task nodes don't run the Hadoop Data Node daemon, and thus all reads/writes are over the network.
Related
Let's say all nodes that are running Flink job manager are restarted at the same time, is there any impact to the running task managers which are untouched?
Thanks!
The new job managers will restart all of the jobs from their latest checkpoints, using the information (job graphs, checkpoint metadata) they find in the HA service provider.
I'm reading the Flink official doc about Task Failure Recovery: https://ci.apache.org/projects/flink/flink-docs-stable/dev/task_failure_recovery.html
As my understanding, this doc tells us that if some task failed for some reason, Flink is able to recover it with the help of Checkpoint mechanism.
So now I have two more questions:
What if a TaskManager failed? As my understanding, a task is assigned to one or more slots, and slots are located at one or more TaskManagers. After reading the doc above, I've known that Flink can recover a failed task, but if a TaskManager failed, what would happen? Can Flink recover it too? If a failed TaskManager can be recoverd, will the tasks assigned to it can continue running automatically after it's recovered?
What if the JobManager failed? If the JobManager failed, do all of TaskManagers will fail too? If so, when I recover the JobManager with the help of Zookeeper, do all of TaskManagers and their tasks will continue running automatically?
In a purely standalone cluster, if a Task Manager dies, then if you had a standby task manager running, it will be used. Otherwise the Job Manager will wait for a new Task Manager to magically appear. Making that happen is up to you. On the other hand, if you are using YARN, Mesos, or Kubernetes, the cluster management framework will take care of making sure there are enough TMs.
As for Job Manager failures, in a standalone cluster you should run standby Job Managers, and configure Zookeeper to do leader election. With YARN, Mesos, and Kubernetes, you can let the cluster framework handle restarting the Job Manager, or run standbys, as you prefer, but in either case you will still need Zookeeper to provide HA storage for the Job Manager's metadata.
Task Managers can survive a Job Manager failure/recovery situation. The jobs don't have to be restarted.
https://ci.apache.org/projects/flink/flink-docs-stable/ops/jobmanager_high_availability.html.
I run multiple job from my .jar file. i want share state between my jobs. but all inputs consumes(from kafka) in every job and generate duplicate output.
i see my flink panel. all of jobs 'record sents' is 3. i think must split number to my jobs.
I create job with this command
bin/flink run app.jar
How can i fix it?
Because of its focus on scalability and high performance, Flink state is local. Flink doesn't really provide a mechanism for sharing state between jobs.
However, Flink does support splitting up a large job among a fleet of workers. A Flink cluster is able to run a single job in parallel, using the resources of one or many multi-core CPUs. Some Flink jobs are running on thousands of cores, just to give an idea of its scalability.
When used with Kafka, each Kafka partition can be read by a different subtask in Flink, and processed by its own parallel instance of the pipeline.
You might begin by running a single parallel instance of your job via
bin/flink run --parallelism <parallelism> app.jar
For this to succeed, your cluster will have to have at least as many free slots as the parallelism you request. The parallelism should be less than or equal to the number of partitions in the Kafka topic(s) being consumed. The Flink Kafka consumers will coordinate amongst themselves -- with each of them reading from one or more partitions.
In Apache Flink (e.g. v1.8), what is the difference between the Job Manager and the Job Master?
Job Manager and Job Master seem to be used analogously in the logs.
What is the difference between the Job Manager and the Job Master?
Thanks!
The JobManager is the composition of mainly 3 components.
Dispatcher - dispatch the job to the Task Managers
Resource Manager - Allocate the required resource for the job
JobMaster - Supervising, coordinating the Flink Job tasks.
So, JobMaster is part of JobManager. As per docs, a single JobManager is assigned to each individual Flink application, which can contain multiple Flink jobs in it.
For example, a Flink Application with 2 jobs will instantiate 1 JobManger but will contain 2 JobMasters.
JobManager and JobMaster have different roles.
For the JobManager, according to the JobManager Data Structures section of the documentation:
During job execution, the JobManager keeps track of distributed tasks, decides when to schedule the next task (or set of tasks), and reacts to finished tasks or execution failures.
The JobManager receives the JobGraph, which is a representation of the data flow consisting of operators (JobVertex) and intermediate results (IntermediateDataSet). Each operator has properties, like the parallelism and the code that it executes. In addition, the JobGraph has a set of attached libraries, that are necessary to execute the code of the operators.
The role of the JobMaster is more limited according to the Javadoc:
JobMaster implementation. The job master is responsible for the execution of a single JobGraph.
I have a Flink Standalone Cluster based on Flink 1.4.2 (1 job manager, 4 task slots) and want to submit two different Flink programs.
Not sure if this is possible at all as some flink archives say that a job manager can only run one job. If this is true, any ideas how can I get around this issue? There is only one machine available for the Flink cluster and we don't want to use any resource manager such as Mesos or Yarn.
Any hints?
The Flink jobs (programs) run in task slots which are located in a task manager. Assuming you have 4 task slots, you can run up-to 4 Flink programs. Also, be careful with the parallelism of your Flink jobs. If you have set the parallelism to 2 in both jobs, then yes 2 is the maximum number of jobs you can run on 4 task slots. Each parallel instance runs on a task slot.
Check this image: https://ci.apache.org/projects/flink/flink-docs-master/fig/slots_parallelism.svg