Flink add Task/JobManagers to cluster - apache-flink

Regarding adding new Task/JobManagers to an existing running cluster the procedure can be found here (https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/cluster_setup.html#adding-jobmanagertaskmanager-instances-to-a-cluster).
However if we shutdown the cluster and start it again the information about the added hosts will be lost.
Is it safe practice that while adding the new host to the cluster to also update and save in parallel the "masters" and "slaves" configuration files on all nodes?

Yes it is absolutely safe. The information from masters and slaves files are read only in starting scripts.

Related

Is it possible to add new embedded worker while cluster is running on statefun?

Here is the deal;
I'm dealing with adding new worker (embbeded) to on running the cluster (flink statefun 2.2.1).
As you see the new task manager can be registered to the cluster;
Screenshot of new deployed taskmanager
But it doesn't initialize (it doesn't deploying sources);
What am I missing here?? (master and workers has to same jar files too? or it should be enough deploying taskmanager with jar file)
Any help would be appreciated,
Thx.
Flink supports two different approaches to rescaling: active and reactive.
Reactive mode is new in Flink 1.13 (released just this week), and works as you expected: add (or remove) a task manager, and your application will adjust to the new parallelism. You can read about elastic scaling and reactive mode in the docs.
Reactive mode is currently a work in progress, but might need your needs.
In broad strokes, for active mode rescaling you need to:
Do a stop with savepoint to bring down your current job while taking a snapshot of its state.
Relaunch with the new parallelism, using the savepoint as the starting point.
The exact details depend on how your cluster is deployed.
For a step-by-step tutorial, see Upgrading & Rescaling a Job in the Flink Operations Playground.
The above applies to rescaling statefun embedded functions. Being stateless, remote functions can be rescaled more straightforwardly.

Flink EMR Installation

I am new to flink and trying to deploy this on EMR cluster. I have used 3 node cluster (1 master and 2 slaves) with their default configuration. I have not done any configuration changes and sticking with default configuration.
I am curious to understand the following points:
How does master and slaves communicate with each other as I have not mentioned any IP in conf/slaves in master node?
I can see a flink library in master node (Path: /usr/lib/flink) but cannot find flink library in slave nodes. How is my code getting executed on slave nodes?
I will change some config according to my requirements in conf/flink-config.yml, if required. Do I need to make any other change on master or slave node apart from this?
See the Running flink-crawler in EMR wiki page for details on how we run a Flink streaming job on top of EMR. Note that in this mode Flink is running via YARN, thus the Flink conf/slaves file isn't being used. You should also take a look at the YARN Setup documentation to better understand how Flink runs on top of YARN.

DC/OS: modifying Cluster name post installation

I missed to update the cluster name (cluster_name) in my boot node's genconf/config.yaml before deploying the DC/OS cluster. I was wondering if there's a configuration/properties file in the nodes (or using dcos-cli or in etcd) that I need to change to update the cluster name string (that appears on the DC/OS UI). 'appreciate any help.
version: DC/OS 1.8
nodes running on CoreOS
size: 3 masters and 11 agents
The cluster name that appears on the DC/OS interface is extracted from the Mesos cluster name. According to this configuration generation file it's possible to change the name of the environment variable. Obviously you're going to have to restart the Mesos master one by one.
Important note: I have not had the possibility to test it, if you are in a production environment I highly recommend you not to do.

Flink dynamic scaling

I am currently studying scalability on Flink. Starting from Version 1.2.0, dynamic rescaling was introduced. I am looking at scaling a long running job which reads data from Kafka source.
Questions regarding dynamic rescaling.
To scale out my flink application, for example: add new task managers, must I restart the job / yarn session to use the newly added resource?
I think it's possible to write Yarn client to deploy new task managers and make it talk to job manager, is that already available in existing flink yarn client application?
Pardon me if these questions are too basic, I did go through the documents and I have to admit I have not been able to put the concepts altogether with some test deployments on yarn recently.
Currently, Dynamic Scaling means the capability to update the operators' parallelism(Flink 1.2), either for keyed state or for non-keyed state.
To scale out my flink application, for example: add new task managers, must I restart the job / yarn session to use the newly added resource? - Yes, the job has to be stopped first, update the parallelism, and restart it again. Do not have to worry about the state, Flink will handle them, including repartition.
I think it's possible to write Yarn client to deploy new task managers and make it talk to job manager, is that already available in existing flink yarn client application? - No, you can not. This feature seems to be added in the future. Currently, we can not do that.

How do I remove a folder from Windows Distributed File System?

We recently moved to a webfarm and setup dfs, only to find a beta application was creating files like there was no tomorrow. 1.2 million files were replicated across the farm, and since then we have prevented the application from creating new files, but every time we try to remove the files, it replaces them on each server because of replication. The process of replacing them actually causes to server to run slowly and in some cases stall.
Is there any way we can stop replication at a folder level?
Then I guess something in your setup is fishy, as the FRS also replicates deletes accross servers

Resources