When using Apache Flink we can configure values in flink-conf.yaml. But here using CLI commands we can assign some values Dynamically when starting or submitting a job or a task in flink.
eg:- bin/taskmanager.sh start-foreground -Dtaskmanager.numberOfSlots=12
But some values like jobmanager.memory.process.size and taskmanager.memory.process.size are unable to set Dynamically using "-D".
Is there a way to set up those values dynamically when starting jobmanager and taskmanager using CLI?
May be this might help you :
ParameterTool parameters = ParameterTool.fromPropertiesFile("src/main/resources/application.properties"); // one can specify the properties defined in conf/flink-conf.yaml in this properties file
Configuration config = Configuration.fromMap(parameters.toMap());
TaskExecutorResourceUtils.adjustForLocalExecution(config);
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment(config);
env.setParallelism(3);
System.out.println("Config Params : " + config.toMap());
Make a note to set the parallelism equal to the number of task slots for task manager as per this link. By default, the number of task slots for a task manager is one.
Related
I'm running Apache Flink version 1.12.7 and configured Streaming Execution Environment with number of task slots for task manager = 3 (just experimenting) but unable to see the output of a file read by the environment. Instead, as seen in the logs, the Execution Graph is stuck as SCHEDULED and does not get into RUNNING state.
Note that if no configuration is passed in the properties file, everything works good and output is seen as environment is able to read the file since Execution Graph get switched to RUNNING state.
The code is as follows :
ParameterTool parameters = ParameterTool.fromPropertiesFile("src/main/resources/application.properties");
Configuration config = Configuration.fromMap(parameters.toMap());
TaskExecutorResourceUtils.adjustForLocalExecution(config);
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment(config);
System.out.println("Config Params : " + config.toMap());
DataStream<String> inputStream =
env.readTextFile(FILEPATH);
DataStream<String> filteredData = inputStream.filter((String value) -> {
String[] tokens = value.split(",");
return Double.parseDouble(tokens[3]) >= 75.0;
});
filteredData.print(); // no o/p seen if configuration object is set otherwise everything works as expected
env.execute("Filter Country Details");
Need help in understanding this behaviour and what changes should be made in order that the output gets printed along with having some custom configuration. Thank you.
Okay..So found the answer to the above puzzle by referring to some links mentioned below.
Solution : So I set the parallelism (env.setParallelism) in the above code just after configuring the streaming execution environment and the file was read with output generated as expected.
Post that, experimented with a few things :
set parallelism equal to number of task slots = everything worked
set parallelism greater than number of task slots = intermittent results
set parallelism less than number of task slots = intermittent results.
As per this link corresponding to Flink Architecture,
A Flink cluster needs exactly as many task slots as the highest parallelism used in the job
So its best to go with no. of task slots for a task manager equal to the parallelism configured.
we are currently running a flink cluster in a standalone mode on Kubernetes. We have wanted to explore whether we could migrate over to managed flink on AWS (KDA).
But I don't seem to find any documentation or indication that it is possible to inject environment variables? Do these need to be provided as runtime arguments?
Related, is it possible to override default flink configurations that we currently specify in our flink-conf.yml in managed Flink?
Thanks in advance!
I'll answer my own question, I seems there is no way to provide the environment variables in the same fashion you would with configmaps in Kubernetes. Instead, we need to use the Runtime properties that can be defined in KDA. These can then be retrieved using the KinesisAnalyticsRuntime.getApplicationProperties()
As an example:
val params: ParameterTool = ParameterTool.fromArgs(args)
val config = params.get("env", "") match {
case "local" => AppConfiguration.initialize(sys.env)
case _ => // KDA
val kdaProperties = KinesisAnalyticsRuntime.getApplicationProperties()
logger.error(
s"kdaProperties $kdaProperties",
Some(Map("kda" -> kdaProperties))
)
Option(kdaProperties.get("DevProperties")) match {
case Some(kdaProperties) =>
val kdaPropsToMap = kdaProperties.asScala.toMap
AppConfiguration.initialize(kdaPropsToMap)
case None =>
logger.error(s"could not read KDA runtime properties", Some(Map("kda" -> kdaProperties)))
throw new Error(
"unable to read KDA runtime properties"
) // scalafix:ok
}
}
And where the grouping key defined in KDA for the Runtime properties is used to fetch these.
This also means configuring flink-conf.yml will be possible as Runtime properties which then need to be set during runtime (is my understanding)
I am programmatically creating my CamelContext and successfully creating the ClusterServiceProvider using the KubernetesClusterService implementation. When running in my kubernetes cluster, it is electing a leader and responding appropriately to "master:" routes. All good.
However, I would like to limit the Pod/Deployments that are detected in the cluster member negotiation/inspection. It currently has knowledge of (finds) every Pod in the cluster namespace which includes completely unrelated deployments/instances.
The overall question is how to select what Pods/Deployments should be included in the particular camel cluster?
I see in the KubernetesLockConfiguration there is an attribute called clusterLabels however it is unclear to me how or what that is used for. When I do set the clusterLabels to something syntactically common in kubernetes (e.g. app -> my-app), then the cluster finds no members.
I mention that I am doing this programmatically as there is no spring-boot or other commonly documented configuration of camel involved. Running in Play-Framework Scala.
First, I create a CamelContext
val context = new DefaultCamelContext()
val rb = new RouteBuilder() {
def configure() = {
val policy = ClusteredRoutePolicy.forNamespace("default")
from(s"master:ip:timer://master-timer?fixedRate=true&period=60000")
.routePolicy(policy)
.bean(classOf[MasterTimer], "execute()")
.log("Current Leader ${routeId}")
}
}
Second, I create a ClusterServiceProvider
import org.apache.camel.component.kubernetes.cluster.KubernetesClusterService
val crc = new ClusteredRouteController()
val service = new KubernetesClusterService
service.setCamelContext(cc)
service.setKubernetesNamespace("default")
//
// if I set clusterLabels here, no camel cluster is realize/created.
// assumption is the my syntax for CamelKubernetes is wrong however
// it is unclear from documentation how to make this work.
//
// if I do not set clusterLabels, every pod in my kubernetes cluster is
// part of a cluster-member (CamelKubernetesLeaderNotifier logs that
// the list of cluster members has changed). So I get completely
// un-related deployments in the context of something that I want
// to specifically related, namely all pods with an "app" label of
// "my-app"
//
service.setClusterLabels(Map("app" -> "my-app").asJava)
crc.setNamespace("default")
crc.setClusterService(service)
camelContext.addService(service)
camelContext.setRouteController(crc)
camelContext.start()
camelContext.getRouteController().startAllRoutes()
I have a parameter params to serialize in flink streaming,
class P extend Serializable {...}
val params = new P(...)
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.addSource(new MySource(params))
.map(new MyMap(params))
.addSink(new MySink(params))
env.setParallelism(1)
env.execute("My Job")
But params would change in driver node and I need to update params to executor during job running. Is it possible based on no stop of flink streaming job?
in short, the answer is no. Because your UDF will need to [de]serialize the parameters every time a new record comes and this will slow down the execution.
However, you can implement your own stream operator by extending AbstractUdfStreamOperator and call it in a transform operation . I did one example here: "Implementing my own stream operator in Flink to deal with data skew".
Then you decide on the operator when to read the new parameter. Just create a new thread that is schedule to every 10 minutes for instance. The parameter files have to be placed on all nodes that the operator will be running.
Is it possible to run a job from a savepoint having a direct
main() + LocalExecutionEnvironment setup?
Is it possible to do that through Remote*Environment?
Is it possible to do that or trigger a savepoint via ClusterClient?
Is the above possible through the rest api? Web ui (doesn't look like that)?
Finally, Is it possible to perform savepoint operations from local ./bin/flink against a remote cluster (same version but maybe different OS)?
Thank you.
To answer partially (3) you can do that using a ClusterClient by doing something similar to:
final Configuration config = GlobalConfiguration.loadConfiguration("...");
final ClusterClient client = new StandaloneClusterClient(config);
final PackagedProgram packagedProgram = new PackagedProgram(new File(FLINK_JOB_JAR));
packagedProgram.setSavepointRestoreSettings(SavepointRestoreSettings.forPath("...", true));
client.run(packagedProgram, 1);