Flink Cluster Details,
Number of nodes : 4
Flink Version : 1.11
Flink Client : RestCluserClient
We are submitting Flink batch job from streaming job using PackagedProgram, but our requirement is to execute only one job at a time, let's say we got 2 events from source so idealy 2 batch job must be triggered(each per event) but only one at a time. To achieve this, we were using client.setDetached(false) (in previous version of flink), but once we have migrated it to 1.11 setDetached(false) API has been removed.
Do we have any idea how to implement this requirement ?
After analysis more on this, I found the solution.
Flink 1.11 API provided Utils class for submitting the job viz,ClientUtils and it has two methods,
ClientUtils.submitJob() -> this method works with detached mode as true
ClientUtils.submitJobAndWaitForExecutionResult() -> this works as detached mode as false.
Related
I did configure the standalone Debezium and tested the streaming. After that I created a pipeline as follows
pipeline.apply("Read from DebeziumIO",
DebeziumIO.<String>read()
.withConnectorConfiguration(
DebeziumIO.ConnectorConfiguration.create()
.withUsername("user")
.withPassword("password")
.withHostName("hostname")
.withPort("1433")
.withConnectorClass(SqlServerConnector.class)
.withConnectionProperty("database.server.name", "customer")
.withConnectionProperty("database.dbname", "test001")
.withConnectionProperty("database.include.list", "test002")
.withConnectionProperty("include.schema.changes", "true")
.withConnectionProperty("database.history.kafka.bootstrap.servers", "kafka:9092")
.withConnectionProperty("database.history.kafka.topic", "schema-changes.inventory")
.withConnectionProperty("connect.keep.alive", "false")
.withConnectionProperty("connect.keep.alive.interval.ms", "200")
).withFormatFunction(new SourceRecordJson.SourceRecordJsonMapper()).withCoder(StringUtf8Coder.of())
)
When I start the pipeline using DirectRunner, datastream is not captured by the pipeline. In my pipeline code I just added code to dump the data into console for the time being.
Also from the log I observe that the Debezium is being started and stopped frequently. Is that by design?
Also when there is a change made into the DB (INSERT/DELETE/UPDATE), I dont find it being reflected in the logs.
So my question is,
Configuration what I provided is that sufficient?
Why is the pipeline not being triggered when there is a change?
What additional steps I need to perform to get it working?
Restarting debezium multiple times can it cause performance impacts. Since it creates a jdbc connection.
Here is the deal;
I'm dealing with adding new worker (embbeded) to on running the cluster (flink statefun 2.2.1).
As you see the new task manager can be registered to the cluster;
Screenshot of new deployed taskmanager
But it doesn't initialize (it doesn't deploying sources);
What am I missing here?? (master and workers has to same jar files too? or it should be enough deploying taskmanager with jar file)
Any help would be appreciated,
Thx.
Flink supports two different approaches to rescaling: active and reactive.
Reactive mode is new in Flink 1.13 (released just this week), and works as you expected: add (or remove) a task manager, and your application will adjust to the new parallelism. You can read about elastic scaling and reactive mode in the docs.
Reactive mode is currently a work in progress, but might need your needs.
In broad strokes, for active mode rescaling you need to:
Do a stop with savepoint to bring down your current job while taking a snapshot of its state.
Relaunch with the new parallelism, using the savepoint as the starting point.
The exact details depend on how your cluster is deployed.
For a step-by-step tutorial, see Upgrading & Rescaling a Job in the Flink Operations Playground.
The above applies to rescaling statefun embedded functions. Being stateless, remote functions can be rescaled more straightforwardly.
I am using Flink 1.2.1 running on Docker, with Task Managers distributed across different VMs as part of a Docker Swarm.
Uploading an Apache Beam application using the Flink Web UI and trying to set the parallelism at job submission point doesn't work. Neither does submit a job using the Flink CLI.
It seems like the parallelism doesn't get picked up at client level, it ends up defaulting to 1.
When I set the parallelism programmatically within the Apache Beam code, it works: flinkPipelineOptions.setParallelism(4);
I suspect the root of the problem may be in the org.apache.beam.runners.flink.DefaultParallelismFactory class, as it checks for Flink's GlobalConfiguration, which may not pick up runtime values passed to Flink.
Any ideas on how this could be fixed or worked around? I need to be able to change the parallelism dynamically, so the programmatic approach won't work, nor will setting the Flink configuration at system level.
I am using the following documentation:
https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/parallel.html
https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/runners/flink/DefaultParallelismFactory.html
This should probably be fixed in the Beam Flink Runner but as a workaround you can try setting the parallelism to -1 programatically. This should make the translation pick up the parallelism that is specified when submitting the job.
I have a Java application to lunch a flink job to process Kafka streaming.
The application is pending here at the job submission at flinkEnv.execute("flink job name") since the job is running forever for streamings incoming from kafka.
In this case, how can I get job id returned from the execution? I see the jobid is printing in the console. Just wonder, how to get jobid is this case without flinkEnv.execute returning yet.
How I can cancel a flink job given job name from remote server in Java?
As far as I know there is currently no nice programmatic way to control Flink. But since Flink is written in Java everything you can do with the console can also be done with internal class org.apache.flink.client.CliFrontend which is invoked by the console scripts.
An alternative would be using the REST API of the Flink JobManager.
you can use rest api to consume flink job process.
check below link: https://ci.apache.org/projects/flink/flink-docs-master/monitoring/rest_api.html.
maybe you can try to request http://host:port/jobs/overview to get all job's message that contains job's name and job's id. Such as
{"jobs":[{"jid":"d6e7b76f728d6d3715bd1b95883f8465","name":"Flink Streaming Job","state":"RUNNING","start-time":1628502261163,"end-time":-1,"duration":494208,"last-modification":1628502353963,"tasks":{"total":6,"created":0,"scheduled":0,"deploying":0,"running":6,"finished":0,"canceling":0,"canceled":0,"failed":0,"reconciling":0,"initializing":0}}]}
I really hope this will help you.
Can anybody explain me why "Configuration" section of running job in Apache Flink Dashboard is empty?
How to use this job configuration in my flow? Seems like this is not described in documentation.
The configuration tab of a running job shows the values of the ExecutionConfig. Depending on the version of Flink you might will experience a different behaviour.
Flink <= 1.0
The ExecutionConfig is only accessible for finished jobs. For running jobs, it is not possible to access it. Once the job has finished or has been stopped/cancelled, you should be able to see the ExecutionConfig.
Flink > 1.0
The ExecutionConfig can also be accessed for running jobs.