I'd like for the latest checkpoint to be loaded in Flink but it just isn't. I've written a word count application that is meant to resume counting where it left off after a restart. I am running it from my IDE so I'm not starting a Flink cluster.
Here's the code I wrote https://github.com/edu05/wordcount/tree/simple
Which is inspired from the example provided by the Flink creators on checkpointing... https://github.com/streaming-with-flink/examples-scala
What am I missing? How can I also avoid re-printing some of the word counts? I don't see many contributors in Stackoverflow on Apache Flink, is there another more appropriate forum?
Checkpoints are by default not retained and are only used to resume a job from failures.
If you need to start your job from a retained checkpoint you have to do it manually just as from savepoint by the following way:
$ bin/flink run -s :checkpointMetaDataPath [:runArgs]
Related
I'm creating a process to handle millions of records with apache flink to support logistics data pipelines. I'm moving from kinesis sources/sink to kafka sources/sink.
However, in the flink dashboard, the job metrics are not being updated in the near-real-time. Do you know what can be wrong with the job/version?
Btw, when job is closed, then it can show all metrics... but not in near-real-time...
Job non-updating metrics picture
Fixed after cleanup conflict dependencies on "Kafka-clients" lib.
So, in my case, using also some avro & cloudevents libs with higher Kafka-clients version. Then, just need to exclude Kafka-clients from these libs and prefer flink Kafka-clients version. And this solved the issue.
if i want to run a flink job on yarn ,the command is
./bin/flink run -m yarn-cluster ./examples/batch/WordCount.jar
but the command will run a default cluster which have 2 taskmanagers ;
if i am only submmit single job,why the default taskmanagers is setted 2?
and when do I need mutiple taskmanager in single job?
The basic idea of any distributed data processing framework is to run the same job across multiple compute nodes. In this way, applications that process too much data for one particular node, simply scale out to multiple nodes and could in theory process arbitrary much data. I suggest you to read the basic concepts of Flink.
Btw, there is no particular reason to have a default of 2 though. It could be any number, but it happens to be 2.
I wanna run flink programs on demand submit them went one conditions happens. How run flink jobs from java code in flink 1.3.0 version?
You can use Flink's REST API to submit a job from another running Flink job. For more details see the REST API documentation.
For example I uploaded JAR with my flow and run it through Apache Flink dashboard. Then I implemented some changes in flow and want to deploy them.
Can anybody explain me step-by-step how to deploy new version of my flow to Apache Flink cluster correctly (without downtime, loosing state, etc.)? I didn't find description of deploy process in official documentation.
What you want to use is the savepoints in Flink.
The steps are as follows:
Prepare the new jar for your job
Save the state of the currently running job using flink savepoint <JobID>
Stop the job
Start the new jar using the just created savepoint flink run -s <pathToSavepoint> <jobJar> ...
See also: https://www.ververica.com/blog/how-apache-flink-enables-new-streaming-applications-part-1
Can anybody explain me why "Configuration" section of running job in Apache Flink Dashboard is empty?
How to use this job configuration in my flow? Seems like this is not described in documentation.
The configuration tab of a running job shows the values of the ExecutionConfig. Depending on the version of Flink you might will experience a different behaviour.
Flink <= 1.0
The ExecutionConfig is only accessible for finished jobs. For running jobs, it is not possible to access it. Once the job has finished or has been stopped/cancelled, you should be able to see the ExecutionConfig.
Flink > 1.0
The ExecutionConfig can also be accessed for running jobs.