How to schedule a job in apache-flink - apache-flink

I want to write a task that is triggered by apache flink after every 24 hours and then processed by flink. What is the possible way to do this? Does flink provide any job scheduling functionality?

Apache Flink is not a job scheduler but an event processing engine which is a different paradigm, as Flink jobs are supposed to run continuously instead of being triggered by a schedule.
That said, you could achieve the functionality by simply using an off the shelve scheduler (i.e. cron) who is scheduled to start a job on your Flink cluster and then stop it after you receive some sort of notification that the job was done (i.e. through a Kafka topic) or simply use a timeout after which you would assume that the job is finished and you can stop the job. But again, especially because Flink is not designed for this kind of use cases, you would most certainly run into edge cases which Flink does not support.
Alternatively you can simply use a 24 hour tumbling window and run your task in the corresponding trigger function. See https://flink.apache.org/news/2015/12/04/Introducing-windows.html for details on that matter.

Related

Savepoint on Flink Job Finish

I have a usecase where I need to seed a Flink Application(both RocksDB state and Broadcast State) using Bounded S3 sources and then read other unbounded/bounded S3 sources after the seeding is complete.
I was trying to achieve this in 2 steps:
Seeding: Trigger a Flink job with only the seeding data bounded source and take a savepoint after the job finishes.
Regular Processing: Restore from seeded savepoint on a new Flink graph to process other unbounded/bounded S3 sources.
Questions:
For Step 1: Does Flink support taking savepoints automatically after Job Finishes in Streaming Mode.
If only manual savepoint trigger is supported, what can be used a done signal that all the seeding data is processed completely and all the task are finished processing?
Any other approaches to achieve the seeding usecase is appreciated as well.
Note: Approaches where we buffer the regular data until seeding data is processed is not feasible for my usecase
Thanks
Using unbounded sources you can make use of externalized checkpoint and you will be able to start/resume jobs from the checkpoint. Enabling this feature it is necessary to have a process to clean the checkpoints when the job is cancelled otherwise the checkpoints won't be deleted by Flink.
You can use the new feature available in Flink 1.15 (checkpoints with finished tasks) to do that.

How to reduce times between Flink intra-jobs and avoid repeated tasks

I have run a Flink bounded job in standalone cluster. Then Flink breaks it down into 3 jobs.
It takes around 10 secs to start the next job after one job finish. How to reduce the times between jobs? and when observing the details of the tasks flow, I notice that 2nd job did the same tasks that have been done by 1st job, plus new additional tasks, and so on with 3rb job. For example, it repeatedly reads the data from files in every job and then join it. Why does it happen? I am a new Flink user. AFAIK, we can't cache dataset in Flink. Really need help to understand how it works. Thank you.
Here is the code

Avoid running initialization code in Apache Flink job when resuming from savepoint

I have a Apache Flink Job, implemented with the DataStream API, which contains some initialization code before the definition and submission of the job graph. The initialization code should only run the first time the job is submitted and not when resuming the job from a checkpoint or when updating it using a savepoint.
It seems that when restarting the job during a failover from a checkpoint, the job is restarted from a job graph stored in the checkpoint - in particular, the initialization code is not run a second time (which is what I want).
Is the same possible when running a job from a savepoint? In other words, is there a way to execute code only when the job is not started from a savepoint?
If you implement the CheckpointedFunction interface, then initializeState(FunctionInitializationContext context) will be called during initialization. Then you can use context.isRestored() to determine whether the job is being started for the first time, or not.

how to deploy a new job without downtime

I have an Apache Flink application that reads from a single Kafka topic.
I would like to update the application from time to time without experiencing downtime. For now the Flink application executes some simple operators such as map and some synchronous IO to external systems via http rest APIs.
I have tried to use the stop command, but i get "Job termination (STOP) failed: This job is not stoppable.", I understand that the Kafka connector does not support the the stop behavior - a link!
A simple solution would be to cancel with savepoint and to redeploy the new jar with the savepoint, but then we get downtime.
Another solution would be to control the deployment from the outside, for example, by switching to a new topic.
what would be a good practice ?
If you don't need exactly-once output (i.e., can tolerate some duplicates) you can take a savepoint without cancelling the running job. Once the savepoint is completed, you start a second job. The second job could write to different topic but doesn't have to. When the second job is up, you can cancel the first job.

Flink batch continuous running

I have flink batch job. What is the best way to run continuously? (It needs to restart when it's finished because the streaming job can provide new data)
I want to restart the job immediately if it's finished.
Infinite cycle and inside call the tasks?
Make a bash script and always push the job into the jobmanager? (I think it's a really big resource waste)
Thanks
In a similar use-case where we run Flink job against same collection; we trigger new job at periodic intervals. [daily, hourly etc.] https://azkaban.github.io/ can be used for scheduling. This is NOT really what you mentioned. But, a close-match which might be sufficient to solve your use-case.

Resources