I have run a Flink bounded job in standalone cluster. Then Flink breaks it down into 3 jobs.
It takes around 10 secs to start the next job after one job finish. How to reduce the times between jobs? and when observing the details of the tasks flow, I notice that 2nd job did the same tasks that have been done by 1st job, plus new additional tasks, and so on with 3rb job. For example, it repeatedly reads the data from files in every job and then join it. Why does it happen? I am a new Flink user. AFAIK, we can't cache dataset in Flink. Really need help to understand how it works. Thank you.
Here is the code
Related
I have been reading up on scheduling policies, specifically the ones in YARN. If I can summarize FAIR scheduler on a high level - It divides the resources almost equally among the jobs. For Hadoop MapReduce case, it does this by reassigning resources to different jobs whenever a map or reduce task completes.
Explaining FAIR scheduling using an example: Suppose a single Hadoop MapReduce job (job1) containing 5 map and 5 reduce tasks is scheduled on a cluster. The cluster has 2 cores in total and can provide maximum 2 containers. Because there are no other jobs, both the containers will be used by job1. When a new job (job2) arrives, the scheduler will wait for a current task of job1 to finish on one of the containers and give that resource to job2. Henceforth, the tasks of the two jobs will run on one container each.
Is my above understanding roughly correct? If yes, then what happens if the individual map and reduce tasks of job1 take a long time? Does it mean that YARN has to wait for a long time for a task of job1 to complete so that resources can be freed up for job2?
My other question is an extension of the above case. How will FAIR scheduling be implemented for long running streaming jobs. For example, suppose a Flink job (job1) with a map->reduce pipeline is scheduled on the cluster. The parallelism of the job's map and reduce tasks can initially be 2. So, there will be 2 parallel pipelines in the two containers (task managers) - each pipeline containing a map and a reduce subtask. Now, if a new job (job2) arrives, YARN will wait for one of the pipelines to finish, so that resource can be given to job2. Since job1 can be a long running continuous job, it may stop after a long time or never stop. In this case, what will YARN do to enforce FAIR scheduling.
I am working on a flink project which write stream to a relational database.
In the current solution, we wrote a custom sink function which open transaction, execute SQL insert statement and close transaction. It works well until the the data volume increases and we started getting connection timeout issues. We tried a few connection pool configuration adjustment, it does not help much.
We are thinking of trying "batch-insert", so to decrease the number of "writes" to the database. We come across a few classes which do almost what we want: JDBCOutputFormat, JDBCSinkFunction. With JDBCOutputFormat, we can configure the batch size.
We would also like to force a "batch-insert" every 1 minutes if the number of records does not exceed the "batch-size". How would you normally deal with these kinds of problems? My first thoughts is to extend JDBCOutputFormat to use scheduled tasks to force flush every 1 minute, but it was not obvious how it could be done.
Do we have to write our own sink all together?
Updated:
JDBCSinkFunction does a flush and batch execute each time Flink checkpoints. So long as you are doing checkpointing, the batches won't be any longer than the checkpointing interval.
However, having read this mailing list thread, I see that JDBCSinkFunction does not support exactly-once output.
I want to write a task that is triggered by apache flink after every 24 hours and then processed by flink. What is the possible way to do this? Does flink provide any job scheduling functionality?
Apache Flink is not a job scheduler but an event processing engine which is a different paradigm, as Flink jobs are supposed to run continuously instead of being triggered by a schedule.
That said, you could achieve the functionality by simply using an off the shelve scheduler (i.e. cron) who is scheduled to start a job on your Flink cluster and then stop it after you receive some sort of notification that the job was done (i.e. through a Kafka topic) or simply use a timeout after which you would assume that the job is finished and you can stop the job. But again, especially because Flink is not designed for this kind of use cases, you would most certainly run into edge cases which Flink does not support.
Alternatively you can simply use a 24 hour tumbling window and run your task in the corresponding trigger function. See https://flink.apache.org/news/2015/12/04/Introducing-windows.html for details on that matter.
I have flink batch job. What is the best way to run continuously? (It needs to restart when it's finished because the streaming job can provide new data)
I want to restart the job immediately if it's finished.
Infinite cycle and inside call the tasks?
Make a bash script and always push the job into the jobmanager? (I think it's a really big resource waste)
Thanks
In a similar use-case where we run Flink job against same collection; we trigger new job at periodic intervals. [daily, hourly etc.] https://azkaban.github.io/ can be used for scheduling. This is NOT really what you mentioned. But, a close-match which might be sufficient to solve your use-case.
I have written a code to write data to custom metrics cloud monitoring - google app engine.
For that i am storing the data for some amount of time say: 15min into datastore and then a cron job runs and gets the data from there and plots the data on the cloud monitoring dashboard.
Now my problem is : while fetching huge data to plot from the datastore the cron job may timeout. Also i wanted to know what happens when cron job fails ?
Also Can it fail if the number of records is high ? if it can, what alternates could we do. Safely how many records cron could process in 10 min timeout duration.
Please let me know if any other info is needed.
Thanks!
You can run your cron job on an instance with basic or manual scaling. Then it can run for as long as you need it.
Cron job is not re-tried. You need to implement this mechanism yourself.
A better option is to use deferred tasks. Your cron job should create as many tasks to process data as necessary and add them to the queue. In this case you don't have to redo the whole job - or remember a spot from which to resume, because tasks are automatically retried if they fail.
Note that with tasks you may not need to create basic/manual scaling instances if each task takes less than 10 minutes to execute.
NB: If possible, it's better to create a large number of tasks that execute quickly as opposed to one or few tasks that take minutes. This way you minimize wasted resources if a task fails, and have smaller impact on other processes running on the same instance.