Reuse Apache Beam Workflow Graph - apache-flink

I'm using Apache Beam to run Batch pipelines on Flink, running on AWS EMR.
Beam Pipelines are being created and submitted to a long running Flink cluster. However, I see that there is a cost associated with building the job graph and submitting it to EMR, taking constantly in excess of 2 minutes.
I want to reduce this time & was wondering if there is a way to cache/prevent Beam from building pipelines with every run of the Driver cronjob.
The job graph is a bit complex, so the 2 minute time is justified. Just looking at ways to reduce/eliminate it from happening everytime.

Related

Processing job v.s. training job in SageMaker

What is the difference between a processing job and a training job? I am running training job and did not launch processing job, why does my account have processing job running?
Training job focuses on a process that train a model, for example it will use a PyTorch or TensorFlow container, or SageMaker built-in algorithms, or have a Hyper parameters tuning job associated with the training job. It may include an MPI cluster for distributed training. Finally it outputs a model artifact.
Processing jobs focus on pre/post processing, and have API and containers for ML data processing tools like SK-learn pipeline or Dusk/Spark.
Why do you see processing jobs - When you are profiling a training job (enabled by default), then a matching processing job is created to process the profiler report. You can disable profiling by adding disable_profiler=True parameter to the estimator object.
Generally speaking training jobs have a richer API than processing jobs, and allows more customization. The right choice will depend on your specific use case.

Flink-Kafka Flink job reading kafka records during startup and failing to start on AWS-KDA

Running a Flink-Beam job on KDA (kakfa --> flink(beam) --> ElasticSearch) the simple job wont start on KDA and goes to infinite loop. The AWS KDA Support replied saying the Job reads records during startup which is the cause of failure.
The dockerized version of the app runs smooth with 3 taskmanagers in kubernetes but not on KDA.
As KDA has 2 minute timeout to start a job.
By my understanding Flink starts reading records once the job starts, how do i reduce startup lesser than 2 minutes, as the job is very basic reading records from kafka and store to ES.
I resolved the issue, basically Beam uses direct runner as default.
it is important to set --runner=FlinkRunner to start your job as a flink job.
otherwise the job is in infinite loop of reading from kafka topic.

Apache Flink, Job with big grap - submisson times out on cluster

We trying to build Flink Job for price aggregation with quite complicated logic.
E.g. previous version had a graph as below.
After another development iteration, I added even more complexity to the job.
The new version was running fine from IDE, however, deployment to cluster fails with
Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out.
If I reconfigure the job (reduce graph complexity) it gets deployed without any problem.
My questions are:
Are there any limitations on Job Graph size and complexity when submitting to standalone cluster?
Is there any possibility to disable graphical graph representation (I have suspicions that the problem is caused by Graph view - locally my job works)
Are there any debug tools, to understand what is happening on the Job submission, and why it times out?
Thanks in advance.
The solution was to use latest flink version (1.5 at the time of writing).

How to schedule a job in apache-flink

I want to write a task that is triggered by apache flink after every 24 hours and then processed by flink. What is the possible way to do this? Does flink provide any job scheduling functionality?
Apache Flink is not a job scheduler but an event processing engine which is a different paradigm, as Flink jobs are supposed to run continuously instead of being triggered by a schedule.
That said, you could achieve the functionality by simply using an off the shelve scheduler (i.e. cron) who is scheduled to start a job on your Flink cluster and then stop it after you receive some sort of notification that the job was done (i.e. through a Kafka topic) or simply use a timeout after which you would assume that the job is finished and you can stop the job. But again, especially because Flink is not designed for this kind of use cases, you would most certainly run into edge cases which Flink does not support.
Alternatively you can simply use a 24 hour tumbling window and run your task in the corresponding trigger function. See https://flink.apache.org/news/2015/12/04/Introducing-windows.html for details on that matter.

Where is the JobManager on embedded Flink instances?

I am developing an application with multiple (micro)services.
I am using Flink (over Kafka) to stream messages between the services. Flink is embedded in the Java applications, each running in a separate docker container.
This is the first time I'm trying Flink and after reading the docs I still have a feeling I'm missing something basic.
Who is managing the jobs?
Where is the JobManager running?
How do I monitor the processing?
Thanks,
Moshe
I would recommend this talk by Stephan Ewen at Flink Forward 2016. It explains the current Apache Flink architecture (10:45) for different deployments as well as future goals.
In general, the JobManager is managing Flink jobs and TaskManagers execute your job consisting of multiple tasks. How the components are orchestrated depends on your deployment (local, Flink cluster, YARN, Mesos etc.).
The best tool for monitor your processing is the Flink Web UI at port 8081 by default, it offers different metrics for debugging and monitoring (e.g. monitoring checkpointing or back-pressure).

Resources