We are using aws EMR cluster setup for submitting Flink job. Today we log in EMR cluster and run following steps for submitting a job
Stop the job with savepoint.
Start new job with savepoint created in step 1
Now I wanted to associate this step with pipeline deployments. How can I do that? Is there any existing software/tools exists for automating Flink deployments?
We would also like to add features like auto-rollbacking deployment to previous version in case of exceptions/error
Related
I am new to Flink and EMR cluster deployment. Currently we have a Flink job and we are manually deploying it on AWS EMR cluster via Flink CLI stop/start-job commands.
I wanted to automate this process (Automate updating flink job jar on every deployment happening via pipelines with savepoints) and need some recommendations on possible approaches that could be explored.
We got an option to automate this process via Flink Rest API support for all flink job operation
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/
Sample project which used the same approach : https://github.com/ing-bank/flink-deployer
I'm running Flink on Kubernetes and when I update the replicas of TaskManager deployment, Kubernetes scales up/down the number of TM pods for me, but when I checked TM is up but the newly added Tm is not getting any task not sure if that is all I need to do. Do I need to do anything else to make the job adapt to the more/less TMs in flink 1.11.3 version
To get this to work the way you expected, upgrade to Flink 1.13 and use reactive mode. See https://flink.apache.org/2021/05/06/reactive-mode.html.
With Flink 1.11, you'll have to rescale manually, by restarting from a checkpoint or savepoint while specifying the new parallelism. If you are using a native kubernetes deployment, Flink will use its kubernetes resource manager, and will create the appropriate number of pods automatically. (Note that native kubernetes deployments have also been improved since 1.11.) On the other hand, with a standalone kubernetes deployment, Flink is unaware of kubernetes, and you're on your own, and need to manually create the right number of pods.
Let's say all nodes that are running Flink job manager are restarted at the same time, is there any impact to the running task managers which are untouched?
Thanks!
The new job managers will restart all of the jobs from their latest checkpoints, using the information (job graphs, checkpoint metadata) they find in the HA service provider.
I am attempting to recover my jobs and state when my job manager goes down and I haven't been able to restart my jobs successfully.
From my understanding, TaskManager recovery is aided by the JobManager (this works as expected) and JobManager recovery is completed through Zookeeper.
I am wondering if there is a way to recover the jobmanager without zookeeper?
I am using docker for my setup and all checkpoints & savepoints are persisted to mapped volumes.
Is flink able to recover when all job managers go down? I can afford to wait for the single JobManager to restart.
When I restart the jobmanager I get the following exception: org.apache.flink.runtime.rest.NotFoundException: Job 446f4392adc32f8e7ba405a474b49e32 not found
I have set the following in my flink-conf.yaml
state.backend: filesystem
state.checkpoints.dir: file:///opt/flink/checkpoints
state.savepoints.dir: file:///opt/flink/savepoints
I think my issue may that the JAR gets deleted when the job manager is restarted but I am not sure how to solve this.
At the moment, Flink only supports to recover from a JobManager fault if you are using ZooKeeper. However, theoretically you can also make it work without it if you can guarantee that there is only a single JobManager ever running. See this answer for more information.
You can check out running your cluster as a "Flink Job Cluster". This will automatically start the job that you baked into the docker image when the container comes up. You can read more here.
I'm using Apache Beam with Flink runner with Java SDK. It seems that deploying a job to Flink means building a 80-megabyte fat jar that gets uploaded to Flink job manager.
Is there a way to easily deploy a lightweight SQL to run Beam SQL? Maybe have job deployed that can soemhow get and run ad hoc queries?
I don't think it's possible at the moment, if I understand your question. Right now Beam SDK will always build a fat jar which will implement the pipeline and include all pipeline dependencies, and it will not be able to accept lightweight ad-hoc queries.
If you're interested in more interactive experience in general, you cat look at the ongoing efforts to make Beam more interactive, for example:
SQL shell: https://s.apache.org/beam-sql-packaging . This describes a work-in-progress Beam SQL shell, which should allow you to quickly execute small SQL queries locally in a REPL environment, so that you can interactively explore your data, and design the pipeline before submitting a long-running job. This does not change the way how the job gets submitted to Flink (or any other runner) though. So after you submitted the long running job, you will likely still have to use normal job management tools you currently have to control it.
Python: https://s.apache.org/interactive-beam . Describes the approach to wrap existing runner into an interactive wrapper.