I am trying to use Flink REST APIs to automate Flink job submission process via pipeline. To call any Flink Rest endpoint we should be aware about the Job Manager Web interface IP. For my POC, I got the IP after running flink-yarn-session command on CLI, but what is the way to get it from code?
Fo automation, I am planning to call following REST API in sequence
request. get('http://ip-10-0-127-59.ec2.internal:8081/jobs/overview') // Get Running job Id
requests.post('http://ip-10-0-127-59.ec2.internal:8081/jobs/:jobID/savepoints/') // Cancel job with savepoint
requests.get('http://ip-10-0-127-59.ec2.internal:8081/jobs/:JobId/savepoints/
:savepointId') // Get savepoint status
requests. Post("http://ip-10-0-127-59.ec2.internal:8081/jars/upload"). // Upload jar for new job
requests.post(
"http://ip-10-0-127-59.ec2.internal:8081/jars/de05ced9-03b7-4f8a-bff9-4d26542c853f_ATVPlaybackStateMachineFlinkJob-1.0-super-2.3.3.jar/run") // submit new job
requests.get('http://ip-10-0-116-99.ec2.internal:35497/jobs/:jobId') // Get status of new job
If you have the flexibility to run on Kubernetes instead on Yarn (looks like you are on AWS from your hostnames, so you could use EKS) then I would recommend using the official Flink Kubernetes Operator - it is built for exactly this purpose by the community.
If Yarn is a given for your use case then you may follow the code examples that Flink uses to talk to the Yarn ResourceManager in the flink-yarn package, especially the following:
https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/YarnClusterDescriptor.java#L384
https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManagerDriver.java#L258
Related
We are using the Flink REST API to submit job to Flink EMR clusters. These clusters are already running in Session mode. We want to know if there is any way to pass following Flink Job manager configuration param while submitting the job via Flink REST API call.
s3.connection.maximum : 1000
state.backend.local-recovery: true
state.checkpoints.dir: hdfs://ha-nn-uri/flink/checkpoints
state.savepoints.dir : hdfs://ha-nn-uri/flink/savepoints
I figured out Flink submit job has "programArgs" field and I tried using it but Flink job manager configuration didn't pick up these settings
"programArgs": f" --s3.connection.maximum 1000 state.backend.local-recovery true --stage '{ddb_config}' --cell-name '{cluster_name}'"
I want to set up a Job Name for my Flink application written using Table API, like I did it using Streaming API env.execute(jobName).
I want to replace:
I can't find a way in documentation except to do it while running a job from jar
bin/flink run -d -yD pipeline.name=MyPipelineName-v1.0 ...
flink: 1.14.5
env: Yarn
Update:
In case someone will face the same situation. We can add Table API pipelines to Data Stream API Doc, so doing like that will allow us to have a desired job name set programmatically.
Ex.:
val sinkDescriptor = TableDescriptor.forConnector("kafka")
.option("topic","topic_out")
.option("properties.bootstrap.servers", "localhost:9092")
.schema(schema)
.format(FormatDescriptor.forFormat("avro").build())
.build()
tEnv.createTemporaryTable("OutputTable", sinkDescriptor)
statementSet.addInsert(sinkDescriptor, tA)
statementSet.attachAsDataStream()
env.execute(jobName)
Only StreamExecutionEnvironment calls setJobName on the stream graph.
I have deployed my own flink setup in AWS ECS. One Service for JobManager and one Service for task Managers. I am running one ECS task for job manager and 3 ecs tasks for TASK managers.
I have a kind of batch job which I upload using flink rest every-day with changing new arguments, when I submit each time disk memory getting increased by ~ 600MB, I have given a checkpoint as S3 . Also I have set historyserver.archive.clean-expired-jobs true .
Since I am running on ECS, not able to find why the memory is getting increased on every jar upload and execution.
What are the flink config params I should look to make sure the memory is not shooting up on every new job upload?
Try this configuration.
blob.service.cleanup.interval:
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#blob-service-cleanup-interval
historyserver.archive.retained-jobs:
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#historyserver-archive-retained-jobs
Basically what the title says. The API and client docs state that a retry can be passed to create_task:
retry (Optional[google.api_core.retry.Retry]): A retry object used
to retry requests. If ``None`` is specified, requests will
be retried using a default configuration.
But this simply doesn't work. Passing a Retry instance does nothing and the queue-level settings are still used. For example:
from google.api_core.retry import Retry
from google.cloud.tasks_v2 import CloudTasksClient
client = CloudTasksClient()
retry = Retry(predicate=lambda _: False)
client.create_task('/foo', retry=retry)
This should create a task that is not retry. I've tried all sorts of different configurations and every time it just uses whatever settings are set on the queue.
You can pass a custom predicate to retry on different exceptions. There is no formal indication that this parameter prevents retrying. You may check the Retry page for details.
Google Cloud Support has confirmed that task-level retries are not currently supported. The documentation for this client library is incorrect. A feature request exists here https://issuetracker.google.com/issues/141314105.
Task-level retry parameters are available in the Google App Engine bundled service for task queuing, Task Queues. If your app is on GAE, which I'm guessing it is since your question is tagged with google-app-engine, you could switch from Cloud Tasks to GAE Task Queues.
Of course, if your app relies on something that is exclusive to Cloud Tasks like the beta HTTP endpoints, the bundled service won't work (see the list of new features, and don't worry about the "List Queues command" since you can always see that in the configuration you would use in the bundled service). Barring that, here are some things to consider before switching to Task Queues.
Considerations
Supplier preference - Google seems to be preferring Cloud Tasks. From the push queues migration guide intro: "Cloud Tasks is now the preferred way of working with App Engine push queues"
Lock in - even if your app is on GAE, moving your queue solution to the GAE bundled one increases your "lock in" to GAE hosting (i.e. it makes it even harder for you to leave GAE if you ever want to change where you run your app, because you'll lose your task queue solution and have to deal with that in addition to dealing with new hosting)
Queues by retry - the GAE Task Queues to Cloud Tasks migration guide section Retrying failed tasks suggests creating a dedicated queue for each set of retry parameters, and then enqueuing tasks accordingly. This might be a suitable way to continue using Cloud Tasks
Maybe you can help me with my problem
I start spark job on google-dataproc through API. This job writes results on the google data storage.
When it will be finished I want to get a callback to my application.
Do you know any way to get it? I don't want to track job status through API each time.
Thanks in advance!
I'll agree that it would be nice if there was to either wait for or get a callback for when operations such as VM creation, cluster creation, job completion, etc finish. Out of curiosity, are you using one of the api clients (like google-cloud-java), or are you using the REST API directly?
In the mean time, there are a couple of workarounds that come to mind:
1) Google Cloud Storage (GCS) callbacks
GCS can trigger callbacks (either Cloud Functions or PubSub notifications) when you create files. You can create an file at the end of your Spark job, which will then trigger a notification. Or, just add a trigger for when you put an output file on GCS.
If you're modifying the job anyway, you could also just have the Spark job call back directly to your application when it's done.
2) Use the gcloud command line tool (probably not the best choice for web servers)
gcloud already waits for jobs to complete. You can either use gcloud dataproc jobs submit spark ... to submit and wait for a new job to finish, or gcloud dataproc jobs wait <jobid> to wait for an in-progress job to finish.
That being said, if you're purely looking for a callback for choosing whether to run another job, consider using Apache Airflow + Cloud Composer.
In general, the more you tell us about what you're trying to accomplish, we can help you better :)