Setting dataflow job dependencies using Cloud function and App engine - google-app-engine

Just a thought, My first job will be triggered by a Cloud Function by any file arrival event. I will capture it's job ID in Cloud Function itself. Once I get the job ID, I will pass that ID to App engine code and that code will monitor the completion of job with that particular ID. Once the App engine code identifies the 'success' status of that job then it will trigger another job which was dependant on the successful completion status for prior job.
Any idea whether it can be made possible or not? If yes, then any help with code samples will be highly appreciated.

Related

How to detect a Flink Batch Job finishes

Currently, I have a streaming job which is firing a batch job when it receives a specific trigger.
I want to follow that fired batch job and when it finishes, want to insert an entry to a database like elastic search or so.
Any ideas, how we can achieve this? How we can listen to that job?
FLINK provides some REST APIs to query job status, you could use this one to query batch job state: https://ci.apache.org/projects/flink/flink-docs-release-1.12/ops/rest_api.html#jobs-jobid. While tasks are running, their status will be reported to JM. Through this API, you can get the job state based on the response from the request.

Spark job callback

Maybe you can help me with my problem
I start spark job on google-dataproc through API. This job writes results on the google data storage.
When it will be finished I want to get a callback to my application.
Do you know any way to get it? I don't want to track job status through API each time.
Thanks in advance!
I'll agree that it would be nice if there was to either wait for or get a callback for when operations such as VM creation, cluster creation, job completion, etc finish. Out of curiosity, are you using one of the api clients (like google-cloud-java), or are you using the REST API directly?
In the mean time, there are a couple of workarounds that come to mind:
1) Google Cloud Storage (GCS) callbacks
GCS can trigger callbacks (either Cloud Functions or PubSub notifications) when you create files. You can create an file at the end of your Spark job, which will then trigger a notification. Or, just add a trigger for when you put an output file on GCS.
If you're modifying the job anyway, you could also just have the Spark job call back directly to your application when it's done.
2) Use the gcloud command line tool (probably not the best choice for web servers)
gcloud already waits for jobs to complete. You can either use gcloud dataproc jobs submit spark ... to submit and wait for a new job to finish, or gcloud dataproc jobs wait <jobid> to wait for an in-progress job to finish.
That being said, if you're purely looking for a callback for choosing whether to run another job, consider using Apache Airflow + Cloud Composer.
In general, the more you tell us about what you're trying to accomplish, we can help you better :)

Asynchronous requests in AppEngine

I'm building an app that essentially does the following:
Get the user to enter certain parameters.
Pass those params to the backend and start a task based on those
params.
When the task is complete redirect the user to another page showing
the results of the task.
The problem here is that the task is expected to take quite long. I was thus hoping to make the request asynchronous. Does appengine allow this ?
If not, what are my options ? I was looking at the documentation for task queues. While it satisfies part of what I'm trying to do, I'm not very clear on how the queue notifies the client when the task is complete, so that the redirect can be initiated.
Also, what if the results of the task have to be returned to the calling client itself ? Is that possible ?
You can't (shouldn't really) wait for completion, GAE is not designed for that. Just launch the task, get a task ID (unique, persisted it in the app) and send the ID back to the client in the response to the launch request.
The client can check, either by polling (at a reasonable rate) or simply on-demand, that status page (you can use the ID to find the right task). You can even add a progress/ETA info on that page, down the road if you so desire.
After the task completes the next status check request from the client can be redirected to the results page as you mentioned.
This Q&A might help as well, it's a very similar scenario, only using the deferred library: How do I return data from a deferred task in Google App Engine
Update:
The Task Queues are preferable to the deferred library, the deferred functionality is available using the optional countdown or eta arguments to taskqueue.add():
countdown -- Time in seconds into the future that this task should run or be leased. Defaults to zero. Do not specify this argument if
you specified an eta.
eta -- A datetime.datetime that specifies the absolute earliest time at which the task should run. You cannot specify this argument if
the countdown argument is specified. This argument can be time
zone-aware or time zone-naive, or set to a time in the past. If the
argument is set to None, the default value is now. For pull tasks, no
worker can lease the task before the time indicated by the eta
argument.

How can Google App Engine be prevented from immediately rescheduling tasks after status code 500?

I have a Google App Engine servlet that is cron configured to run once a week. Since it will take more than 1 minute of execution time, it launches a task (i.e. another servlet task/clear) on the application's default push task queue.
Now what I'm observing is this: if the task causes an exception (e.g. NullPointerException inside its second servlet), this gets translated into HTTP status 500 (i.e. HttpURLConnection.HTTP_INTERNAL_ERROR) and Google App Engine apparently reacts by immediately relaunching the same task again. It announces this by printing:
Web hook at http://127.0.0.1:8888/task/clear returned status code 500. Rescheduling..
I can see how this can sometimes be a feature, but in my scenario it's inappropriate. Can I request that Google App Engine should not do such automatic rescheduling, or am I expected to use other status codes to indicate error conditions that would not cause rescheduling by its rules? Or is this something that happens only on the dev. server?
BTW, I am currently also running other tasks (with different frequencies) on the same task queue, so throttling reschedules on the level of task queue configuration would be inconvenient (so I hope there is another/better option too.)
As per https://developers.google.com/appengine/docs/java/taskqueue/overview-push#Java_Task_execution - the task must return a response code between 200 and 299.
You can either return the correct value, set the taskRetryLimit in RetryOptions or check the header X-AppEngine-TaskExecutionCount when task launches to check how many times it has been launched and act accordingly.
I think I've found a solution: in the Java API, there is a method RetryOptions#taskRetryLimit, which serves my case.

AppEngine TaskQueue Fires, but doesn't exist

I was trying out AppEngine TaskQueues with Python. I ran some code and it created a task on the default queue, I ran the code a few more times and it created a few more tasks as expected.
I went into the Task Queue management section and manually deleted one of the tasks. All the other tasks completed ok.
I then removed the code that created the task from my AppEngine code.
But now the task I deleted keeps getting called every 20 minutes or so, is there something I can do to stop this task? It doesn't show up in any of the task queues.
I have tried disable/re-enabling the application, uploading a clean queue.yaml file.
Any Ideas?
Thanks
If you are sure this is happening, please file a a bug with the label Component-TaskQueue and include your app ID.
Did you use the deferred modul? If so, remove the builtins: deferred from your app.yaml. Then you should make the url the task calls avaiable again and return a http-200 status code. (For the deferred-modul this is /_ah/queue/deferred). This should stop the GAE from retrying.

Resources