Spark job callback - google-app-engine

Maybe you can help me with my problem
I start spark job on google-dataproc through API. This job writes results on the google data storage.
When it will be finished I want to get a callback to my application.
Do you know any way to get it? I don't want to track job status through API each time.
Thanks in advance!

I'll agree that it would be nice if there was to either wait for or get a callback for when operations such as VM creation, cluster creation, job completion, etc finish. Out of curiosity, are you using one of the api clients (like google-cloud-java), or are you using the REST API directly?
In the mean time, there are a couple of workarounds that come to mind:
1) Google Cloud Storage (GCS) callbacks
GCS can trigger callbacks (either Cloud Functions or PubSub notifications) when you create files. You can create an file at the end of your Spark job, which will then trigger a notification. Or, just add a trigger for when you put an output file on GCS.
If you're modifying the job anyway, you could also just have the Spark job call back directly to your application when it's done.
2) Use the gcloud command line tool (probably not the best choice for web servers)
gcloud already waits for jobs to complete. You can either use gcloud dataproc jobs submit spark ... to submit and wait for a new job to finish, or gcloud dataproc jobs wait <jobid> to wait for an in-progress job to finish.
That being said, if you're purely looking for a callback for choosing whether to run another job, consider using Apache Airflow + Cloud Composer.
In general, the more you tell us about what you're trying to accomplish, we can help you better :)

Related

Deploying a python bot script on Google Cloud Run (GCR)

I have been racking my brains on this for a few weeks now, trying different variations from Google Cloud service offerings but can't seem to find the proper one.
I have a python script with dependencies etc, that I have containerized, pushed, and deploy to GCR.
The script is a bot that connects to an external websocket receiving signals perpetually to then do other processing via API against another external service.
What would be the best service offering from Google Cloud to run this?
So far, I've tried:
GCR Web Service - requires listening service (:8080) which I do not provide in this use case, and, it scales down your service when there is no traffic so no go.
GCR Job Service - Seems like the next ideal option (no HTTP port requirement) - however, since the script (my entry point), upon launch, doesn't 'return' unless it quits, the service launch just allows it to run for a minute or so, until the jobs API declares it as 'failed' - basically, it is launching it via my entry point which just executes the script as if it was running locally and my script isn't meant to return anything.
To try and get around this, I went the google's recommended way and built a main.py with they standard boilerplate, and built it as a wrapper to act as a launcher for the actual script. I did this via a simple subprocess.Popen using their sample main.py as shown below.
main.py
import json
import os
import sys
import subprocess
# Retrieve Job-defined env vars
TASK_INDEX = os.getenv("CLOUD_RUN_TASK_INDEX", 0)
TASK_ATTEMPT = os.getenv("CLOUD_RUN_TASK_ATTEMPT", 0)
# Define main script
def main():
print(f"Starting Task #{TASK_INDEX}, Attempt #{TASK_ATTEMPT}...")
subprocess.Popen(["python3", "myscript.py"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print(f"Completed Task #{TASK_INDEX}.")
# Start script
if __name__ == "__main__":
try:
main()
except Exception as err:
message = f"Task #{TASK_INDEX}, " \
+ f"Attempt #{TASK_ATTEMPT} failed: {str(err)}"
print(json.dumps({"message": message, "severity": "ERROR"}))
sys.exit(1) # Retry Job Task by exiting the process
My thinking being, this would allow the job to execute my script and mark the job as completed, while the actual script remains running. Also, since subprocess.Popen sets its stdout and stderr to PIPE, my thinking is it would get caught by the google logging and I would see the output.
The job runs and marks it as succeed, however, I see no indication of the actual script executing anywhere.
I had similar issue with Google Cloud functions. Jobs seemed like an ideal option since I can run on their scheduler to make sure it is launching after saying, every hour (my script uses a lock file so it doesn't run again if running).
Am I just missing the point on how these cloud services run?
Do offerings like google cloud run jobs/functions, etc meant to execute jobs that return and quit until launched again by however scheduled?
Do I need to consider Google Computing engine as an option for this use case that is, a full running VM instead of stateless/serverless options?
I am trying to use this in a containerized, scalable as needed, fashion to both make my project portable and minimize costs as much as possible given the always running nature of the job.
Lastly, I know services like pythonanywhere as I am sure others, make this kinda stuff easier, but I would like to learn how to do this via standard cloud offerings like GCR, AWS, etc.
thanks for any insight / advice!
Cloud Run best fit is for HTTP Rest APIs serving (stateless services). There are also Jobs in beta.
One of the top feature of Run is that it scales to 0, when there are not requests to your service (your service instance gets totally destroyed).
If your bot needs to stay alive "for ever", Run is not for you... (Even if you can configure Run to keep at least one instance live).
I would consider instead AppEngine or Compute.

Query on automating Flink Job submission

I am trying to use Flink REST APIs to automate Flink job submission process via pipeline. To call any Flink Rest endpoint we should be aware about the Job Manager Web interface IP. For my POC, I got the IP after running flink-yarn-session command on CLI, but what is the way to get it from code?
Fo automation, I am planning to call following REST API in sequence
request. get('http://ip-10-0-127-59.ec2.internal:8081/jobs/overview') // Get Running job Id
requests.post('http://ip-10-0-127-59.ec2.internal:8081/jobs/:jobID/savepoints/') // Cancel job with savepoint
requests.get('http://ip-10-0-127-59.ec2.internal:8081/jobs/:JobId/savepoints/
:savepointId') // Get savepoint status
requests. Post("http://ip-10-0-127-59.ec2.internal:8081/jars/upload"). // Upload jar for new job
requests.post(
"http://ip-10-0-127-59.ec2.internal:8081/jars/de05ced9-03b7-4f8a-bff9-4d26542c853f_ATVPlaybackStateMachineFlinkJob-1.0-super-2.3.3.jar/run") // submit new job
requests.get('http://ip-10-0-116-99.ec2.internal:35497/jobs/:jobId') // Get status of new job
If you have the flexibility to run on Kubernetes instead on Yarn (looks like you are on AWS from your hostnames, so you could use EKS) then I would recommend using the official Flink Kubernetes Operator - it is built for exactly this purpose by the community.
If Yarn is a given for your use case then you may follow the code examples that Flink uses to talk to the Yarn ResourceManager in the flink-yarn package, especially the following:
https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/YarnClusterDescriptor.java#L384
https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManagerDriver.java#L258

Cloud Tasks client ignores retry configuration

Basically what the title says. The API and client docs state that a retry can be passed to create_task:
retry (Optional[google.api_core.retry.Retry]): A retry object used
to retry requests. If ``None`` is specified, requests will
be retried using a default configuration.
But this simply doesn't work. Passing a Retry instance does nothing and the queue-level settings are still used. For example:
from google.api_core.retry import Retry
from google.cloud.tasks_v2 import CloudTasksClient
client = CloudTasksClient()
retry = Retry(predicate=lambda _: False)
client.create_task('/foo', retry=retry)
This should create a task that is not retry. I've tried all sorts of different configurations and every time it just uses whatever settings are set on the queue.
You can pass a custom predicate to retry on different exceptions. There is no formal indication that this parameter prevents retrying. You may check the Retry page for details.
Google Cloud Support has confirmed that task-level retries are not currently supported. The documentation for this client library is incorrect. A feature request exists here https://issuetracker.google.com/issues/141314105.
Task-level retry parameters are available in the Google App Engine bundled service for task queuing, Task Queues. If your app is on GAE, which I'm guessing it is since your question is tagged with google-app-engine, you could switch from Cloud Tasks to GAE Task Queues.
Of course, if your app relies on something that is exclusive to Cloud Tasks like the beta HTTP endpoints, the bundled service won't work (see the list of new features, and don't worry about the "List Queues command" since you can always see that in the configuration you would use in the bundled service). Barring that, here are some things to consider before switching to Task Queues.
Considerations
Supplier preference - Google seems to be preferring Cloud Tasks. From the push queues migration guide intro: "Cloud Tasks is now the preferred way of working with App Engine push queues"
Lock in - even if your app is on GAE, moving your queue solution to the GAE bundled one increases your "lock in" to GAE hosting (i.e. it makes it even harder for you to leave GAE if you ever want to change where you run your app, because you'll lose your task queue solution and have to deal with that in addition to dealing with new hosting)
Queues by retry - the GAE Task Queues to Cloud Tasks migration guide section Retrying failed tasks suggests creating a dedicated queue for each set of retry parameters, and then enqueuing tasks accordingly. This might be a suitable way to continue using Cloud Tasks

How to auto-start an App-Engine backend when a pull-queue has tasks?

It looks like I can create a push-queue that will start backends to process tasks and that I can limit the number of workers to 1. However, is there a way to do this with a pull-queue?
Can App-Engine auto-start a named backend when a pull-queue has tasks and then let it expire when idle and the queue is empty?
It looks like I just need some way to call an arbitrary URL to "notify" it that there are tasks to process but I'm unable to find any documentation on how this can be done.
Use a cron task or a push queue to periodically start the backend. The backend can loop through the tasks (if any) in the pull queue, and then expire.
There isn't a notification system for pull queues, only inspection through queue statistics and empty/non-empty lease results.
First of all you need to decide scalability type you want to use for your module. I think that you should take a look to Basic Scaling (https://developers.google.com/appengine/docs/java/modules/)
Next, to process tasks from pull queue you can use Cron to check queues every several minutes. It will be important to request not basic scaling module, but frontend module, cause cron will start instances. The problem is that you will still need to pay for at least one instance of frontend cause your cron job will not allow it to shutdown.
So the solution could be the following:
Start cron every 1 or 5 minutes and call frontend
Check queue in frontend and issue URLFetch request to basic scaling module if there are tasks in pull queue
Process tasks in queue using basic scaling module
If you use F1 instances for frontend and b2 or greate for other modules it could save you some money.

How to use pull queue on dev server of Google App Engine

We have been using push queue for a very long time and have no problems in consuming the tasks from a dev server.
However during implementing a new service with pull queue, it became difficult to figure out how to do the same thing on the dev server.
Basically from the docs, what we can see is that you should use a REST api (we can't use the direct queue api as it is consumed by an external app) to lease/delete a task with the end point of
https://www.googleapis.com/taskqueue/v1beta1/projects/taskqueues
But obviously this will not work in local dev server, and it appears that no place have talking about this.
Just wondering if anyone had ever run into the same issue had can shed some light?
With Pull Queue, task consumer can be internal or external.
If you need it to work on dev server, then just create a handler (a servlet) and use internal API to add, lease and delete tasks.

Resources