Best way for running long Python scripts on GCP

Best way for running long Python scripts on GCP - google-app-engine

We are starting a new project in our company where we basically run few Python scripts for each client, twice a day.
So the idea is, twice a day a Cloud Function will be triggered where the function will trigger the Python script for each client creating new instances of App Engine / Cloud Run or any other serverless service Google's offer.
At the begining we though of using Cloud Functions, but very quickly we found out they are not suited for long running Python scripts, the scripts will eventually calculate and collect different information for each client and write them to Firebase.
The flow of the processes would be: Cloud Function triggered -> function trigger GCP instance for each client -> script running for each client -> out put is being saved to Firebase.
What would be the recommended way to do it without a dedicated server, which GCP serverless services would fit the most?

There is a lot of great answers! The key here is to decouple and to distribute the processing.
When you talk about decoupling you can use Cloud Task (where you can add flow control with rate limit or to postpone a task in the future) or PubSub (more simple message queueing solution).
And Cloud Run is a requirement to run up to 15 minutes processing. But you will have to fine tune it (see below my tips)
So, to summarize the process
You have to trigger a Cloud Functions twice a day. You can use Cloud Scheduler for that.
The triggered Cloud Functions get the list of clients (in database?) and for each client, create a task on Cloud Task(or a message in PubSub)
Each task (or message) call a HTTP endpoint on Cloud Run that perform the process for each client. Set the timeout to 30 minutes on Cloud Run.
However, if your processing is compute intensive, you have to tune Cloud Run. If the processing take 15 minutes for 1 client on 1vCPU, that mean you can't process more than 1 client per CPU if you don't want to reach the timeout (2 clients can lead you to take about 30 minutes for both on the same CPU and you can reach the timeout). For that, I recommend you to set the concurrency parameter of Cloud Run to 1, to process only one request at a time (of course, if you set 2 or 4 CPU on Cloud Run you can also increase the concurrency parameter to 2 or 4 to allow parallel processing on the same instance, but on different CPU).
If the processing is not CPU intensive (you perform API call and you wait the answer) it's harder to say. Try with a concurrency of 5, 10, 30,... and observe the behaviour/latency of the processed requests. No worries, with Cloud Task and PubSUb you can set retry policies in case of timeout.
Last things: is your processing idempotent? I mean, if you run 2 time the same process for the same client, is the result correct or is it a problem? Try to make the solution idempotent to overcome retry issues and globally issues that can happen on distributed computing (including the replays)

#NoCommandLine's answer is a best recommendation and Cloud Run is also a good option if you want to set longer running operations as timeout could be set between 5 minutes (as default) and 60 minutes. You can set or update request timeout through either Cloud Console, command line or YAML.
Meanwhile, execution time for Cloud Function only has 1 minute (by default) and could be set to 9 minutes maximum.
You can check out the full documentation below:
Requesting Timeout for Cloud Run
Requesting Timeout for Cloud Function
You can also check a related SO question through this link.

You can execute "long" running Google App Engine (GAE) Tasks using Cloud Tasks.
How long (which is why I have it in quotes) depends on the kind of scaling that you are using for your GAE Project Instance. Instances which are set to 'automatic scaling' are limited to a maximum of 10 minutes while instances which are set to 'manual' or 'basic' have up to 24 hours execution time.
From the earlier link
....all workers must send an HTTP response code (200-299) to the Cloud
Tasks service, in this instance before a deadline based on the
instance scaling type of the service: 10 minutes for automatic scaling
or up to 24 hours for manual scaling. If a different response is sent,
or no response, the task is retried....
Adding Update (there's seems to be some confusion between 30 mins vs 24 hours)
Standard HTTP Requests have a maximum execution time of 30 minutes (source) while GAE Endpoints can run for up to 24 hours if you're using manual scaling (source)

Related

Cloud Tasks with an automatically scaling App Engine that last longer than 10 minutes

I have a backend for an iOS app that I've built on App Engine and I'm looking to do potentially long running background tasks to add records to my Cloud SQL database. Is this possible without Compute Engine? I've seen Cloud Tasks can do asynchronous work and you can set the dispatchDeadline to basically anything you want, but I've also read in the documentation
For App Engine tasks, 0 indicates that the request has the default deadline. The default deadline depends on the scaling type of the service: 10 minutes for standard apps with automatic scaling, 24 hours for standard apps with manual and basic scaling, and 60 minutes for flex apps. If the request deadline is set, it must be in the interval [15 seconds, 24 hours 15 seconds]. Regardless of the task's dispatchDeadline, the app handler will not run for longer than than the service's timeout. We recommend setting the dispatchDeadline to at most a few seconds more than the app handler's timeout. For more information see Timeouts.
I don't particularly need the App Engine instance to care if the task completes or not... so I'm not sure why the recommendation is at most a few seconds more than the app handler's timeout ... can anyone shed any light on this? What am I missing? Adding a Compute Engine for these relatively simple tasks that will take at most a few ours to complete seems like a lot of overhead and I don't want this to dictate which scaling options I choose.
Thanks for your time.

The recommendation is only for logging purpose. If your task timeout is shorter that your app timeout, you never know if there is an error from your app, because you don't have the return.
If you have longer timeout on Cloud Task, you can catch and trace in Cloud Task logs the app return code and thus gracefully track the errors.
App Engine with a basic scaling mode is a great solution.
You have 9H free per days (B instance type)
App Engine scale to 0 automatically after a period of inactivity (that you can define in the basic scaling: idle_timeout parameter)
You have a regional available service. Not a zonal like a compute engine, or you need to have 9 computes engine, to cover the regional High Availability (3 per zone, over 3 zones)
You don't have server to manage: no update, no patching, no network/ip/firewall rule...
If you ask me about the overhead, I will anser Compute Engine and not App Engine (even if you need few configuration)

Why is Google Cloud Tasks so slow?

I use Google Cloud Tasks with AppEngine to process tasks, but the tasks wait about 2-3 minutes in the queue before being sent to my App Engine endpoint.
There is no "delay" set on the tasks, and I expect them to be sent right away.
So the question is: Is Cloud Tasks slow?
As you can see is the following screenshot, Cloud Tasks gives an ETA of about 3 mins:

The official word from Google is that this is the best you can expect from their task queues.
In my experience, how you configure tasks seems to influence how quickly they get executed.
It seems that:
If you don't change the default behavior of your task queues (e.g., maximum concurrent, etc.) and if you don't specify an execution time of a task (e.g., eta) then your tasks will execute very soon after submission.
If you mess with either of these two things, then Google takes longer to execute your tasks. My guess is that it is the extra overhead of controlling task rate and execution.

I see from your screenshot that you have a task with an ETA of 2 min 49 sec which is the time until your task will be run. You have high bucket size and concurrency numbers, so I think your issue has more to do with the parameters you are using when queueing your tasks, especially the scheduled_time attribute. Check your code to see if you are adding a delay to your tasks, and make sure to tune it down.

Just adding here, that as of February 2023, I can queue tasks and then consume them VERY fast using the Python 3.7 libraries.
Takes me about 13.5 seconds to queue up 1000 tasks.
Takes about 1 minute to process those 1000 tasks using a Cloud Run deployed python/flask app. (No other processing done, just receive and reply with 200).
So, super fast!
BTW, pubsub was much slower in my tests... about 40ms per message to queue a message.

Receiving email on GAE: still 60s to complete processing?

My application is doing some juggling of email attachments. So far it's clocking in around 20s and everything works fine. But if I send larger attachments and it passes 60s, is it going to break?

The App Engine doc does not say if the mail reception servlet have a timeout of 60s or 10 minutes, so it's hard to say.
In any case, I would recommend you perform the following in the servlet that handles /_ah/mail :
Store the mail content in Cloud Storage or blob store
Start a task to process this mail
That way you will take advantage of the retry capabilities of task, and you'll have 10 minutes to process your mail.
If you believe your task may take more than 10 minutes, you can either break up in smaller tasks (chained or parallel depending on your use case) or use modules to go beyond the 10 minutes limit. Note that modules will not stay up forever and you should not expect to perform 4 hour tasks on modules, for example.

Preparing for a flash crowd on Google App Engine

I recently experienced a sharp, short-lived increase in the load of my service on Google App Engine. The load went from ~1-2 req/second to about 10 req/second for about a couple of hours. My number of dynamic instances scaled up pretty quickly but in the process I did get a number of "Request waited too long" timeout messages.
So the next time around, I would like to be prepared with enough idle instances to handle my load. But now the question is, how do I determine how many is adequate. I expect a much larger burst in load this time - from practically nothing to an average of 500 requests/second, possibly with a peak of 3000. This is to last between 15 minutes and 1 hour.
My main goal is to ensure that the information passed via HTTP Post is saved to the datastore by means of a single write.
Here are the steps I have taken to prepare for the burst:
I have pruned the fast path to disable analytics and other reporting, which typically generate 2 urlfetch requests.
The datastore write is to be deferred to a taskqueue via the deferred library
What I would like to know is:
1. Tips/insights into calculating how many idle instances one would need per N requests/second.
2. It seems that the maximum throughput of a task queue is 500/second. Is this the rate at which you can push tasks, and if not, then is there a cap on that? I'm guessing not, since these are probably just datastore writes, but I would like to be sure.
My fallback plan if I am not confident of saving all of the information for this flash mob is to set up a beefy Amazon EC2 instance, run a web server on it and make my clients send a backup request to this server.

You must understand that Idle Instances are only used when new frontend instances are being spun-up. This means that they are only used during traffic increases. When traffic is steady they are not used.
Now if your instance needs 20 sec to spin up and can handle 10 req/sec of steady traffic and you traffic INCREASE is 5 req/sec, then you'll need 20 * 5 / 10 = 10 idle instances if you don't want any requests dropped.
What you should do is:
Maximize instance throughput (number of requests it can handle): optimize code, use async db operations and enable Concurrent Requests.
Minimize your instance startup time. This is important because idle instances are used during spinning up of new instances and the time it takes to spin up a new instance directly relates to how many idle instances you need. If you use Java this means getting rid of any heavy frameworks that do classpath scanning (Spring, etc..).
Fourth, number of frontend instances needed is VERY application specific. But since you already had traffic increase you should know how many requests your frontend instance can handle per second.
Edit: There is one more obvious thing you should do: HTTP caching. GAE has a transparent HTTP cache which can be simply controlled via Cache-Control headers.
Also, if analytics has a big performance impact on your server, consider using client side analytics services (like Google Analytics). They also work for devices.

Tasks, Cron jobs or Backends for an app

I'm trying to construct a non-trivial GAE app and I'm not sure if a cron job, tasks, backends or a mix of all is what I need to use based on the request time-out limit that GAE has for HTTP requests.
The distinct steps I need to do are:
1) I have upwards of 15,000 sites I need to pull data from at a regular schedule and without any user interaction. The total number of sites isn't going to static but they're all saved in the datastore [Table0] along side the interval at which they're read at. The interval may vary as regular as every day to every 30 days.
2) For each site from step #1 that fits the "pull" schedule criteria, I need to fetch data from it via HTTP GET (again, it might be all of them or as few as 2 or 3 sites). Once I get the response back from the site, parse the result and save this data into the datastore as [Table1].
3) For all of the data that was recently put into the datastore in [Table1] (they'll have a special flag), I need to issue additional HTTP request to a 3rd party site to do some additional processing. As soon as I receive data from this site, I store all of the relevant info into another table [Table2] in the datastore.
4) As soon as data is available and ready from step #3, I need to take all of it and perform some additional transformation and update the original table [Table1] in the datastore.
I'm not certain which of the different components I need to use to ensure that I can complete each piece of the work without exceeding the response deadline that's placed on the web requests of GAE. For requests initiated by cron jobs and tasks, I believe you're allowed 10 mins to complete it, whereas typical user-driven requests are allowed 30 seconds.

Task queues are the best way to do this in general, but you might want to check out the App Engine Pipeline API, which is designed for exactly the sort of workflow you're talking about.

GAE is a tough platform for your use-case. But, out of extreme masochism, I am attempting something similar. So here are my two cents, based on my experience so far:
Backends -- Use them for any long-running, I/O intensive tasks you may have (Web-Crawling is a good example, assuming you can defer compute-intensive processing for later).
Mapreduce API -- excellent for compute-intensive/parallel jobs such as stats collection, indexing etc. Until recently, this library only had a mapper implementation, but recently Google also released an in-memory Shuffler that is good for jobs that fit in about 100MB.
Task Queues -- For when everything else fails :-).
Cron -- mostly to kick off periodic tasks -- which context you execute them in, is up to you.
It might be a good idea to design your backend tasks so that they can be scheduled (manually, or perhaps by querying your current quota usage) in the "Frontend" context using task queues, if you have spare Frontend CPU cycles.

I abandoned GAE before Backends came out, so can't comment on that. But, what I did a few times was:
Cron scheduled to kick off process
Cron handler invokes a task URL
task grabs first item (URL) from datastore, executes HTTP request, operates on data, updates the URL record as having worked on it and the invokes the task URL again.
So cron is basically waking up taskqueue periodically and taskqueue runs recursively until it reaches some stopping point.
You can see it in action one of my public GAE apps - https://github.com/mavenn/watchbots-gae-python.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight