DeadlineExceededError in GAE + BQ despite several steps taken to avoid it - google-app-engine

I have a BigQuery query that takes perhaps a minute several BigQuery queries that each take around 10-30 seconds to run that I have been trying to execute from Google App Engine. At one or more places in the call stack, an HTTP request is being killed with a DeadlineExceededError. Sometimes the DeadlineExceededError (unsure which kind) is raised as is, and sometimes it is translated to an HTTPException.
Following leads found in different SO posts, I have taken various steps to avoid the timeout:
Run the query in a task that is added to a GAE TaskQueue, setting the task_age_limit to 10m. (1)
Pass a timeoutMs flag to getQueryResults (called on a job object in Google's Python API) using a value of 599 * 1000 ~ 10 minutes. (2)
Just before the call to getQueryResults, call urlfetch.set_default_fetch_deadline(60), every time, in an attempt to ensure that the setting is local to the thread that is making the call. (3, 4, 5)
A gist of the relevant part of a typical stack trace can be found here. In a typical task execution, there will be a number of failures and then finally, perhaps, a success.
This answer seems to be saying that a urlfetch call will not be allowed to exceed 60 seconds on GAE, in any context (including a task). I doubt the queries are exceeding the hard limit in my case, so I'm probably missing an important step. Has anyone run into a similar situation and figured out what was going on?

Related

Best way for running long Python scripts on GCP

We are starting a new project in our company where we basically run few Python scripts for each client, twice a day.
So the idea is, twice a day a Cloud Function will be triggered where the function will trigger the Python script for each client creating new instances of App Engine / Cloud Run or any other serverless service Google's offer.
At the begining we though of using Cloud Functions, but very quickly we found out they are not suited for long running Python scripts, the scripts will eventually calculate and collect different information for each client and write them to Firebase.
The flow of the processes would be: Cloud Function triggered -> function trigger GCP instance for each client -> script running for each client -> out put is being saved to Firebase.
What would be the recommended way to do it without a dedicated server, which GCP serverless services would fit the most?
There is a lot of great answers! The key here is to decouple and to distribute the processing.
When you talk about decoupling you can use Cloud Task (where you can add flow control with rate limit or to postpone a task in the future) or PubSub (more simple message queueing solution).
And Cloud Run is a requirement to run up to 15 minutes processing. But you will have to fine tune it (see below my tips)
So, to summarize the process
You have to trigger a Cloud Functions twice a day. You can use Cloud Scheduler for that.
The triggered Cloud Functions get the list of clients (in database?) and for each client, create a task on Cloud Task(or a message in PubSub)
Each task (or message) call a HTTP endpoint on Cloud Run that perform the process for each client. Set the timeout to 30 minutes on Cloud Run.
However, if your processing is compute intensive, you have to tune Cloud Run. If the processing take 15 minutes for 1 client on 1vCPU, that mean you can't process more than 1 client per CPU if you don't want to reach the timeout (2 clients can lead you to take about 30 minutes for both on the same CPU and you can reach the timeout). For that, I recommend you to set the concurrency parameter of Cloud Run to 1, to process only one request at a time (of course, if you set 2 or 4 CPU on Cloud Run you can also increase the concurrency parameter to 2 or 4 to allow parallel processing on the same instance, but on different CPU).
If the processing is not CPU intensive (you perform API call and you wait the answer) it's harder to say. Try with a concurrency of 5, 10, 30,... and observe the behaviour/latency of the processed requests. No worries, with Cloud Task and PubSUb you can set retry policies in case of timeout.
Last things: is your processing idempotent? I mean, if you run 2 time the same process for the same client, is the result correct or is it a problem? Try to make the solution idempotent to overcome retry issues and globally issues that can happen on distributed computing (including the replays)
#NoCommandLine's answer is a best recommendation and Cloud Run is also a good option if you want to set longer running operations as timeout could be set between 5 minutes (as default) and 60 minutes. You can set or update request timeout through either Cloud Console, command line or YAML.
Meanwhile, execution time for Cloud Function only has 1 minute (by default) and could be set to 9 minutes maximum.
You can check out the full documentation below:
Requesting Timeout for Cloud Run
Requesting Timeout for Cloud Function
You can also check a related SO question through this link.
You can execute "long" running Google App Engine (GAE) Tasks using Cloud Tasks.
How long (which is why I have it in quotes) depends on the kind of scaling that you are using for your GAE Project Instance. Instances which are set to 'automatic scaling' are limited to a maximum of 10 minutes while instances which are set to 'manual' or 'basic' have up to 24 hours execution time.
From the earlier link
....all workers must send an HTTP response code (200-299) to the Cloud
Tasks service, in this instance before a deadline based on the
instance scaling type of the service: 10 minutes for automatic scaling
or up to 24 hours for manual scaling. If a different response is sent,
or no response, the task is retried....
Adding Update (there's seems to be some confusion between 30 mins vs 24 hours)
Standard HTTP Requests have a maximum execution time of 30 minutes (source) while GAE Endpoints can run for up to 24 hours if you're using manual scaling (source)

Why is Google Cloud Tasks so slow?

I use Google Cloud Tasks with AppEngine to process tasks, but the tasks wait about 2-3 minutes in the queue before being sent to my App Engine endpoint.
There is no "delay" set on the tasks, and I expect them to be sent right away.
So the question is: Is Cloud Tasks slow?
As you can see is the following screenshot, Cloud Tasks gives an ETA of about 3 mins:
The official word from Google is that this is the best you can expect from their task queues.
In my experience, how you configure tasks seems to influence how quickly they get executed.
It seems that:
If you don't change the default behavior of your task queues (e.g., maximum concurrent, etc.) and if you don't specify an execution time of a task (e.g., eta) then your tasks will execute very soon after submission.
If you mess with either of these two things, then Google takes longer to execute your tasks. My guess is that it is the extra overhead of controlling task rate and execution.
I see from your screenshot that you have a task with an ETA of 2 min 49 sec which is the time until your task will be run. You have high bucket size and concurrency numbers, so I think your issue has more to do with the parameters you are using when queueing your tasks, especially the scheduled_time attribute. Check your code to see if you are adding a delay to your tasks, and make sure to tune it down.
Just adding here, that as of February 2023, I can queue tasks and then consume them VERY fast using the Python 3.7 libraries.
Takes me about 13.5 seconds to queue up 1000 tasks.
Takes about 1 minute to process those 1000 tasks using a Cloud Run deployed python/flask app. (No other processing done, just receive and reply with 200).
So, super fast!
BTW, pubsub was much slower in my tests... about 40ms per message to queue a message.

What is the difference between appengine datastore timeout errors 5 and 11?

I'm trying to speed up a Google App Engine request handler that has a big datastore PutMulti call (500 entities) by splitting it into batches of entities and running concurrent goroutines to send smallerPutMulti calls (100 entities each).
Before this, I had often been getting the datastore error Call error 11: Deadline exceeded (timeout) from my PutMulti calls going over the deadline when I tested the handler on many concurrent requests. After the parallelization, the handler did speed up, but I still occasionally got that error and also another type of error, API error 5 (datastore_v3: TIMEOUT): The datastore operation timed out, or the data was temporarily unavailable.
Is this error 5 due to contention in the datastore, and what is the difference between errors 5 and 11?
These errors come from two different places, the first, the call error, is a local error that is caused by a timeout in the RPC client. It indicates that there was a timeout waiting for completion of an RPC. The default RPC timeout in google.golang.org/appengine is 60 seconds.
The second error comes from the service side. This error indicates that a timeout occurred performing operations within datastore. Some of these operations have timeouts much shorter than 60s, and typically this may indicate contention.
A possibly simpler way to understand the differences is that you will find that if you make a single multi operation with a very large number of changes, you can trigger the first timeout with ease. If you create a significant number of concurrent operations against a single key or small set of keys, you will more readily trigger the latter. As timeouts are general indicators of saturation of shared resources, there are of course many ways and combinations to generate them. In general, one will want to retry operations as appropriate, and also size operations appropriately, as well as aggregating operations on hot keys as best as possible to reduce the chance of contention related issues. As others have suggested, the python and java docs have some examples of this already.
You may wish to make use of https://godoc.org/google.golang.org/appengine#IsTimeoutError and if you need to increase the timeout for the first error class, you may be able to adjust the context deadline, see the methods here: https://godoc.org/golang.org/x/net/context#WithDeadline Note: you will not be able to extend the deadline beyond that of a request deadline, however, if you are running in tasks or VMs you can extend to long deadlines.
The first error you see may be just the timeout in normal operation, the 2nd is likely because of write contention. More on this: Handling Datastore Errors https://cloud.google.com/appengine/articles/handling_datastore_errors

How to create large number of entities in Cloud Datastore

My requirement is to create large number of entities in Google Cloud Datastore. I have csv files and in combine number of entities can be around 50k. I tried following:
1. Read a csv file line by line and create entity in the datstore.
Issues: It works well but it timed out and cannot create all the entities in one go.
2. Uploaded all files in Blobstore and red them to datastore
Issues: I tried Mapper function to read csv files uploaded in Blobstore and create Entities in datastore. Issues i have are, mapper does not work if file size go larger than 2Mb. Also I simply tried to read files in a servlet but again timedout issue.
I am looking for a way to create above(50k+) large number of entities in datastore all in one go.
Number of entities isn't the issue here (50K is relatively trivial). Finishing your request within the deadline is the issue.
It is unclear from your question where you are processing your CSVs, so I am guessing it is part of a user request - which means you have a 60 second deadline for task completion.
Task Queues
I would suggest you look into using Task Queues, where when you upload a CSV that needs processing, you push it into a queue for background processing.
When working with Tasks Queues, the tasks themselves still have a deadline, but one that is larger than 60 seconds (10 minutes when automatically scaled). You should read more about deadlines in the docs to make sure you understand how to handle them, including catching the DeadlineExceededError error so that you can save when you are up to in a CSV so that it can be resumed from that position when retried.
Caveat on catching DeadlineExceededError
Warning: The DeadlineExceededError can potentially be raised from anywhere in your program, including finally blocks, so it could leave your program in an invalid state. This can cause deadlocks or unexpected errors in threaded code (including the built-in threading library), because locks may not be released. Note that (unlike in Java) the runtime may not terminate the process, so this could cause problems for future requests to the same instance. To be safe, you should not rely on theDeadlineExceededError, and instead ensure that your requests complete well before the time limit.
If you are concerned about the above, and cannot ensure your task completes within the 10 min deadline, you have 2 options:
Switch to a manually scaled instance which gives you are 24 hour deadline.
Ensure your tasks saves progress and returns an error well before the 10 min deadline so that it can be resumed correctly without having to catch the error.

Backend "Process moved to a different machine" and fails withh error 500

I have a process that takes around five minutes to complete. It runs on a cron job every two hours in a backend instance.
Recently the process has started to fail; not every time but a few times a day. First thing that happens is that the memcache starts to throw exceptions:
04:21:13.640 com.google.appengine.api.memcache.LogAndContinueErrorHandler handleServiceError: Service error in memcache
com.google.appengine.api.memcache.MemcacheServiceException: Memcache get: exception getting 1 key (ItemFollowableCompleted:RegionUS:P8XD:0)
at com.google.appengine.api.memcache.MemcacheServiceApiHelper$RpcResponseHandler.handleApiProxyException(MemcacheServiceApiHelper.java:68)
at com.google.appengine.api.memcache.MemcacheServiceApiHelper$1.absorbParentException(MemcacheServiceApiHelper.java:109)
None of these are fatal exceptions but a few seconds later the process terminated without warning or shutdown message. Logs show
04:21:30.591 Process moved to a different machine.
and an error 500.
Is this a google infrastructure problem related to memcache or is there something in the app code that could be causing it?
No, it's not an error in Google infrastructure. Your process is expected to be moved among instances when needed (maintenance, more demand from your side, ...), and there's nothing you can do to prevent it.
Nonetheless there are a few things you could do to alleviate any effect this could have in your app.
Look [1] for some suggestions on how to keep track of your pending jobs when your instance is shut down and also have a look at the background threads.
I'm guessing you're using Python, if not, look for your corresponding language.
[1] https://developers.google.com/appengine/docs/python/backends/#Python_Backend_states
I have the same problem when I use ndb.putmulti() to load data. I tried a few things
1. increase my backends machine size, I moved to B4_1G
2. sleep between ndb.putmulti() (2 minutes for every 200 entities)
3. Dedicated memcache (1G)
1 and 2 were not very helpful, 3 seems to help.
I think rapid updates to ndb datastore affecting memcache is the root cause in my case. I could not find any other way besides paying for dedicated memcache.
I also met the issue "Process moved to a different machine" in the backend module too.
The issue context is as below:
Get the query result from one KIND
Iterating each entity in the query result, I will do some tasks and write new entities to different KINDs
The "Process moved to a different machine" happens during the half of iterating
After some experiments, I found it is due to "too many writing transactions in one request". Everything is fine when the size of query result is small, but cause problem when it becomes larger.
The final solution I took is to use Task Queue, the work should be done for a entity is looked as one task and be put into the PushQueue. So the issue is gone.
Hope this will help :)

Resources