Receiving email on GAE: still 60s to complete processing? - google-app-engine

My application is doing some juggling of email attachments. So far it's clocking in around 20s and everything works fine. But if I send larger attachments and it passes 60s, is it going to break?

The App Engine doc does not say if the mail reception servlet have a timeout of 60s or 10 minutes, so it's hard to say.
In any case, I would recommend you perform the following in the servlet that handles /_ah/mail :
Store the mail content in Cloud Storage or blob store
Start a task to process this mail
That way you will take advantage of the retry capabilities of task, and you'll have 10 minutes to process your mail.
If you believe your task may take more than 10 minutes, you can either break up in smaller tasks (chained or parallel depending on your use case) or use modules to go beyond the 10 minutes limit. Note that modules will not stay up forever and you should not expect to perform 4 hour tasks on modules, for example.

Related

Best way for running long Python scripts on GCP

We are starting a new project in our company where we basically run few Python scripts for each client, twice a day.
So the idea is, twice a day a Cloud Function will be triggered where the function will trigger the Python script for each client creating new instances of App Engine / Cloud Run or any other serverless service Google's offer.
At the begining we though of using Cloud Functions, but very quickly we found out they are not suited for long running Python scripts, the scripts will eventually calculate and collect different information for each client and write them to Firebase.
The flow of the processes would be: Cloud Function triggered -> function trigger GCP instance for each client -> script running for each client -> out put is being saved to Firebase.
What would be the recommended way to do it without a dedicated server, which GCP serverless services would fit the most?
There is a lot of great answers! The key here is to decouple and to distribute the processing.
When you talk about decoupling you can use Cloud Task (where you can add flow control with rate limit or to postpone a task in the future) or PubSub (more simple message queueing solution).
And Cloud Run is a requirement to run up to 15 minutes processing. But you will have to fine tune it (see below my tips)
So, to summarize the process
You have to trigger a Cloud Functions twice a day. You can use Cloud Scheduler for that.
The triggered Cloud Functions get the list of clients (in database?) and for each client, create a task on Cloud Task(or a message in PubSub)
Each task (or message) call a HTTP endpoint on Cloud Run that perform the process for each client. Set the timeout to 30 minutes on Cloud Run.
However, if your processing is compute intensive, you have to tune Cloud Run. If the processing take 15 minutes for 1 client on 1vCPU, that mean you can't process more than 1 client per CPU if you don't want to reach the timeout (2 clients can lead you to take about 30 minutes for both on the same CPU and you can reach the timeout). For that, I recommend you to set the concurrency parameter of Cloud Run to 1, to process only one request at a time (of course, if you set 2 or 4 CPU on Cloud Run you can also increase the concurrency parameter to 2 or 4 to allow parallel processing on the same instance, but on different CPU).
If the processing is not CPU intensive (you perform API call and you wait the answer) it's harder to say. Try with a concurrency of 5, 10, 30,... and observe the behaviour/latency of the processed requests. No worries, with Cloud Task and PubSUb you can set retry policies in case of timeout.
Last things: is your processing idempotent? I mean, if you run 2 time the same process for the same client, is the result correct or is it a problem? Try to make the solution idempotent to overcome retry issues and globally issues that can happen on distributed computing (including the replays)
#NoCommandLine's answer is a best recommendation and Cloud Run is also a good option if you want to set longer running operations as timeout could be set between 5 minutes (as default) and 60 minutes. You can set or update request timeout through either Cloud Console, command line or YAML.
Meanwhile, execution time for Cloud Function only has 1 minute (by default) and could be set to 9 minutes maximum.
You can check out the full documentation below:
Requesting Timeout for Cloud Run
Requesting Timeout for Cloud Function
You can also check a related SO question through this link.
You can execute "long" running Google App Engine (GAE) Tasks using Cloud Tasks.
How long (which is why I have it in quotes) depends on the kind of scaling that you are using for your GAE Project Instance. Instances which are set to 'automatic scaling' are limited to a maximum of 10 minutes while instances which are set to 'manual' or 'basic' have up to 24 hours execution time.
From the earlier link
....all workers must send an HTTP response code (200-299) to the Cloud
Tasks service, in this instance before a deadline based on the
instance scaling type of the service: 10 minutes for automatic scaling
or up to 24 hours for manual scaling. If a different response is sent,
or no response, the task is retried....
Adding Update (there's seems to be some confusion between 30 mins vs 24 hours)
Standard HTTP Requests have a maximum execution time of 30 minutes (source) while GAE Endpoints can run for up to 24 hours if you're using manual scaling (source)

Why is Google Cloud Tasks so slow?

I use Google Cloud Tasks with AppEngine to process tasks, but the tasks wait about 2-3 minutes in the queue before being sent to my App Engine endpoint.
There is no "delay" set on the tasks, and I expect them to be sent right away.
So the question is: Is Cloud Tasks slow?
As you can see is the following screenshot, Cloud Tasks gives an ETA of about 3 mins:
The official word from Google is that this is the best you can expect from their task queues.
In my experience, how you configure tasks seems to influence how quickly they get executed.
It seems that:
If you don't change the default behavior of your task queues (e.g., maximum concurrent, etc.) and if you don't specify an execution time of a task (e.g., eta) then your tasks will execute very soon after submission.
If you mess with either of these two things, then Google takes longer to execute your tasks. My guess is that it is the extra overhead of controlling task rate and execution.
I see from your screenshot that you have a task with an ETA of 2 min 49 sec which is the time until your task will be run. You have high bucket size and concurrency numbers, so I think your issue has more to do with the parameters you are using when queueing your tasks, especially the scheduled_time attribute. Check your code to see if you are adding a delay to your tasks, and make sure to tune it down.
Just adding here, that as of February 2023, I can queue tasks and then consume them VERY fast using the Python 3.7 libraries.
Takes me about 13.5 seconds to queue up 1000 tasks.
Takes about 1 minute to process those 1000 tasks using a Cloud Run deployed python/flask app. (No other processing done, just receive and reply with 200).
So, super fast!
BTW, pubsub was much slower in my tests... about 40ms per message to queue a message.

Preparing for a flash crowd on Google App Engine

I recently experienced a sharp, short-lived increase in the load of my service on Google App Engine. The load went from ~1-2 req/second to about 10 req/second for about a couple of hours. My number of dynamic instances scaled up pretty quickly but in the process I did get a number of "Request waited too long" timeout messages.
So the next time around, I would like to be prepared with enough idle instances to handle my load. But now the question is, how do I determine how many is adequate. I expect a much larger burst in load this time - from practically nothing to an average of 500 requests/second, possibly with a peak of 3000. This is to last between 15 minutes and 1 hour.
My main goal is to ensure that the information passed via HTTP Post is saved to the datastore by means of a single write.
Here are the steps I have taken to prepare for the burst:
I have pruned the fast path to disable analytics and other reporting, which typically generate 2 urlfetch requests.
The datastore write is to be deferred to a taskqueue via the deferred library
What I would like to know is:
1. Tips/insights into calculating how many idle instances one would need per N requests/second.
2. It seems that the maximum throughput of a task queue is 500/second. Is this the rate at which you can push tasks, and if not, then is there a cap on that? I'm guessing not, since these are probably just datastore writes, but I would like to be sure.
My fallback plan if I am not confident of saving all of the information for this flash mob is to set up a beefy Amazon EC2 instance, run a web server on it and make my clients send a backup request to this server.
You must understand that Idle Instances are only used when new frontend instances are being spun-up. This means that they are only used during traffic increases. When traffic is steady they are not used.
Now if your instance needs 20 sec to spin up and can handle 10 req/sec of steady traffic and you traffic INCREASE is 5 req/sec, then you'll need 20 * 5 / 10 = 10 idle instances if you don't want any requests dropped.
What you should do is:
Maximize instance throughput (number of requests it can handle): optimize code, use async db operations and enable Concurrent Requests.
Minimize your instance startup time. This is important because idle instances are used during spinning up of new instances and the time it takes to spin up a new instance directly relates to how many idle instances you need. If you use Java this means getting rid of any heavy frameworks that do classpath scanning (Spring, etc..).
Fourth, number of frontend instances needed is VERY application specific. But since you already had traffic increase you should know how many requests your frontend instance can handle per second.
Edit: There is one more obvious thing you should do: HTTP caching. GAE has a transparent HTTP cache which can be simply controlled via Cache-Control headers.
Also, if analytics has a big performance impact on your server, consider using client side analytics services (like Google Analytics). They also work for devices.

Tasks, Cron jobs or Backends for an app

I'm trying to construct a non-trivial GAE app and I'm not sure if a cron job, tasks, backends or a mix of all is what I need to use based on the request time-out limit that GAE has for HTTP requests.
The distinct steps I need to do are:
1) I have upwards of 15,000 sites I need to pull data from at a regular schedule and without any user interaction. The total number of sites isn't going to static but they're all saved in the datastore [Table0] along side the interval at which they're read at. The interval may vary as regular as every day to every 30 days.
2) For each site from step #1 that fits the "pull" schedule criteria, I need to fetch data from it via HTTP GET (again, it might be all of them or as few as 2 or 3 sites). Once I get the response back from the site, parse the result and save this data into the datastore as [Table1].
3) For all of the data that was recently put into the datastore in [Table1] (they'll have a special flag), I need to issue additional HTTP request to a 3rd party site to do some additional processing. As soon as I receive data from this site, I store all of the relevant info into another table [Table2] in the datastore.
4) As soon as data is available and ready from step #3, I need to take all of it and perform some additional transformation and update the original table [Table1] in the datastore.
I'm not certain which of the different components I need to use to ensure that I can complete each piece of the work without exceeding the response deadline that's placed on the web requests of GAE. For requests initiated by cron jobs and tasks, I believe you're allowed 10 mins to complete it, whereas typical user-driven requests are allowed 30 seconds.
Task queues are the best way to do this in general, but you might want to check out the App Engine Pipeline API, which is designed for exactly the sort of workflow you're talking about.
GAE is a tough platform for your use-case. But, out of extreme masochism, I am attempting something similar. So here are my two cents, based on my experience so far:
Backends -- Use them for any long-running, I/O intensive tasks you may have (Web-Crawling is a good example, assuming you can defer compute-intensive processing for later).
Mapreduce API -- excellent for compute-intensive/parallel jobs such as stats collection, indexing etc. Until recently, this library only had a mapper implementation, but recently Google also released an in-memory Shuffler that is good for jobs that fit in about 100MB.
Task Queues -- For when everything else fails :-).
Cron -- mostly to kick off periodic tasks -- which context you execute them in, is up to you.
It might be a good idea to design your backend tasks so that they can be scheduled (manually, or perhaps by querying your current quota usage) in the "Frontend" context using task queues, if you have spare Frontend CPU cycles.
I abandoned GAE before Backends came out, so can't comment on that. But, what I did a few times was:
Cron scheduled to kick off process
Cron handler invokes a task URL
task grabs first item (URL) from datastore, executes HTTP request, operates on data, updates the URL record as having worked on it and the invokes the task URL again.
So cron is basically waking up taskqueue periodically and taskqueue runs recursively until it reaches some stopping point.
You can see it in action one of my public GAE apps - https://github.com/mavenn/watchbots-gae-python.

crawler on appengine

i want to run a program continiously on appengine.This program will automatically crawl some website continiously and store the data into its database.Is it possible for the program to
continiously keep doing it on appengine?Or will appengine kill the process?
Note:The website which will be crawled is not stored on appengine
i want to run a program continiously
on appengine.
Can't.
The closest you can get is background-running scheduled tasks that last no more than 30 seconds:
Notably, this means that the lifetime
of a single task's execution is
limited to 30 seconds. If your task's
execution nears the 30 second limit,
App Engine will raise an exception
which you may catch and then quickly
save your work or log process.
A friend of mine suggested following
Create a task queue
Start the queue by passing some data.
Use an Exception handler and handle DeadlineExceededException.
In your handler create a new queue for same purpose.
You can run your job infinitely. You only need to consider used CPU Time and storage.
You might want to consider Backends introduced in the newer version of GAE.
These run continuous processes
Is Possible Yes, I have already build a solution on Appengine - wowprice
Sharing all details here will make my answer lengthy,
Problem - Suppose I want to crawl walmart.com, As i known that I cant crawl in one shot(millions products)
Solution - I have designed my spider to break the task in smaller task.
Step 1 : I input job for walmart.com, Job scheduler will create a task.
Step 2 : My spider will pick the job and its notice that Its index page, now my spider will create more jobs as starting page as categories page, Now its enters 20 more tasks
Step 3 : now spider make more smaller jobs for subcategories, and its will go till it gets product list page and create task for it.
Step 4 : for product list pages, its get the product and make call to to stores the product data and in case of next page It ll make one task to crawl them.
Advantages -
We can crawl without breaking 30 seconds rules, and speed of crawling will depends backend machine, It will provide parallel crawling for single target.
they fixed it for you.
you can run background threads on a manual scaled instance.
check https://developers.google.com/appengine/docs/python/modules/#Python_Background_threads
You cannot literally run one continuous process for more than 30 seconds. However, you can use the Task Queue to have one process call another in a continuous chain. Alternatively you can schedule jobs to run with the Cron service.
Use a cron job to periodically check for pages which have not been scraped in the past n hours/days/whatever, and put scraping tasks for some subset of these pages onto a task queue. This way your processes don't get killed for taking too long, and you don't hammer the server you're scraping with excessive bursts of traffic.
I've done this, and it works pretty well. Watch out for task timeouts; if things take too long, split them into multiple phases and be sure to use memcached liberally.
Try this:
on appengine run any program. You connect from browser, click for start url during ajax. Ajax call server, download some data from internet and return you (your browser) next url. This is not one request, each url is one diferent request. You mast only resolve in JS how ajax is calling url un cycle.
You can using lasted GAE service called backends . Check this http://code.google.com/appengine/docs/java/backends/
Backends are special App Engine instances that have no request deadlines, higher memory and CPU limits, and persistent state across requests. They are started automatically by App Engine and can run continously for long periods. Each backend instance has a unique URL to use for requests, and you can load-balance requests across multiple instances.

Resources