Google App Engine Instance is abruptly shutting down on Standard Environment - google-app-engine

We are using Google App Engine to ingest large amount of data to Google Cloud FireStore with below configuration:
Basic scaling
instance_class: B4
basic_scaling:
instances: 1
The overall data ingestion 0f 20GB takes around 1.5 hours. But we have noticed that some time after an hour, instance is abruptly shutting down with below error:
Container terminated on signal 9.
As per this documentation, basic scaling can serve he request up to 24 hours.
We can not see any more details in the logs as well. Also checked the memory usage, B4 has 1024 MB and the app is only utilising up to 700 MB.
If anyone has faced this kind of error, your input would be valuable!

Although the instance has 1024MB, the Operating System also needs some of that space - I guess that's why it shuts down. It's out of memory.

Related

CloudTasks not scaling GAE Backend instances

I was searching for a solution which would allow me to run tasks up to 24h long. The combination of Cloud Tasks and multiple AppEngine Backend instances seemed like a perfect way to go.
As the tasks are long running, I would like to scale to max_instances as fast as possible. But I am having trouble to do so.
Here is my app.yaml
service: slow
runtime: python37
# --timeout=90000 (25h) -> AppEngine Backend Instance should raise TimeoutExceededError after 24h
entrypoint: gunicorn main:app --workers 1 --timeout=90000
instance_class: B2
basic_scaling:
max_instances: 15
Here is a printscreen of my Cloud Task Queue configuration.
My issue is that the tasks from Cloud Task Queue are not spawning new instances as I would expect (eg. 15 max_concurrent_tasks in queue settings should spawn 15 backend instances).
I somehow managed to overcome this issue by aggressively increasing max_concurrent_tasks in the queue configuration (200 max_concurrent_tasks will spawn 15 backend instances).
Unfortunately, as the number of tasks in queue decreases, the backend instances will start terminating.
Now, there is 8 tasks left in queue (out of several hundreds) and only 1 backend instance , which is running 1 task only. I cannot trigger starting additional instances even by clicking on the "RUN TASK" button in CloudTask web UI.
Has anyone of you came across similar issue?
Do you have any hint why this might be happening?
Why doesn't cloud task hit /_ah/start endpoint to spin up a new instance to run on?
I am not sure if basic scaling is good idea. According to the documentation:
If you use basic scaling, App Engine attempts to keep your cost low,
even though that may result in higher latency as the volume of
incoming requests increases
It seems that automatic scaling will be better idea. If you take a look at the same document you can find:
If you use automatic scaling, each instance in your app has its own
queue for incoming requests. Before the queues become long enough to
have a noticeable effect on your app's latency, App Engine
automatically creates one or more new instances to handle the
increasing load.
You can configure the settings for automatic scaling to achieve a
trade-off between the performance you want and the cost you can incur.
The documentation mentions 3 settings that can be used:
Target CPU Utilization
Target Throughput Utilization
Max Concurrent Requests
I think you should be able find the configuration that will serve you the best with those 3.

Cloud Tasks with an automatically scaling App Engine that last longer than 10 minutes

I have a backend for an iOS app that I've built on App Engine and I'm looking to do potentially long running background tasks to add records to my Cloud SQL database. Is this possible without Compute Engine? I've seen Cloud Tasks can do asynchronous work and you can set the dispatchDeadline to basically anything you want, but I've also read in the documentation
For App Engine tasks, 0 indicates that the request has the default deadline. The default deadline depends on the scaling type of the service: 10 minutes for standard apps with automatic scaling, 24 hours for standard apps with manual and basic scaling, and 60 minutes for flex apps. If the request deadline is set, it must be in the interval [15 seconds, 24 hours 15 seconds]. Regardless of the task's dispatchDeadline, the app handler will not run for longer than than the service's timeout. We recommend setting the dispatchDeadline to at most a few seconds more than the app handler's timeout. For more information see Timeouts.
I don't particularly need the App Engine instance to care if the task completes or not... so I'm not sure why the recommendation is at most a few seconds more than the app handler's timeout ... can anyone shed any light on this? What am I missing? Adding a Compute Engine for these relatively simple tasks that will take at most a few ours to complete seems like a lot of overhead and I don't want this to dictate which scaling options I choose.
Thanks for your time.
The recommendation is only for logging purpose. If your task timeout is shorter that your app timeout, you never know if there is an error from your app, because you don't have the return.
If you have longer timeout on Cloud Task, you can catch and trace in Cloud Task logs the app return code and thus gracefully track the errors.
App Engine with a basic scaling mode is a great solution.
You have 9H free per days (B instance type)
App Engine scale to 0 automatically after a period of inactivity (that you can define in the basic scaling: idle_timeout parameter)
You have a regional available service. Not a zonal like a compute engine, or you need to have 9 computes engine, to cover the regional High Availability (3 per zone, over 3 zones)
You don't have server to manage: no update, no patching, no network/ip/firewall rule...
If you ask me about the overhead, I will anser Compute Engine and not App Engine (even if you need few configuration)

Initial requests to datastore and cloud tasks have higher latency, is that normal?

My app engine service is written in Go. I have code that connects to Cloud Datastore before even the server listens on the port. There is a single Projection query that takes about 500ms reading just 4 entities. Does the first interaction with datastore have higher latency potentially as a connection needs to be established? Any way this datastore connection latency be reduced? Also, is there any difference in doing this db call before listening to the port vs doing it within the warmup request (this is an autoscaled instance).
Similar to high initial latency for Cloud Datastore, I see a similar pattern for Cloud Tasks. Initial task creation could be as high as 500ms but even subsequent ones are any where from 200 to 400ms. This is in us-central. I was actually considering moving a db update to a background task but in general I am seeing the latency of task creation to be more or less same as doing a transaction to read and update the data giving no net benefit.
Finally, instance startup time is typically 2.5 to 3 seconds with the main getting called after about 2 seconds. My app startup time is the above mentioned project query cost of 500ms and nothing else. So, no matter how much I optimize my app startup, should I assume an additional latency of about 2 seconds?
Note that the load on the system is very light so these issues can't be because of high volume.
Update: deployment files as requested by Miguel (this is for a test environment investigating performance characteristics. Prod deployment will be more generous for instances)
default app:
service: default
runtime: go112
instance_class: F1
automatic_scaling:
min_instances: 0
max_instances: 1
min_idle_instances: 1
max_idle_instances: 1
min_pending_latency: 200ms
max_pending_latency: 500ms
max_concurrent_requests: 10
target_cpu_utilization: 0.9
target_throughput_utilization: 0.9
inbound_services:
- warmup
backend app:
service: backend-services
runtime: go112
instance_class: B1
basic_scaling:
idle_timeout: 1m
max_instances: 1
200-500ms to initialize a client seems reasonable because there is a remote connection being established. Also, a 1-2 seconds cold start for App Engine also seems normal.
As you mentioned, you can experiment with a warmup request to reduce cold starts and initialize clients.
I would also recommend looking into the mode you are running your Datastore in (native vs datastore). There is increase latency when using datastore mode, for more info see Cloud Datastore Best Practices.

Exceeded soft memory limit of 243 MB with 307 MB after servicing 4330 requests total. Consider setting a larger instance class in app.yaml

Situation:
My project are mostly automated tasks.
My GAE (standard environment) app has 40 crons job like this, all run on default module (frontend):
- description: My cron job Nth
url: /mycronjob_n/ ###### Please note n is the nth cron job.
schedule: every 1 minutes
Each of cron jobs
#app.route('/mycronjob_n/')
def mycronjob_n():
for i in (0,100):
pram = prams[i]
options = TaskRetryOptions(task_retry_limit=0,task_age_limit=0)
deferred.defer(mytask,pram)
Where mytask is
def mytask(pram):
#Do some loops, read and write datastore, call api, which I guesss taking less than 30 seconds.
return 'Task finish'
Problem:
As title of the question, i am running out of RAM. Frontend instance hours are increasing to 100 hours.
My wrong thought?
defer task runs on background because it is not something that user sends request when visit the website. Therefore, they will not be considered as a request.
I break my cronjobs_n into small different tasks because i think it can help to reduce the running time each cronjobs_n so that REDUCE instance's ram consumption.
My question: (purpose: keep the frontend/backend instance hours as low as possible, and I accept latency)
Is defer task counted as request?
How many request do I have in 1 mintues?
40 request of mycronjob_n
or
40 requests of mycronjob_n x 100 mytask = 4000
If 3-4 instances can not handle 4000 requests, why doesnt GAE add 10 to 20 F1 instances more and then shut down if idle? I set autoscale in app.yaml. I dont see the meaning of autoscale of GAE here as advertised.
What is the best way to optimize my app?
If defer task is counted as request, it is meaningless to slit mycronjob_n into different small tasks, right? I mean, my current method is as same as:
#app.route('/mycronjob_n/')
def mycronjob_n():
for i in (0,100):
pram = prams[i]
options = TaskRetryOptions(task_retry_limit=0,task_age_limit=0)
mytask(pram) #Call function mytask
Here, will my app has 40 requests per minute, each request runs for 100 x 30s = 3000s? So will this approach also return out of memory?
Should I create a backend service running on F1 instance and put all cron jobs on that backend service? I heard that a request can run for 24 hours.
If I change default service instance from F1 to F2,F3, will I still get 28 hours free? I heard free tier apply to F1 only. And will my backend service get 9 hours free if it runs on B2 instead of B1?
My regret:
- I am quite regret that I choose GAE for this project. I choosed it because it has free tier. But I realized that free tier is just for hobby/testing purpose. If I run a real app, the cost will increase very fast that it make me think GAE is expensive. The datastore reading/writing are so expensive even though I tried my best to optimize them. The frontend hours are also always high. I am paying 40 usd per month for GAE. With 40 usd per month, maybe I can get better server if I choose Heroku, Digital Ocean? Do you think so?
Yes, task queue requests (deferred included) are also requests, they just can run longer than user requests. And they need instances to serve them, which count as instance hours. Since you have at least one cron job running every minute - you won't have any 15 minute idle interval allowing your instances to shut down - so you'll need at least one instance running at all times. If you use any instance class other than F1/B1 - you'll exceed the free instance hours quota. See Standard environment instances billing.
You seem to be under the impression that the number of requests is what's driving your costs up. It's not, at least not directly. The culprit is most likely the number of instances running.
If 3-4 instances can not handle 4000 requests, why doesnt GAE add 10
to 20 F1 instances more and then shut down if idle?
Most likely GAE does exactly that - spawns several instances. But you keep pumping requests every minute, they don't reach an idle state long enough, so they don't shut down. Which drives your instance hours up.
There are 2 things you can do about it:
stagger your deferred tasks so they don't hit need to be handled at the same time. Fewer instance (maybe even a single one?) may be necessary to handle them in such case. See Combine cron jobs to reduce number of instances and Preventing Google App Engine Cron jobs from creating multiple instances (and thus burning through all my instance hours)
tune your app's scaling configuration (the range is limited though). See Scaling elements.
You should also carefully read How Instances are Managed.
Yes, you only pay for exceeds the free quota, regardless of the instance class. Billing is in F1/B1 units anyways - from the above billing link:
Important: When you are billed for instance hours, you will not see any instance classes in your billing line items. Instead, you will
see the appropriate multiple of instance hours. For example, if you
use an F4 instance for one hour, you do not see "F4" listed, but you
see billing for four instance hours at the F1 rate.
About the RAM usage, splitting the cron job in multiple tasks isn't necessarily helping, see App Engine Deferred: Tracking Down Memory Leaks
Finally, cost comparing GAE with Heroku, Digital Ocean isn't an apples-to-apples comparison: GAE is PaaS, not IaaS, it's IMHO expected to be more expensive. Choosing one or the other is really up to you.

Google App Engine Timeout running a Java Program

I am running an optimization program in GAE. The program runs on my laptop/eclipse and could potentially take up to 6-10 minutes to run. It seems that GAE has a timeout of 60 secs and throws a 500 Error.
How do you increase the memory requirement for GAE? How would you increase the timeout requirement to more than 10 min? Is there something that I can do in GAE settings in Eclipse or do I have to get in touch with Google.
I second #PatrickGray's answer.
Related note: the actual execution on GAE can often take longer than on the development server, so don't use the local execution time as a reference.
For increasing the memory requirement - you can configure the module's instance class as needed (with cost implications, of course).
I was running in to this error in Google App Engine before. You can't increase the timeout time, it's a hard limit. I recommend the deferred library or using a Task Queue. Here's how I solved my problem (in Python):
How do I return data from a deferred task in Google App Engine
I assume that you don't need an automatic scaling instance for this program. Instances with basic or manual scaling give you as much time as you need to complete your tasks.

Resources