Today we suddenly had a problem with standard environment app-engine instances being restarted all the time (eg every 10s or so) with basic scaling. Switching to manual scaling "solved" the problem, so it most probably has nothing to do with crashes etc. Any idea how this could happen?
So it was actually an infrastructure problem at Google:
https://status.cloud.google.com/incident/cloud-datastore/17002
Related
I have a GAE application that's set up as a flexible instance, which is expected to be restarted on a weekly basis (and a continually unhealthy instance can be restarted): https://cloud.google.com/appengine/docs/flexible/java/how-instances-are-managed
However, we're seeing this restart ("npm run build" command) several times per week! For example in the past three weeks we've had 9 restarts, and I've confirmed that the log entries leading up are successful 200 responses (no sign of trouble)- all for the active version serving traffic (and not for the other versions that are stopped).
Has anyone seen this symptom before or know of something else that can cause frequent restarts?
Let me know if any other info would be helpful.
An instance restart in the Google App Engine flexible environment can occur for several reasons:
According to the GAE documentation, there is no guarantee that an instance runs indefinitely, it can be restarted due to hardware maintenance, software updates or unforeseen issues. Besides that, as you stated, all instances are restarted on a weekly basis.
An instance can also be restarted if it fails to respond to a specified number of consecutive health check requests.
In case that you observe a unusual number of restarts I recommend you to open a ticket in Google Cloud Platform Support. They have internal tools that are able to check what is going on in the instance and figure out why the restarts are happening.
#DianeKaplan's comment:
Contacting GCP support has given me some a few helpful nuggets so far:
The automatic weekly restart of an instance due to maintenance can occur around different times (so it may only be 5 days since the last one, for example)
our deployments (which result in new GAE versions) make Google Builds
In some cases, a VM was being created overnight and then immediately deleted, where it didn't look like autoscaling was needed. Still looking into this, but was pointed towards the Google Cloud Console section Home > Activity as a good place to find clues
I have come across a strange situation and do not know what or how to look for.
We are having a Silverlight project hosted in a web project. This Silverlight project communicates using REST services hosted by the web project.
Now when we run this in debug mode, Everything runs fine as expected. So I thought of profiling it and checking which all places I might be loosing performance. So here is the interesting part.
I ran VS2012 Profiler and its is collecting all information related to methods executed, time and so on. But this time my project is lightning fast. Queries which used to take under normal debug about 1 sec to execute are now taking less than 200ms. There is one very intensive query which takes about 20 sec to execute in normal mode, but under profiling it takes less than 600ms.
So what I make out of this is that my code and project is capable of running this fast but for some reason it is not that fast under normal debug scenarios.
Can somebody throw light as what is happening under the hood and how can I achieve this performance in normal scenarios.
I would also like to mention that I have also tried release mode and publishing to IIS but none of these give as good performance as when in profiling mode.
Technically what I thought earlier is under profiling mode, performance should be less than normal as at that instant VS2012 is also collection other data.
I am confused. Please help.
Thanks
I know you probably don't need help at this point, but for anyone else who stumbles upon this post, I'll give my two cents.
I had this same problem with an XNA project I'm working on. Debug and Release modes both saw MASSIVE slowdowns in a certain situations. It pulled me down to less than 1 FPS. I was trying to profile the problem to solve it, but the issue never occurred during profiling.
I finally discovered the slowdowns were caused by a Console.WriteLine() I was calling in the situation. Commenting it out solved the issues on both Debug and Release build. Apparently, Console.WriteLine is just INCREDIBLY slow.
I developed an application for client that uses Play framework 1.x and runs on GAE. The app works great, but sometimes is crazy slow. It takes around 30 seconds to load simple page but sometimes it runs faster - no code change whatsoever.
Are there any way to identify why it's running slow? I tried to contact support but I couldnt find any telephone number or email. Also there is no response on official google group.
How would you approach this problem? Currently my customer is very angry because of slow loading time, but switching to other provider is last option at the moment.
Use GAE Appstats to profile your remote procedure calls. All of the RPCs are slow (Google Cloud Storage, Google Cloud SQL, ...), so if you can reduce the amount of RPCs or can use some caching datastructures, use them -> your application will be much faster. But you can see with appstats which parts are slow and if they need attention :) .
For example, I've created a Google Cloud Storage cache for my application and decreased execution time from 2 minutes to under 30 seconds. The RPCs are a bottleneck in the GAE.
Google does not usually provide a contact support for a lot of services. The issue described about google app engine slowness is probably caused by a cold start. Google app engine front-end instances sleep after about 15 minutes. You could write a cron job to ping instances every 14 minutes to keep the nodes up.
Combining some answers and adding a few things to check:
Debug using app stats. Look for "staircase" situations and RPC calls. Maybe something in your app is triggering RPC calls at certain points that don't happen in your logic all the time.
Tweak your instance settings. Add some permanent/resident instances and see if that makes a difference. If you are spinning up new instances, things will be slow, for probably around the time frame (30 seconds or more) you describe. It will seem random. It's not just how many instances, but what combinations of the sliders you are using (you can actually hurt yourself with too little/many).
Look at your app itself. Are you doing lots of memory allocations in the JVM? Allocating/freeing memory is inherently a slow operation and can cause freezes. Are you sure your freezing is not a JVM issue? Try replicating the problem locally and tweak the JVM xmx and xms settings and see if you find similar behavior. Also profile your application locally for memory/performance issues. You can cut down on allocations using pooling, DI containers, etc.
Are you running any sort of cron jobs/processing on your front-end servers? Try to move as much as you can to background tasks such as sending emails. The intervals may seem random, but it can be a result of things happening depending on your job settings. 9 am every day may not mean what you think depending on the cron/task options. A corollary - move things to back-end servers and pull queues.
It's tough to give you a good answer without more information. The best someone here can do is give you a starting point, which pretty much every answer here already has.
By making at least one instance permanent, you get a great improvement in the first use. It takes about 15 sec. to load the application in the instance, which is why you experience long request times, when nobody has been using the application for a while
I have a simple app running on App Engine but I'm having odd problems with latency. It's a Python 2.7 app and a loading request takes between 1.5 and 10 secs (I guess depending on how GAE is feeling). This is a low traffic site right now, so previously GAE was sitting with no idle instances and most request were loading requests, resulting in a long wait time on the first page view.
I've tried configuring the minimum number of idle instances to "1" so that these infrequent page views can immediately hit a warm instance.
However, I've seen several cases now where even with one instance sitting unused, GAE will route an incoming request to a loading instance, leaving the warm instance untouched:
gae dashboard showing odd scheduling
How can I prevent this from happening? I feel I must be understanding something wrong, because I certainly don't expect this behavior.
Update: Also, what makes this even less comprehensible is that the app has threadsafe enabled, so I really don't understand why GAE would get flustered and spin up an instance for a single, lone request.
Actually, I believe this is normal behavior. Idle instances are supposed to guarantee a minimum number of instances always available (for spiky load).
So, when some requests start coming in, they are initially served by idle instances, but at the same time AE scheduler will start launching new instances to always guarantee the same amount of idle instances even during suddenly increased load. That is, to "cover" for those idle instances that became busy serving requests.
It is described in details on Adjusting Application Performance page.
Arrrgh! Suffer from this myself. This topic-area has come up in several threads (GAE groups & SO). If someone can dial-in the settings for a low-traffic site (billing on/off), that would be a real benefit. IIRC, someone with what I think is deep GAE experience noted in one thread that the Scheduler does not do well with very low volume apps. I have also seen wildly different startup times within a relatively short period of time. Painful to see a spinup take 700ms then 7000ms just a few minutes later. Overall the issue is not so much the cost to me, but more so the waste of infrastructure resources. In testing I've had two instances running despite having pinged the app with an RPC once every few minutes. If 50k other developers are similarly testing, that could accumulate into a significant waste.
I am using the python app engine and finding that the log console on the local development server is terribly slow. Output to this window seems to show in chunks of about 5-15 lines every second. Is that typical? I find that it's so slow that it hinders my debugging time waiting for log data to appear.
I suppose this may be as good an answer as any under the circumstance. Basically, I closed and reopened google app engine launcher, and the outputting was back to being appropriately fast. If anyone has a suggestion why this happens, that would be great. For now, though, at least this makes the slowness go away.