I'm currently in the process of rewriting my java code to run it on Google App Engine. Since I cannot use Timer for programming timeouts (no thread creation allowed), I need to rely on system clock to mark the time of the timeout start so that I could compare it later in order to find out if the timeout has occurred.
Now, several people (even on Google payroll) have advised developers not to rely on system time due to the distributed nature of Google app servers and not being able to keep their clocks in sync. Some say the deviance of system clocks can be up to 10s or even more.
1s deviance would be very good for my app, 2 seconds can be tolerable, anything higher than that would cause a lot of grief for me and my app users, but 10 second difference would turn my app effectively unusable.
I don't know if anything has changed for the better since then (I hope yes), but if not, then what are my options other than shooting up a new separate request so that its handler would sleep the duration of the timeout (which cannot exceed 30 seconds due to request timeout limitation) in order to keep the timeout duration consistent.
Thanks!
More Specifically:
I'm trying to develop a poker game server, but for those who are not familiar how online poker works: I have a set of players attached to 1 game instance. Evey player has a certain amount of time to act before the timeout will occur so the next player can act. There is a countdown on each actor and every client has to see it. Only one player can act at a time. The timeout durations I need are 10s and 20s for now.
You should never be making your request handlers sleep or wait. App Engine will only automatically scale your app if request handlers complete in an average of 1000ms or less; deliberately waiting will ruin that. There's invariably a better option than sleeping/waiting - let us know what you're doing, and perhaps we can suggest one.
Related
It seems that when app engine taskqueue's get interrupted, they take 20 minutes or more to restart, is this behavior normal?
I am using the TaskQueue on Google Cloud's App Engine Flexible system. I regularly add tasks to the taskqueue and they get processed on the system. It appears that occasionally, the task gets interrupted in the middle of what it's doing. I don't know why this happens, but I assume it's probably because the instance that its on restarted itself.
My software is resilient to such restarts, but the problem is that it takes a full 20 minutes for the task to be restarted. Has anyone experienced this before?
I think you're right, an instance grabs the task and then goes down. Taskqueue doesn't realize it and waits for some kind of timeout.
This sounds very similar to an issue i experienced:
app engine instance dies instantly, locking up deferred tasks until they hit 10 minute timeout
So to answer your question, I would say yes this does happen. As for what to do, I guess it depends on what it is this task is doing, how often it runs, etc. If the 20 minute lag isnt a big deal I would just live with it, just because fixing it can be a bit of a wild goose chase, but here's what I would try:
When launching tasks, launch duplicates as well with a staggered value for countdown/eta
setup a separate microservice to handle/execute these tasks, hopefully this will make it's execution more predictable, you'll be able to tweak instance-size, & scaling settings to better suit it.
Starting about 1 week ago, my app will occasionally and randomly completely stop serving for 1-5 minutes. Requests during this time hang for the full timeout and then return a 500.
The System Status dashboard reads OK, I have no cron jobs or anything special that might cause this disruption (that I know of).
Has anyone experienced this, and is there a solution?
If you have 'threadsafe: false' in your app.yaml configuration, App Engine will not send concurrent requests to your app. If you have a request that's blocking for a really long time, all other requests coming in will line up (and possibly time out) before being serviced. If this is the cause of your problem, either make your app thread-safe or have a look in your logs to find requests that take a long time and fix them.
Alternatively, if your app gets very little traffic, your instances might be getting shut down after they've been idle for a while. If your app takes a long time to start up, that would explain the behavior you're seeing. In app.yaml, you can set 'min_idle_instances' to some value greater than zero to avoid this startup penalty.
We have an App Engine app that handles an average .5 requests per second, and seemingly all those requests can be handled by the same instance running a Go app as the main version.
However, sometimes App Engine kicks off a second instance (and sometimes even a third one), that doesn't seem to do anything past handling one or two requests. Here's an example.
Shutting down that instance manually doesn't seem to cause any harm, so my question is, why does App Engine not kill the instance after it did not get any requests for a while? (The above example had four requests in the past hour, often the requests/age ratio gets even lower).
Update:
A similar situation is when an instance is started on a different version. App Engine only seems to kill the instance after hours of not getting any requests.
Under Application Settings → Performance,
Idle Instances is set to Automatic – 20
Pending Latency is set to 150ms – 250ms
I wish I knew what controls if/when it kills idle instances, but I can't see any documentation of it.
To avoid excess instances starting, I think the main thing you can do here is increase the pending latency:
The Pending Latency slider controls how long requests spend in the pending queue before being served by an Instance of the default version of your application. If the minimum pending latency is high App Engine will allow requests to wait rather than start new Instances to process them. This can reduce the number of instance hours your application uses, but can result in more user-visible latency.
Even if you only average 4 requests/hour, if you happen to get two closely spaced I suppose it's possible it would start a new instance.
You can also see some small amount of information in the logs about why it started a new instance.
The "How Applications Scale" section of the Google App Engine documentation states:
Scaling in Instances
Each instance has its own queue for incoming requests. App Engine monitors the number of requests waiting in each instance's queue. If App Engine detects that queues for an application are getting too long due to increased load, it automatically creates a new instance of the application to handle that load.
App Engine also scales instances in reverse when request volumes decrease. This scaling helps ensure that all of your application's current instances are being used to optimal efficiency and cost effectiveness.
It also states you can "specify a minimum number of idle instances", and to "optimize for high performance or low cost" in the administration console.
Try setting the "Idle instances" field to something like 3 - 5, and "optimize for low cost" and see if that affects the instance kill time.
I have a simple app running on App Engine but I'm having odd problems with latency. It's a Python 2.7 app and a loading request takes between 1.5 and 10 secs (I guess depending on how GAE is feeling). This is a low traffic site right now, so previously GAE was sitting with no idle instances and most request were loading requests, resulting in a long wait time on the first page view.
I've tried configuring the minimum number of idle instances to "1" so that these infrequent page views can immediately hit a warm instance.
However, I've seen several cases now where even with one instance sitting unused, GAE will route an incoming request to a loading instance, leaving the warm instance untouched:
gae dashboard showing odd scheduling
How can I prevent this from happening? I feel I must be understanding something wrong, because I certainly don't expect this behavior.
Update: Also, what makes this even less comprehensible is that the app has threadsafe enabled, so I really don't understand why GAE would get flustered and spin up an instance for a single, lone request.
Actually, I believe this is normal behavior. Idle instances are supposed to guarantee a minimum number of instances always available (for spiky load).
So, when some requests start coming in, they are initially served by idle instances, but at the same time AE scheduler will start launching new instances to always guarantee the same amount of idle instances even during suddenly increased load. That is, to "cover" for those idle instances that became busy serving requests.
It is described in details on Adjusting Application Performance page.
Arrrgh! Suffer from this myself. This topic-area has come up in several threads (GAE groups & SO). If someone can dial-in the settings for a low-traffic site (billing on/off), that would be a real benefit. IIRC, someone with what I think is deep GAE experience noted in one thread that the Scheduler does not do well with very low volume apps. I have also seen wildly different startup times within a relatively short period of time. Painful to see a spinup take 700ms then 7000ms just a few minutes later. Overall the issue is not so much the cost to me, but more so the waste of infrastructure resources. In testing I've had two instances running despite having pinged the app with an RPC once every few minutes. If 50k other developers are similarly testing, that could accumulate into a significant waste.
I am using Task Queue in GAE for performing some background work for my application. I have come to know that there is a 10 minute time limit for a particular task. My concern is how do I test this thing in my local environment. I tried thread sleep but it didn't throw any exception as mentioned in google app engine docs. Also is this time limit is measured by CPU time or the actual time.
Thanks.
The time is measured in wall clock time. The development server doesn't enforce time limits, although it's unclear why you'd want to test it because it's unlikely your tests will perform the same as they will in production, so trying to guess how much you'll be able to accomplish in 10 minutes on the production servers by seeing how much you can accomplish in 10 minutes on the development server will fail horribly.
For your development server, start a timer when a task is initiated. keep checking in your code if you reached 10 mins wall clock time. When you reach, throw a DeadlineExceededError. It would be better to have the try and except statements in the class handlers which call a particular function of your code.