I have an application that requires really low latency (real time game).
Currently in my solution it takes less than 2 milliseconds for a message to route to from the client front-end server to the destination server.
Does anybody know how much time will it take in Google Cloud Pub/Sub to route a message from one server to another?
Thank you!
While Cloud Pub/Sub's end-to-end latency at the 99.9th percentile is sufficient for many applications--including some using it for real-time interaction, 2ms is lower than what the system can currently promise. We have thus far prioritized high throughput and strong delivery guarantees. End-to-end latency is also highly dependent on the rate at which a subscriber issues pull requests. A subscriber should always have at least a few open pull requests if throughput and/or latency are important. We do aim to significantly reduce out intra-region latencies but at the moment Cloud Pub/Sub cannot guarantee 2ms intra-region latencies at the 99.9th percentile.
Related
Why would there be any latency in App Engine in the middle of processing a request? This only happens at times and randomly occurs at different places in the request handling with a latency of around 3 or more seconds after starting to process a request.
The usual suspect is your handler reaching out for some resources, either from GAE APIs (datastore, memcache, etc), other GCP API/infra (cloud storage, machine learning, big query, etc) or an external/3rd party service/URL.
Most, if not all such interactions can occasionally encounter peak response times way longer than average for various possible reasons (or combinations of reasons), for example:
temporary outages of the service being accessed of in the networking layer ensuring connectivity to them
retries at networking or application layers due to communication errors/packet loss
service VMs/instances needed to be launched from scratch during (re)starts or even during scaling up
normal operation conditions which require more time, like datastore transaction retries due to collisions
If the occurrence rate becomes unacceptable an investigation would need to be done to identify which of such external accesses is/are responsible, what are the conditions causing them and maybe find some solution to prevent or reduce the impact of the occurences.
Of course, there may be other reasons as well.
Is Google PubSub suitable for low-volume (10 msg/sec) but mission-critical messaging, where timely delivery of each message is guaranteed within any fixed period of time?
Or, is it rather suited for high-throughput, where individual messages might be occasionally lost or delayed indefinitely?
Edit: To rephrase this question a bit: Is it true, that any particular message in PubSub, regardless of volume of messages produced, can be indefinitely delayed?
Google Cloud Pub/Sub guarantees delivery of all messages, whether low throughput or high throughput, so there should be no concern about messages being lost.
Latency for message delivery from publisher to subscriber depends on many different factors. In particular, the rate at which the subscriber is able to process messages and request more messages is vitally important. For pull subscribers, this means always having several outstanding pull requests to the server. For push subscribers, they should be returning a successful HTTP response code as quickly as possible. You can read more about the difference between push and pull subscribers.
Google Cloud Pub/Sub tries to minimize latency as much as possible, though there are no guarantees made. Empirically, Cloud Pub/Sub consistently delivers messages in no more than a couple of seconds at the 99th percentile. Note that if your publishers or subscribers are not running on Google Cloud Platform, then network latency between your servers and Google servers could also be a factor.
I recently experienced a sharp, short-lived increase in the load of my service on Google App Engine. The load went from ~1-2 req/second to about 10 req/second for about a couple of hours. My number of dynamic instances scaled up pretty quickly but in the process I did get a number of "Request waited too long" timeout messages.
So the next time around, I would like to be prepared with enough idle instances to handle my load. But now the question is, how do I determine how many is adequate. I expect a much larger burst in load this time - from practically nothing to an average of 500 requests/second, possibly with a peak of 3000. This is to last between 15 minutes and 1 hour.
My main goal is to ensure that the information passed via HTTP Post is saved to the datastore by means of a single write.
Here are the steps I have taken to prepare for the burst:
I have pruned the fast path to disable analytics and other reporting, which typically generate 2 urlfetch requests.
The datastore write is to be deferred to a taskqueue via the deferred library
What I would like to know is:
1. Tips/insights into calculating how many idle instances one would need per N requests/second.
2. It seems that the maximum throughput of a task queue is 500/second. Is this the rate at which you can push tasks, and if not, then is there a cap on that? I'm guessing not, since these are probably just datastore writes, but I would like to be sure.
My fallback plan if I am not confident of saving all of the information for this flash mob is to set up a beefy Amazon EC2 instance, run a web server on it and make my clients send a backup request to this server.
You must understand that Idle Instances are only used when new frontend instances are being spun-up. This means that they are only used during traffic increases. When traffic is steady they are not used.
Now if your instance needs 20 sec to spin up and can handle 10 req/sec of steady traffic and you traffic INCREASE is 5 req/sec, then you'll need 20 * 5 / 10 = 10 idle instances if you don't want any requests dropped.
What you should do is:
Maximize instance throughput (number of requests it can handle): optimize code, use async db operations and enable Concurrent Requests.
Minimize your instance startup time. This is important because idle instances are used during spinning up of new instances and the time it takes to spin up a new instance directly relates to how many idle instances you need. If you use Java this means getting rid of any heavy frameworks that do classpath scanning (Spring, etc..).
Fourth, number of frontend instances needed is VERY application specific. But since you already had traffic increase you should know how many requests your frontend instance can handle per second.
Edit: There is one more obvious thing you should do: HTTP caching. GAE has a transparent HTTP cache which can be simply controlled via Cache-Control headers.
Also, if analytics has a big performance impact on your server, consider using client side analytics services (like Google Analytics). They also work for devices.
I am running a free application and using 1 max idle instance using GAE's Python runtime.
According to http://code.google.com/appengine/docs/adminconsole/instances.html,
Your application's latency has the biggest impact on the number of
instances needed to serve your traffic. If you service requests
quickly, a single instance can handle a lot of requests.
This seems to suggest that adjusting the slider in 'Application Settings' to minimum latency would be best.
However, according to http://code.google.com/appengine/docs/adminconsole/performancesettings.html#Setting_the_Minimum_Pending_Latency,
it seems like having a high latency is good for preventing load spikes from spinning up new instances.
So is latency basically a tradeoff between ability to respond to request spikes (high latency) vs. number of requests handled over a given time period (low latency)?
"Pending latency" refers to how long a request can be sitting in the queue before App Engine decides to spin up another instance. If all of your app instances are busy when a request arrives, the request will wait in a queue to be handled by the next available instance. If it's there beyond the minimum, App Engine may decide to start up a new instance to handle the request. (There's also a maximum pending latency setting you can adjust.)
The minimum pending latency is configurable because starting up a new instance takes time and costs money. A larger minimum pending latency means App Engine will hold onto pending requests longer (and make them wait) before starting new instances, favoring instance cost to the ability to handle more traffic. A smaller minimum pending latency means App Engine will start new instances more often, as traffic picks up.
The term "latency" simply refers to how long it takes for your app to respond to a request. The faster your app can respond to requests, the more requests a single instance can handle, and the shorter the request queue will typically be. Lower latency is always good, but it's up to the app to do what it needs to do quickly.
Could somebody give a good explanation for newbie, what does following phrase means:
1) workload throttling within a single cluster and 2) workload
balance across multiple clusters.
This is from overview of advantages of one ETL-jobs tool, that helps perform ETL (Extract, Transform, Load) jobs on Redshift database.
Many web services allocate a maximum amount of "interaction" that you can have with a service. Once your exceed that amount, the service will shift in how it completes its interactions.
Amazon imposes limitations on how much compute power you can consume within your nodes. The phrase "workload throttling" means that if you exceed the limits detailed in Amazon's documentation Amazon Redshift Limts, your queries, jobs, tasks, or work items will be given lower priority or fail outright.
The idea is that Amazon doesn't want you to consume so much compute power that it prevents others from using the service and, honestly, they don't want you to consume more power than it costs them to provide.
Workload throttling isn't an idea exclusive to this Amazon service, or cloud services in general. The concept can be found in any system that needs to account for receiving more tasks than it can handle. Some systems deal with being overburdened differently.
For example, some systems will defer you to alternate services in the case of a load balancer. 3rd party data APIs will delegate you a maximum amount of data per hour/minute and then either delay the responses you get back, charge you more money, or stop responding altogether.
Another service that you can look at that deals with throttling is the Google Maps Geocoding service. If you look on their documentation, Google Maps Geocoding API Usage Limits, you will see that:
Users of the standard API:
2,500 free requests per day, calculated as the sum of client-side and server-side queries.
50 requests per second, calculated as the sum of client-side and server-side queries.
If you exceed this and have billing enabled, Google will shift to:
$0.50 USD / 1000 additional requests, up to 100,000 daily.
I can't remember what the response looks like after you hit that daily limit, but once you hit it, you basically don't get responses back until the day resets.