We are using Cloud Tasks to call an "on-prem" API gateway (using Http request). This API gateway (IBM API Connect) sits in front off an on-prem system (Oracle). This back end system can at times be very slow. >5s.
We are desperately trying to increase the throughput but “adjusting” the Cloud Task queue settings (like -max-dispatches-per-second etc).
gcloud tasks queues update queue-1 --max-dispatches-per-second=8 --max-concurrent-dispatches=16
But all we see when we “crank up” the Cloud Task settings is that yellow triangle telling us that we are “enforced" to lower rate due to "system resources".
My understanding is that the yellow triangle shows up due to “errors” from the API gateway we call. Basically GCP/Cloud Tasks re-acts "by it self" based on return codes/errors/time-outs/latency etc from the API end-point we are calling with the result of a very low rate/thru-put. Is this understanding correct? Can someone verify?
The GUI does say that "or because currently there is no instance available to execute a request". What instance are they talking about? So to me that means that there is a possibility that it's "GCP specific" resources that comes into the picture here and have an effect on the "enforced rate"? Or?
Anyway, any help/insight would be appreciated.
Thanks
The error message you are seeing can be prompted by any of the 2 things you are mentioning: "Enforced rates" or "lack of GCP resources at the time of request".
The "Enforced rates" that Cloud tasks is refering to are the ones mentioned here. As you mention, this is due to the server being overloaded and returning too many errors. When this happens Cloud tasks acts by itself and will slow down execution until errors stop.
The "currently there is no instance available to execute a request" message you are seeing is that GCP does not have resources to create the request. Remember that cloud tasks is a managed service so this means that requests are created by GCP fully managed compute engine instances. This is a bit rare, although it does happen from time to time.
In order to make sure which of these 2 issues is the one you are running into, I would recommend you to check your Stackdriver logs and see if you are getting a high amount of errors on the Cloud Tasks filter as if this is the case, most likely you are running into the "Enforced rates" territory.
Hope you find this useful!
Related
I'm running into a performance issue with Google Cloud Bigtable Python Client. I'm working on a flask API that writes to and reads from a GCP Bigtable instance. The API uses the python client to communicate with Bigtable, and was deployed to GCP App Engine flexible environment.
Under low traffic, the API works fine. However during a load test, the endpoints that reads and writes to Bigtable suffers a huge performance decrease compare to a similar endpoint that doesn't communicate with Bigtable. Also, a large percentage of requests went to the endpoint receives a 502 Bad Gateway, even when health check was turned off in App Engine.
I'm aware of that the client is currently in Alpha. I wonder if the performance issue is known, or if anyone also ran into the same issue
Update
I found a documentation from Google stating:
There are issues with the network connection. Network issues can
reduce throughput and cause reads and writes to take longer than
usual. In particular, you'll see issues if your clients are not
running in the same zone as your Cloud Bigtable cluster.
In my case, my client is in a different region, by moving it to the same region had a huge increase in performance. However the performance issue still exist, and the recommendation from the documentation is to put client in the same zone as Bigtable.
I also considered using Container engine or Compute Engine where it is easier to specify the zone, but I want stay with App Engine for its autoscale functionality and managed services.
Bigtable client take somewhere between 3 ms to 20 ms to complete each request, and because python is single threaded, during that period of time it will just wait until the response comes back. The best solution we found was for any writes, publish the request to Pubsub, then use Dataflow to write to Bigtable. It is significantly faster because publishing a message in Python would take way below 1 ms to complete, and because Dataflow can be set to exactly the same region as Bigtable, and it is easy to parallel, it can write much faster.
Though it doesn't solve the scenario where you need frequent read or write need to be instantaneous
App Engine has been great for requests that process quickly with no external API calls to databases or caches or third-party resources, but we've found that introducing any sort of "longer running" component or external latency (for example in a HTTP POST operation that runs asynchronously in the background and might take a second or two to process a few more intense database queries... totally invisible and OK from a UX perspective on the client-side because it's asynchronous but expensive to App Engine billing since it's long running) ... the "instance hours" compound and drive costs up considerably.
These sorts of expense inducing situations where a request is literally just waiting for a response from an external resource and requiring almost zero CPU during their idling seem avoidable, but I'm not sure if it's avoidable with App Engine.
It's almost like a "long poll" where the response might be left open but doing nothing.
Is there a way to do this on App Engine without just paying an insane amount for instance hours, or would we be better off moving to Compute Engine or EC2? Does it scale automatically based on CPU load, or is it based solely on open and perhaps inactive requests in total count? — threadsafe is indeed enabled.
There are really two ways to go about this one (top of mind).
Use Task Queues!
If the work doesn't need to be exactly at the same time of the request, this is exactly what [task queues] in App Engine are for. They allow you to put a job on a queue, and have another module pick up the work. They're kind of great because you can separately scale your front end and back end processes.
If that doesn't work....
Use App Engine Flexible
Under the hood App Engine Flexible is just running GCE instances. The cost structure is entirely different, since you persistently have a VM running in the background serving your requests.
Hope this helps!
What you're really worried about here is how App Engine scales your instances. Because many of your requests require few resources, your app might be able to handle many more concurrent requests on a single instance than normal. You can look into parameters that shape scaling here. Of particular interest:
max_concurrent_requests The number of concurrent requests an automatic scaling instance can accept before the scheduler spawns a new instance (Default: 8, Maximum: 80).
There is a danger here, where an instance may fill up with non-long-polling requests and become overburdened. To prevent that, you could isolate your long-polling requests into their own service and set its scaling parameters separately from the rest of your app.
I developed an application for client that uses Play framework 1.x and runs on GAE. The app works great, but sometimes is crazy slow. It takes around 30 seconds to load simple page but sometimes it runs faster - no code change whatsoever.
Are there any way to identify why it's running slow? I tried to contact support but I couldnt find any telephone number or email. Also there is no response on official google group.
How would you approach this problem? Currently my customer is very angry because of slow loading time, but switching to other provider is last option at the moment.
Use GAE Appstats to profile your remote procedure calls. All of the RPCs are slow (Google Cloud Storage, Google Cloud SQL, ...), so if you can reduce the amount of RPCs or can use some caching datastructures, use them -> your application will be much faster. But you can see with appstats which parts are slow and if they need attention :) .
For example, I've created a Google Cloud Storage cache for my application and decreased execution time from 2 minutes to under 30 seconds. The RPCs are a bottleneck in the GAE.
Google does not usually provide a contact support for a lot of services. The issue described about google app engine slowness is probably caused by a cold start. Google app engine front-end instances sleep after about 15 minutes. You could write a cron job to ping instances every 14 minutes to keep the nodes up.
Combining some answers and adding a few things to check:
Debug using app stats. Look for "staircase" situations and RPC calls. Maybe something in your app is triggering RPC calls at certain points that don't happen in your logic all the time.
Tweak your instance settings. Add some permanent/resident instances and see if that makes a difference. If you are spinning up new instances, things will be slow, for probably around the time frame (30 seconds or more) you describe. It will seem random. It's not just how many instances, but what combinations of the sliders you are using (you can actually hurt yourself with too little/many).
Look at your app itself. Are you doing lots of memory allocations in the JVM? Allocating/freeing memory is inherently a slow operation and can cause freezes. Are you sure your freezing is not a JVM issue? Try replicating the problem locally and tweak the JVM xmx and xms settings and see if you find similar behavior. Also profile your application locally for memory/performance issues. You can cut down on allocations using pooling, DI containers, etc.
Are you running any sort of cron jobs/processing on your front-end servers? Try to move as much as you can to background tasks such as sending emails. The intervals may seem random, but it can be a result of things happening depending on your job settings. 9 am every day may not mean what you think depending on the cron/task options. A corollary - move things to back-end servers and pull queues.
It's tough to give you a good answer without more information. The best someone here can do is give you a starting point, which pretty much every answer here already has.
By making at least one instance permanent, you get a great improvement in the first use. It takes about 15 sec. to load the application in the instance, which is why you experience long request times, when nobody has been using the application for a while
We know from the documentation there is a theoretical limit of 1 message per user per second, but we aren't coming anywhere close to that while running email migrations on a high-end server. What should we do? Should we increase the amount of threads per user to more than one (even though the documentation suggests only 1 thread per user)? I've used their GAMME tool and it blows the email migration api away in terms of speed, even on lower end servers.
Does anyone have any suggestions? It's not super-slow, but it's slow enough to be a pain.
The GAMME tool itself utilizes the Email Migration API, it's not doing anything special so there are likely other factors slowing your migration. Are you actually hitting the migration API from AppEngine? If so, you should be able to utilize appstats to profile your application and see if there are other bottlenecks. Where are you pulling messages from?
Do not attempt to use more than 1 thread per user migration, it won't work and you'll get performance issues. DO make sure that you are properly implementing exponential backoff. If your app doesn't acknowledge 503 error codes by backing off exponential (1 second the first time, then 2 seconds, 4, 8, etc) then Google will respond by further throttling your API calls.
Does anyone have any advice on making the logging in Google App Engine better? I am currently trying to use Splunk Storm, but they are finicky regarding input and go down often. Has anyone else encountered this and solved it in some capacity?
Currently I have a process that runs in a backend that reads from the LogService and pipes the logs into Splunk Storm via REST api. This often fails, or storm goes down, or the backend IP changes.
My issue is with the logging provided within App Engine, as the logs disappear when new versions are pushed and querying the logs with the provided dashboard is almost unusable. Splunk was a potential solution, but the cloud solution leaves a lot to be desired.
Anything that would provide a better interface into my logs would be appreciated.
You can export logs from GAE to BiqQuery which has quite capable query language. You can use Mache, an open-source project that already does this. You should write your own exporter, to expose (and make queryabe) fields (columns) you are interested in.
Since you've decided to use Splunk (or another external service) as permanent storage, it sounds like you need a location to buffer logs between the times when they're written to App Engine's log service and when Splunk is available to accept the logs. To avoid losing logs before version churn causes them to fall out of App Engine, this buffer needs to be fast and highly available.
One reasonable choice is the AE datastore. There's no unreliable hop to a 3rd party, it has an availability SLA, and it can be scaled arbitrarily by sharding writes. The downside would be the cost of R/W operations and the storage footprint of in-flight logs, but you'll incur a comparable cost for another backing store.
Whatever choice of service, have one batch process (e.g. backend or cronjob) write to the buffer from the logs reader API. As long as it runs more often than app updates, logs will always exist in durable storage. Then have another batch process wait for Splunk to be available then upload to it from the buffer and delete as you get receipt confirmation from Splunk.