I have a Google App Engine server up and running. I don't have any cron jobs or anything that I am aware that calls the default website url. I have had over 70,000 requests for the _ah/warmup in the last 24 hours. How can I figure where this traffic is coming from? Is this a denial of service attack on the server?
The warmup request is, as explained by the Google Documentation, an internal call that is made whenever you spin up new instances.
Without any more information, the only thing 70 000 requests to that URL means is that your scaling is maybe a bit too aggressive and your app is spinning too many instances.
Related
This may be the wrong place for this question, so please re-direct me if necessary.
I have deployed a couple simple functions using Google Cloud Functions that do the following:
Read files from AWS and write to Cloud SQL
Aggregate Cloud SQL data and write csv file to Cloud Storage bucket
Simple OLS prediction model on aggregated data
I have these as separate functions because (1) often takes longer than the Cloud Function maximum timeout. Because of this, I am considering moving this whole thing to App Engine as a service. My question about App Engine Standard are:
What do the request timeouts mean? If I were to run this service, do I still have a short time-limit after which it will no longer run?
Is App Engine the best thing to use for this task?
Thanks for all your help
According to Google Documentation, GAE Standard has a maximum timeout of 1 minute for http requests and 10 minutes for cron/tasks for the older environments. Newer env have it as 10 minutes for both http requests & tasks. If your functions are taking longer than these, then GAE standard won't work for you. For such a case, you should take a look at GAE Flex - see this Google documentation which compares Flex to Standard).
Secondly, it seems to me that what you have are endpoints that are only hit occasionally or at specific scheduled times. If that is the case, I would also recommend taking a look at Cloud Run. We have a blog article about it and we have this
....Another thing to note about Cloud Run is that it only runs when it receives an HTTP request. It plays dead and comes alive to execute your code when an HTTP request comes in. When it is done executing the request, it goes 'dead' again till the next request comes in. This means you're not paying for time spent idling i.e. when it is not doing anything....
You can keep your Cloud Functions and the strong cohesion implemented by each of your 3 Functions, then you can use Cloud Workflows a serverless solution to orchestrate the 3 CF calls. The drawback : you pay for 3 CF invocations and 3 Workflows steps. But does it matter ? Since 2millions CF invocations are free and 5000 Workflows steps are free.
As proposed by #NoCommandLine Cloud Run is indeed an alternative, with its timeout of 3600s(1h). The drawback: you need to wrap your code in a http request and provide a webserver like express or gunicorn.
A hack is to build a docker container for your code with no need for a webserver and run it using Cloud Build which have a timeout of 24 hours.
Initial connections to my website are extremely slow (100+ seconds). How can I diagnose the issue?
Using the Chrome dev tool network tab, I see that the issue is "initial connection" and not things like SSL or Waiting/TTFB.
This only happens for the first page visit to the website for a given device; after the first page loads, everything on the website is very fast. This consistently happens for new devices, on the same device if I don't visit the website for X days, and on the same device if I clear the cache and browsing history.
The website is a Django app is hosted using Google Cloud App Engine with 2 flexible instances.
User traffic to the website is low, so I doubt the issue is load balancing or traffic spikes.
Thanks!
Yesterday I tried opening the page and I noticed 1.8min to load the main page and 2.1min to make a search, later attempts were faster as you mentioned. I also tried accessing the page today and it loaded quite fast.
From my understanding the high latency of the first connection might be related to session handling, database connections, ssl certificates problem, huge amount of uncached data, expensive operations run before the server sends the response. It's nearly impossible for us to determine it without access to your code, logs and database configurations.
As for how to narrow down the issue I might suggest the following:
Examine your logs for possible causes.
Add timeit logging interleaved with each of the statements that handle the requests and watch for bottlenecks or long-running code.
Deploy the same endpoint without logos, images and other data that would be browser-cached and see if it makes a difference.
Create a hello-world simple endpoint and check it's latency. Keep slowly evolving the endpoint to resemble your actual handling code with hopes on finding what's the issue.
If only the first connection is slow, it might be because the instance is starting and you do not have minimum idle instances and warmup requests enabled. This configuration will make you have instances ready for taking traffic and the latency will be slower in the first connection.
As it is stated in the documentation:
If you set a minimum number of idle instances, pending latency will
have less effect on your application's performance. Because App Engine
keeps idle instances in reserve, it is unlikely that requests will
enter the pending queue except in exceptionally high load spikes. You
will need to test your application and expected traffic volume to
determine the ideal number of instances to keep in reserve.
Also you can find more information about warmup requests in this documentation about Configuring Warmup Requests to Improve Performance
I resolved this issue by removing and recreating custom domain settings for my App Engine project, and removing and recreating the corresponding DNS records in domains.google, following these instructions:
https://cloud.google.com/appengine/docs/standard/python/mapping-custom-domains
I'm still not sure what the underlying issue was, but this fixed it. Hope this can help anyone encountering a similar issue.
I had this exact issue. it was killing our load performance once we switched to a load balancer.
What is ended up being was the instance group port setting. We're obviously using ssl certs for the site but I had indicated port 80 and 443.
Once I removed port 80 from the instance group that the load balancer refers to, it loaded all the pages immediately.
It appears that AppEngine standard has a warmup feature to warm up an app after a deployment but I don't see the same feature available for Flex. The readiness & liveness probes also don't work for this since setting the path setting to a custom path inside the application doesn't seem to make the probes actually hit the internal endpoint.
Is there some solution I'm missing other than doing things like manually hitting the endpoints myself after the deployment which won't be very reliable since the calls don't necessarily always round robin to each instance?
In App Engine Standard, warmup requests essentially load your app's code into a new instance before any live requests reach that instance. This can happen in the following situations:
When you redeploy a version of your app.
When new instances are created due to the load from requests
exceeding the capacity of the current set of running instances.
When maintenance and repairs of the underlying infrastructure or
physical hardware occur
In App Engine Flexible, you can achieve the same result by using the initial_delay_sec setting for liveness checks in your app.yaml file. If you set up its value to give enough time for your code to initialize, the first request coming to that instance will be processed quickly by your already-initialized code.
I have and MVC5 web-app running on Azure. Primarily this is a website but I also have a CRON job running (triggered from an external source) which calls a URL (a GET Request) to carry out some house keeping.
This task is asynchronous can take up to and sometimes over the default timeout on Azure of 230 seconds. I'm regularly getting 500 errors due to this.
Now, I've done a bit of reading and this looks like it's something to do with the Azure Load Balancer settings. I've also found some threads relating to being able to alter that timeout in various contexts but I'm wondering if anyone has experience in altering the Azure Load Balancer timeout in the context of a Web App? Is it a straightforward process?
but I'm wondering if anyone has experience in altering the Azure Load Balancer timeout in the context of a Web App?
You could configure the Azure Load Balancer settings in virtual machines and cloud services.
So, if you want to do it, I suggest you could deploy web app to Virtual Machine or migrate web app to cloud services.
For more detail, you could refer to the link.
If possible, you could try to optimize your query or execution code.
This is a little bit of an old question but I ran into it while trying to figure out what was going on with my app so I figured I would post an answer here in case anyone else does the same.
An Azure Load Balancer (created via ANY means) which mostly comes along with an external IP Address — at least when created via Kubernetes / AKS / Helm — has the 4 min, 240 second idle connection timeout that you referred to.
That being said there are two different types of Public IP Addresses — Basic (which is almost always the default that you probably created) and Standard. More information on the docs: https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-standard-overview
You can modify the idle timeout see the following doc: https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-idle-timeout
That being said changing the timeout may or may not solve your problem. In our case we had a Rails app that was connection to a database outside of azure. As soon as you add a Load Balancer with a Public IP all traffic will EXIT that public IP and be bound by the 4 minute idle timeout.
That's fine except for that Rails anticipates that a connection to a Database is not going to be cut frequently — which in this case happens ALL the time.
We ended up implementing a connection pooling service that sat between our Rails app and our real database (called PGbouncer — specific to Postgres DBs). That service monitored a connection and re-connected when the timer was nearing the Azure LB timeout.
It took a little while to implement but in our case it works flawlessly. You can see some more details over here: What Azure Kubernetes (AKS) 'Time-out' happens to disconnect connections in/out of a Pod in my Cluster?
The longest timeout you can set for a Public IP / Load Balancer is 30 minutes. If you have a connection that you would like to utilize that runs idle longer than that — then you may be out of luck. As of now 30 mins is the max.
I have been writing a Google Chrome extension for Stack Exchange. It's a simple extension that allows you to keep track of your reputation and get notified of comments on Stack Exchange sites.
Currently I've encountered with some issues that I can't handle myself.
My extension uses Google App Engine as its back-end to make external requests to Stack Exchange API. Each single client request from extension for new comments on single site can cause plenty of requests to api endpoint to prepare response even for non-skeetish user. Average user has accounts at least on 3 sites from Stack Exchange network, some has > 10!
Stack Exchange API has request limits:
A single IP address can only make a certain number of API requests per day (10,000).
The API will cut my requests off if I make more than 30 requests over 5 seconds from single IP address.
It's clear that all requests should be throttled to 30 per 5 seconds and currently I've implemented request throttle logic based on a distributed lock with memcached. I'm using memcached as a simple lock manager to coordinate the activity of GAE instances and throttle UrlFetch requests.
But I think it's a big failure to limit such powerful infrastructure to issue no more than 30 requests per 5 sec. Such api request rate does not allow me to continue development of new interesting and useful features and one day it will stop working properly at all.
Now my app has 90 users and growing and I need come up with solution how to maximize request rate.
As known App Engine makes external UrlFetch requests via the same pool of different IP's.
My goal is to write request throttle functionality to ensure compliance with the api terms of usage and to utilize GAE distributed capabilities.
So my question is how-to provide maximum practical API throughput while complying with api terms of usage and utilizing GAE distributed capabilities.
Advise to use another platform/host/proxy is just useless in my mind.
If you are searching a way to programmatically manage Google App Engine shared pool of IPs, I firmly believe that you are out of luck.
Anyway, quoting this advice that is part of the faq, I think you have more than a chance to keep on running your awesome app:
What should I do if I need more
requests per day?
Certain types of applications -
services and websites to name two -
can legitimately have much higher
per-day request requirements than
typical applications. If you can
demonstrate a need for a higher
request quota, contact us.
EDIT:
I was wrong, actually you don't have any chance.
Google App Engine [app]s are doomed.
First off: I'm using your extension and it rocks!
Have you consider using memcached and caching the results?
Instead of taking the results from the API directly, try first to find them on the cache if they are use it and if they are not: retrieve them and cache them and let them expire after X minutes.
Second, try to batch up users requests, instead of asking the reputation of a single user ask the reputation of several users together.