Caching & GZip on GAE (Community Wiki) - google-app-engine

Why does it seem like Google App Engine isn’t setting appropriate cache-friendly headers (like far-future expiration dates) on my CSS stylesheets and JavaScript files? When does GAE gzip those files? My app.yaml marks the respective directories as static_dirs, so the lack of far-future expiration dates is kind of surprising to me.
This is a community wiki to showcase the best practices regarding static file caching and gzipping on GAE!

How does GAE handle caching?
It seems GAE sets near-future cache expiration times, but does use the etag header. This is used so browsers can ask, “Has this file changed since when it had a etag of X68f0o?” and hear “Nope – 304 Not Modified” back in response.
As opposed to far-future expiration dates, this has the following trade-offs:
Your end users will get the latest copies of your resources, even if they have the same name (unlike far-future expiration). This is good.
Your end users will however still have to make a request to check on the status of that file. This does slow down your site, and is “pure overhead” when the content hasn’t changed. This is not ideal.
Opting for far-future cache expiration instead of (just) etag
To use far-future expiration dates takes two steps and a bit of understanding.
You have to manually update your app to request new versions of resources, by e.g. naming files like mysitesstyles.2011-02-11T0411.css instead of mysitestyles.css. There are tools to help automate this, but I’m not aware of any that directly relate to GAE.
Configure GAE to set the expiration times you want by using default_expiration and/or expiration in app.yaml. GAE docs on static files
A third option: Application manifests
Cache manifests are an HTML5 feature that overrides cache headers. MDN article, DiveIntoHTML5, W3C. This affects more than just your script and style files' caching, however. Use with care!
When does GAE gzip?
According to Google’s FAQ,
Google App Engine does its best to serve gzipped content to browsers that support it. Taking advantage of this scheme is automatic and requires no modifications to applications.
We use a combination of request headers (Accept-Encoding, User-Agent) and response headers (Content-Type) to determine whether or not the end-user can take advantage of gzipped content. This approach avoids some well-known bugs with gzipped content in popular browsers. To force gzipped content to be served, clients may supply 'gzip' as the value of both the Accept-Encoding and User-Agent request headers. Content will never be gzipped if no Accept-Encoding header is present.
This is covered further in the runtime environment documentation (Java | Python).
Some real-world observations do show this to generally be true. Assuming a gzip-capable browser:
GAE gzips actual pages (if they have proper content-type headers like text/html; charset=utf-8)
GAE gzips scripts and styles in static_dirs (defined in app.yaml).
Note that you should not expect GAE to gzip images like GIFs or JPEGs as they are already compressed.

Related

Redirect 404 page rendering to CDN?

Is it possible to offload all 404 page renders to a CDN instead of an origin web server to prevent DDoS attacks? It seems like if you are getting DDoS'd and the attack is tying up compute resources by forcing your web server to render 404 pages, those renders would be better served through a CDN. Does anyone have any experience around this that they would be willing to share?
CloudFront automatically protects against DDoS using Shield Standard.
If you'd like to prevent requests made to CloudFront from reaching your origin, a few options:
Geographic restrictions: allow or block requests only from specific countries where you do not expect your viewers to be located. You can configure this within the CloudFront console, and there is no cost to use.
Add AWS WAF. You can block common application-layer attacks, as well as create specific rules to block requests (for example, files ending in extensions you do not use - e.g., .php) or add rate limiting.
Write a CloudFront Function (javascript function that executes at CloudFront's edge locations) to inspect the request and block any that do not match requests your application is capable of serving (for example, you could check that the incoming request matches one of the routes accepted by your application. If not, return a 404).
Both WAF and CloudFront Functions may add an additional cost. CloudFront has a perpetual free tier (meaning it is applied every month) that includes 1TB of data transfer out and 2 million CloudFront Function executions each month. Function executions beyond that are priced at $0.10 per 1 million invocations.
https://aws.amazon.com/cloudfront/pricing/
Several examples of CloudFront Functions to get you started available here - https://github.com/aws-samples/amazon-cloudfront-functions

Cache busting with versioning does not seem to work

I am currently using versioning to bust cache. I used to generate different file name with date or version. However, it breaks google cached page because google look for the old file name.
I have a webpack setup for the chunking.
output.filename = '[name].js?v=' + hash
output.chunkFilename = '[name].js?v=' + hash
And I can see that browser requesting file with v=xxx correctly
However, sometimes I need to ask my customer to open up dev tool and click clear cache and hard refresh because normal refresh does not work somehow.
I also use Cloudflare cdn and it does have cache policy.
Cloudflare response headers.
cache-control: max-age=31536000
cf-bgj: minify
cf-cache-status: HIT
cf-polished: origSize=9873
How to make sure browser and cloudflare purge all the js and css files when the new code is pushed ?
Do not know what to do when normal refresh does not work.
On Cloudflare there are several ways to control the behavior of the cache
Understanding the Cloudflare CDN (general rules)
Cache level (can be configured to consider or ignore the querystring)
Page Rules (useful to fine tune caching behavior based on URL patterns)
Origin Cache Control (to control the behavior based on the cache headers returned by your origin server)
You also have various options (depending on the plan) for proactively purging certain resources with Cache Purge (both from the dashboard or via APIs).
It is worth reviewing the above settings (in particular cache levels and page rules) to verify that the querystring is being considered part of the cache key used to retrieve the data. In particular, the header cf-cache-status: HIT indicates that the requested resource was fetched from the CDN cached copy.

Change to static file doesn't happen immediately after deploy

When I change a static file (here page.html), and then run appcfg.py update, even after deployment is successful and it says the new files are serving, if I curl for the file the change has not actually taken place.
Relevant excerpt from my app.yaml:
default_expiration: "10d"
- url: /
static_files: static/page.html
upload: static/page.html
secure: always
Google's docs say "Static cache expiration - Unless told otherwise, web proxies and browsers retain files they load from a website for a limited period of time." There shouldn't be any browser cache as I am using curl to get the file, and I don't have a proxy set up at home at least.
Possible hints at the answer
Interestingly, if I curl for /static/page.html directly, it has updated, but if I curl for / which should point to the same file, it has not.
Also if I add some dummy GET arg, such as /?foo, then I can also see the updated version. I also tried adding the -H "Cache-Control: no-cache" option to my curl command, but I still got the stale version.
How do I see updates to / immediately after deploy?
As pointed out by Omair, the docs for the standard environment for Pyhton state that "files are likely to be cached by the user's browser, as well as by intermediate caching proxy servers such as Internet Service Providers". But I've found a way to flush static files cached by your app on Google Cloud.
Head to your Google Cloud Console and open your project. Under the left hamburger menu, head to Storage -> Browser. There you should find at least one Bucket: your-project-name.appspot.com. Under the Lifecycle column, click on the link with respect to your-project-name.appspot.com. Delete any existing rules, since they may conflict with the one you will create now.
Create a new rule by clicking on the 'Add rule' button. For the object conditions, choose only the 'Newer version' option and set it to 1. Don't forget to click on the 'Continue' button. For the action, select 'Delete' and click on the 'Continue' button. Save your new rule.
This new rule will take up to 24 hours to take effect, but at least for my project it took only a few minutes. Once it is up and running, the version of the files being served by your app under your-project-name.appspot.com will always be the latest deployed, solving the problem. Also, if you are routinely editing your static files, you should remove any expiration element from handlers related to those static files and the default_expiration element from the app.yaml file, which will help avoid unintended caching by other servers.
According to App Engine's documentation on static cache expiration, this could be due to caching servers between you and your application respecting the caching headers on the responses:
The expiration time will be sent in the Cache-Control and Expires HTTP response headers, and therefore, the files are likely to be cached by the user's browser, as well as by intermediate caching proxy servers such as Internet Service Providers.
Once a file is transmitted with a given cache expiration time, there is generally no way to clear it out of intermediate caches, even if you clear the browser cache or use Curl command with no-cache option. Re-deploying a new version of the app will not reset caches as well.
For files that needs to be modified, shorter expire times are recommended.

Set headers on file upload to Google Cloud Storage

According to the documentation, one should be able to set objects headers on upload to GoogleCloudStorage.
Implementation Details
You should specify cache-control only for objects that are accessible
to all anonymous users. To be anonymously accessible, an object's ACL
must grant READ or FULL_CONTROL permission to AllUsers. If an object
is accessible to all anonymous users and you do not specify a
cache-control setting, Cloud Storage applies a cache-control setting
of 3600 seconds. When serving via XML, Cloud Storage respects the
cache-control of the object as set by its metadata.
However, adding headers through the Google API doesn't seem to work, when fetching the image back with google.appengine.api.images.get_serving_url .
Changing Cache-Control headers from gsutil console has its effects, but takes several days for changes to be visible on the object (when checking from the gsutil console, again, no effect when fetching the image back with the API.
After 2 months of going back and forth with Google's support, we found out that file is sent to the Google Cloud Storage with the proper headers (can be checked via gsutil command).
However get_serving_url function does not respect Blob's headers (confirmed by Google's engineers).
As of 17th of August 2017, there are no future plans to fix that.
Thought someone may encounter the similar problem as there's nothing about it in the documentation.

How does Google App Engine modify HTTP request/response encoding?

I'm trying to send JSON HTTP request from Google App Engine application and retrieve response, and while this works great locally, it suddenly breaks when I deploy it to GAE.
To be more precise, the HTTP response body that is returned to my application ends up looking like this instead of being simple JSON:
�\bD�[��8��ʖϣ�M�M$ �\\�bA` #!r���~pvk�cR]�_7E�
I did find one set of circumstances where I get correct response on GAE which might give some insight at this behavior - if the response doesn't have content-type header it goes through fine, but as soon as there is content-encoding header set to "gzip" present I get the incorrect garbage above as a response body.
Unfortunately I don't have control over the service I'm calling. So the only choice I have is to fix this somehow on my side, but to fix it I'm trying to understand the difference between what Google does to response. Does anyone know?
I understand that Google does some things to HTTP traffic. Is it forcing gzip on my responses as well?
I've also tried playing with encodings, trying to read response as utf-8, and setting utf-8 as default encoding for my GAE application as recommended here, but with no effect. I've ruled out incorrect processing of response in my code or anything I'm using, at least I think so, otherwise I'd have the same problems locally. I'm trying to understand what exactly is happening with hope that this will give me some idea how to prevent it.
EDIT: I figured it out and did a workaround but it's still a workaround, not a solution. So my GAE app calls another web service from outside GAE which sometimes gzips response and sometimes doesn't. If it does, GAE strips away content-type header from response, thus preventing my app from correctly decoding response body. My workaround so far is to get response bytes and test if response is valid JSON, unzipping it manually if it isn't. Would still want to know if stripping content-type can be prevented...
As explained at https://cloud.google.com/appengine/kb/#compression , the application should not supply the content-encoding header: "Google App Engine does its best to serve gzipped content to browsers that support it. Taking advantage of this scheme is automatic and requires no modifications to applications".
I believe that the origin of this architecture (that content encoding is not controlled by the application side of things -- on App Engine, your application code -- but by the server/gateway side) originated with WSGI, the Python standard interface of applications to web servers/gateways (which App Engine uses on the Python runtime), but the architecture makes enough sense that we generalized it -- as the above page puts it, "This approach avoids some well-known bugs with gzipped content in popular browsers.".
The client is far from powerless, and in fact, if it so chooses, can control the content encoding -- still quoting, "To force gzipped content to be served, clients may supply 'gzip' as the value of both the Accept-Encoding and User-Agent request headers. Content will never be gzipped if no Accept-Encoding header is present.".

Resources