CDN serving private images / videos - database

I would like to know how do CDNs serve private data - images / videos. I came across this stackoverflow answer but this seems to be Amazon CloudFront specific answer.
As a popular example case lets say the problem in question is serving contents inside of facebook. So there is access controlled stuff at an individual user level and also at a group of users level. Besides, there is some publicly accessible data.
All logic of what can be served to whom resides on the server!

The first request to CDN will go to application server and gets validated for access rights. But there is a catch - keep this in mind:
Assume that first request is successful and after that, anyone will be able to access the image with that CDN URL. I tested this with Facebook user uploaded restricted image and it was accessible with the CDN URL by others too even after me logging out. So, the image will be accessible till the CDN cache expiry time.

I believe this should work - all requests first come to the main application server. After determining whether access is allowed or not, a redirect to the CDN server or access-denied error can be shown.

Each CDN working differently, so unless you specify which CDN you are looking for its hard to tell.

Related

Anyone can fetch my blog posts using my GET end-points and use them on his own site? Is there a way to protect this?

I have a blogging app built on top of the MERN Stack. I am fetching my blog posts on the react front-end, however, I feel anyone can use my blog posts on his own site by hitting the same end-point. I want to protect this behaviour. Is there a way?
If, for some reason, it isn't enabled already, make sure your endpoints have standard Access-Control-Allow-Origin restrictions - that is, that they only permit direct connections from your domain, not from other sites. This will make it slightly more difficult for other sites to scrape yours, because they won't be able to make requests directly from the frontend.
You could also change your application structure so that the blog data gets sent with the initial HTML response. For a small example, you could have
<script type="application/json" class="blog-data">
[{"title":"some post title", "content":"some content"}]
</script>
const blogData = JSON.parse(document.querySelector('.blog-data').textContent);
This will also make it a bit harder for a scraper to work - they won't have an endpoint ready to serve the plain blog data, they'll have to parse through your HTML response first.
You could also frequently change up the DOM structure of the data in the HTML response to make it harder.
But web scraping is fundamentally nearly impossible to stop, for someone who's determined enough.
Basically, you can use CORS on your backend to protected fetching your endpoints from any browsers origins except allowed ones.
Anyway it will not help you to protect from calling API from such things like mobile apps, Postman etc.
If you worry about loading to the server you can add something like rate limiting.
But keep in mind if your API is public it will be public for all, you can't restrict to use it from your site only.
Here are a few ideas:
Maybe add some authentication to protect your endpoints.
If you are using CORS, only accept requests from a certain URL.
In your package.json, add a proxy.

Reduce initial server response time with Netlify and Gatsby

I'm running PageSpeed Insights on my website and one big error that I get sometimes is
Reduce initial server response time
Keep the server response time for the main document short because all
other requests depend on it. Learn more.
React If you are server-side rendering any React components, consider
using renderToNodeStream() or renderToStaticNodeStream() to allow
the client to receive and hydrate different parts of the markup
instead of all at once. Learn more.
I looked up renderToNodeStream() and renderToStaticNodeStream() but I didn't really understand how they could be used with Gatsby.
It looks like a problem others are having also
The domain is https://suddenlysask.com if you want to look at it
My DNS records
Use a CNAME record on a non-apex domain. By using the bare/apex domain you bypass the CDN and force all requests through the load balancer. This means you end up with a single IP address serving all requests (fewer simultaneous connections), the server is proxying to the content without caching, and the distance to the user is likely to be further.
EDIT: Also, your HTML file is over 300KB. That's obscene. It looks like you're including Bootstrap in it twice, you're repeating the same inline <style> tags over and over with slightly different selector hashes, and you have a ton of (unused) utility classes. You only want to inline critical CSS if possible; serve the rest from an external file if you can't treeshake it.
Well the behavior is unexpected, I ran the pagespeed insights of your site and it gave me a warning on first test with initial response time of 0.74 seconds. Then i used my developer tools to look at the initial response time on the document root, which was fairly between 300 to 400ms. So I did the pagespeed test again and the response was 140ms. The test was passed. After that it was 120ms.
See the attached image.
I totally think there is no problem with the site. Still if you wanna try i would recommend you to change the server or your hosting for once, try and go for something different. I don't know what kind of server you have right now where the site is deployed. You can try AWS S3 and CloudFront, works well for me.

Private S3 + CloudFront react app: "XML file does not appear to have any style information associated with it"

This is a follow up question to the one found here: CloudFront + S3 Website: "The specified key does not exist" when an implicit index document should be displayed
I am trying to host a React single page app (static website) through S3 and I want to allow https access only (using a custom SSL). I have everything configured with CloudFront and my website is showing up at the CloudFront URL just fine. But when I navigate around the app, I get the error shown in the link above.
According to that post, the error is fixed by switching from a REST to a website endpoint. But in the process, you have to make your S3 bucket public. My question: is there a way to fix this error without switching to a website endpoint and, in the process, making all my S3 content public? Is there some kind of workaround within the AWS ecosystem where I can combine private S3 contents with a process that returns the html doc without the XML formatted error? According to this reference (https://docs.aws.amazon.com/AmazonS3/latest/dev/WebsiteEndpoints.html#WebsiteRestEndpointDiff), this seems like it may not be possible, but I'm hoping someone can prove me wrong.
Thanks!
The error you're getting usually occurs when your application tries to access something which it isn't privileged to.
Since you mentioned the app loads normally but you get this error while you move around; So it can be the case that it occurs when a component tries to load a private resource which you haven't added in the policies you have defined.
My question: is there a way to fix this error without switching to a website endpoint and, in the process, making all my S3 content public?
Definitely! But you need to pin point the resources which is being accessed when you're getting the error! I would request you to provide more info regarding the same.
Lastly, if you switch to website endpoints, you won't to able to serve private S3 content. You'll have to make it all public. You can find more info about this here: https://docs.aws.amazon.com/AmazonS3/latest/dev/WebsiteEndpoints.html#WebsiteRestEndpointDiff

why i couldn't see any text in "http://crawlservice.appspot.com/?key=123456&url=http://mydomain.com#!article"?

Ok, i found this link https://code.google.com/p/gwt-platform/wiki/CrawlerSupport#Using_gwtp-crawler-service that explain how you can make your GWTP app crawlable.
I got some GWTP experience, but i know nothing about AppEngine.
Google said its "crawlservice.appspot.com" can parse any Ajax page. Now I have a page "http://mydomain.com#!article" that has an artice that was pulled from Database. Say that page has the text "this is my article". Now I open this link:
crawlservice.appspot.com/?key=123456&url=http://mydomain.com#!article, then i can see all javascript but I couldn't find the text "this is my article".
Why?
Now let check with a real life example
open this link https://groups.google.com/forum/#!topic/google-web-toolkit/Syi04ArKl4k & you will see the text "If i open that url in IE"
Now you open http://crawlservice.appspot.com/?key=123456&url=https://groups.google.com/forum/#!topic/google-web-toolkit/Syi04ArKl4k you can see all javascript but there is no text "If i open that url in IE",
Why is it?
SO if i use http://crawlservice.appspot.com/?key=123456&url=mydomain#!article then Can google crawler be able to see the text in mydomain#!article?
also why the key=123456, it means everyone can use this service? do we have our own key? does google limit the number of calls to their service?
Could you explain all these things?
Extra Info:
Christopher suggested me to use this example
https://github.com/ArcBees/GWTP-Samples/tree/master/gwtp-samples/gwtp-sample-crawler-service
However, I ran into other problem. My app is a pure GWTP, it doesn't have appengine-web.xml in WEB-INF. I have no idea what is appengine or GAE mean or what is Maven.
DO i need to register AppEngine?
My Appp may have a lot of traffic. Also I am using Godaddy VPS. I don't want to register App Engine since I have to pay for Google for extra traffic.
Everything in my GWTP App is ok right now except Crawler Function.
So if I don't use Google App Engine, then how can i build Crawler Function for GWTP?
I tried to use HTMLUnit for my app, but HTMLUnit doesn't work for GWTP (See details in here Why HTMLUnit always shows the HostPage no matter what url I type in (Crawlable GWT APP)? )
I believe you are not allowed to crawl Google Groups. Probably they are actively trying to prevent this, so you do not see the expected content.
There's a couple points I wish to elaborate on:
The Google Code documentation is no longer maintained. You should look on Github instead: https://github.com/ArcBees/GWTP/wiki/Crawler-Support
You shouldn't use http://crawlservice.appspot.com. This isn't a Google service, it's out of date and we may decide to delete it down the road. This only serves as a public example. You should create your own application on App Engine (https://appengine.google.com/)
There is a sample here (https://github.com/ArcBees/GWTP-Samples/tree/master/gwtp-samples/gwtp-sample-crawler-service) using GWTP's Crawler Service. You can basically copy-paste it. Just make sure you update the <application> tag in appengine-web.xml to the name of your application and use your own service key in CrawlerModule.
Finally, if your client uses GWTP and you followed the documentation, it will work. If you want to try it manually, you must encode the Query Parameters.
For example http://crawlservice.appspot.com/?key=123456&url=http://www.arcbees.com#!service will not work because the hash (everything including and after #) is not sent to the server.
On the other hand http://crawlservice.appspot.com/?key=123456&url=http%3A%2F%2Fwww.arcbees.com%2F%23!service will work.

What's the correct way of handling web requests during database maintenance?

Scenario: You are going to do scheduled database maintenance. You will hence be unable to serve dynamic content (just assume the caching system in front of the database also needs to be maintained).
During that time, what's the correct way of handling web requests trying to access a dynamic resource?
What's the correct HTTP error code, if any, that goes along with the notice that your service is currently not available? Should you use errors in the 5XX range?
What are the implications in terms of SEO? Will it hurt if search engine crawlers try to access your site and see lots of error codes or pages with the same notice instead of dynamic content? Can you easily recover from that?
503 Service Unavailable is the correct response to use in this situation.
Depending on how your site works, you could just put up a static HTML page replacing everything saying that the site is undergoing maintenance.

Resources