I am going to be hosting 100's of thousands of pages, only a few KB, each with the exact same format. I originally thought of using a central database, but then thought that as the content on each page will never change, is it worth the unnecessary database requests? I originally thought of using something like memcache, but then, as stated above, thought it wouldn't be too efficient as the content will never change.
Question:
As the content is only a few KB per page, is it actually that inefficient to serve as static pages?
Could it affect the ability to search through the content to find the page, and how? Eg:
There will be a description on the page, when the user is using the search function it will need to search through the descriptions of each page.
One thing on the individual pages will change, but only once every few weeks/months, should I use a database for that, or serve that statically?
Related
I have a working Hugo site. It has hundreds of pages. However there are times I just want to regenerate a single page.
I know that hugo is super fast, often rendering hundreds or thousands of pages per second. However in this case I’m trying to optimize a particular situation and the ability to just generate this one page is the best option.
There is no way to request Hugo to update a single file. This is mostly due to the fact that lots of Hugo parameters and functions require Hugo to analyze the whole set of pages to render (internal linking, page counts...)
The only way would be to set all the pages you don't want to update as Draft, but this would have an impact on the site for the reason mentioned above.
You can disable some pages kinds using hugo --disableKinds strings
See here: https://gohugo.io/commands/hugo/
If it is a speed issue, the best solution is to use partialCached instead of partial, to avoid rendering the same partial for each page. This improves the rendering speed significantly.
https://gohugo.io/functions/partialcached/
I did look through the similar questions and found this one, but the answer there isn't, at least by itself, dynamic enough for my needs.
Similar to that question, I am attempting to put together a multi-tenant application with a different skin per property. However, the answer given in the above question assumes that the various skin resources can be hard-coded into the application. That would be fine if we were talking about 2 or 3 skins, but my application will need to support dozens at launch and probably tens of thousands in its lifetime (each property can create multiple skins for different campaigns).
I have an API where I can request the skin, which is currently a long string of HTML with a token embedded indicating where the application contents should be rendered into the skin (e.g. {{body}}).
One of the things I'll need to do is inject some <link> tags into the <Head> element to pull in some external CSS. If React.Fragment supports attributes (like __dangerouslySetInnerHTML), I haven't been able to figure out how. If it's possible, that might be one way.
I'll run into the same problem when I want to inject some pre-application and post-application content into the body of the page, too.
Since I want the skin to be rendered server-side on the first request and then be static until the tab is closed, it makes sense to do this in pages/_document.js. After that, I'm kind of lost for what to do next. Parsing the string that contains the skin content is easy enough, but how do I intermingle that raw HTML with React components?
Project Description:
As a learning exercise for asp mvc 4, I'm creating a site builder / multitenancy site. It's nothing too fancy just wysiwyg editing on templates with custom routing to direct users to the correct template based on subdomain. So usr1.mysite.com is directed to the template edited by usr1. My main concern at the moment is my method of storing the edited templates.
Storage Dilemma:
At first I was simply going to make the templates into views and store the changes made by the user in the database. When usr1's template was displayed the system would pull up the view and populate it with usr1's data.
Instead I've implemented a system that takes the user's modified template and saves the whole thing as static html files in the file system. Only the path to the usr1's site (and some other details) are saved in the database. When usr1.mysite.com is called I just have a "content" controller to retrieve the correct html file.
Question:
Is there any reason to choose the database/view method over the static html file method?
Also I'm not concerned with having dynamic content in the end user pages. This is one reason I even tried the file method.
Decision (EDIT):
I'm implementing the file method. After more research (verifying my previous research), I have few doubts the file system will have trouble with even a few hundred sites. I will structure it in a way to group user data directories into group directories based on a naming convention I've yet to dream up, probably something like 000usr1, 000usr2 in 000 group directory. With a goal of less than 100 files/folders in any given directory and less than 4 levels deep. Which should give me the capability of holding 10000 sites. I have no plans of having any activity near that level with this software, but I do want to get up and running and torture it for awhile and see what it's capable of handling. If anyone expresses any interest I'll post back some results.
How would a web analytics package such as piwik/google analytics/omniture etc determine what are unique pages from a set of urls?
E.g. a) a site could have the following pages for a product catalogue
http://acme.com/products/foo
http://acme.com/products/bar
or b) use query string
http://acme.com/catalogue.xxx?product=foo
http://acme.com/catalogue.xxx?product=bar
In either case you can have extra query string vars for things like affiliate links or other uses so how could you determine that its the same page?
e.g. both of these are for the foo product pages listed above.
http://acme.com/products/foo?aff=somebody
http://acme.com/catalogue.xxx?product=foo&aff=somebody
If you ignore all the query string then all products in catalogue.xxx are collated into one page view.
If you don't ignore the query string then any extra query string params look like different pages.
If you're dealing with 3rd party sites then you can't assume that they are using either method or rely on something like canonicallinks being correct.
How could you tackle this?
different tracking tools handle it differently, but you can explicitly set the reporting URL for all the tools.
For instance, Omniture doesn't care about the query string. It will chop it off, even if you don't specify a pageName and it defaults to the URL in the pages report, it still chops off the query string.
GA will record the full url including query string every time.
Yahoo Web Analytics only records the query string on first page of the visit and every page afterwards it removes it.
But as mentioned, all of the tools have a way to explicitly specify the URL to reported, and it is easy to write a bit of javascript to remove the Query string from the URL and pass that as the URL to report.
You mentioned giving your tracking code to 3rd parties. Since you are already giving them tracking code, it's easy enough to throw that extra bit of javascript into the tracking code you are already giving them.
For example, with GA (async version), instead of
_gaq.push(['_trackPageview']);
you would do something like
var page = location.href.split('?');
_gaq.push(['_trackPageview',page[0]]);
edit:
Or...for GA you can actually specify to exclude them within the report tool. Different tools may or may not do this for you, so code example can be applied to any of the tools (but popping their specific URL variable, obviously)
If you're dealing with third-party sites, you can't assume that their URLs follow any specific format either. You can try downloading the pages and comparing them locally, but even that is unreliable because of issues like rotating advertisement, timestamps, etc.
If you are dealing with a single site (or a small group of them), you can make a pattern to match each URL to a canonical (for you) form. However, this will get unmanageable quickly.
Of course, this is the reason that search engines like Google recommend the use of rel='canonical' links in the page header; if Google has issues telling the pages apart, it's not a trivial problem.
planning to launch a comic site which serves comic strips (images).
I have little prior experience to serving/caching images.
so these are my 2 methods i'm considering:
1. Using LinkProperty
class Comic(db.Model)
image_link = db.LinkProperty()
timestamp = db.DateTimeProperty(auto_now=True)
Advantages:
The images are get-ed from the disk space itself ( and disk space is cheap i take it?)
I can easily set up app.yaml with an expiration date to cache the content in user's browser
I can set up memcache to retrieve the entities faster (for high traffic)
2. Using BlobProperty
I used this tutorial , it worked pretty neat. http://code.google.com/appengine/articles/images.html
Side question: Can I say that using BlobProperty sort of "protects" my images from outside linkage? That means people can't just link directly to the comic strips
I have a few worries for method 2.
I can obviously memcache these entities for faster reads.
But then:
Is memcaching images a good thing? My images are large (100-200kb per image). I think memcache allows only up to 4 GB of cached data? Or is it 1 Mb per memcached entity, with unlimited entities...
What if appengine's memcache fails? -> Solution: I'd have to go back to the datastore.
How do I cache these images in the user's browser? If I was doing method no. 1, I could just easily add to my app.yaml the expiration date for the content, and pictures get cached user side.
would like to hear your thoughts.
Should I use method 1 or 2? method 1 sounds dead simple and straightforward, should I be wary of it?
[EDITED]
How do solve this dilemma?
Dilemma: The last thing I want to do is to prevent people from getting the direct link to the image and putting it up on bit.ly because the user will automatically get directed to only the image on my server
( and not the advertising/content around it if the user had accessed it from the main page itself )
You're going to be using a lot of bandwidth to transfer all these images from the server to the clients (browsers). Remember appengine has a maximum number of files you can upload, I think it is 1000 but it may have increased recently. And if you want to control access to the files I do not think you can use option #1.
Option #2 is good, but your bandwidth and storage costs are going to be high if you have a lot of content. To solve this problem people usually turn to Content Delivery Networks (CDNs). Amazon S3 and edgecast.com are two such CDNs that support token based access urls. Meaning, you can generate a token in your appengine app that that is good for either the IP address, time, geography and some other criteria and then give your cdn url with this token to the requestor. The CDN serves your images and does the access checks based on the token. This will help you control access, but remember if there is a will, there is a way and you can't 100% secure anything - but you probably get reasonably close.
So instead of storing the content in appengine, you would store it on the cdn, and use appengine to create urls with tokens pointing to the content on the cdn.
Here are some links about the signed urls. I've used both of these :
http://jets3t.s3.amazonaws.com/toolkit/code-samples.html#signed-urls
http://www.edgecast.com/edgecast_difference.htm - look at 'Content Security'
In terms of solving your dilemma, I think that there are a couple of alternatives:
you could cause the images to be
rendered in a Flash object that would
download the images from your server
in some kind of encrypted format that
it would know how to decode. This would
involve quite a bit of up-front work.
you could have a valid-one-time link
for the image. Each time that you
generated the surrounding web page,
the link to the image would be
generated randomly, and the
image-serving code would invalidate
that link after allowing it one time. If you
have a high-traffic web-site, this would be a very
resource-intensive scheme.
Really, though, you want to consider just how much work it is worth to force people to see ads, especially when a goodly number of them will be coming to your site via Firefox, and there's almost nothing that you can do to circumvent AdBlock.
In terms of choosing between your two methods, there are a couple of things to think about. With option one, where are are storing the images as static files, you will only be able to add new images by doing an appcfg.py update. Since AppEngine application do not allow you to write to the filesystem, you will need to add new images to your development code and do a code deployment. This might be difficult from a site management perspective. Also, serving the images form memcache would likely not offer you an improvement performance over having them served as static files.
Your second option, putting the images in the datastore does protect your images from linking only to the extent that you have some power to control through logic if they are served or not. The problem that you will encounter is that making that decision is difficult. Remember that HTTP is stateless, so finding a way to distinguish a request from a link that is external to your application and one that is internal to your application is going to require trickery.
My personal feeling is that jumping through hoops to make sure that people can't see your comics with seeing ads is solving the prolbem the wrong way. If the content that you are publishing is worth protecting, people will flock to your website to enjoy it anyway. Through high volumes of traffic, you will more than make up for anyone who directly links to your image, thus circumventing a few ad serves. Don't try to outsmart your consumers. Deliver outstanding content, and you will make plenty of money.
Your method #1 isn't practical: You'd need to upload a new version of your app for each new comic strip.
Your method #2 should work fine. It doesn't automatically "protect" your images from being hotlinked - they're still served up on a URL like any other image - but you can write whatever code you want in the image serving handler to try and prevent abuse.
A third option, and a variant of #2, is to use the new Blob API. Instead of storing the image itself in the datastore, you can store the blob key, and your image handler just instructs the blobstore infrastructure what image to serve.