Use Apache to Rewrite URLs with Database Parameters as Nice URLs

Use Apache to Rewrite URLs with Database Parameters as Nice URLs - database

For years our database driven websites have had URLs that look like:
https://www.example.com/product?id=30
But nowadays, and especially for SEO purposes, we want our URLs to be "nice" and look like:
https://www.example.com/30/myproduct
We use Zope 2.13.x running on Debian and using Apache 2.4 as the front-end webserver. I know that not too many people use Zope, but utilizing Apache's mod_rewrite we should be able to proxy the rewrite and have nice URLs that still pass the database arguments necessary in order to properly serve the pages to the end users.
There used to be a Zope Cookbook where I wrote a bunch of really detailed tutorials on Zope functionality but that no longer seems to exist and I wanted to share this with the SE community.
The awesome thing is that this is not specific to Zope, but will/should work with any rewrite of a parameter based URL into a nice URL and it's super easy once it's all working.
For complete transparency, I am going to answer my own question so that it's documented for everyone.

Using the rewrite engine in Apache, decide how you want your URLs to look to the end user in their web browser.
For example, if you are calling to a database and have a url that looks like
https://www.example.com/products?id=30&product_name=myproduct
but you want that URL to look like
https://www.example.com/products/30/myproduct
you would use a rewrite rule as follows:
RewriteRule ^/products/(.*)/(.*) /products?id=$1&product_name=$2 [L,P,NE,QSA]
To explain that further:
^/products/(.*)/(.*) is saying that anytime domain.com/products is accessed, look for two variables in the next directory names, i.e. /(.*)/(.*)
If you only wanted one variable you would do ^/products/(.*)
Likewise if you wanted three variables you would do ^/products/(.*)/(.*)/(.*)
From there we need to tell Apache how to interpret that URL in order to rewrite and still allow Zope (or whatever db you may be using) to pass the correct URL parameters. That part is:
/products?id=$1&product_name=$2
Apache will now take the first (.*) and treat that as $1. It will take the second (.*) and treat that as $2 and so on.
The part in the brackets is extremely important
L = This makes Apache stop processing the rewrite ruleset if the rule matches. This is important because you don't want Apache to get confused and start trying other rewrites.
P = Proxy the request. This makes sure that the browser does not display a different URL than https://www.example.com/products/30/myproduct (i.e. we do not want the end user seeing the rewritten URL as https://www.example.com/products?id=30&product_name=myproduct
NE = No Escaping any URL characters. You need this to ensure that the URL rewrite does not try and escape the special characters like $ = & as these are important to URL parameters
QSA = This allows multiple variables (or URL parameters) to exist
Please Note: It is very important to consider how you want your URLs to look (the nice URLs) because that is what you want to submit to the search engines. If you change your URL structure, those nice URLs will no longer work and your search engine rankings may decrease.

Related

import wikipedia article using wget or curl (on windows)

i have a folder with wikipedia article (XML format).
I want imported files throught the Webinterface (Special:Import). Currently i do it with imacro. But this often hangs and need a lot of resources (Memory) an can only processing one file at once.So i am looking for better solution.
I currently figured out, that in have to login to get an edittoken. This is needed to upload the file.
Read already this. get stuck
To get his run in need two wget/curl "commandlines"
to login and get the edittoken (push user and pwd to form, get edittoken)
push the file to the Formular (push edittoken and content to form)
Building the loop to processing more than one file, i can do by my own.

First of all, let's be clear: the web interface is not the right way to do this. MediaWiki installation requirements include shell access to the server, which would allow you to use importDump.php as needed for heavier imports.
Second, if you want to import a Wikipedia article from the web interface then you shouldn't be downloading the XML directly: Special:Import can do that for you. Set
$wgImportSources = array( 'wikipedia' );
or whatever (see manual), visit Special:Import, select Wikipedia from the dropdown, enter the title to import, confirm.
Third, if you want to use the commandline then why not use the MediaWiki web API, also available with plenty of clients. Most clients handle tokens for you.
Finally, if you really insist on using wget/curl over index.php, you can get a token visiting api.php?action=query&meta=tokens in your browser (check on api.php for the exact instructions for your MediaWiki version) and then do something like
curl -d "&action=submit&...#filename.xml" .../index.php?title=Special:Import
(intentionally partial code so that you don't run it without knowing what you're doing).

Wordpress URL Change on submit

I currently have a website i'm working on that I have taken over from another individual, I dumped his SQL file into my database and everything seems to be ok apart from one thing. Whenever I try to log in to the back end or if I try to go elsewhere, it will add an additional .co.uk to the address bar, making it like so:
From: www.domain.co.uk to www.domain.co.uk.co.uk
I've had a dig in the database but I really can't find anything and i've never faced this issue before, could anyone shed some light on this for me? Maybe just let me know where I could look within the database to identify the problem, many thanks.

Take a look at the .htaccess file in the root folder, which is hidden and may contain rewrite rules.
Also, I recommend you use this plugin for migrations:
http://wordpress.org/extend/plugins/wp-migrate-db/
I use it whenever I move from localhost to a live site and vice versa. It will also ensure your widgets are preserved, since doing a find replace will cause the object serialisation syntax WordPress uses to break.
After migrating, you need to visit Settings > Permalinks so the .htaccess file can be updated according to the new URL for rewrites.

Apache hashbang url problems

I set up an older Rails 2 project on a brand new Apache#Debian#squeeze. The project itself could be a single pager, using links to scroll page up and down. My links look like that:
http://mydomain.com/en/#home
These links do fine as long as JavaScript intercepts the click event and simply scrolls to the intended section. In case the user leaves the single page and opens one where these links (still the same) cannot be followed via JavaScript, I only receive an:
Forbidden
You don't have permission to access /en/ on this server.
If I change the link to:
http://mydomain.com/en#home
everything works fine and as expected. But I do not want to change my link structure. It already worked well at an older Debian5 box.
I expect that to be an Apache2 configuration issue, but do not find anything useful in the net.
Looking forward to any kind of enlightenment.
Thx
Felix

I don't know how or where you are working with javascript related to this problem, but let me tell you this.
Everything after the hashtag # is never passed to the server. Its HTTP standardization, it is just not passed to the server.
It is only intended to navigate to anchor within the webpage, and today used for a lot of new techniques including, but not limited to, xss scripting, javascript hooks, etc
It is possible that links are prohibited to load with an onclick event and some javascript does something instead, but it is not possible that you end up on this page http://mydomain.com/en/#home if http://mydomain.com/en/ does not work.
However to solve your problem you probably have to adjust your your apache rewriting rule (or enable mod_rewrite at all?) to also capture links with trailing slashes.
The link http://mydomain.com/en/ http://mydomain.com/en is something different and could serve a completely different page.
I would strongly recommend not to get a mess here and do a strict permanent redirect from one to the other. Which you choose for primary usage is up to you.
I prefer a trailing slash and can also supply arguments for that, but they can be invalidated easily and replaced by some to suggest the opposite. You should find plenty on discussion on that if you search for trailing slash here.
To solve your problem please try to find the according RewriteRule, copy it and add it one more time with a trailing slash. See whether it works and make a redirect to the url without trailign slash.
You may also edit your answer and post your server config to get help with that.

Website analysis tool - how to determine unique pages from set of urls

How would a web analytics package such as piwik/google analytics/omniture etc determine what are unique pages from a set of urls?
E.g. a) a site could have the following pages for a product catalogue
http://acme.com/products/foo
http://acme.com/products/bar
or b) use query string
http://acme.com/catalogue.xxx?product=foo
http://acme.com/catalogue.xxx?product=bar
In either case you can have extra query string vars for things like affiliate links or other uses so how could you determine that its the same page?
e.g. both of these are for the foo product pages listed above.
http://acme.com/products/foo?aff=somebody
http://acme.com/catalogue.xxx?product=foo&aff=somebody
If you ignore all the query string then all products in catalogue.xxx are collated into one page view.
If you don't ignore the query string then any extra query string params look like different pages.
If you're dealing with 3rd party sites then you can't assume that they are using either method or rely on something like canonicallinks being correct.
How could you tackle this?

different tracking tools handle it differently, but you can explicitly set the reporting URL for all the tools.
For instance, Omniture doesn't care about the query string. It will chop it off, even if you don't specify a pageName and it defaults to the URL in the pages report, it still chops off the query string.
GA will record the full url including query string every time.
Yahoo Web Analytics only records the query string on first page of the visit and every page afterwards it removes it.
But as mentioned, all of the tools have a way to explicitly specify the URL to reported, and it is easy to write a bit of javascript to remove the Query string from the URL and pass that as the URL to report.
You mentioned giving your tracking code to 3rd parties. Since you are already giving them tracking code, it's easy enough to throw that extra bit of javascript into the tracking code you are already giving them.
For example, with GA (async version), instead of
_gaq.push(['_trackPageview']);
you would do something like
var page = location.href.split('?');
_gaq.push(['_trackPageview',page[0]]);
edit:
Or...for GA you can actually specify to exclude them within the report tool. Different tools may or may not do this for you, so code example can be applied to any of the tools (but popping their specific URL variable, obviously)

If you're dealing with third-party sites, you can't assume that their URLs follow any specific format either. You can try downloading the pages and comparing them locally, but even that is unreliable because of issues like rotating advertisement, timestamps, etc.
If you are dealing with a single site (or a small group of them), you can make a pattern to match each URL to a canonical (for you) form. However, this will get unmanageable quickly.
Of course, this is the reason that search engines like Google recommend the use of rel='canonical' links in the page header; if Google has issues telling the pages apart, it's not a trivial problem.

case sensititivity with users controller on certain hosting

We generally use two different hosting services. On one, everything works ticketyboo, as it does on my local dev servers. On the other server, however, I am having this problem:
I can't access the users controller like this:
http://www.example.com/users/login
But I can like this:
http://www.example.com/Users/login
** note the capitalised 'Users' **
If I displace the application to a sub-folder everything works fine (both upper- and lowercase).
The hosting company have looked at it and can't see a problem at their end and they assure me that users is not a reserved word.
You might say this isn't a problem, just use the version that works. Unfortunately it leads to problems downstream where Cake core starts generating urls itself.
Anybody else seen this problem or know the solution?
[This only occurs on the users controller - all others work as expected]

Without seeing all your code / diving in too deeply, I'm not sure what the cause of this problem is. Do you have some special stuff going on in the routes.php file? If you have a specific route defined for users, that could be it.
However, you could make a quick fix -- in UsersController (or AppController if you want to ensure this behavior doesn't pop up elsewhere) just add a line to the beforeFilter() method to capitalize / decapitalize (whichever is more appropriate) the controller parameter.
[edit] - sorry, didn't finish that first paragraph. It still could be the routes file, even though it works on one server and not the other, because it's possible that the working server uses a case-insensitive apache module that normalizes all urls. This is why it's so nice to have your staging and dev setups being EXACTLY the same as production.

While the hosting support denied that the word 'user' or 'users' or 'Users' was in any way reserved, it seems that it was:
"We have removed the users/ redirect"
Problem solved.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Use Apache to Rewrite URLs with Database Parameters as Nice URLs - database

Related

import wikipedia article using wget or curl (on windows)

Wordpress URL Change on submit

Apache hashbang url problems

Website analysis tool - how to determine unique pages from set of urls

case sensititivity with users controller on certain hosting

Categories

Resources