Get all the urls under domain (YQL?) - screen-scraping

I want to get all the urls under a domain.
When I looked at their robots.txt. It clearly states that some of the folders are not for robots but I am wondering is there a way to get the all the urls that are open to robots. There is no sitemap on the robots.txt.
For example, on their robots.txt, it has the information looks similar like this:
User-agent: *
Allow: /
Disallow: /A/
Disallow: /B/
Disallow: /C/
...
But I am interested in all the urls available to the robots but not included in this blacklist, like
/contact
/welcome
/product1
/product2
...
Any idea will be appreicated and I am also curious if there will be a Yahoo Query Language(YQL) solution for this problem because this work has probably already been done by Yahoo.
Thanks !

yes there is a way to get all the urls open to robots.
A simple solution would be to go to www.google.com and type site:www.website.com into the search bar.
While that isn't a guarantee to get you every page it will get you all the pages google has indexed. And google adhered to robots.txt so it seems to fit your purpose.

Related

Should I input the after-login pages in robots.txt

In our sites, some pages could be accessed only after login...
Is it good to set these after-login pages in robots.txt disallowed ?
Really search the answer from google, but nothing could help...
In general, I would heed the advice from this article:
To summarize always add the login page to the robots exclusion protocol file, otherwise you will end up:
1 - sacrificing valuable "search engine crawling time" in your site.
2 - spending unnecessary bandwidth and server resources.
3 - potentially even blocking crawlsers from your content.
https://blogs.msdn.microsoft.com/carlosag/2009/07/06/seo-tip-beware-of-the-login-pages-add-them-to-robots-exclusion/
Similarly:
https://webmasters.stackexchange.com/questions/86395/using-robots-txt-to-block-sessionid-urls
Ideally, you'd be able to easily exclude all of those pages via some sort of regex. For example, if all of the urls for these pages started with /my-account/ then you should be able to do this:
disallow: /my-account/*

Why does Googlebot crawl for /mobile/* and /m/* pages that are not referenced anywhere?

Since the end of may, I have a lot of new 404 errors in the Smartphone Crawl Errors page in Webmaster Tools / Google search console. All of them starts with /m/ or /mobile/, none of which are existing nor linked to anywhere on the site.
For example, I have a 404 error for http://www.example.com/mobile/foo-bar/ and http://www.example.com/m/foo-bar pages. According to the Search Console, those page are linked in the existing page http://www.example.com/foo-bar/, but they are not.
Is Googlebot deciding on its own to look for a mobile version of every page ? Can I disable this behavior ? Is this because my site is not mobile-friendly yet (a problem for which I received another warning message from Google).
As #Jonny 5 mentioned in a comment, this seems to be happening as a result of Google guessing that you may have a mobile version of your site in the /m and/or /mobile directories. From what I have read, they will only try those directories if they decided that the pages they initially indexed were not mobile-friendly/responsive. More info on this behavior can be found in these Google Product Forum threads:
https://productforums.google.com/forum/#!topic/webmasters/k3TFeCkFE0Q
https://productforums.google.com/forum/#!topic/webmasters/56CNFxZBFwE
Another helpful comment came from #user29671, who pointed out that your website does in fact have some URLs with /m and /mobile indexed. I found that the same was true for my website, so this behavior may also be limited to sites that Google has (for whatever reason) indexed a /m and/or /mobile URL for. To test if this is true for your site, go to the following URLs and replace example.com with your website's domain:
https://www.google.com/search?q=site:example.com/m/&filter=0
https://www.google.com/search?q=site:example.com/mobile/&filter=0
As far as preventing this goes, your best bet is either creating a mobile-friendly version of your site or redirecting /m and /mobile pages back to the originals.
You could block those directories in your robots.txt, but that's a bit of a workaround. The better option would be to figure out where exactly Googlebot is picking up those URLs from.
If you shared an example page URL where Google says you have links to the /mobile pages, I could look at it and figure out where that's being picked up.
And no, Google doesn't just invent directories to crawl on the off-chance that you might have snuck in a mobile page randomly :)
I am experiencing the same issue since December 2016. Googlebot is constantly trying to crawl my website pages with the /m/ and /mobile/ prefixes.
All those urls cause the 404 errors and get listed in Google Webmaster Tools as errors.
The automatic email was received from GWT on January 2nd, 2017 stating
Googlebot for smartphones identified a significant increase in the number of URLs on http://example.com that return a 404 (not found) error. If these pages exist on your desktop site, showing an error for mobile users can be a bad user experience. This misconfiguration can also prevent Google from showing the correct page in mobile search results. If these URLs don't exist, no action is necessary.
This is done by a mobile crawler:
*Ip: 66.249.65.124
Agent: Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1)
Browser: Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1)*
You are not alone, therefore.
Take it easy. It's a Google bug :)
As for redirecting /m and /mobile pages back to the originals, here's a snippet for nginx:
location /m/ {
rewrite ^/[^/]+(/.*)$ $1 permanent;
}
location /mobile/ {
rewrite ^/[^/]+(/.*)$ $1 permanent;
}
One can also redirect everything to the root:
location /m/ {
return 301 $scheme://$host/;
}
location /mobile/ {
return 301 $scheme://$host/;
}

Kik search engine not picking up mobile app even though robots.txt is empty

Kik's search engine is not picking up our HTML5 mobile app even though our robots.txt file is empty.
Our mobile app is at http://panabee-games.herokuapp.com/spoof/spoof.
Our robots.txt file is at http://panabee-games.herokuapp.com/robots.txt.
Did this happen to others?
Please note: we were asked by Kik customer support to post this question here.
Kik indexing service determines listing eligibility based on 3 things:
kik.js is included
all necessary tags are included (description meta, icon link)
robots.txt doesn't allow indexing
This is the rule I would recommend using:
User-agent: KikBot
Disallow:
# Other rules if needed
# User-agent:*
# Disallow: /
I would also recommend putting your kik.js script at the bottom of the body. Other than that your code looks good.
Also, keep in mind that once your app is indexed it is added to the whitelist queue that is checked by a representative at Kik. They very quickly verify that the app does not contain any inappropriate content and then add it to the list of searchable apps.

CakePHP Application: some pages with SSL, some without

I have an application written with the CakePHP framework and it is currently located in httpdocs. I want a few pages to be redirected to https://
Basically this shouldn't be a problem to detect whether the user is already on https://... or not. My issue is a different one: In my opinion I would need to make a copy of the whole project and store it in httpsdocs, right? This sounds so silly but how should it work without duplicating the code? I think I miss something but I don't get it ...
I have never had to copy the code for ssl. You should specify in the vhost for the site what the path is.
On apache there is a vhost for each, ssl and non ssl. Both can have the same webroot path.
If your webhoster requires you to put the https part of your website in httpsdocs, then you will need to put something there. But not the whole project: maybe only the /web part (the part that is actually served up by the webhoster).
Something like
/cake/app/ --> your app code
/httpsdoc/.. --> index.php and possibly css stuff, images etc
/httpsdocs/.. --> copy of index.php and the rest as well
Of course, you could also use some internal redirect in .htaccess
One suggestion: now that google indexes https urls, you could also choose to make the whole site available through https.

Should a www. to m. redirect for mobile devices accessing a PC site use a 301 or a 302?

I read in a few places that people are recommending a 301 redirect to redirect mobile devices from a PC site the mobile optimised site.
Making Websites Mobile Friendly (Google Webmaster blog) - http://googlewebmastercentral.blogspot.com/2011/02/making-websites-mobile-friendly.html
Untangling Your Mobile Metrics With Better Redirects - http://searchengineland.com/untangling-your-mobile-metrics-with-better-redirects-113015?utm_source=twitterfeed&utm_medium=twitter&utm_campaign=feed-main
The arguments seem to be 1) Reduce duplication in Google search results and 2) Preserve referrer.
However the side affect of doing a permanent redirect is that you couldn't give the option of the user back to the PC version if they wanted (since the 301 permanent redirect would have been cached by the client) - See http://mobiforge.com/designing/story/a-very-modern-mobile-switching-algorithm-part-ii for why giving user choice is important.
What is the recommendation to optimise for search (e.g. following Google's SEO guidelines) at the same time as giving the user choice as to whether to visit the mobile or PC site?
If you craft your regex carefully, you could probably make redirection happen only for URLs that don't have the "force" parameter. After all, you probably want fine-grained control over that redirection anyway, rather than have blanket rules, so that you can control one-to-one page mapping.
Google now says it doesn't matter, as long as you do your canonical tags properly:
HTTP redirection is a commonly used to redirect clients to
device-specific URLs. Usually, the redirection is done based on the
user-agent in the HTTP request headers. It is important to keep the
redirection consistent with the alternate URL specified in the page's
link rel="alternate" tag or in the Sitemap.
For this purpose, it does not matter if the server redirects with an
HTTP 301 or a 302 status code.

Resources