Search engine bots crawl the web and download each page they go to for analysis, right?
How exactly do they download a page? What way do they store the pages?
I am asking because I want to run an analysis on a few webpages. I could scrape the page by going to the address but wouldn't it make more sense to download the pages to my computer and work on them from there?
wget --mirror
Try HTTrack
About the way they do it:
The indexing starts from a designated starting point (an entrance if you prefer). From there, the spider follows recursively all hyperlinks until a given depth.
Search engine spiders work like this as well, but there are many crawling simultaneously and there are other factors that count. For example a newly created post here in SO will be picked up by google very fast, but an update at a low traffic web site will be picked up even days later.
You can use the debugging tools built into Firefox (or firebug) and Chrome to examine how the page works. As far as downloading them directly, I am not sure. You could maybe try viewing the page source in your browser, and then copy and paste the code.
Related
I just lauched a website http://www.dicorico.com running on AngularJS and Django for the back-end. The performances of Google Page Speed insight are not great and my Google anaytics indicates a page loading time under Chrome of 10 sec on average since launch on 22nd of October ... I'd like to identify the issue and have no clue where to start looking. Your help would be much appreciated.
Note, the app uses http://www.michaelbromley.co.uk/blog/171/enable-rich-social-sharing-in-your-angularjs-app to render HTML so that the content is crawlable by google.
Thanks,
Laurent
You need to first exclude the fact that it's your code that made the performance suffers. To debug the performance of django projects, use django-debug-toolbar in your dev environment.
There are too many other facts that could also slow down your website, the instance you use might not be performant enough to handle the traffic, or you are doing some backend process in crontab that eats up the resource, or your database is not optimized, or you simply didn't configure web server correctly, etc.
You might need to login into the box and check the memory/cpu/disk usage to determine where the bottleneck is, then try to improve that. There's no straight answer for that, hope it helps.
i need a sitemap which can help to people and google to know pages as well.
I've tried WebSphinx application.
I realize if I put wikipedia.org as the starting URL, it will not crawl further.
Hence, how to actually crawl the entire Wikipedia? Can anyone gimme some guidelines? Do I need to specifically go and find those URLs and put multiple starting URLs?
Anyone has suggestions of good website with the tutorial on usng WebSphinx's API?
Crawling wikipedia is a bad idea. It is hundreds of TBs of data uncompressed. I would suggest offline crawling by using various dumps provided by wikipedia. Find them here https://dumps.wikimedia.org/
You can create a sitemap for wikipedia using page meta information, external links, interwikilinks and redirects databases to name a few.
I have a Drupal website that has a ton of data on it. However, people can quite easily scrape the site, due to the fact that Drupal class and IDs are pretty consistent.
Is there any way to "scramble" the code to make it harder to use something like PHP Simple HTML Dom Parser to scrape the site?
Are there other techniques that could make scraping the site a little harder?
Am I fighting a lost cause?
I am not sure if "scraping" is the official term, but I am referring to the process by which people write a script that "crawls" a website and parses sections of it in order to extract data and store it in their own database.
First I'd recommend you to google over web scraping anti-scrape. There you'll find some tools for fighting web scrapig.
As for the Drupal there should be some anti-scrape plugins avail (google over).
You might be interesting my categorized layout of anti-scrape techniques answer. It's for techy as well as non-tech users.
I am not sure but I think that it is quite easy to crawl a website where all contents are public, no matter if the IDs are sequential or not. You should take into account that if a human can read your Drupal site, a script also does.
Depending on your site's nature if you don't want your content to be indexed by others, you should consider setting registered-user access. Otherwise, I think you are fighting a lost cause.
I have been recived some job offers to develop simple static pages (only with a contact form) and I have been tempted to suggest appengine for the hosting, but, is this appropiate? I don't want to see appengine become in the new geocities.
I think so. It's free after all, so worth a shot. You can even use something like DryDrop (http://drydrop.binaryage.com/) to make it super easy to manage.
Google Sites could also be a possibility for hosting static pages. Uploading HTML files directly is not supported, but you could copy and past the source of the pages you have created as described here.
One limitation that you should take into consideration before suggesting this solution is that AppEngine will not work with naked domain names. In other words, if your client wants to host static webpages at myawesomedomain.com, you would have make sure that users were making requests to www.myawesomedomain.com.
Well, AppEngine gives you access to Google's CDN, which is useful. But you might look at SimpleCDN, S3 (and/or CloudFront), AOL's CDN, and such before making a recommendation.
hosting static pages is fine as it gives the scope to grow and since low hit pages wont cost anything in hosting is win win. Also putting your own domqin can be useful for brainding.g
If you are only serving static pages, it will be easy to move the website somewhere else if AppEngine ever does disappear like geocities.
I'm building a little play project and I'd like to use satellite images of a town inside deepzoom, what's the easiest way to get them? I'm sure there's a MUCH better way than PrtScn, I've tried google maps downloader but it doesn't download satellite images and it's company don't seem to be offering it anymore.
Take a look at Deep Earth, unless what you're trying to build is what deep earth give you - in which case it may remove all the fun ;)
http://www.codeplex.com/deepearth
If you want to go your own way, then it used to be that you could just manually request the various image tiles directly from the MS Virtual Earth server hosting them, if you could calculate the quad keys and build the correct URL, thus bypassing their payment model. Whilst I know they were looking to cut out this loop hole, that's certainly what early versions of Deep Earth did.
Microsoft Virtual Earth has SOAP and AJAX-based services that you can use in your application. The service has a Staging and Production version. Using the Staging version is free, and could easily serve the needs of a "play project." The Production version costs money and can serve info to a large application with many users.
http://dev.live.com/VirtualEarth/
However there is some registration required to get working with the Staging sdk. You can get started here: http://msdn.microsoft.com/en-us/library/cc980844.aspx