I have just installed nutch integrated with solr and started crawling. but the urls I am specifying in seed.txt nutch is not crawling those url immediately. It's injecting old urls which I may have given earlier but now they are commented out.It looks like nutch is injecting url's in some strange order. What is the reason.also could anybody guide me any book or detailed tutorial on nutch becuase most of the tutorial available are only installation.
As mentioned in an answer to a similar question, the old URLs are still in Nutch's crawldb.
You can nuke your previous runs completely like this user did and start fresh, or you can remove the unwanted URLs a few different ways via CrawlDbMerger:
CLI via bin/nutch mergedb
CLI via bin/nutch updatedb
Related
I am using Nutch for web crawling with crawl script using the command
bin/crawl -s urls crawl 3
In my portal, for each page, there should be 2 versions available. If I have a page with link example.com/docs, there should be 2 versions as example.com/docs?v=1 and example.com/docs?v=2 for all pages and the content will be different. When I run the Nutch crawl script, it fetches only one version of documents.
How can I resolve that?
how should I put Angular url's in the sitemap.xml?
Now I've added this:
https://www.domain.com/user-area#!/logon
but I think Google doesn't like it.
I've read about the _escaped_fragment_ but I don't understand what that means.
The _escaped_fragment_ was recently depreciated by google. See their blog post here:
http://googlewebmastercentral.blogspot.com/2015/10/deprecating-our-ajax-crawling-scheme.html
Anyways, from what you wrote, it seems like your server structure isn't prepared for the _escaped_fragment_. I won't go into too much detail about it here, since it was depreciated after all.
Anyways, Google's bots weren't always able to process websites with AJAX content (content rendered via Javascript). To create a workaround, Google proposed adding the hashbang #! to all AJAX sites. Bots would be able to detect the hashbang and know that the website had content rendered through AJAX. The bots would then request a pre-rendered version of the AJAX pages by replace the hashbang with the _escaped_fragment_. However, this required the server hosting the AJAX pages to know about the _escaped_fragment_ and be able to serve up a pre-rendered page, which was a difficult process to set up and execute.
Now, according to the depreciation blogpost, the URL you entered in your sitemap.xml should be fine, since Google's bots should be able to "crawl, render, and index the #! URLs". If you really want to know if Google can understand your website, I'd recommend using their webmaster console, located at https://www.google.com/intl/en/webmasters/ . Using that tool, you can register your site with google, observe how google indexes your site, and be notified if any problems arise with your site.
In general though, getting Google to index an AJAX site is a pain. I'd strongly recommend using the webmaster console and referring to it frequently. It does help.
I am crawling some websites using apache Nutch. I have to give boost to one website out of all. suppose out of 100 urls, there is a wiki url in seed. I want to give all data from wiki some boot, so that they should be displayed at top. I am using solr 4.10.3.
I recrawl these websites after few days. So I think, index boot via solr will not work, it will be Nutch that should do it. Any idea ?
Ok, so I'm trying to setup nutch to crawl a site and index the pages into solr. I'm currently using Nutch 1.9 with Solr 4.10.2
I've followed these instructions: http://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search
The crawling appears to go just fine but when I check the collection on Solr (using the web ui) there are no documents indexed...any idea where I could check for problems?
Found my problem, I'll leave it as an answer in case anyone else has the same symptoms:
My problem was the proxy configuration. My linux box has the proxy configured to be applied system-wide, but I also had to configure Nutch to use the same proxy. Once I changed that, it started to work.
The configuration is under config/nutch-default.xml
Edit with more info
To be more specific, here is the Proxy configuration I had to change:
<property>
<name>http.proxy.host</name>
<value>xxx.xxx.xxx</value>
<description>The proxy hostname. If empty, no proxy is used.</description>
</property>
I am trying to integrate apache nucth and Solr and when nutch tries to dump the output to solr , it throws
HTTP method POST is not supported by this URL
I checked configurations but couldn't find the right point to make solr url POST supported , and I don't like to use Tomcat to drive solr , so how to make solr url POST supported