I am using Nutch for web crawling with crawl script using the command
bin/crawl -s urls crawl 3
In my portal, for each page, there should be 2 versions available. If I have a page with link example.com/docs, there should be 2 versions as example.com/docs?v=1 and example.com/docs?v=2 for all pages and the content will be different. When I run the Nutch crawl script, it fetches only one version of documents.
How can I resolve that?
Related
I am crawling some websites using apache Nutch. I have to give boost to one website out of all. suppose out of 100 urls, there is a wiki url in seed. I want to give all data from wiki some boot, so that they should be displayed at top. I am using solr 4.10.3.
I recrawl these websites after few days. So I think, index boot via solr will not work, it will be Nutch that should do it. Any idea ?
I have tried indexing public url of a google drive document, but it seems that it does not work . Is there any way to crawl google drive documents via nutch and make their index using solr?
Use Google Drive API to read/manage files
https://developers.google.com/drive/web/about-sdk
Drive Public URL's page won't have direct links to subdirectories, so you will get nothing if you crawl those pages.
I have just installed nutch integrated with solr and started crawling. but the urls I am specifying in seed.txt nutch is not crawling those url immediately. It's injecting old urls which I may have given earlier but now they are commented out.It looks like nutch is injecting url's in some strange order. What is the reason.also could anybody guide me any book or detailed tutorial on nutch becuase most of the tutorial available are only installation.
As mentioned in an answer to a similar question, the old URLs are still in Nutch's crawldb.
You can nuke your previous runs completely like this user did and start fresh, or you can remove the unwanted URLs a few different ways via CrawlDbMerger:
CLI via bin/nutch mergedb
CLI via bin/nutch updatedb
Ok, so I'm trying to setup nutch to crawl a site and index the pages into solr. I'm currently using Nutch 1.9 with Solr 4.10.2
I've followed these instructions: http://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search
The crawling appears to go just fine but when I check the collection on Solr (using the web ui) there are no documents indexed...any idea where I could check for problems?
Found my problem, I'll leave it as an answer in case anyone else has the same symptoms:
My problem was the proxy configuration. My linux box has the proxy configured to be applied system-wide, but I also had to configure Nutch to use the same proxy. Once I changed that, it started to work.
The configuration is under config/nutch-default.xml
Edit with more info
To be more specific, here is the Proxy configuration I had to change:
<property>
<name>http.proxy.host</name>
<value>xxx.xxx.xxx</value>
<description>The proxy hostname. If empty, no proxy is used.</description>
</property>
I have an application which crawls over the websites using Apache Nutch 2.1 and persisting data to the MySQL. I have to integrate Nutch and Solr which is not a problem as enough documentation is available on the internet.
After storing content from webpages, i want to add a search functionality based on Solr. I need to search for key words in the webpages. For example, if i am crawling websites which are movies related and i want to search for any specific movie(as a key word) from the crawled data, what are the changes i need to make to the Solr configurations. Do i need to write a separate plugin altogether or i can use existing plugins?What type of indexing i have to add to the solr configurations?