Ok, so I'm trying to setup nutch to crawl a site and index the pages into solr. I'm currently using Nutch 1.9 with Solr 4.10.2
I've followed these instructions: http://wiki.apache.org/nutch/NutchTutorial#A4._Setup_Solr_for_search
The crawling appears to go just fine but when I check the collection on Solr (using the web ui) there are no documents indexed...any idea where I could check for problems?
Found my problem, I'll leave it as an answer in case anyone else has the same symptoms:
My problem was the proxy configuration. My linux box has the proxy configured to be applied system-wide, but I also had to configure Nutch to use the same proxy. Once I changed that, it started to work.
The configuration is under config/nutch-default.xml
Edit with more info
To be more specific, here is the Proxy configuration I had to change:
<property>
<name>http.proxy.host</name>
<value>xxx.xxx.xxx</value>
<description>The proxy hostname. If empty, no proxy is used.</description>
</property>
Related
I am new in SOLR Cloud so sorry for this question. I am using 3 nodes of SOLR and now I need to update database setting in data-config.xml. But I have now idea how to do that.
In standalone SOLR we have data-config.xml and we can change in it, but I dont how to do this it in SOLR cloud.
Please help me.
data-config.xml is just another piece of a collection's configuration (if it needs it, not all collections have one). So you just get it ready, and then deploy it the same way you deploy schema.xml, by using bundled utilities:
bin\solr upconfig ...
I get Starting Zookeeper and solr service and in Cloudera Manager also I had create a HDFS.But i still not able to get working nutch and solr together in Cloudera.
I do not know the following steps in order to get crawling and indexing new urls and get Query Result of solr index.
Does anyone know how to proceed?
I have just installed nutch integrated with solr and started crawling. but the urls I am specifying in seed.txt nutch is not crawling those url immediately. It's injecting old urls which I may have given earlier but now they are commented out.It looks like nutch is injecting url's in some strange order. What is the reason.also could anybody guide me any book or detailed tutorial on nutch becuase most of the tutorial available are only installation.
As mentioned in an answer to a similar question, the old URLs are still in Nutch's crawldb.
You can nuke your previous runs completely like this user did and start fresh, or you can remove the unwanted URLs a few different ways via CrawlDbMerger:
CLI via bin/nutch mergedb
CLI via bin/nutch updatedb
I am trying to integrate apache nucth and Solr and when nutch tries to dump the output to solr , it throws
HTTP method POST is not supported by this URL
I checked configurations but couldn't find the right point to make solr url POST supported , and I don't like to use Tomcat to drive solr , so how to make solr url POST supported
I have a solr instance up and running and I can visit the solr admin page without any problem. I have setup a solr multicore with one core for ckan and another core for a different application. I can see two different collections as well in the admin page. I don't understand why ckan is not able to connect to Solr. I have even include solr site url in production.ini.
ckan.lib.search Problems were found while connecting to the SOLR server
Edit # 1: I have installed ckan from Source; I already had Solr running so all I did was added a new core & collection for ckan in an existing solr instance