Nutch 1.3 and Solr 4.4.0 integration Job failed - solr

I am trying to crawl the web using nutch and I followed the documentation steps in the nutch's official web site (run the crawl successfully, copy the scheme-solr4.xml into solr directory). but when I run the
bin/nutch solrindex http://localhost:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
I get the following error:
Indexer: starting at 2013-08-25 09:17:35
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195)
I have to mention that the solr is running but I cannot browse http://localhost:8983/solr/admin (it redirects me to http://localhost:8983/solr/#).
On the other hand, when I stop the solr, I get the same error! Does anybody have any idea about what is wrong with my setting?
P.S. the url that I crawl is: http://localhost/NORC

Check your configuration against: Solr and Nutch
Nutch and Solr's schema files should be the same or you may encounter problems so make sure they match up

When I meet same problem in nutch, the solr's log appear a error message "unknown field host".
After modifying the schema.xml in solr, the nutch's error vanish.

You are missing the name of the core inside your command.
e.g.:
./bin/crawl -i -D solr.server.url=http://localhost:8983/solr/#/your_corname urls/ crawl 1

Related

Integrating Solr with Nutch issue

I am following a tutorial from here. i have got solr and nutch installed separately and they are both working all fine. The problem comes when i have to integrate them. From the earlier posts on this site i learned that there could some issue with the schema files. As mentioned in the tut i copied the schema.xml of nutch to the schema.xml of solr and restarted the solr. solr stoped because of configuration issues. So i simply copied the contents of each file into the other along with the existing content. Now (and previously as well) i get this error:
Indexer: starting at 2014-08-05 11:10:21
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
Can someone suggest what should be done?
I am using apache-nutch-1.8 and solr-4.9.0
Here is how my hadoop.log file looks like:
2014-08-05 12:50:05,032 INFO crawl.Injector - Injector: starting at 2014-08-05 12:50:05
2014-08-05 12:50:05,033 INFO crawl.Injector - Injector: crawlDb: -dir/crawldb
2014-08-05 12:50:05,033 INFO crawl.Injector - Injector: urlDir: urls
.
.
.
.
.
2014-08-05 13:04:21,255 INFO solr.SolrIndexWriter - Indexing 1 documents
2014-08-05 13:04:21,286 WARN mapred.LocalJobRunner - job_local1310160376_0001
org.apache.solr.common.SolrException: Bad Request
Bad Request
request: http://my-solr-url:8983/solr/update?wt=javabin&version=2
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:155)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:118)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:467)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:535)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
2014-08-05 13:04:21,544 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
2014-08-05 13:10:37,855 INFO crawl.Injector - Injector: starting at 2014-08-05 13:10:37
.
.
.
may be because of some versioning differences the tutorial suggested to copy the conf/schema.xml whereas in this particular version of solr, the file schema-solr4.xml was supposed to be copied followed by addition of : <field name="_version_" type="long" indexed="true" stored="true"/> in line no 351. Restart the solr by java -jar start.jar and it works all normal! Hope this helps someone!

Error while indexing .xml files in solr

I am trying to index xml files in solr search engine using following command:
java -Durl=http://10.1.11.143:8080/solr/#/ -jar post.jar solr.xml
But I am getting following error:
SimplePostTool version 1.5
Posting files to base url http://10.1.11.143:8080/solr/#/ using content-type application/xml..
POSTing file solr.xml
SimplePostTool: WARNING: Solr returned an error #500 Internal Server Error
SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 500 for URL: http://10.1.11.143:8080/solr/#/
1 files indexed.
COMMITting Solr index changes to http://10.1.11.143:8080/solr/#/..
SimplePostTool: WARNING: Solr returned an error #500 Internal Server Error for url http://10.1.11.143:8080/solr/#/?commit=true
Time spent: 0:00:00.017
Please help me to come out of this error.
Content of solr.xml is as shown in the picture:
The issue is because of the URL. You didn't mention any requestHandler while updating. Use the following command. It'll work.
java -Durl=http://10.1.11.143:8080/solr/update?commit=true -jar post.jar solr.xml
/update is the requestHandler to index data into Solr.

Nutch message "No IndexWriters activated" while loading to solr

I have run nutch crawler as per nutch tutorial http://wiki.apache.org/nutch/NutchTutorial but when i started loading it to solr i am getting this message i.e. "No IndexWriters activated - check your configuration"
bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -dir crawl/segments/
Indexer: starting at 2013-07-15 08:09:13
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
**No IndexWriters activated - check your configuration**
Indexer: finished at 2013-07-15 08:09:21, elapsed: 00:00:07
Make sure that the plugin indexer-solr is included. Go to the file: conf/nutch-site.xml and in the property plugin.includes add the plugin, for instance:
protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
After adding the plugin the No IndexWriters activated - check your configuration warning disappeared in my case.
Check this thread: http://lucene.472066.n3.nabble.com/a-plugin-extending-IndexWriter-td4074353.html
#Tryskele + #Scott101 worked for me:
add plugin.includes property to both /conf/nutch-site.xml and runtime/local/conf/nutch-site.xml files:
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>
Don't know if this is still an issue, but I was having this problem and then realized that my src/plugin/build.xml was missing the indexer-solr plugin. Adding the following and then recompiling nutch fixed it for me:
<ant dir="indexer-solr" target="deploy"/>
Add the below property in conf/nutch-site.xml for plugin
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>
Let me know if it solves your problem.

Nutch 1.2 Solr 3.6 integration issue

I have crawled a site successfully using NUTCH 1.2 .Now I want to integrate this with solr 3.6 . Problem is when I am issuing command
$ bin/nutch solrindex //localhost:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/* an error occurs
SolrIndexer: starting at 2013-07-08 14:52:27
java.io.IOException: Job failed!
Please help me to solve this issue
Here is my nutch log
java.lang.RuntimeException: Invalid version (expected 2, but 60) or the data in not in 'javabin' format
at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:469)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:249)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:69)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:75)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2013-07-08 15:17:39,539 ERROR solr.SolrIndexer - java.io.IOException: Job f
This is mainly the javabin incompatiblity between the Solrj version jars used by Nutch and the Solr 3.6 which you are trying to integrate.
You would need to update the Solrj jars and regenerate the jobs.
Follow the steps as mentioned in the forum.

Error while indexing in solr data crawled by nutch

I have starting working with nutch and solr and I have a problem with integrating Solr with Nutch. I followed this tutorial: http://wiki.apache.org/nutch/NutchTutorial and after using:
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
nutch shows message:
java.io.IOException: Job failed!
and solr is showing:
SEVERE: org.apache.solr.common.SolrException: ERROR:
[doc=http://nutch.apache.org/] unknown field 'host'
I thought that the reason might be a missing 'host' field in the $SOLR_HOME/example/solr/conf/schema.xml but it is there.
I would be very grateful for your help.
Changing configuration at Nutch side does not effect the schema of Solr. You have to define that field at schema.xml of Solr.

Resources