Integrating Solr with Nutch issue - solr

I am following a tutorial from here. i have got solr and nutch installed separately and they are both working all fine. The problem comes when i have to integrate them. From the earlier posts on this site i learned that there could some issue with the schema files. As mentioned in the tut i copied the schema.xml of nutch to the schema.xml of solr and restarted the solr. solr stoped because of configuration issues. So i simply copied the contents of each file into the other along with the existing content. Now (and previously as well) i get this error:
Indexer: starting at 2014-08-05 11:10:21
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
Can someone suggest what should be done?
I am using apache-nutch-1.8 and solr-4.9.0
Here is how my hadoop.log file looks like:
2014-08-05 12:50:05,032 INFO crawl.Injector - Injector: starting at 2014-08-05 12:50:05
2014-08-05 12:50:05,033 INFO crawl.Injector - Injector: crawlDb: -dir/crawldb
2014-08-05 12:50:05,033 INFO crawl.Injector - Injector: urlDir: urls
.
.
.
.
.
2014-08-05 13:04:21,255 INFO solr.SolrIndexWriter - Indexing 1 documents
2014-08-05 13:04:21,286 WARN mapred.LocalJobRunner - job_local1310160376_0001
org.apache.solr.common.SolrException: Bad Request
Bad Request
request: http://my-solr-url:8983/solr/update?wt=javabin&version=2
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:155)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:118)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:467)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:535)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
2014-08-05 13:04:21,544 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
2014-08-05 13:10:37,855 INFO crawl.Injector - Injector: starting at 2014-08-05 13:10:37
.
.
.

may be because of some versioning differences the tutorial suggested to copy the conf/schema.xml whereas in this particular version of solr, the file schema-solr4.xml was supposed to be copied followed by addition of : <field name="_version_" type="long" indexed="true" stored="true"/> in line no 351. Restart the solr by java -jar start.jar and it works all normal! Hope this helps someone!

Related

Error while indexing .xml files in solr

I am trying to index xml files in solr search engine using following command:
java -Durl=http://10.1.11.143:8080/solr/#/ -jar post.jar solr.xml
But I am getting following error:
SimplePostTool version 1.5
Posting files to base url http://10.1.11.143:8080/solr/#/ using content-type application/xml..
POSTing file solr.xml
SimplePostTool: WARNING: Solr returned an error #500 Internal Server Error
SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 500 for URL: http://10.1.11.143:8080/solr/#/
1 files indexed.
COMMITting Solr index changes to http://10.1.11.143:8080/solr/#/..
SimplePostTool: WARNING: Solr returned an error #500 Internal Server Error for url http://10.1.11.143:8080/solr/#/?commit=true
Time spent: 0:00:00.017
Please help me to come out of this error.
Content of solr.xml is as shown in the picture:
The issue is because of the URL. You didn't mention any requestHandler while updating. Use the following command. It'll work.
java -Durl=http://10.1.11.143:8080/solr/update?commit=true -jar post.jar solr.xml
/update is the requestHandler to index data into Solr.

Nutch 1.3 and Solr 4.4.0 integration Job failed

I am trying to crawl the web using nutch and I followed the documentation steps in the nutch's official web site (run the crawl successfully, copy the scheme-solr4.xml into solr directory). but when I run the
bin/nutch solrindex http://localhost:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
I get the following error:
Indexer: starting at 2013-08-25 09:17:35
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195)
I have to mention that the solr is running but I cannot browse http://localhost:8983/solr/admin (it redirects me to http://localhost:8983/solr/#).
On the other hand, when I stop the solr, I get the same error! Does anybody have any idea about what is wrong with my setting?
P.S. the url that I crawl is: http://localhost/NORC
Check your configuration against: Solr and Nutch
Nutch and Solr's schema files should be the same or you may encounter problems so make sure they match up
When I meet same problem in nutch, the solr's log appear a error message "unknown field host".
After modifying the schema.xml in solr, the nutch's error vanish.
You are missing the name of the core inside your command.
e.g.:
./bin/crawl -i -D solr.server.url=http://localhost:8983/solr/#/your_corname urls/ crawl 1

Nutch message "No IndexWriters activated" while loading to solr

I have run nutch crawler as per nutch tutorial http://wiki.apache.org/nutch/NutchTutorial but when i started loading it to solr i am getting this message i.e. "No IndexWriters activated - check your configuration"
bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -dir crawl/segments/
Indexer: starting at 2013-07-15 08:09:13
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
**No IndexWriters activated - check your configuration**
Indexer: finished at 2013-07-15 08:09:21, elapsed: 00:00:07
Make sure that the plugin indexer-solr is included. Go to the file: conf/nutch-site.xml and in the property plugin.includes add the plugin, for instance:
protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
After adding the plugin the No IndexWriters activated - check your configuration warning disappeared in my case.
Check this thread: http://lucene.472066.n3.nabble.com/a-plugin-extending-IndexWriter-td4074353.html
#Tryskele + #Scott101 worked for me:
add plugin.includes property to both /conf/nutch-site.xml and runtime/local/conf/nutch-site.xml files:
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>
Don't know if this is still an issue, but I was having this problem and then realized that my src/plugin/build.xml was missing the indexer-solr plugin. Adding the following and then recompiling nutch fixed it for me:
<ant dir="indexer-solr" target="deploy"/>
Add the below property in conf/nutch-site.xml for plugin
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>
Let me know if it solves your problem.

nutch 1.2 solr 3.1 bad request issue

I have crawled a site successfully using NUTCH 1.2 .Now I want to integrate this with solr 3.1 . Problem is when I am issuing command $ bin/nutch solrindex localhost:8080/solr/ crawl/crawldb crawl/linkdb cra wl/segments/* an error occurs. I am attaching my nutch logs
Please help me to solve this issue
Bad Request
request: //localhost:8080/solr/update?wt=javabin&version=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:436) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:245) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:75) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 2013-07-08 17:38:47,577 ERROR solr.SolrIndexer - java.io.IOException: Job failed!

Error while indexing in solr data crawled by nutch

I have starting working with nutch and solr and I have a problem with integrating Solr with Nutch. I followed this tutorial: http://wiki.apache.org/nutch/NutchTutorial and after using:
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
nutch shows message:
java.io.IOException: Job failed!
and solr is showing:
SEVERE: org.apache.solr.common.SolrException: ERROR:
[doc=http://nutch.apache.org/] unknown field 'host'
I thought that the reason might be a missing 'host' field in the $SOLR_HOME/example/solr/conf/schema.xml but it is there.
I would be very grateful for your help.
Changing configuration at Nutch side does not effect the schema of Solr. You have to define that field at schema.xml of Solr.

Resources