I have crawled a site successfully using NUTCH 1.2 .Now I want to integrate this with solr 3.1 . Problem is when I am issuing command $ bin/nutch solrindex localhost:8080/solr/ crawl/crawldb crawl/linkdb cra wl/segments/* an error occurs. I am attaching my nutch logs
Please help me to solve this issue
Bad Request
request: //localhost:8080/solr/update?wt=javabin&version=2 at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:436) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:245) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:49) at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:75) at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216) 2013-07-08 17:38:47,577 ERROR solr.SolrIndexer - java.io.IOException: Job failed!
Related
I have followed the tutorial and configured nutch to run on Windows 7 using Cygwin and i'm using Solr 5.4.0 to index the data
But nutch 1.11 is having problem in executing a crawl.
Crawl Command
$ bin/crawl -i -D solr.server.url=http://127.0.0.1:8983/solr /urls /TestCrawl 2
Error/Exception
Injecting seed URLs /apache-nutch-1.11/bin/nutch inject /TestCrawl/crawldb /urls
Injector: starting at 2016-01-19 17:11:06
Injector: crawlDb: /TestCrawl/crawldb
Injector: urlDir: /urls
Injector: Converting injected urls to crawl db entries.
Injector: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:739)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:722)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:125)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:348)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
at org.apache.nutch.crawl.Injector.inject(Injector.java:323)
at org.apache.nutch.crawl.Injector.run(Injector.java:379)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.crawl.Injector.main(Injector.java:369)
Error running:
/home/apache-nutch-1.11/bin/nutch inject /TestCrawl/crawldb /urls
Failed with exit value 127.
I can see there are multiple problems with your command, try this:
bin/crawl -i -Dsolr.server.url=http://127.0.0.1:8983/solr/core_name path_to_seed crawl 2
The first problem is that there is a space when you pass the solr parameter. The second problem is that the solr url should include the core name as well.
hadoop-core jar file is needed when you are working with nutch
with nutch 1.11 compatible hadoop-core jar is 0.20.0
please download jar from this link :
http://www.java2s.com/Code/Jar/h/Downloadhadoop0200corejar.htm
paste that jar into "C:\cygwin64\home\apache-nutch-1.11\lib" folder and it will run
successfully.
I am following a tutorial from here. i have got solr and nutch installed separately and they are both working all fine. The problem comes when i have to integrate them. From the earlier posts on this site i learned that there could some issue with the schema files. As mentioned in the tut i copied the schema.xml of nutch to the schema.xml of solr and restarted the solr. solr stoped because of configuration issues. So i simply copied the contents of each file into the other along with the existing content. Now (and previously as well) i get this error:
Indexer: starting at 2014-08-05 11:10:21
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
Can someone suggest what should be done?
I am using apache-nutch-1.8 and solr-4.9.0
Here is how my hadoop.log file looks like:
2014-08-05 12:50:05,032 INFO crawl.Injector - Injector: starting at 2014-08-05 12:50:05
2014-08-05 12:50:05,033 INFO crawl.Injector - Injector: crawlDb: -dir/crawldb
2014-08-05 12:50:05,033 INFO crawl.Injector - Injector: urlDir: urls
.
.
.
.
.
2014-08-05 13:04:21,255 INFO solr.SolrIndexWriter - Indexing 1 documents
2014-08-05 13:04:21,286 WARN mapred.LocalJobRunner - job_local1310160376_0001
org.apache.solr.common.SolrException: Bad Request
Bad Request
request: http://my-solr-url:8983/solr/update?wt=javabin&version=2
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:155)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:118)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:467)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:535)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
2014-08-05 13:04:21,544 ERROR indexer.IndexingJob - Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
2014-08-05 13:10:37,855 INFO crawl.Injector - Injector: starting at 2014-08-05 13:10:37
.
.
.
may be because of some versioning differences the tutorial suggested to copy the conf/schema.xml whereas in this particular version of solr, the file schema-solr4.xml was supposed to be copied followed by addition of : <field name="_version_" type="long" indexed="true" stored="true"/> in line no 351. Restart the solr by java -jar start.jar and it works all normal! Hope this helps someone!
I am trying to crawl the web using nutch and I followed the documentation steps in the nutch's official web site (run the crawl successfully, copy the scheme-solr4.xml into solr directory). but when I run the
bin/nutch solrindex http://localhost:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
I get the following error:
Indexer: starting at 2013-08-25 09:17:35
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195)
I have to mention that the solr is running but I cannot browse http://localhost:8983/solr/admin (it redirects me to http://localhost:8983/solr/#).
On the other hand, when I stop the solr, I get the same error! Does anybody have any idea about what is wrong with my setting?
P.S. the url that I crawl is: http://localhost/NORC
Check your configuration against: Solr and Nutch
Nutch and Solr's schema files should be the same or you may encounter problems so make sure they match up
When I meet same problem in nutch, the solr's log appear a error message "unknown field host".
After modifying the schema.xml in solr, the nutch's error vanish.
You are missing the name of the core inside your command.
e.g.:
./bin/crawl -i -D solr.server.url=http://localhost:8983/solr/#/your_corname urls/ crawl 1
I have crawled a site successfully using NUTCH 1.2 .Now I want to integrate this with solr 3.6 . Problem is when I am issuing command
$ bin/nutch solrindex //localhost:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/* an error occurs
SolrIndexer: starting at 2013-07-08 14:52:27
java.io.IOException: Job failed!
Please help me to solve this issue
Here is my nutch log
java.lang.RuntimeException: Invalid version (expected 2, but 60) or the data in not in 'javabin' format
at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:469)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:249)
at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:69)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at org.apache.nutch.indexer.solr.SolrWriter.close(SolrWriter.java:75)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:48)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:474)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
2013-07-08 15:17:39,539 ERROR solr.SolrIndexer - java.io.IOException: Job f
This is mainly the javabin incompatiblity between the Solrj version jars used by Nutch and the Solr 3.6 which you are trying to integrate.
You would need to update the Solrj jars and regenerate the jobs.
Follow the steps as mentioned in the forum.
I have starting working with nutch and solr and I have a problem with integrating Solr with Nutch. I followed this tutorial: http://wiki.apache.org/nutch/NutchTutorial and after using:
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5
nutch shows message:
java.io.IOException: Job failed!
and solr is showing:
SEVERE: org.apache.solr.common.SolrException: ERROR:
[doc=http://nutch.apache.org/] unknown field 'host'
I thought that the reason might be a missing 'host' field in the $SOLR_HOME/example/solr/conf/schema.xml but it is there.
I would be very grateful for your help.
Changing configuration at Nutch side does not effect the schema of Solr. You have to define that field at schema.xml of Solr.