Integrating nutch 1.11 with solr 6.0.1 cloud

Integrating nutch 1.11 with solr 6.0.1 cloud - solr

This is similar to solr5.3.15-nutch here, but with a few extra wrinkles. First, as background, I tried solr 4.9.1 and nutch with no problems. Then moved up to solr 6.0.1. Integration worked great as a standalone, and got backend code working to parse the json, etc. However, ultimately, we need security, and don't want to use Kerberos. According to the Solr security documentation, basic auth and rule-based auth (which is what we want) works only in cloud mode (as an aside, if anyone has suggestions for getting non-Kerberos security working in standalone mode, that would work as well). So, went through the doc at Solr-Cloud-Ref, using the interactive startup and taking all the defaults, except for the name of the collection which I made as "nndcweb" instead of "gettingstarted". The configuration I took was data_driven_schema_configs . To integrate nutch, there were many permutations of attempts I made. I'll only give the last 2 that seemed to come closest based on what I've been able to find so far. From the earlier stack-overflow reference, the last one I tried was (note all urls have http://, but the posting system for Stackoverflow was complaining, so I took them out for the sake of this post):
bin/nutch index crawl/crawldb -linkdb crawl/linkdb -D solr.server.url=localhost:8939/solr/nndcweb/ -Dsolr.server.type=cloud -D solr.zookeeper.url=localhost:9983/ -dir crawl/segments/* -normalize
I ended up with the same problem noted in the previous thread mentioned: namely,
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=localhost:8939/solr/nndcweb
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:217)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=localhost:8939/solr/nndcweb
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parse(URI.java:3048)
at java.net.URI.(URI.java:746)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
I also tried:
bin/nutch solrindex localhost:8983/solr/nndcweb crawl/crawldb -linkdb crawl/linkdb -Dsolr.server.type=cloud -D solr.zookeeper.url=localhost:9983/ -dir crawl/segments/* -normalize
and get same thing. Doing a help on solrindex indicates using the -params with an "&" separating the options (in contrast to using -D). However, this only serves telling my Linux system to try to run some strange things in the background, of course.
Does anybody have any suggestions on what to try next? Thanks!
Update
I updated the commands used above to reflect the correction to a silly mistake I made. Note that all url references, in practice, do have the http:// prefix, but I had to take them out to be able to post. In spite of the fix, I'm still getting the same exception though ( a sample of which I used to replace the original above, again with the http:// cut out..which does make things confusing...sorry about that...).
Yet Another Update
So..this is interesting. Using the solrindex option, I just took out the port from the zookeeper url ..just localhost (with the http:// prefix). 15 characters. The URISyntaxException says the problem is at index 18 (from org.apache.hadoop.fs.Path.initialize(Path.java:206)). This does happen to match the "=" in "solr.zookeeper.url=". So, it seems like the hadoop.fs.Path.intialize() is taking the whole string as the url. So perhaps I am not setting that up correctly? Or is this a bug in hadoop? That would be hard to believe.
An Almost There Update
Alright..given the results of the last attempt, I decided to put the solr.type of cloud and the zookeeper.url in the nutch-site.xml config file. Then did:
bin/nutch solrindex http://localhost:8983/solr/nndcweb crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments -normalize
(great..no complaints about the url now from StackOverflow). No uri exception anymore. Now, the error I get is:
(cutting verbiage at the top)
Indexing 250 documents
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
Digging deeper into the nutch logs, I see the following:
No Collection Param specified on request and no default collection has been set.
Apparently, this has been mentioned at the Nutch Mailing list , in connection with Nutch 1.11 and solr 5 (cloud mode). There it was mentioned that it was not going to work, but a patch would be uploaded (this was back in January 2016). Digging around on the nutch development site, I hadn't come across anything on this issue...something a little bit similar for nutch 1.13, which is apparently not officially released. Still digging around, but if anybody actually has this working somehow, I'd love to hear how you did it..
Edit July 12-2016
So, after a few weeks diversion on another unrelated project, I'm back to this. Before seeing S. Doe's response below, I decided to give ElasticSearch a try instead..as this is a completely new project and we're not tied to anything yet. So far so good. Nutch is working well with it, although to use the distributed binaries I had to back the Elasticsearch version down to 1.4.1. Haven't tried the security aspect yet. Out of curiosity, I will try S. Doe's suggestion with solr eventually and will post how that goes later...

You're not specifying the protocol to connect to Solr: You need to specify the http:// portion of the solr.server.url and you used the wrong syntax to specify the port to connect, the right URL should be: http://localhost:8983/solr/nndcweb/.

About the problem with URL when using solr index: I had the same problem, and I know it sounds stupid but for some reason that I cannot get, you can fix it by using the URL’s Encode(replace ":" with "%3A", "/" with "%2F" and... ) instead.(at least for me this fixed that problem.)
in your case:
bin/nutch solrindex -D solr.server.url=http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fnndcweb crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments -normalize
I hope it helps.
BTW, now I'm having the exact same problem as you do (Indexer: java.io.IOException: Job failed!)

Related

Solr Exception: Sort param field can't be found

I've been trying to set up CKAN, however I am facing some problems in SOLR.
Everytime I run CKAN, the SOLR log file registers a new event, and that is:
org.apache.solr.common.SolrException: sort param field can't be found: metadata_modified
I am trying to use CKAN for the first time and I have no experience at all, so I have no idea what that log event means, nor how to fix it.
EDIT:
When I reload the core at SOLR, the following is logged:
The schema has been upgraded to managed, but the non-managed schema schema.xml is still loadable. PLEASE REMOVE THIS FILE.
Could anyone help me?
Many thanks.

Well, turns out the SOLR 6, for some reason, was the problem.
Downgrading to version 5 worked for me.

This sounds like you are not using CKAN's custom Solr schema. Make sure to go over all points in the setup documentation, specially point 2, and to restart jetty afterwards:
sudo service restart jetty

Controlling what cores get loaded when solr 5 starts/restarts

I setup my solr instance to run the way I wanted. The service was restarted and all my setup was removed and 4 gettingstarted cores were loaded.
Can someone explain why this happened and what I can do to prevent it from happening again. I would like the cores that I built to be persistent.
Thanks for your help
Edit: Looking over :
https://cwiki.apache.org/confluence/display/solr/Moving+to+the+New+solr.xml+Format
I have the solr.xml setup exactly like the example. I have the core.properties files setup properly. I don't see how it is suppose to know to load the core I created.
Edit2: I found this documentation that states any core.properties files in the home folder will be used.
https://cwiki.apache.org/confluence/display/solr/Solr+Cores+and+solr.xml

Assuming Solr 5, it sounds like you run /bin/solr restart and got wrong collections. The reason to that would be that the restart command needs the same parameters as the start command, most importantly your solr home path.
Solr home is what you provided with -s parameter when you started your Solr the last time. If you did it instead by starting from an example, this guide on solr home locations should help.

Disappearing cores in Solr

I am new to Solr.
I have created two cores from the admin page, let's call them "books" and "libraries", and imported some data there. Everything works without a hitch until I restart the server. When I do so, one of these cores disappears, and the logging screen in the admin page contains:
SEVERE CoreContainer null:java.lang.NoClassDefFoundError: net/arnx/jsonic/JSONException
SEVERE SolrCore REFCOUNT ERROR: unreferenced org.apache.solr.core.SolrCore#454055ac (papers) has a reference count of 1
I was testing my query in the admin interface; when I refreshed it, the "libraries" core was gone, even though I could normally query it just a minute earlier. The contents of solr.xml are intact. Even if I restart Tomcat, it remains gone.
Additionally, I was trying to build a query similar to this: "Find books matching 'war peace' in libraries in Atlanta or New York". So given cores "books" and "libraries", I would issue "books" the following query (which might be wrong, if it is please correct me):
(title:(war peace) blurb:(war peace))
AND _query_:"{!join
fromIndex=libraries from=libraryid to=libraryid
v='city:(new york) city:(atlanta)'}"
When I do so, the query fails with "libraries" core disappears, with the above symptoms. If I re-add it, I can continue working (as long as I don't restart the server or issue another join query).
I am using Solr 4.0; if anyone has a clue what is happening, I would be very grateful. I could not find out anything about the meaning of the error message, so if anyone could suggest where to look for that, or how go about debugging this, it would be really great. I can't even find where the log file itself is located...

I would avoid the Debian package which may be misconfigured and quirky. And it contains (a very early build of?) solr 4.0, which itself may have lingering issues; being the first release in a new major version. The package maintainer may not have incorporated the latest and safest Solr release into his package.
A better way is to download Solr 4.1 yourself and set it up yourself with Tomcat or another servlet container.

In case you are looking to install SOLR 4.0 and configure, you can following the installation procedure from here

Update the solr config for the cores to be persistent.
In your solr.xml, update <solr> or <solr persistent="false"> to <solr persistent="true">

Using Solr to read OpenGrok's database and failing with "no segments* file found"

I need a simple way to read OpenGrok's DB from a php script to do some weird searches (as doing that in Java in OpenGrok itself isn't in my abilities). So I decided to use Solr as a way to query the Lucene DB directly from another language (probably PHP or C).
The problem is that when I point Solr to /var/opengrok/data, it bombs out with:
java.lang.RuntimeException: org.apache.lucene.index.IndexNotFoundException: no segments* file found in org.apache.lucene.store.MMapDirectory#/var/opengrok/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory#3a329572: files: [] at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1103)
(etc, etc, the backtrace is about three screens long)
I tried to point it somewhere inside data with no luck. The structure looks like this:
/var/opengrok/data/index/$projname/segment*
/var/opengrok/data/spelling...
and seems like whatever Solr is using is expecting the segment files directly in the index directory.
I checked to see if there's any version discrepancy, but OpenGrok 0.11 is using Lucene 3.0.2 and I've set Solr to LUCENE_30 as the database version.
Any pointers will be greatly appreciated, google didn't seem to be able to help with this.

opengroks web interface can consume any well formed search query (through url) and reply with xhtml results which are easily parse-able, so you're probably making it too complex to hack inside the lucene rather than using UI provided ...

Nutch querying on the fly

I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :)
I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible.
I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. But I did all the steps mentioned in the link. I think somewhere in the process there should be a crawling happening and which is missed.
Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience.
Cheers,
Abi

This is not possible.
What you could do though is chunk the crawl cycle in a smaller number of URL's such that it will publish result more often whith this command
nutch generate crawl/crawldb crawl/segments -topN <the limit>
If you are using the onestop command crawl it should be the same.
I typically have a 24hours chunking scheme.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight