Solr CDCR (cross data center replication) target and its commits

Solr CDCR (cross data center replication) target and its commits - solr

How does a Solr CDCR target cluster handle commits? The commit settings are the same as on the CDCR source, where the Solr admin UI shows freshly indexed documents with hardly any delay. Both settings are
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>
with the ${solr*maxTime} unset as far as I can tell. (I am a bit surprised anyway, given the settings, that I see results shortly after indexing on the source cluster:-(
Yet on the target cluster, I seem to never get a commit, certainly not within 15 seconds as the default suggests. I know the data is available, since it appears right after a manual commit.
Any idea how to configure the target to actually perform any commits?

I was facing same issue. Adding softcommit to 15s helped me:
curl -X POST -H 'Content-type: application/json' -d '{"set-property":{"updateHandler.autoSoftCommit.maxTime":1500}}' http://localhost:8983/solr/<yourcollection>/config
I found in some forum hint to restart server, did not help me.

Related

Solr reindex is stopping prematurely when running Collective Solr for Plone

My team is working on a search application for our websites. We are using Collective Solr in Plone to index our intranet and documentation sites. We recently set up shared blob storage on our test instance of the intranet site because Solr was not indexing our PDF files. This appears to be working, however, each time I run the reindexing script (##solr-maintenance/reindex) it stops after about an hour and a half. I know that it is not indexing our entire site as there are numerous pages, files, etc. missing when I run a query in the Solr dashboard.
The warning below is the last thing I see in the Solr log before the script stops. I am very new to Solr so I'm not sure what it indicates. When I run the same script on our documentation site, it completes without error.
2017-04-14 18:05:37.259 WARN (qtp1989972246-970) [ ] o.a.s.h.a.LukeRequestHandler Error getting file length for [segments_284]
java.nio.file.NoSuchFileException: /var/solr/data/uvahealthPlone/data/index/segments_284
I'm hoping someone out there might have more experience with Collective Solr for Plone and could recommend some good resources for debugging this issue. I've done a lot of searching lately but haven't found much useful info.

This was a bug fixed some time ago with https://github.com/collective/collective.solr/pull/122

Integrating nutch 1.11 with solr 6.0.1 cloud

This is similar to solr5.3.15-nutch here, but with a few extra wrinkles. First, as background, I tried solr 4.9.1 and nutch with no problems. Then moved up to solr 6.0.1. Integration worked great as a standalone, and got backend code working to parse the json, etc. However, ultimately, we need security, and don't want to use Kerberos. According to the Solr security documentation, basic auth and rule-based auth (which is what we want) works only in cloud mode (as an aside, if anyone has suggestions for getting non-Kerberos security working in standalone mode, that would work as well). So, went through the doc at Solr-Cloud-Ref, using the interactive startup and taking all the defaults, except for the name of the collection which I made as "nndcweb" instead of "gettingstarted". The configuration I took was data_driven_schema_configs . To integrate nutch, there were many permutations of attempts I made. I'll only give the last 2 that seemed to come closest based on what I've been able to find so far. From the earlier stack-overflow reference, the last one I tried was (note all urls have http://, but the posting system for Stackoverflow was complaining, so I took them out for the sake of this post):
bin/nutch index crawl/crawldb -linkdb crawl/linkdb -D solr.server.url=localhost:8939/solr/nndcweb/ -Dsolr.server.type=cloud -D solr.zookeeper.url=localhost:9983/ -dir crawl/segments/* -normalize
I ended up with the same problem noted in the previous thread mentioned: namely,
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=localhost:8939/solr/nndcweb
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.(Path.java:172)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:217)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
Caused by: java.net.URISyntaxException: Illegal character in scheme name at index 15: solr.server.url=localhost:8939/solr/nndcweb
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parse(URI.java:3048)
at java.net.URI.(URI.java:746)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
I also tried:
bin/nutch solrindex localhost:8983/solr/nndcweb crawl/crawldb -linkdb crawl/linkdb -Dsolr.server.type=cloud -D solr.zookeeper.url=localhost:9983/ -dir crawl/segments/* -normalize
and get same thing. Doing a help on solrindex indicates using the -params with an "&" separating the options (in contrast to using -D). However, this only serves telling my Linux system to try to run some strange things in the background, of course.
Does anybody have any suggestions on what to try next? Thanks!
Update
I updated the commands used above to reflect the correction to a silly mistake I made. Note that all url references, in practice, do have the http:// prefix, but I had to take them out to be able to post. In spite of the fix, I'm still getting the same exception though ( a sample of which I used to replace the original above, again with the http:// cut out..which does make things confusing...sorry about that...).
Yet Another Update
So..this is interesting. Using the solrindex option, I just took out the port from the zookeeper url ..just localhost (with the http:// prefix). 15 characters. The URISyntaxException says the problem is at index 18 (from org.apache.hadoop.fs.Path.initialize(Path.java:206)). This does happen to match the "=" in "solr.zookeeper.url=". So, it seems like the hadoop.fs.Path.intialize() is taking the whole string as the url. So perhaps I am not setting that up correctly? Or is this a bug in hadoop? That would be hard to believe.
An Almost There Update
Alright..given the results of the last attempt, I decided to put the solr.type of cloud and the zookeeper.url in the nutch-site.xml config file. Then did:
bin/nutch solrindex http://localhost:8983/solr/nndcweb crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments -normalize
(great..no complaints about the url now from StackOverflow). No uri exception anymore. Now, the error I get is:
(cutting verbiage at the top)
Indexing 250 documents
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:222)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:231)
Digging deeper into the nutch logs, I see the following:
No Collection Param specified on request and no default collection has been set.
Apparently, this has been mentioned at the Nutch Mailing list , in connection with Nutch 1.11 and solr 5 (cloud mode). There it was mentioned that it was not going to work, but a patch would be uploaded (this was back in January 2016). Digging around on the nutch development site, I hadn't come across anything on this issue...something a little bit similar for nutch 1.13, which is apparently not officially released. Still digging around, but if anybody actually has this working somehow, I'd love to hear how you did it..
Edit July 12-2016
So, after a few weeks diversion on another unrelated project, I'm back to this. Before seeing S. Doe's response below, I decided to give ElasticSearch a try instead..as this is a completely new project and we're not tied to anything yet. So far so good. Nutch is working well with it, although to use the distributed binaries I had to back the Elasticsearch version down to 1.4.1. Haven't tried the security aspect yet. Out of curiosity, I will try S. Doe's suggestion with solr eventually and will post how that goes later...

You're not specifying the protocol to connect to Solr: You need to specify the http:// portion of the solr.server.url and you used the wrong syntax to specify the port to connect, the right URL should be: http://localhost:8983/solr/nndcweb/.

About the problem with URL when using solr index: I had the same problem, and I know it sounds stupid but for some reason that I cannot get, you can fix it by using the URL’s Encode(replace ":" with "%3A", "/" with "%2F" and... ) instead.(at least for me this fixed that problem.)
in your case:
bin/nutch solrindex -D solr.server.url=http%3A%2F%2Flocalhost%3A8983%2Fsolr%2Fnndcweb crawl/crawldb -linkdb crawl/linkdb -dir crawl/segments -normalize
I hope it helps.
BTW, now I'm having the exact same problem as you do (Indexer: java.io.IOException: Job failed!)

Solr : Issues with Soft Auto commit (Near Real Time)

Having issues in Soft Auto commit (Near Real Time). Am using solr 4.3 on tomcat . The index size is 10.95 GB. With this configuration it takes more than 60 seconds to return the indexed document. When adding documents to solr and searching after soft commit time, its returning 0 hits. Its taking long before the document actually starts showing up, even more than the autoCommit interval.
<autoCommit>
<maxTime>15000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>1000</maxTime>
</autoSoftCommit>
Machine is ubuntu 13 / 4 cores / 16GB RAM. Given 6gb to Solr running over tomcat.
Can somebody help me with this?

If you are adding new document using solr client, try using commitWithin, its more flexible over autoSoftCommit.One more thing make sure the u have update log is enabled in solrconfig.xml and get handler as well.You can get more details here-http://wiki.apache.org/solr/NearRealtimeSearch

Disappearing cores in Solr

I am new to Solr.
I have created two cores from the admin page, let's call them "books" and "libraries", and imported some data there. Everything works without a hitch until I restart the server. When I do so, one of these cores disappears, and the logging screen in the admin page contains:
SEVERE CoreContainer null:java.lang.NoClassDefFoundError: net/arnx/jsonic/JSONException
SEVERE SolrCore REFCOUNT ERROR: unreferenced org.apache.solr.core.SolrCore#454055ac (papers) has a reference count of 1
I was testing my query in the admin interface; when I refreshed it, the "libraries" core was gone, even though I could normally query it just a minute earlier. The contents of solr.xml are intact. Even if I restart Tomcat, it remains gone.
Additionally, I was trying to build a query similar to this: "Find books matching 'war peace' in libraries in Atlanta or New York". So given cores "books" and "libraries", I would issue "books" the following query (which might be wrong, if it is please correct me):
(title:(war peace) blurb:(war peace))
AND _query_:"{!join
fromIndex=libraries from=libraryid to=libraryid
v='city:(new york) city:(atlanta)'}"
When I do so, the query fails with "libraries" core disappears, with the above symptoms. If I re-add it, I can continue working (as long as I don't restart the server or issue another join query).
I am using Solr 4.0; if anyone has a clue what is happening, I would be very grateful. I could not find out anything about the meaning of the error message, so if anyone could suggest where to look for that, or how go about debugging this, it would be really great. I can't even find where the log file itself is located...

I would avoid the Debian package which may be misconfigured and quirky. And it contains (a very early build of?) solr 4.0, which itself may have lingering issues; being the first release in a new major version. The package maintainer may not have incorporated the latest and safest Solr release into his package.
A better way is to download Solr 4.1 yourself and set it up yourself with Tomcat or another servlet container.

In case you are looking to install SOLR 4.0 and configure, you can following the installation procedure from here

Update the solr config for the cores to be persistent.
In your solr.xml, update <solr> or <solr persistent="false"> to <solr persistent="true">

OutOfMemory with Solr4

I'm indexing the content I have and after upgrade my Solr instance to solr 4 I'm facing some OutOfMemories. The exception thrown is:
INFO org.apache.solr.update.UpdateHandler - start commit{flags=0,_version_=0,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false}
ERROR o.a.solr.servlet.SolrDispatchFilter - null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
at org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:469)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:297)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:240)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:164)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:562)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:395)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:250)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:188)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:166)
at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:302)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.OutOfMemoryError: Java heap space
Is there some known bug or something I could test to get rid of it?
Within this upgrade two things changed:
solr version (from 3.4 to 4.0);
lucene match version (from LUCENE_34 to LUCENE_40).

Seems to be running out of memory when accessing logs, at a glance. That may not be particularly meaningful, with an 'Out of Memory' error, of course, but worth a shot, particularly after seeing this complaint regarding SOLR 4.0 logging. Particularly so if this is occuring during an index rebuild of some form, or heavy load of updates.
So try disabling the update log, which I believe can be done by commenting out:
<updateLog>
<str name="dir">${solr.data.dir:}</str>
</updateLog>
in solrconfig.xml.
EDIT:
Another (possibly better) approach to this, taking another glance at it, might be to commit more often. The growth of the update log seems to be directly related to having a lot of queued updates waiting for commit.
If you do not have autocommit enabled, you might want to try adding it in your config, something like:
<autoCommit>
<maxTime>15000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
There's a good bit of related discussion and recommendation to be found on this thread.

I ran into the same problem today and after reading #femtoRgon's suggested thread, I changed the following in the solrconfig.xml
<autoCommit>
<maxTime>15000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
to
<autoCommit>
<maxDocs>15000</maxDocs>
<openSearcher>false</openSearcher>
</autoCommit>
It no longer gives me that error. So it commits every 15,000 docs. Which in my case is frequent enough not to run into memory issues. In my MacBook Pro it took few minutes to index ~4m documents containing product information (so short documents).