Solr reindex is stopping prematurely when running Collective Solr for Plone - solr

My team is working on a search application for our websites. We are using Collective Solr in Plone to index our intranet and documentation sites. We recently set up shared blob storage on our test instance of the intranet site because Solr was not indexing our PDF files. This appears to be working, however, each time I run the reindexing script (##solr-maintenance/reindex) it stops after about an hour and a half. I know that it is not indexing our entire site as there are numerous pages, files, etc. missing when I run a query in the Solr dashboard.
The warning below is the last thing I see in the Solr log before the script stops. I am very new to Solr so I'm not sure what it indicates. When I run the same script on our documentation site, it completes without error.
2017-04-14 18:05:37.259 WARN (qtp1989972246-970) [ ] o.a.s.h.a.LukeRequestHandler Error getting file length for [segments_284]
java.nio.file.NoSuchFileException: /var/solr/data/uvahealthPlone/data/index/segments_284
I'm hoping someone out there might have more experience with Collective Solr for Plone and could recommend some good resources for debugging this issue. I've done a lot of searching lately but haven't found much useful info.

This was a bug fixed some time ago with https://github.com/collective/collective.solr/pull/122

Related

Sitecore SOLR Errors

I am using SOLR with sitecore, on production environment, I am getting a lot of errors in SOLR log, but sites are working fine, I have 32 solr cores, and I am using Solr version 4.10.3.0 with Sitecore 8.1 update 2, below is sample of these errors, any one can explain to me these errors :
Most of the errors are self-descriptive, like this one:
undefined field: "Reckless"
tells that the field in question is not defined in the solr schema. Try to analyze the queries you system is accepting and the system sending these in.
The less obvious one:
Overlapping onDeckSearchers=2
is warning about warming searchers, in this case 2 of them concurrently. This means, that there were commits to the Solr index in a quick succession, each of which triggered a warming searcher. The reason it is wasteful is that even though the first searcher has warmed up and is ready to serve queries, it will be thrown away as the new searcher warms up and is ready to serve.

Disappearing cores in Solr

I am new to Solr.
I have created two cores from the admin page, let's call them "books" and "libraries", and imported some data there. Everything works without a hitch until I restart the server. When I do so, one of these cores disappears, and the logging screen in the admin page contains:
SEVERE CoreContainer null:java.lang.NoClassDefFoundError: net/arnx/jsonic/JSONException
SEVERE SolrCore REFCOUNT ERROR: unreferenced org.apache.solr.core.SolrCore#454055ac (papers) has a reference count of 1
I was testing my query in the admin interface; when I refreshed it, the "libraries" core was gone, even though I could normally query it just a minute earlier. The contents of solr.xml are intact. Even if I restart Tomcat, it remains gone.
Additionally, I was trying to build a query similar to this: "Find books matching 'war peace' in libraries in Atlanta or New York". So given cores "books" and "libraries", I would issue "books" the following query (which might be wrong, if it is please correct me):
(title:(war peace) blurb:(war peace))
AND _query_:"{!join
fromIndex=libraries from=libraryid to=libraryid
v='city:(new york) city:(atlanta)'}"
When I do so, the query fails with "libraries" core disappears, with the above symptoms. If I re-add it, I can continue working (as long as I don't restart the server or issue another join query).
I am using Solr 4.0; if anyone has a clue what is happening, I would be very grateful. I could not find out anything about the meaning of the error message, so if anyone could suggest where to look for that, or how go about debugging this, it would be really great. I can't even find where the log file itself is located...
I would avoid the Debian package which may be misconfigured and quirky. And it contains (a very early build of?) solr 4.0, which itself may have lingering issues; being the first release in a new major version. The package maintainer may not have incorporated the latest and safest Solr release into his package.
A better way is to download Solr 4.1 yourself and set it up yourself with Tomcat or another servlet container.
In case you are looking to install SOLR 4.0 and configure, you can following the installation procedure from here
Update the solr config for the cores to be persistent.
In your solr.xml, update <solr> or <solr persistent="false"> to <solr persistent="true">

GAE log search is not reliable

I`m running several python (2.7) applications and constantly hit one problem: log search (from dashboard, admin console) is not reliable. it is fine when i searching for recent log entries (they are normally found ok), but after some period (one day for instance) its not possible to find same record with same search query again. just "no results". admin console shows that i have 1 gig of logs spanning 10-12 days, so old record should be here to find, retention/log size limits is not a reason for this.
Specifically i have "cron" request that write stats to log every day (it`s enough for me) and searching for this request always gives me the last entry, not entry-per-day-of-span-period as expected.
Is it expected behaviour (i do not see clear statements about log storage behaviour in docs, for example) or there is something to tune? For example, will it help to log less per request? or may be there is advanced use of query language.
Please advise.
This is a known issue that has already been reported on googleappengine issue tracker.
As an alternative you can consider reading your application logs programmatically using the Log Service API to ingest them in BigQuery, or build your own search index.
Google App Engine Developer Relations delivered a codelab at Google I/O 2012 about App Engine logs ingestion into Big Query.
And Streak released a tool called called Mache and a chrome extension to automate this use case.

How to configure Nutch and solr in ubuntu 10.10?

I'm trying to build a search engine for my final year project. I have done lots of research on this topic in the last 2 months.
And I found that I will need a crawler to crawl the Internet, a parser, and an indexer.
I am trying to use Nutch as crawler and solr to index data crawled by Nutch. But I am stuck in the installation part of both of them. I am trying to install Nutch and solr in my system with the help of tutorials on the internet, but nothing worked for me.
I need some kind of installation guide or a link where I can learn how to install and integrate Nutch and solr.
Next I am stuck with the parser. I have no idea about this phase. I need help here on how to do the parsing of data before indexing.
I don't want to build Google or something. All I need is certain items from certain websites to be searched.
I have Java experience and I can work with it comfortably but I am not a professional like you guys, and please do tell me whether I am going in the right direction or not, and what I should do next.
I am using Ubuntu 10.10, and I have Apache Tomcat 7.
This is for nutch installation and this is for integration with Solr.
Regarding the parsers, nutch has its own set of parsers and you dont have to bother about parsing. Trigger the crawl command, its done automatically. Unless you want to parse things apart from those provided by nutch, it wont be an issue for you. If you want nutch to parse some .xyz files, then you nedd to write parser plugins for that and integrate with nutch.

Nutch querying on the fly

I am a newbie to nutch and solr. Well relatively much newer to Solr than Nutch :)
I have been using nutch for past two weeks, and I wanted to know if I can query or search on my nutch crawls on the fly(before it completes). I am asking this because the websites I am crawling are really huge and it takes around 3-4 days for a crawl to complete. I want to analyze some quick results while the nutch crawler is still crawling the URLs. Some one suggested me that Solr would make it possible.
I followed the steps in http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ for this. I see only the injected URLs are shown in the Solr search. I know I did something really foolish and the crawl never happened, I feel I am missing some information here. But I did all the steps mentioned in the link. I think somewhere in the process there should be a crawling happening and which is missed.
Just wanted to see if some one could help me pointing this out and where I went wrong in the process. Forgive my foolishness and thanks for your patience.
Cheers,
Abi
This is not possible.
What you could do though is chunk the crawl cycle in a smaller number of URL's such that it will publish result more often whith this command
nutch generate crawl/crawldb crawl/segments -topN <the limit>
If you are using the onestop command crawl it should be the same.
I typically have a 24hours chunking scheme.

Resources