Getting stable SOLR scores - solr

I run a query against a SOLR core and restrict the result using a filter
like fq: {!frange l=0.7 }query($q). I'm aware that SOLR scores do not
have an absolute meaning, but the 0.7 (just an example) is calculated
based on user input and some heuristics, which works quite well.
The problem is the following: I update quite a few documents in my core.
The updated fields are only meta data fields, which are unrelated to the
above search. But because an update is internally a delete + insert, IDF
and doc counts change. And so do the calculated scores. Suddenly my
query returns different results.
As Yonik explained to me here, this behaviour is by design. So my question is: What is the most simple
and minimal way to keep the scores and the output of my query stable?
Running optimize after each commit should solve the problem, but I
wonder if there is something simpler and less expensive.

You really need to run optimize. When you optimize the index solr clean all documents not pointed yet and make the query stable. This occurs because build this meta data information is expensive to be done all the time a document is updated. Because of this solr just do that on optimize. There is a good way to see if your index is more or less stable... When you access Solr API you could see Num Docs and Max Doc information. If Max Doc is greater than Num Docs it seams that you have some old products affecting your relevancy calculation. Optimizing the index these two numbers is made equal again. If these numbers are equal you can trust IDF is been calculated correctly.

Related

Solr near real time search: impact of reindexing frequently the same documents

We want to use SolR in a Near Real Time scenario. Say for example we want to filter / rank our results by number of views.
SolR SoftCommit was made for this use case but:
In practice, the same few documents are updated very frequently (just for the nb_view field) while most of the documents are untouched.
As far as I know each update, even partial are implemented as a full delete and full addition of the document in lucene.
It seems to me having many times the same docs in the Tlog is inefficient and might also be problematic during the merge process (is the doc marked n times as deleted and added?)
Any advice / good practice?
Two things you could use for supporting this scenario:
In place updates: only that field is udpated, not the whole doc. Check out the conditions you need to be able to use them.
ExternalFileFieldType you keep the values in an external file
if the scenario is critical, I would test both in reald world conditions if possible, and asses.

Solr 4.5: When is Solr facet query better than simple query?

I'm working with Apache Solr and would like to get more detailed information about some query options. I discovered facet queries and was wondering, when exactly do they bring essential advantages; especially in case of the following example:
There is a stock of books that is saved on a Solr server. Despite the common attributes a book ought to have, they have an ISBN. Data about books is provided by third parties and so it's important to check that there are no doubled ISBNs within the system. In order to check if a book's ISBN is a duplicate, it has to go through a routed path, were - unfortunately - every book is processed indiviually without any information about preceeding or following processes.
The question is:
a) Should you simply query Solr with the current book ISBN and check the total results, or
b) should you send a facet query with a f.isbn.facet.mincount=2 and check if the result contains the current book ISBN?
In both cases, caching results is not possible. So the number of queries would always equal the number of books processed. I simply don't know how Solr works within and therefore can't make this decision without further information, especially because the number of queries won't be reduced by either of above possibilities.
If you're going to do a query - do a query. Lucene is highly optimized for doing queries, so that's what you should do. A facet query is for creating facets (counts) from arbitrary queries - so internally it does the same thing. If you generate a facet and then iterate through that one, Lucene has to look at far more documents than if you're just querying for one single value.
The best strategy to get a performance boost would be to perform these operations in batch - check 500 books in the same batch (i.e. isbn:(123 OR 321 OR 567 OR 765)), and then handle that in your code. If these updates can arrive from many systems in parallel without going through one single source, you'll have to decide how much time you can spend before any duplicates might appear in the streams (this race condition can happen with just one book as well, as two streams can query for a single isbn and get a negative result before adding it separately from both streams).

Apply Solr filter query to only part of the search results

I have a Solr solution working which requires two queries, but I'm looking for a way to do it in a single query. My idea is that if I can figure out a way to do this, I wont have to incur the overhead of twice the load on the Solr cluster.
The details: I'm running a simple query like "q=camera" with a query filter of say "fq=type:digital". The second query is identical to the first, but the filter is the inverse, like "fq=-type:digital" I'm imagining that if there's a way to run a single query while applying the first filter to get the first set of topDocs, then generate a second set with the second filter the results could be merged and returned ( it doesn't matter if sorting resorts and mixes the two sets).
I experimented with partitioning the data by marking a specific field during indexing, into two different groups and then using Solr "grouping" queries, but the response time for these wasn't acceptable in my setup.
I'm looking for suggestions the most Solr congruent approach to experiment with: tuning to improve the two-query solution performance, or investigating a kind of custom Solr post-filter ( I read Yonik's 2/2012 blog post ).
I have to implement this in Solr 3.5, although if there's a slam dunk solution in 4.0 I'll eventually be able to move to that.
I can think of two alternate approaches :-
Instead of filter the results, use a variable higher boost so that all the results for type:digital come on top and rest of the documents would follow. No need for separate queries. The boost can be changes as per the type value.
Other approach is not to display the results for type other then digital. However, you can display the facets for the other types with the counts for the same for users to know if the other types exist for the search term. You can check on tagging and excluding filters
Result grouping might give you what you want. Just group by that parameter and specify sufficient top number of documents in each group.
But I would test whether its performance is any better than two queries. Just because it mentions performance in limitations section.

Best way to create a Lucene Index with fields that can be updated frequently, and filtering the results by this field

I use Lucene to index my documents and search. Actually I have 800k documents indexed in Lucene. Those documents have some fields:
Id: is a Numeric field to index the documents
Name: is a textual field to be stored and analyzed
Description: like name
Availability: is a numeric field to filter results. This field can be updated frequently, every day.
My question is: What's the better way to create a filter for availability?
1 - add this information to index and make a lucene filter.
With this approach I have to update document (remove and add, because lucene 3.0.2 not have update support) every time the "availability" changes. What the cost of reindex?
2 - don't add this information to index, and filter the results with a DB select.
This approach will do a lot of selects, because I need select every id from database to check availability.
3 - Create a separated index with id and availability.
I don't know if it is a good solution, but I can create a index with static information and other with information can be frequently updated. I think it is better then update all document, just because some fields were updated.
I would stay away from 2, if you can deal only with the search in lucene, instead of search in lucene+db, do it. I deal in my project with this case (Lucene search + DB search), but I do it cause there is no way out of it.
The cost of an update is internally:
delete the doc
insert new doc (with new field).
I would just try approach number 1 (as is the simplest), if the performance is good enough, then just stick with it, if not then you might look ways to optimize it or try 3.
Answer provided from lucene-groupmail:
How often is "frequently"? How many updates do you expect to do in
a day? And how quickly must those updates be reflected in the search
results?
800K documents isn't all that many. I'd go with the simple approach first
and monitor the results, #then# go to a more complex solution if you
see a problem arising. Just update (delete/add) the documents when
the value changes.
Well, the cost to reindex is just about what the cost to index it orignally
is. The old version of the document is marked deleted and the new one
is added. It's essentially the same cost as to index a new document.
This leaves some gaps in your index, that is the deleted docs are still in
there, but the next optimize will compact them.
From which you may infer that optimizing is the expensive part. I'd do that,
say
once daily (or even weekly).
HTH
Erick

Efficiently sorting and paging with Solr when index is changing

I'm working on a structured document viewer, where each Solr document is a "section" or "paragraph" in a large set of legal documents, along with assorted metadata. I have a corpus which will probably represent 10^12 or more of these sections. I want to provide paging for the user so that they can view N of these sections at a time in sort_path order.
Now the problem: Even if sort_path is indexed, there are docs being added and removed all the time. A simple sort and paging solution will end up with users possibly skipping sections or jumping around in the ordering unexpectedly, even when they are nowhere near the documents being added/removed in the ordering; this behavior would be unacceptable.
Example: I make the "next" page link point at something like ...sort_order=sort_path+desc&rows=N&start:12345. Then, while the user is viewing the page, a document early in the sort_path order is deleted. Now when they fetch the next N rows, they will have skipped 1 document without knowing.
So, given I have a sort_path field which orders the sections, the front end needs to be able to ask for N sections "before" or "after" sort_path:/X/Y/Z, instead of asking for rows:N with start:12345. I have no idea how to represent this in a Solr query.
I may be pushing the edges of Solr a little far, and it may end up making more sense to store representations of these "section" documents both in Solr (for content searches, which Solr is awesome at) and an RDBMS (for ordering and indexing). I was hoping to avoid that, and this sort of query is still going to be ugly in a database, so maybe you've got some ideas. (Thanks!)
Update:
It turns out that solr ranges combined with sorting may give me exactly what I need. On the indexed field, I can do something like
sort_path:["/A/B/C" TO *]
to get the "next" N sections, and do
sort_path:[* TO "/A/B/C"]
ordering by sort_path:desc and then reversing the returned chunk to get the previous N sections. I am going to test the performance of this solution, but it seems viable.
This is not really a Solr-specific problem, but a general problem with pagination of any external data source, because the data source has an independent state from the (web) application. For example, it also happens on relational databases. Here's a good coverage of pagination in relational databases, along with the possible solutions. Most web applications / websites take the first solution: "Repeat the query for each new request" since the other solutions are much more complex and not scalable, but this suffers from the problem you describe. Browse the questions on stackoverflow.com for a while and you'll notice it, since questions are being created constantly.
In your case I'd consider modeling the Solr documents as your whole legal documents instead of their individual sections. You'll get a lot less documents (therefore a slower rate of inserts/deletes) and you can use the highlighting parameters to get snippets of the sections that matched the user query.
Another option would be decreasing your commit rate, but this could end up in less-than-ideal document freshness.

Resources