Solr near real time search: impact of reindexing frequently the same documents

Solr near real time search: impact of reindexing frequently the same documents - solr

We want to use SolR in a Near Real Time scenario. Say for example we want to filter / rank our results by number of views.
SolR SoftCommit was made for this use case but:
In practice, the same few documents are updated very frequently (just for the nb_view field) while most of the documents are untouched.
As far as I know each update, even partial are implemented as a full delete and full addition of the document in lucene.
It seems to me having many times the same docs in the Tlog is inefficient and might also be problematic during the merge process (is the doc marked n times as deleted and added?)
Any advice / good practice?

Two things you could use for supporting this scenario:
In place updates: only that field is udpated, not the whole doc. Check out the conditions you need to be able to use them.
ExternalFileFieldType you keep the values in an external file
if the scenario is critical, I would test both in reald world conditions if possible, and asses.

Related

Solr documents with multiple parents

I'm currently trying to figure out if Solr is the right tool for me. I have the following setup:
There is the primary document type "blog". Then there are two additional document types "user" and "category". Both of these are parents of the "blog" document type.
Now when searching the "blog" documents, I not only want to search in those fields (e.g. title and content), but also in the parent fields (user>name and category>name.
Of course, I could just flatten that down to a single document for Solr, which would ease the search a lot. The downside to this is though, that when e.g. a user updates their name, I have to run through all blog posts of them and update the documents for that in Solr, instead of just updating a single document.
This becomes even worse when the user has another parent, on which I need to search as well.
Do you have any recommendations about how to handle this use case? Maybe my Google foo is just not good enough, but what I found (block joins, etc.) don't seem to do the trick.

The absolutely most performant and easiest solution would be to flatten everything to a single document. It turns out that these relations aren't updated as often as people think, and that searches are performed more often than the documents update. And even if one of the values that are identical across a large set of documents change, reindexing from the most recent documents (for a blog) and then going backwards will appear rather performant for most users. The assumes that you have to actually search the values and don't just need the values - which you could look up from secondary storage when displaying an item (and just store the never changing id in the document).
Another option is to divide this into a multi-search problem. One collection for blog posts, one collection for users and one collection for categories. You then search through each of the collections for the relevant data and merge it in your search model. You can also use [Streaming Expressions] to hand off most of this processing to a Solr cluster for you.
The reason why I always recommend flattening if possible is that most features in Solr (and Lucene) are written for a flat document structure, and allows you to fully leverage the features available. Since Lucene by design is a flat document store, most other features require special care to support blockjoins and parent/child relationships, and you end up experimenting a lot to get the correct queries and feature set you want (if possible). If the documents are flat, it just works.

Solr 4.5: When is Solr facet query better than simple query?

I'm working with Apache Solr and would like to get more detailed information about some query options. I discovered facet queries and was wondering, when exactly do they bring essential advantages; especially in case of the following example:
There is a stock of books that is saved on a Solr server. Despite the common attributes a book ought to have, they have an ISBN. Data about books is provided by third parties and so it's important to check that there are no doubled ISBNs within the system. In order to check if a book's ISBN is a duplicate, it has to go through a routed path, were - unfortunately - every book is processed indiviually without any information about preceeding or following processes.
The question is:
a) Should you simply query Solr with the current book ISBN and check the total results, or
b) should you send a facet query with a f.isbn.facet.mincount=2 and check if the result contains the current book ISBN?
In both cases, caching results is not possible. So the number of queries would always equal the number of books processed. I simply don't know how Solr works within and therefore can't make this decision without further information, especially because the number of queries won't be reduced by either of above possibilities.

If you're going to do a query - do a query. Lucene is highly optimized for doing queries, so that's what you should do. A facet query is for creating facets (counts) from arbitrary queries - so internally it does the same thing. If you generate a facet and then iterate through that one, Lucene has to look at far more documents than if you're just querying for one single value.
The best strategy to get a performance boost would be to perform these operations in batch - check 500 books in the same batch (i.e. isbn:(123 OR 321 OR 567 OR 765)), and then handle that in your code. If these updates can arrive from many systems in parallel without going through one single source, you'll have to decide how much time you can spend before any duplicates might appear in the streams (this race condition can happen with just one book as well, as two streams can query for a single isbn and get a negative result before adding it separately from both streams).

How to handle frequently changing multivalue string fields in SOLR?

I have a SOLR (or rather Heliosearch 0.07) core on a single EC2 instance. It contains about 20M documents and takes about 50GB on disc. The core is quite fixed/frozen and performs quite well, if everything is warmed up.
The problem is a multimulti value string field: That field contains assigned categories, which change quite frequently for large parts of the 20M documents. After a commit, the warm up takes way too long to be usable in production.
The field is used only for facetting and filtering. My idea was, to store the categories outside SOLR and to inject them somehow using custom code. I checked quite some approaches in various JIRA issues and blogs, but I could not find some working solution. Item 2 of this issue suggests that there is a solution, but I don't get what he's talking about.
I would appreciate any solution which allows me to update my category field without having to re-warmup my caches again afterwards.

I'm not sure that JIRA will help you: it seems an advanced topic and most impprtant it is still unresolved so not yet available.
Partial document updates are not useful here because a) it requires everything is stored in your schema b) behind the scenes it does reindex again the whole index
From what you say it seems tou have a one monolithic index: have you considered to split the index vertically using sharding or SolrCloud? In that way each "portion" would be smaller and the autowarm shouldn't be a big problem.

Long queries on very short documents

I'm fresh out of the nursery as far as Lucene/Solr are concerned, so I may be trying to utilize it completely wrong, but I hope someone can point me in the right direction.
My documents (less than 3,000) are short statements from a taxonomy. All are single sentences, with some having no more than 4-6 words long. There is only one field for each document, so searching across multiple fields is not a route I would be looking into. What I would like to do is query the contents of a work related document and have the taxonomy statements that are relevant returned.
Currently I am using the default example setup that came with Solr with added verb synonyms from Wordnet since performed actions are what I am trying to identify (i.e. taxonomy statement of 'Alter garments to specifications').
Basic word matching works as expected, but I would like to make things somewhat more sophisticated. Since the queries are so long I never end up with a high relevancy scores when searching against the tiny documents. I'm sure this can be resolved by normalizing scores in some fashion so I am not real concerned about the scores coming out, but the actual statements (documents) that are being identified.
Would I be better off indexing the documents (currently the long queries) on the fly and querying each taxonomy statement and compiling/sorting the results or can I perform these long queries on the tiny documents effectively in some other fashion? I presume this may present it's own difficulties.

I see no end to what are you trying to do here, i mean your short documents index will definitely suffer from lake of information, and a long query will make every result almost flat in front of it, even expanding the document by adding every term with Wordnet synonyms will be confusing and misleading i think, my advice is to chack other possible forms of the query.

Efficiently sorting and paging with Solr when index is changing

I'm working on a structured document viewer, where each Solr document is a "section" or "paragraph" in a large set of legal documents, along with assorted metadata. I have a corpus which will probably represent 10^12 or more of these sections. I want to provide paging for the user so that they can view N of these sections at a time in sort_path order.
Now the problem: Even if sort_path is indexed, there are docs being added and removed all the time. A simple sort and paging solution will end up with users possibly skipping sections or jumping around in the ordering unexpectedly, even when they are nowhere near the documents being added/removed in the ordering; this behavior would be unacceptable.
Example: I make the "next" page link point at something like ...sort_order=sort_path+desc&rows=N&start:12345. Then, while the user is viewing the page, a document early in the sort_path order is deleted. Now when they fetch the next N rows, they will have skipped 1 document without knowing.
So, given I have a sort_path field which orders the sections, the front end needs to be able to ask for N sections "before" or "after" sort_path:/X/Y/Z, instead of asking for rows:N with start:12345. I have no idea how to represent this in a Solr query.
I may be pushing the edges of Solr a little far, and it may end up making more sense to store representations of these "section" documents both in Solr (for content searches, which Solr is awesome at) and an RDBMS (for ordering and indexing). I was hoping to avoid that, and this sort of query is still going to be ugly in a database, so maybe you've got some ideas. (Thanks!)
Update:
It turns out that solr ranges combined with sorting may give me exactly what I need. On the indexed field, I can do something like
sort_path:["/A/B/C" TO *]
to get the "next" N sections, and do
sort_path:[* TO "/A/B/C"]
ordering by sort_path:desc and then reversing the returned chunk to get the previous N sections. I am going to test the performance of this solution, but it seems viable.

This is not really a Solr-specific problem, but a general problem with pagination of any external data source, because the data source has an independent state from the (web) application. For example, it also happens on relational databases. Here's a good coverage of pagination in relational databases, along with the possible solutions. Most web applications / websites take the first solution: "Repeat the query for each new request" since the other solutions are much more complex and not scalable, but this suffers from the problem you describe. Browse the questions on stackoverflow.com for a while and you'll notice it, since questions are being created constantly.
In your case I'd consider modeling the Solr documents as your whole legal documents instead of their individual sections. You'll get a lot less documents (therefore a slower rate of inserts/deletes) and you can use the highlighting parameters to get snippets of the sections that matched the user query.
Another option would be decreasing your commit rate, but this could end up in less-than-ideal document freshness.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight