Solr: Querying across multiple collection with different conditions

Solr: Querying across multiple collection with different conditions - solr

I have multiple collections with schemas almost same. I'd like to apply certain conditions specific to each collection while other conditions are same across all collections and return a combined result set. Is this possible in Solr? Appreciate if you can share a sample query. I'm using Solr 5.3.0.

You're going to have issues with the "certain conditions specific to each collection", as there is no query support for anything like that. You're probably going to have do to the querying and merging yourself.
Otherwise as possible solution would be the "shard unification" strategy as mentioned in Query multiple collections with different fields in solr, but scoring between documents would be local to each shard.

Related

Indexing Architecture for frequently updated index solr?

I have roughly 50M documents, 90 (stored(20) + non- stored(70)) fields in schema.xml indexed in single core. The queries are quiet complex along with faceting and highlighting. Out of this 90 fields, there are 3-4 fields (all stored) which are very frequently uploaded. Now, updating these field normally would require populating all the fields again which is heavy task. If I use atomic/partial update, we have to update the non-stored fields again.
Our Solution:
To overcome the above problems, we decided to use SolrCloud and Join queries. We split the index into two separate indexes/collection i.e one for stored fields and one for non-stored fields. The relation b/w the documents being the id of the doc. We kept the frequently updated fields in stored index. By doing this we were able to leverage atomic updates. Also to overcome the limitation of join queries in cloud, we sharded & replicated the stored fields across all nodes but the non-stored was not sharded but replicated across all nodes.we have a 5 node cluster with additional 3 instances of zookeeper. Considering the number of docs, the only area of concern is that will join queries eventually degrade search performance? If so, what other options I can consider.

Thinking about Joins makes Solr more like a Relational database. I have found an article on this from the Lucidworks team Solr and Joins. Even they are saying that if your solution includes the use of Join then it means you need to rethink about that.
I think I have a solution for you guys. First of all, forget two collections.You create one collection and You are going to have two Solr document for every single document. Now one document will have the stored fields and the other has the non-stored fields. At the time of updating you will update the document which has stored field and perform a search-related operation on the other document.
Now all you need to do is at the time of query you need to merge both the documents into a single document which can be done by writing service layer over the Solr.

I have a issue with partial/atomic updates and index operations on fields in the background, I did not modify. This is different to the question, but maybe the use of nested documents is worth thinking about.
I was checking the use of nested documents to separate document header data from text content to be indexed, since processing the text content is consuming a lot resources. According to the docs, parent and childs are indexed as blocks and always have to be indexed together.
This is stated in https://solr.apache.org/guide/8_0/indexing-nested-documents.html:
With the exception of in-place updates, the whole block must be updated or deleted together, not separately. For some applications this may result in tons of extra indexing and thus may be a deal-breaker.
So as long as you are not able to perform in-place updates (which have their own restrictions in terms of indexed, stored and <copyField...> directives), the use of nested documents does not seem to be a valid approach.

Solr documents with multiple parents

I'm currently trying to figure out if Solr is the right tool for me. I have the following setup:
There is the primary document type "blog". Then there are two additional document types "user" and "category". Both of these are parents of the "blog" document type.
Now when searching the "blog" documents, I not only want to search in those fields (e.g. title and content), but also in the parent fields (user>name and category>name.
Of course, I could just flatten that down to a single document for Solr, which would ease the search a lot. The downside to this is though, that when e.g. a user updates their name, I have to run through all blog posts of them and update the documents for that in Solr, instead of just updating a single document.
This becomes even worse when the user has another parent, on which I need to search as well.
Do you have any recommendations about how to handle this use case? Maybe my Google foo is just not good enough, but what I found (block joins, etc.) don't seem to do the trick.

The absolutely most performant and easiest solution would be to flatten everything to a single document. It turns out that these relations aren't updated as often as people think, and that searches are performed more often than the documents update. And even if one of the values that are identical across a large set of documents change, reindexing from the most recent documents (for a blog) and then going backwards will appear rather performant for most users. The assumes that you have to actually search the values and don't just need the values - which you could look up from secondary storage when displaying an item (and just store the never changing id in the document).
Another option is to divide this into a multi-search problem. One collection for blog posts, one collection for users and one collection for categories. You then search through each of the collections for the relevant data and merge it in your search model. You can also use [Streaming Expressions] to hand off most of this processing to a Solr cluster for you.
The reason why I always recommend flattening if possible is that most features in Solr (and Lucene) are written for a flat document structure, and allows you to fully leverage the features available. Since Lucene by design is a flat document store, most other features require special care to support blockjoins and parent/child relationships, and you end up experimenting a lot to get the correct queries and feature set you want (if possible). If the documents are flat, it just works.

Apply Solr filter query to only part of the search results

I have a Solr solution working which requires two queries, but I'm looking for a way to do it in a single query. My idea is that if I can figure out a way to do this, I wont have to incur the overhead of twice the load on the Solr cluster.
The details: I'm running a simple query like "q=camera" with a query filter of say "fq=type:digital". The second query is identical to the first, but the filter is the inverse, like "fq=-type:digital" I'm imagining that if there's a way to run a single query while applying the first filter to get the first set of topDocs, then generate a second set with the second filter the results could be merged and returned ( it doesn't matter if sorting resorts and mixes the two sets).
I experimented with partitioning the data by marking a specific field during indexing, into two different groups and then using Solr "grouping" queries, but the response time for these wasn't acceptable in my setup.
I'm looking for suggestions the most Solr congruent approach to experiment with: tuning to improve the two-query solution performance, or investigating a kind of custom Solr post-filter ( I read Yonik's 2/2012 blog post ).
I have to implement this in Solr 3.5, although if there's a slam dunk solution in 4.0 I'll eventually be able to move to that.

I can think of two alternate approaches :-
Instead of filter the results, use a variable higher boost so that all the results for type:digital come on top and rest of the documents would follow. No need for separate queries. The boost can be changes as per the type value.
Other approach is not to display the results for type other then digital. However, you can display the facets for the other types with the counts for the same for users to know if the other types exist for the search term. You can check on tagging and excluding filters

Result grouping might give you what you want. Just group by that parameter and specify sufficient top number of documents in each group.
But I would test whether its performance is any better than two queries. Just because it mentions performance in limitations section.

SOLR: ordering results alphabetically by field

SOLR results are normally ordered by "best match" of your search criteria. Is it possible to order the results alphabetically by a given SOLR field?
I realize that this is not a typical use case, but here's my motivation. We have quite a lot of code written around SOLR that performs queries based on user searches against the various fields of our data. Most of the time, we want a relevancy ordering (i.e. best matches first).
But one anomalous use-case requires that we return data ordered alphabetically by field. I could perform this query using our SQL database (avoiding SOLR altogether), but I'd have to replicate an awful lot of code that's tailored around consuming SOLR results (facets in particular). I'm hoping to use the same code path, if it's possible to get such an ordering from SOLR.

Yes, you just have to set the sort parameter to field-name

SolrNet Newbie - How to handle multiple Where Clauses

I just started exploring SolrNet. Previously I have been using MSSQL FULL TEXT.
In sql server, my query make full text searches and also have multiple joins and Where clauses. I am also using custom paging to return only the 10 rows out of millions.
I have read few solrNet docs and run sample apps provided on the blogs. All worked well so far. Just need to get an idea, What do I do with JOINS and WHERE clauses??
e.g. If user searches for Samsung, db would return 100k records, but if users searches for Samsung && City='New york' && Price >'500' then he would only get couple of thousands records.
Do I add all columns in Solr and write WHERE clauses in Solr?
What do I do about SQL JOINS?
Thanks in Advance!

There are no joins in Solr. From the Solr wiki:
Solr provides one table. Storing a set
database tables in an index generally
requires denormalizing some of the
tables. Attempts to avoid
denormalizing usually fail.
About WHERE clauses (i.e. filtering), see Querying in SolrNet, Solr query syntax, and Common Solr query parameters.

The Solr equivalent of your where clauses is to map your columns to fields and run queries based on the query syntax. A query like your example:
Samsung && City='New york' && Price >'500'
could be translated to something like this in Solr:
q=Samsung AND city:"new york" AND price:[500 TO *]
You need to take some care when you map your database to a Solr schema, specifically you will probably have to denormalize your data. See this page on the Solr wiki for more information. Basically, you can't really do complex JOINs in Solr. It's a "flat" index.