SOLR results are normally ordered by "best match" of your search criteria. Is it possible to order the results alphabetically by a given SOLR field?
I realize that this is not a typical use case, but here's my motivation. We have quite a lot of code written around SOLR that performs queries based on user searches against the various fields of our data. Most of the time, we want a relevancy ordering (i.e. best matches first).
But one anomalous use-case requires that we return data ordered alphabetically by field. I could perform this query using our SQL database (avoiding SOLR altogether), but I'd have to replicate an awful lot of code that's tailored around consuming SOLR results (facets in particular). I'm hoping to use the same code path, if it's possible to get such an ordering from SOLR.
Yes, you just have to set the sort parameter to field-name
Related
I'm using Solr facets to get the most common values for specific fields. It has occurred to me that (for business logic purposes) it would be preferable to exclude certain values. I cannot seem to find a way to do this, however.
I'm not looking to exclude the filter query, as seems to be commonly discussed.
If I'm getting the top 3 facets for a field, and seeing that "ValueA", "ValueB", and "ValueC", I'd like to say, essentially, "Get facets that aren't ValueB". So my facet instead returns data for "ValueA", "ValueC", and "ValueD".
Use the facet.excludeTerms parameter. According to the source the format seems to be "term1,term2" to exclude those two terms.
The feature was introduced with Solr 6.5.
If you need the same feature before Solr 6.5 - if you need to supply the term to exclude separately for each query, you're going to have to do it in your controller / Solr interfacing code. If you want to do it for a single or multiple terms across the whole index for all queries, add a separate field and filter out those terms while indexing.
I'm working with Apache Solr and would like to get more detailed information about some query options. I discovered facet queries and was wondering, when exactly do they bring essential advantages; especially in case of the following example:
There is a stock of books that is saved on a Solr server. Despite the common attributes a book ought to have, they have an ISBN. Data about books is provided by third parties and so it's important to check that there are no doubled ISBNs within the system. In order to check if a book's ISBN is a duplicate, it has to go through a routed path, were - unfortunately - every book is processed indiviually without any information about preceeding or following processes.
The question is:
a) Should you simply query Solr with the current book ISBN and check the total results, or
b) should you send a facet query with a f.isbn.facet.mincount=2 and check if the result contains the current book ISBN?
In both cases, caching results is not possible. So the number of queries would always equal the number of books processed. I simply don't know how Solr works within and therefore can't make this decision without further information, especially because the number of queries won't be reduced by either of above possibilities.
If you're going to do a query - do a query. Lucene is highly optimized for doing queries, so that's what you should do. A facet query is for creating facets (counts) from arbitrary queries - so internally it does the same thing. If you generate a facet and then iterate through that one, Lucene has to look at far more documents than if you're just querying for one single value.
The best strategy to get a performance boost would be to perform these operations in batch - check 500 books in the same batch (i.e. isbn:(123 OR 321 OR 567 OR 765)), and then handle that in your code. If these updates can arrive from many systems in parallel without going through one single source, you'll have to decide how much time you can spend before any duplicates might appear in the streams (this race condition can happen with just one book as well, as two streams can query for a single isbn and get a negative result before adding it separately from both streams).
Sometimes I don't need just the top X results from a SOLR query, but all results (running into millions). This is easily achievable by searching once with 0 rows as a request parameter, and then re-execute the search with the numFound from the result as number of rows(*)
Of course we can sort the results by e.g. "id asc" to remove relevancy ranking, however, I would like to be able to disable the entire scoring calculation for these queries, as they probably are quite computational intensive and we just don't need them in these cases.
My question:
Is there a way to make SOLR work in boolean mode and effectively run faster on these often slow queries, when all we need is just all results?
(*) I actually usually simply do a paged query where a script walks through the pages (multi threaded), to prevent timeouts on large result sets, yet keep it fast as possible, but this is not important for the question.
This looks like a related question, but apparently the user asked the wrong question and was only after retrieving all results: Solr remove ranking or modify ranking feature; This question is not answered there.
Use filters instead of queries; there is no score calculation for filters.
There is a couple of things to be aware of
Solr deep paging allows you to export large number of results much quicker
Using an export format such as CSV could be faster than using an XML format just due to the formatting and it being more compact
And, as already mentioned, if you are exporting all, put your queries into FilterQuery with caching off
For very complex queries, if you can split it into several steps, you can actually assign different weights to the filters and have them execute in sequence. This allows to use cheap first filter that gets rid of most of the results and only then apply more expensive, more precise, filters
I have a Solr solution working which requires two queries, but I'm looking for a way to do it in a single query. My idea is that if I can figure out a way to do this, I wont have to incur the overhead of twice the load on the Solr cluster.
The details: I'm running a simple query like "q=camera" with a query filter of say "fq=type:digital". The second query is identical to the first, but the filter is the inverse, like "fq=-type:digital" I'm imagining that if there's a way to run a single query while applying the first filter to get the first set of topDocs, then generate a second set with the second filter the results could be merged and returned ( it doesn't matter if sorting resorts and mixes the two sets).
I experimented with partitioning the data by marking a specific field during indexing, into two different groups and then using Solr "grouping" queries, but the response time for these wasn't acceptable in my setup.
I'm looking for suggestions the most Solr congruent approach to experiment with: tuning to improve the two-query solution performance, or investigating a kind of custom Solr post-filter ( I read Yonik's 2/2012 blog post ).
I have to implement this in Solr 3.5, although if there's a slam dunk solution in 4.0 I'll eventually be able to move to that.
I can think of two alternate approaches :-
Instead of filter the results, use a variable higher boost so that all the results for type:digital come on top and rest of the documents would follow. No need for separate queries. The boost can be changes as per the type value.
Other approach is not to display the results for type other then digital. However, you can display the facets for the other types with the counts for the same for users to know if the other types exist for the search term. You can check on tagging and excluding filters
Result grouping might give you what you want. Just group by that parameter and specify sufficient top number of documents in each group.
But I would test whether its performance is any better than two queries. Just because it mentions performance in limitations section.
I'm using Apache Solr and querying an index with a schema that has a text field PostBody, a integer Userid field, and a trie based datetime field MostRecentActivityDate.
I'm attempting to apply query-time boosting to my select query such that more recent posts are boosted by some factor to assist in scoring. My values for this are in attempts to have a timescale of days rather than years as in many online date boosting examples.
The following two queries produce different results, the only thing being different in them is where the "code" for the boosting is actually placed (i.e. prior to or after the field conditionals themselves). In my testing I've also noticed that they both produce different results from when there is no {} boosting code, so its not as if in one case its being ignored.
Is anyone able to explain why they would produce different results? Thanks!
{!boost%20b=recip(ms(NOW,MostRecentActivityDate),1.16e-7,1,1)} (PostBody:"timmy is great and that is a fact") AND !Userid=2
Vs.
(PostBody:"timmy is great and that is a fact") AND !Userid=2 {!boost%20b=recip(ms(NOW,MostRecentActivityDate),1.16e-7,1,1)}
Since this will be very specific to your data, the best way to figure out what is happening, is to turn on query Debugging - via the debugQuery=on parameter of your search. Here are two links that help explain the debug output.
Debugging Search Applications Relevance - Explanations
Why does id:archangel come before id:hawkgirl when querying for "wings"