I'm using Apache Solr and querying an index with a schema that has a text field PostBody, a integer Userid field, and a trie based datetime field MostRecentActivityDate.
I'm attempting to apply query-time boosting to my select query such that more recent posts are boosted by some factor to assist in scoring. My values for this are in attempts to have a timescale of days rather than years as in many online date boosting examples.
The following two queries produce different results, the only thing being different in them is where the "code" for the boosting is actually placed (i.e. prior to or after the field conditionals themselves). In my testing I've also noticed that they both produce different results from when there is no {} boosting code, so its not as if in one case its being ignored.
Is anyone able to explain why they would produce different results? Thanks!
{!boost%20b=recip(ms(NOW,MostRecentActivityDate),1.16e-7,1,1)} (PostBody:"timmy is great and that is a fact") AND !Userid=2
Vs.
(PostBody:"timmy is great and that is a fact") AND !Userid=2 {!boost%20b=recip(ms(NOW,MostRecentActivityDate),1.16e-7,1,1)}
Since this will be very specific to your data, the best way to figure out what is happening, is to turn on query Debugging - via the debugQuery=on parameter of your search. Here are two links that help explain the debug output.
Debugging Search Applications Relevance - Explanations
Why does id:archangel come before id:hawkgirl when querying for "wings"
Related
I am experimenting with boosting in Solr and have become confused how my document scores are being affected.
I have a collection of technical documents that contain fields like Title, Symptoms, Resolution, Classification, Tags, etc. All the fields listed are required except Tags which is optional. All fields are copied to _text_ and that field is the default search field.
When I run a default query
http://search:8983/solr/articles-experimental/select?defType=edismax&fl=id,%20tags,%20score&q=virtualization&qf=_text_
The top article (Article 42014) comes back with a score of 4.182179. This document has 6 instances of the word virtualization in multiple fields -- Title, Symptoms, Resolution, and Classification. This particular article does not have any Tags value.
I now want to experiment with boosting so that articles that have Tag values matching the search terms appear closer to the top of the results. To do this, I send the following query
http://search:8983/solr/articles-experimental/select?defType=edismax&fl=id,tags,score&q=virtualization&qf=tags^2%20_text_
which keeps the same Article 42014 at the top of the list but now with a score of 4.269944. However, results 2 through 65 now all have the same score of 4.255975. In the non-boosted query the scores range from 4.056591 down to 2.7029662.
In addition, the collection of document id coming back are not quite the same as before. I certainly expect some differences but not the extent that I am seeing considering that the vast majority of the articles coming back have the search term as a tag.
Ultimately, I am having trouble finding out exactly how boosting changes the score and what is an "appropriate" boost value. Understanding that it is probably subjective, what criteria should I be considering?
well, with all parameters you set for edismax (plus the default values for all the ones you don't set) Solr runs just the algorithm (BM25) nowadays and all scores will be calculated.
The specific boosting values etc you should use for your query are impossible to guess, you must try and retry. It is a known pain, I even built vifun a tool to help me visualize how different parameters affect score with edismax.
From reading the docs the search +term +another_term should return the same documents as term AND another_term. But I'm getting different results. Someone suggested that one of the terms is actually acting as an OR. But I thought the search queries were baked into SOLR.
Where in the Solr config would I check for this?
If you enable the debug flat in the admin UI when you run those two queries, it will show you what they get translated to on the lowest level after the Query Parser, etc. You can compare and see if something is different.
I am working on two different searching tools: DtSearch and Solr. I do a FULL_TEXT search on one indexed search term ("2008/12/02") and unfortunately both give different hits though the data are the same. Another strange thing I notice is that Solr gives three DOC_ID as hits and DtSearch gives me five for the same search terms.
I am confused about date searching now. How can it be possible though the data are the same?
Do I need to apply some extra settings in config files? Is there any way I get consistent output?
Thank you,
I have a Solr solution working which requires two queries, but I'm looking for a way to do it in a single query. My idea is that if I can figure out a way to do this, I wont have to incur the overhead of twice the load on the Solr cluster.
The details: I'm running a simple query like "q=camera" with a query filter of say "fq=type:digital". The second query is identical to the first, but the filter is the inverse, like "fq=-type:digital" I'm imagining that if there's a way to run a single query while applying the first filter to get the first set of topDocs, then generate a second set with the second filter the results could be merged and returned ( it doesn't matter if sorting resorts and mixes the two sets).
I experimented with partitioning the data by marking a specific field during indexing, into two different groups and then using Solr "grouping" queries, but the response time for these wasn't acceptable in my setup.
I'm looking for suggestions the most Solr congruent approach to experiment with: tuning to improve the two-query solution performance, or investigating a kind of custom Solr post-filter ( I read Yonik's 2/2012 blog post ).
I have to implement this in Solr 3.5, although if there's a slam dunk solution in 4.0 I'll eventually be able to move to that.
I can think of two alternate approaches :-
Instead of filter the results, use a variable higher boost so that all the results for type:digital come on top and rest of the documents would follow. No need for separate queries. The boost can be changes as per the type value.
Other approach is not to display the results for type other then digital. However, you can display the facets for the other types with the counts for the same for users to know if the other types exist for the search term. You can check on tagging and excluding filters
Result grouping might give you what you want. Just group by that parameter and specify sufficient top number of documents in each group.
But I would test whether its performance is any better than two queries. Just because it mentions performance in limitations section.
SOLR results are normally ordered by "best match" of your search criteria. Is it possible to order the results alphabetically by a given SOLR field?
I realize that this is not a typical use case, but here's my motivation. We have quite a lot of code written around SOLR that performs queries based on user searches against the various fields of our data. Most of the time, we want a relevancy ordering (i.e. best matches first).
But one anomalous use-case requires that we return data ordered alphabetically by field. I could perform this query using our SQL database (avoiding SOLR altogether), but I'd have to replicate an awful lot of code that's tailored around consuming SOLR results (facets in particular). I'm hoping to use the same code path, if it's possible to get such an ordering from SOLR.
Yes, you just have to set the sort parameter to field-name