solr: Boosting documents that match all terms - solr

I would like to have documents that match all terms rank the highest, followed by the partial matches. Among the full matches and among the partial matches, the documents should be ranked by default behavior (IF-TDF). I figured (correct me if I'm wrong) that the best way to do this would be with boost queries, but I am not sure what the correct syntax is.
Here are some of my handler settings:
<str name="defType">edismax</str>
<str name="qf">parent_and_self_description details^0.0001 info^0.0001 code^10000</str>
And let's say an example query is q=cheese apple
How should I set my bq? I guessed something like bq=(cheese AND apple)^100 or bq=(+cheese +apple)^100 but obviously this is not working so it must be syntactically wrong. Thank you.

Related

Solr SpellCheck Component custom freq field?

I'm currently playing around with the Solr SpellCheck component and at the moment I've a core which is my 'dictionary'. In this core there is a huge list of words with a "score".
Example document:
"keyword":"facebook",
"frequency":89504,
A word is only listed once in the core, so when I execute a spellcheck for example faceboek
spell?omitHeader=true&wt=xml&json.nl=flat&spellcheck=true&spellcheck.q=faceboek&spellcheck.build=false
it returns facebook with a freq of 1 because that word is only listed once in my core. However I want that the freq is going to be my field frequency.
Return example:
<lst>
<str name="word">facebook</str>
<int name="freq">1</int>
</lst>
So my question is. Is it possible to modify the freq field into the frequency field every document has, or is there another solution to this?
Thank you for your time. I'll provide more information if the question is unclear.
Consider creating a separate core/collection with your suggestions and use that instead.
That will allow you to apply a boost to each document (i.e. suggestion) by using freq, and use fuzzy search (q=term~) to find suggestions (if they're misspelled).
Depending on the use case, the Suggester can also be useful, but a dedicated collection will give you the most flexibility (i.e. you can score it any way you want).

LUCENE: search for terms that match a regex

I need to search for any terms in the lucene index, matching particular regex. I know that I can do it using the TermsComponent in solr, if it is configed like this:
<searchComponent name="terms" class="solr.TermsComponent"/>
<!-- A request handler for demonstrating the terms component -->
<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<bool name="terms">true</bool>
<bool name="distrib">false</bool>
</lst>
<arr name="components">
<str>terms</str>
</arr>
</requestHandler>
For example, I want to fetch any terms containing "surface defects". Using solr I can do this:
http://localhost:8983/solr/core1/terms?terms.fl=content&
terms.regex=^(.*?(\bsurface%20defects\b)[^$]*)$&
terms.sort=count&
terms.limit=10000
But my question is, how can I achieve the same by using the Lucene API, not solr? I looked into the org.apache.solr.handler.component.TermsComponent class but it is not very obvious for me.
You can use a RegexQuery:
Query query = new RegexQuery(new Term("myField", myRegex));
Or the QueryParser:
String queryString = "/" + myRegex + "/";
QueryParser parser = new QueryParser("myField", new KeywordAnalyzer());
Query query = parser.parse(queryString);
Now, my question is: Are you sure that regex works in Solr?
I haven't tried the TermsComponent regex functionality, so maybe it's doing some fancy SpanQuery footwork here, or running regexes on the stored fields retrieved, or something like that, but you are using regex syntax that is not supported by Lucene, and may be making some general assumptions about how regexes work in Lucene that are not accurate.
The big one: a lucene regex query must match the whole term. If your field is not analyzed, the general idea here should work. If they are analyzed with, say, StandardAnalyzer, you can not use a regex query to search like this, since "surface defects" would be split into multiple terms. On the plus side, in that case, a simple PhraseQuery would probably work just fine, as well as being faster and easier (In general, on Lucene regex queries: You probably don't need them, and if you do, you probably should have analyzed better).
^ and $ won't work. You are attempting to match terms, and must match the whole term in order to match. As such, these don't serve any purpose, and aren't supported.
.*? not really wrong, but reluctant matching isn't supported, so it is redundant. .* does the same thing here.
[^$]* if you are trying not to match dollar signs, fine, otherwise, I'm not sure what regex engine would support this. $ in a character class is just a dollar sign.
\b no support in lucene regexes. The whole idea of analysis is that the content should already but split on word breaks, so what purpose would this serve?

Solr supporting varying facets for different types of products

I am using Solr for indexing different types of products. The product types (category) have different facets. For example:
camera: megapixel, type (slr/..), body construction, ..
processors: no. of cores, socket type, speed, type (core i5/7)
food: type, origin, shelf-life, weight
tea: type (black/green/white/..), origin, weight, use milk?
serveware: type, material, color, weight
...
And they have common facets as well:
Brand, Price, Discount, Listing timeframe (like new), Availability, Category
I need to display the relevant facets and breadcrumbs when user clicks on any category, or brand page or a global search across all types of products. This is same as what we see on several ecommerce sites.
The query that I have is:
Since the facet types are more or less unique across different types of products, do I put them in separate schemas? Is that the way to do it? The fundamental worry is that those fields will not have any data for other types of products. And are there any implementation principles here that makes retrieving the respective faces for a given product type easier?
I would like to have a design that is scalable to accommodate lots of items in each product type as we go forward, as well as that is easy to use and performance oriented, if possible. Right now I am having a single instance of Solr.
The only risk of underpopulated facets are when they misrepresent the search. I'm sure you've used a search site where the metadata you want to facet on is underpopulated so that when you apply the facet you also eliminate from your result set a number of records that should have been included. The thing to watch is that the facet values are populated consistently where they are appropriate. That means that your "tea" records don't need to have a number of cores listed, and it won't impact anything, but all of your "processor" records should, and (to whatever extent possible) they should be populated consistently. This means that if one processor lists its number of cores as "4", and another says "quadcore", these are two different values and a user applying either facet value will eliminate the other processor from their result. If a third quadcore processor is entirely missing the "number of cores" stat from the no_cores facet field (field name is arbitrary), then your facet could be become counterproductive.
So, we can throw all of these records into the same Solr, and as long as the facets are populated consistently where appropriate, it's not really necessary that they be populated for all records, especially when not applicable.
Applying facets dynamically
Most of what you need to know is in the faceting documentation of Solr. The important thing is to specify the appropriate arguments in your query to tell Solr which facets you want to use. (Until you actually facet on a field, it's not a facet but just a field that's both stored="true" and indexed="true".) For a very dynamic effect, you can specify all of these arguments as part of the query to Solr.
&facet=true
This may seem obvious, but you need to turn on faceting. This argument is convenient because it also allows you to turn off faceting with facet=false even if there are lots of other arguments in your query detailing how to facet. None of it does anything if faceting is off.
&facet.field=no_cores
You can include this field over and over again for as many fields as you're interested in faceting on.
&facet.limit=7
&f.no_cores.facet.limit=4
The first line here limits the number of values for returned by Solr for each facet field to 7. The 7 most frequent values for the facet (within the search results) will be returned, with their record counts. The second line overrides this limit for the no_cores field specifically.
&facet.sort=count
You can either list the facet field's values in order by how many appear in how many records (count), or in index order (index). Index order generally means alphabetically, but depends on how the field is indexed. This field is used together with facet.limit, so if the number of facet values returned is limited by facet.limit they will either be the most numerous values in the result set or the earliest in the index, depending on how this value is set.
&facet.mincount=1
There are very few circumstances that you will want to see facet values that appear zero times in your search results, and this can fix the problem if it pops up.
The end result is a very long query:
http://localhost/solr/collecion1/search?facet=true&facet.field=no_cores&
facet.field=socket_type&facet.field=processor_type&facet.field=speed&
facet.limit=7&f.no_cores.facet.limit=4&facet.mincount=1&defType=dismax&
qf=name,+manufacturer,+no_cores,+description&
fl=id,name,no_cores,description,price,shipment_mode&q="Intel"
This is definitely effective, and allows for the greatest amount of on-the-fly decision-making about how the search should work, but isn't very readable for debugging.
Applying facets less dynamically
So these features allow you to specify which fields you want to facet on, and do it dynamically. But, it can lead to a lot of very long and complex queries, especially if you have a lot of facets you use in each of several different search modes.
One option is to formalize each set of commonly used options in a request handler within your solrconfig.xml. This way, you apply the exact same arguments but instead of listing all of the arguments in each query, you just specify which request handler you want.
<requestHandler name="/processors" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<str name="fl">id,name,no_cores,description,price,shipment_mode</str>
<str name="qf">name, manufacturer, no_cores, description</str>
<str name="sort">score desc</str>
<str name="rows">30</str>
<str name="wt">xml</str>
<str name="q.alt">*</str>
<str name="facet.mincount">1</str>
<str name="facet.field">no_cores</str>
<str name="facet.field">socket_type</str>
<str name="facet.field">processor_type</str>
<str name="facet.field">speed</str>
<str name="facet.limit">10</str>
<str name="facet.sort">count</str>
</lst>
<lst name="appends">
<str name="fq">category:processor</str>
</lst>
</requestHandler>
If you set up a request hander in solrconfig.xml, all it does is serve as a shorthand for a set of query arguments. You can have as many request handlers as you want for a single solr index, and you can alter them without rebuilding the index (reload the Solr core or restart the server application (JBoss or Tomcat, e.g.), to put changes into effect).
There are a number of things going on with this request handler that I didn't get into, but it's all just a way of representing default Solr request arguments so that your live queries can be simpler. This way, you might make a query like:
http://localhost/solr/collection1/processors?q="Intel"
to return a result set with all of your processor-specific facets populated, and filtered so that only processor records are returned. (This is the category:processor filter, which assumes a field called category where all the processor records have a value processor. This is entirely optional and up to you.) You will probably want to retain the default search request handler that doesn't filter by record category, and which may not choose to apply any of the available (stored="true" and indexed="true") fields as active facets.

Querying across multiple fields with different boosts in Solr

In Solr, what is the best way of querying across different fields where each query on each field has a different weighting?
We're using C# and ASP.NET, with SolrNet being used to query Solr. Our index looks a bit like this:
document_id
title
text_content
tags
[some more fields...]
This is then queried using keywords, where each keyword has a different weight. So, for example, "ipad" might have a weight of 40, but "android" might have a weight of 25.
In conjunction with this, each field has a different base weight. For example, keywords are more valuable than page title, which are more valuable than text content.
So, we end up with something like the following:
title^25
text_content^10
tags^50
And the following keywords:
ipad^25
apple^22
microsoft^15
windows^15
software^20
computer^18
So, each search query has a different weighting, and each field has a different weight. As a result, we end up with search criteria that looks like this:
title:ipad^50
title:apple^47
title:microsoft^40
[more titles...]
text_content:ipad^35
text_content:apple^32
text_content:microsoft^25
[lots more...]
This translates into a very, very long search query, which exceeds the limit allowed. It also seems like a very inefficient way of doing things, and I was wondering if there's a better way of achieving this.
Effectively, we have a list of keywords with varied weights, and a list of fields in Solr which also have varied weights, and the idea is to query the index to retrieve the most relevant documents.
Further complicating this matter, though it may be out of the scope of this question, is that the query also includes filters to filter out documents. This is done using the following type of query:
&fq=(-document_id:4f845eb321c90b0aec5ee0eb)&fq=(-document_id:4f845cd421c90b0aec5ee041)&fq=(-document_id:4f845cea21c90b0aec5ee049)&fq=(-document_id:4f845cf821c90b0aec5ee04d)&fq=(-document_id:4f845d0e21c90b0aec5ee056)&fq=(-document_id:4f845d3521c90b0aec5ee064)&fq=(-document_id:4f845d3921c90b0aec5ee065)&fq=(-document_id:4f845d4921c90b0aec5ee06b)&fq=(-document_id:4f845d7521c90b0aec5ee07b)&fq=(-document_id:4f845d9021c90b0aec5ee084)&fq=(-document_id:4f845dac21c90b0aec5ee08e)&fq=(-document_id:4f845dbc21c90b0aec5ee093)
These can also add a lot of characters to the search query, and it would be good if there was also a better way to handle this as well.
Any help or advice is most appreciated. Thanks.
I would suggest to add those default parameters to your requesthandler configuration within solrconfig.xml. They are always the same, right?
<requestHandler name="standard" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="qf">title^25 text_content^10 tags^50</str>
</lst>
</requestHandler>
You should be able to add your static filters and so on, so that you don't have to specify those values unless you want to do something different from the default, ending up with urls a lot shorter.

How can SOLR be made to boost within result set?

I have indexed some documents that have title, content and keyword (multi-value).
I want to search on title and content, and then, within these results boost by keyword.
I have set up my qf as such:
<str name="qf">
content^0.5 title^1.0
</str>
And my bq as such:
<str name="bq">keyword:(*.*)^1.0</str>
But I'm fairly sure that this is boosting on all keywords (not just ones matching my search term)
Does anyone know how to achieve what I want (I'm using the DisMax query request handler btw.)
I don't think that's how the boost works. Boost is supposed to specify the importance of a match on a specific field.
So by doing something like content^0.5 title^1.0 keyword^5.0, you can make your queries give extra importance to the keyword.
You might be able to force it by doing a complex query. For instance you can use the "+" operator to make it required. So something like this if you were searching for "query":
+(content:query title:query) keyword:query

Resources