LUCENE: search for terms that match a regex - solr

I need to search for any terms in the lucene index, matching particular regex. I know that I can do it using the TermsComponent in solr, if it is configed like this:
<searchComponent name="terms" class="solr.TermsComponent"/>
<!-- A request handler for demonstrating the terms component -->
<requestHandler name="/terms" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<bool name="terms">true</bool>
<bool name="distrib">false</bool>
</lst>
<arr name="components">
<str>terms</str>
</arr>
</requestHandler>
For example, I want to fetch any terms containing "surface defects". Using solr I can do this:
http://localhost:8983/solr/core1/terms?terms.fl=content&
terms.regex=^(.*?(\bsurface%20defects\b)[^$]*)$&
terms.sort=count&
terms.limit=10000
But my question is, how can I achieve the same by using the Lucene API, not solr? I looked into the org.apache.solr.handler.component.TermsComponent class but it is not very obvious for me.

You can use a RegexQuery:
Query query = new RegexQuery(new Term("myField", myRegex));
Or the QueryParser:
String queryString = "/" + myRegex + "/";
QueryParser parser = new QueryParser("myField", new KeywordAnalyzer());
Query query = parser.parse(queryString);
Now, my question is: Are you sure that regex works in Solr?
I haven't tried the TermsComponent regex functionality, so maybe it's doing some fancy SpanQuery footwork here, or running regexes on the stored fields retrieved, or something like that, but you are using regex syntax that is not supported by Lucene, and may be making some general assumptions about how regexes work in Lucene that are not accurate.
The big one: a lucene regex query must match the whole term. If your field is not analyzed, the general idea here should work. If they are analyzed with, say, StandardAnalyzer, you can not use a regex query to search like this, since "surface defects" would be split into multiple terms. On the plus side, in that case, a simple PhraseQuery would probably work just fine, as well as being faster and easier (In general, on Lucene regex queries: You probably don't need them, and if you do, you probably should have analyzed better).
^ and $ won't work. You are attempting to match terms, and must match the whole term in order to match. As such, these don't serve any purpose, and aren't supported.
.*? not really wrong, but reluctant matching isn't supported, so it is redundant. .* does the same thing here.
[^$]* if you are trying not to match dollar signs, fine, otherwise, I'm not sure what regex engine would support this. $ in a character class is just a dollar sign.
\b no support in lucene regexes. The whole idea of analysis is that the content should already but split on word breaks, so what purpose would this serve?

Related

Solr 5 how to search in specific field

I am using Solr version 5 for searching data. I am using below query which searches for keyword in all fields.
http://localhost:8983/solr/document/select?q=keyword1+keyword2&wt=json
Can anyone suggest me query to search for keyword only in title field.
Thanks.
use
http://localhost:8983/solr/document/select?q=title:*yourkeyword*&wt=json
or for exact match
http://localhost:8983/solr/document/select?q=title:"yourkeyword"&wt=json
You can not search for a keyword in all fields without some extra work:
How can I search all field in SOLR that contain the keywords,.?
The "q"-Parameter contains the query string and for the standard parser this means that you must specify the field via colon like in
fieldname:searchterm
or the standard parser will use the default field. The default field is specified in the "df"-Parameter and if you did not change your solrconfig.xml you will search in the "text"-Field because you will find something like
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="df">text</str>
</lst>
</requestHandler>
P.S. If you want to search in all fields you have either to copy all field-content to one field or you must use a specific query parser like dismax parser, where you can list all your fields in the "qf"-Parameter.
P.P.S. You can not search in all fields but you can highlight in all fields :-)
The best way is to run the query from Admin concole. When we run it, it also provides the actuall SQL query executed. Just copy the query and use it.
About the question: search specific field value from the Solr. In the admin console look for 'Q' text box. write the yourfield=value OR yourfield:value. Hit the 'Execute Query' button. Top right side the SQL will be available.
Generated Query: ......select?indent=on&q=YOURFIELD:"VALUE"&wt=json

Querying across multiple fields with different boosts in Solr

In Solr, what is the best way of querying across different fields where each query on each field has a different weighting?
We're using C# and ASP.NET, with SolrNet being used to query Solr. Our index looks a bit like this:
document_id
title
text_content
tags
[some more fields...]
This is then queried using keywords, where each keyword has a different weight. So, for example, "ipad" might have a weight of 40, but "android" might have a weight of 25.
In conjunction with this, each field has a different base weight. For example, keywords are more valuable than page title, which are more valuable than text content.
So, we end up with something like the following:
title^25
text_content^10
tags^50
And the following keywords:
ipad^25
apple^22
microsoft^15
windows^15
software^20
computer^18
So, each search query has a different weighting, and each field has a different weight. As a result, we end up with search criteria that looks like this:
title:ipad^50
title:apple^47
title:microsoft^40
[more titles...]
text_content:ipad^35
text_content:apple^32
text_content:microsoft^25
[lots more...]
This translates into a very, very long search query, which exceeds the limit allowed. It also seems like a very inefficient way of doing things, and I was wondering if there's a better way of achieving this.
Effectively, we have a list of keywords with varied weights, and a list of fields in Solr which also have varied weights, and the idea is to query the index to retrieve the most relevant documents.
Further complicating this matter, though it may be out of the scope of this question, is that the query also includes filters to filter out documents. This is done using the following type of query:
&fq=(-document_id:4f845eb321c90b0aec5ee0eb)&fq=(-document_id:4f845cd421c90b0aec5ee041)&fq=(-document_id:4f845cea21c90b0aec5ee049)&fq=(-document_id:4f845cf821c90b0aec5ee04d)&fq=(-document_id:4f845d0e21c90b0aec5ee056)&fq=(-document_id:4f845d3521c90b0aec5ee064)&fq=(-document_id:4f845d3921c90b0aec5ee065)&fq=(-document_id:4f845d4921c90b0aec5ee06b)&fq=(-document_id:4f845d7521c90b0aec5ee07b)&fq=(-document_id:4f845d9021c90b0aec5ee084)&fq=(-document_id:4f845dac21c90b0aec5ee08e)&fq=(-document_id:4f845dbc21c90b0aec5ee093)
These can also add a lot of characters to the search query, and it would be good if there was also a better way to handle this as well.
Any help or advice is most appreciated. Thanks.
I would suggest to add those default parameters to your requesthandler configuration within solrconfig.xml. They are always the same, right?
<requestHandler name="standard" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="qf">title^25 text_content^10 tags^50</str>
</lst>
</requestHandler>
You should be able to add your static filters and so on, so that you don't have to specify those values unless you want to do something different from the default, ending up with urls a lot shorter.

Apache Solr TermsComponent: How to prevent from splitting words after one character. E.g. "t-shirt"

I'm trying to get autosuggestions for search terms. But I#ve run into a problem with words containing characters like "-" and "&" which are being splitted after just one character.
Example:
/solr/terms/?terms=true&terms.fl=item&terms.limit=10&terms.sort=count&terms.prefix=t
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
</lst>
<lst name="terms">
<lst name="item">
<int name="top">11335</int>
<int name="tshirt">10249</int>
<int name="t">10156</int>
<int name="trouser">4771</int>
<int name="tight">1577</int>
</lst>
</lst>
</response>
The problem lies with tshirt and t. "t" only appears within "t-shirt". so how do I prevent Solr from splitting words just after one character if there is no whitespace after it. "t shirt" should split - "t-shirt" and "h&m" should not.
Thanks for your help!
The field type for items seems to be text with WordDelimiterFilterFactory being one of the filters in the analysis.
WordDelimiterFilterFactory by default will split on Intra word delimiters.
So t-shirt would generate two tokens t and shirt, and hence the term t appears for you.
If you want to use terms for autosuggest, remove or tune the WordDelimiterFilterFactory as per the requirement.
You can use the TextField with basic configurations, like with WhitespaceTokenizerFactory and apply the lower, ascii folding filters on it so the tokens are least analyzed and don't appear fragmented.
You can also add words you don't want to be split by adding them to protwords.txt or map certain characters in wdfftypes.txt so they won't be used to split terms.
Also check this link for good AutoSuggester http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
If that's the only problem you have using the TermsComponent to make auto suggestions the answer you got is perfect, but I'd like to propose an alternative answer.
The TermsComponent is fast and pretty easy to use, but it has the following limitations:
you can't apply any filter to your suggestions;
you may have trouble with case-sensitive queries: for example, if you use the LowerCaseFilterFactory and index the word Word, you'll get the suggestion only typing w and not W. You basically need to take care of lowering the query before submitting it to solr, since you can't apply any tokenizer or filter to your query.
Depending on your requirements, you might want to consider different ways to make auto suggestions with Solr. The Different ways to make auto suggestions with Solr article should be useful in order to make the right choice.

Solr search query over multiple fields

Is it possible to search in Solr over two fields using two different words and get back only those results which contain both of them?
For example, if I have fields "type" and "location" , I want only those results who have type='furniture' and location = 'office' in them.
You can use boolean operators and search on individual fields.
q=type:furniture AND location:office
If the values are fixed, it is better to use Filter Queries for Performance.
fq=type:furniture AND location:office
The suggested solutions have the drawback, that you have to care about escaping special characters.
If the user searches for "type:d'or AND location:coffee break" the query will fail.
I suggest to combine two edismax handlers:
<requestHandler name="/combine" class="solr.SearchHandler" default="false">
<lst name="invariants">
<str name="q">
(_query_:"{!edismax qf='type' v=$uq1}"
AND _query_:"{!edismax qf='location' v=$uq2}")
</str>
</lst>
</requestHandler>
Call the request handler like this:
http://localhost:8983/solr/collection1/combine?uq1=furniture&uq2=office
Explanation
The variables $uq1 and $uq2 will be replaced by the request parameters uq1 and uq2 will.
The result of the first edismax query (uq1) is combined by logical AND with the second edismax query (uq2)
Solr Docs
https://wiki.apache.org/solr/LocalParams
You can also use the boostQuery function on the dismaxRequest handler as
type=dismax&bq=type:furniture AND location:office
fq=type:furniture AND location:office
Instead of using AND, this could be break into two filter queries as well.
fq=type:furniture
fq=location:office

How can SOLR be made to boost within result set?

I have indexed some documents that have title, content and keyword (multi-value).
I want to search on title and content, and then, within these results boost by keyword.
I have set up my qf as such:
<str name="qf">
content^0.5 title^1.0
</str>
And my bq as such:
<str name="bq">keyword:(*.*)^1.0</str>
But I'm fairly sure that this is boosting on all keywords (not just ones matching my search term)
Does anyone know how to achieve what I want (I'm using the DisMax query request handler btw.)
I don't think that's how the boost works. Boost is supposed to specify the importance of a match on a specific field.
So by doing something like content^0.5 title^1.0 keyword^5.0, you can make your queries give extra importance to the keyword.
You might be able to force it by doing a complex query. For instance you can use the "+" operator to make it required. So something like this if you were searching for "query":
+(content:query title:query) keyword:query

Resources