How do you create Solr Queries with wildcard seaches and scoring, fuzzy search, distance searching and other features - solr

I am trying to build a search over my domain with solr, and I am having trouble producing a keyword search that fulfils our requirements. My issue;
When my users search, the requirement is that the search must return results with partial token matches. For example:
Consider the text field: "CA-1234-ABCD California project"
The following keyword searches (what the user puts in the search field) should match this field:
``
"California"
"Cali"
"CA-1234-ABCD"
"ABCD"
"ABCD-1234"
``
etc.
With a text_en field (as configured in the example schema), the tokenization, stemming and grammar processing will allow non-wildcard searches to work for partial words/tokens in many cases, but Solr still seems limited to exact token match in many situations. For example, the following query does not match:
name:cali
The only way I have found to get the user experience that is required is to use a wildcard search:
name:*cali*
The problem with this is that tf scoring (and it seems other functionality like fuzzy searches) don't work with a wildcard search.
The question is, is there a way to get partial token matching (for all tokens not just those that have common stems/etc.) while retaining tf scoring and other advanced query functionality?
My best workaround at the moment is a query that includes both wildcard and non-wildcard clauses, such as:
name:cali OR name:*cali*
but I don't know if that is a good strategy here. Does SOLR provide a way?

Related

Solr syntax for phrase query

I have a field with definition:
"replace-field": {
"name":"search_words",
"type":"lowercase",
"stored":true,
"indexed": true,
"multiValued": true
}
that contains sentences as array (thus multiValued: true):
"id":500
"search_words":["How much oil should you pour into the engine",
"How important is engine oil?]
How should I create a query thatwould return that document (with id = 500) when user inputs phrase "engine oil"?
With single term queries I can user *engine* and it would find that document becasue engine is in the middle of the sentence but I can't find a way to be able to seearch for phrases in sentences. Is it even possible using solr?
Solr supports phrase search, and is what it was actually designed for. Wildcard searches are not really the way you should use Solr by default - the field type should tell Solr how to process the text in the field to make you get hits when querying it in a regular way.
In this case the standard text_en would probably work fine, or a field definition with a Standard Tokenizer and a lowercasefilter (and possibly a WordDelimiterGraphFilter to get rid of special characters).
The query would then be search_words:"engine oil".

Azure search contains word not working as expected

I am new to Azure Search. I am trying to use "contains" logic in my search query. I looked it up and found out that I need to add something like following in my search query.
&queryType=full&search=/.*_search.*/
where _search in the string I want to search. Now what happens is that the "contains" logic works fine. For example, I try to search sweep and I get well sweep-cmu in the results.
But, when I search well sweep-cmu, I get zero results. Why? and how can I improve my query to get results when I enter partial and full strings.
If you want exact match for the search query please surround the query with double quotes.
eg: "well sweep-cmu"
This will return all documents which contain the exact phrase.
Since you've just started to play with Azure Search you might find this article particularly interesting. It explains how the full text search works in Azure Search.
https://learn.microsoft.com/en-us/azure/search/search-lucene-query-architecture
In order to get results for partial terms, you should use wildcard expressions in your search queries. The above article explains this in detail.
PS: Some wildcard queries can be very expensive and hence slow.

Solr highlighting gives field/snippets with ANY term, instead of those that satisfy the query fully

I'm using Solr 5.x, standard highlighter, and i'm getting snippets which matches even one of the search terms only, even if i indicate q.op=AND.
I need ONLY the fields and snippets that matches ALL the terms (unless i say q.op=OR or just omit it), i.e. the field/snippet must satisfy the query. Solr does return the field/snippet that has all the terms, but also return many others.
I'm using hl.fl=*, to get the only fields having the terms, and searching against the default field ('text' containing full doc). Need to use * since i have multiple dynamic fields. Most fields are 'text_general' type (for search and HL), and some are 'string' type for faceting.
If its not possible for snippets to have all the terms, i MUST get only the fields that satisfy the query fully (since the question is more talking about matching all the terms, but the search query can become arbitrarily complex, so the fields/snippets should match the query).
Also, next is to get snippets highlighted with proximity based search/terms. What should i do/use for this? The fields coming in highlighting in this scenario should also satisfy the proximity query (unlike i get a field that contain any term, without regard to proximity constrains and other query terms etc)
Thanks for your help.
I've also encountered the same problem with highlighting. In my case, the query like
(foo AND bar) OR eggs
highlighted eggs and foo despite bar was not present in the document. I didn't manage to come up with proper solution, however I devised a dirty workaround.
I use the following query:
id:highlighted_document_id AND text:(my_original_query)
with debugQuery set to true. Then I parse explain text for highlighted_document_id. The text contains the terms from the query, which have contributed to the score. The terms, which should not be highlighted, are not present in the explanation.
The Python regex expressions I use to extract the terms (valid for Solr 5.2.1):
term_regex = re.compile(r'weight\(text:(.+) in')
wildcard_term_regex = re.compile(r'text:(.+), product')
then I simply search the markings in the highlighted text and remove them if the term doesn't match against any of the term in term_regex and wildcard_term_regex.
The solution is probably pretty limited, but works for me.

Solr/Lucene - partial fuzzy match

How do you set up partial (substring) fuzzy match in Solr 4.2.1?
For example, if you have a list of US cities indexed, I would like a search term "Alber" to match "Alburquerque".
I have tried using the NGramFilterFactory on the <fieldType> and rebuilt the index but queries do not return results as expected - they still work as if I had just done the standard text_general defaults. Exact matches work, and explicit fuzzy searches would work given sufficient similarity (for example "Alberquerque~" with one misspelling would work.)
I did go to the analyzer tool in the Solr admin and saw that my ngrams were indeed being generated.
Is there something i'm missing from the query side?
Or should I take a different approach altogether?
And can this work with dismax? (Multiple fields indexed like this with different weights)
Thanks!

Return stemmed word in Solr

We have stemming in our Solr search and we need to retrieve the word/phrase after stemming. That is if I search for "oranges", through stemming a search for "orange" is carried out. If I turn on debugQuery I would be able to see this, however we'd like to access it through the result if possible. Basically, we need this, because we pass the searched word as a parameter to a 3rd party application which highlights the word in an online PDF reader. Currently, if a user searches for "oranges" and a document contains "orange", then the PDF wouldn't highlight anything since it tries to highlight "oranges" not "orange".
Thanks all in advance,
Krt_Malta
I've no experience with Solr but if you need it just for presentation to users you could stem their queries using the same stemmer Solr uses yourself. This would probably be faster since it would avoid a trip to Solr's index. For English this would presumably be http://tartarus.org/~martin/PorterStemmer/ - or you could check Solr's implementation.
However, a word of caution, most stemming algorithms do not guarantee that stemmed words will be actual words. Check here http://snowball.tartarus.org/algorithms/english/stemmer.html for examples.
You could use the implicit analysis request handler to get the stemmed word.
For your example, if you are using the text_en field and the Snowball Stemmer, the URL
<YOUR SOLR HOST>/solr/<YOUR COLLECTION>/analysis/field?analysis.query=oranges&analysis.fieldtype=text_en&verbose_output=1
would give you a json response, including the following:
"org.apache.lucene.analysis.snowball.SnowballFilter",
[
{
"text": "orang",
...

Resources