Azure Search Highlight Partial Match - azure-cognitive-search

I have turned Hit Highlighting on and it is working well for entire word matches. But we append a wildcard character at the end of each word the user specifies and highlighting is not working on the partial matches. We are getting the results back, but the .Highlights object is null so no highlighting is available for partial matching.
Here is how we configure the SearchParameters:
var parameters = new SearchParameters
{
Filter = newFilter,
QueryType = QueryType.Full,
Top = recordsPerPage,
Skip = skip,
SearchMode = SearchMode.Any,
IncludeTotalResultCount = true,
HighlightFields = new List<string> { "RESULT" },
HighlightPreTag = "<font style=\"color:blue; background-color:yellow;\">",
HighlightPostTag = "</font>"
};
return parameters;
response = indexClient.Documents.Search<SearchResultReturn>(query, parameters);
Here is an example of our query string: ("the") the*^99.95
The idea is we search for the exact string the user specified (multiple words) and then we do a wild-card search for each individual word specified.
So for the above example we are getting all the results that contain "the" and "the*" but only the words "the" have the highlighting. "They", "There", etc do not have any highlighting even if "They" is the only matching entry in the result ("the" was not in the result).
Again the query is bringing back the correct results, it's just the highlighting is not working for partial matches.
Is there some other setting I need to be able to highlight partial matches?

Thanks for reporting the issue.
Unfortunately, it is a known limitation in Azure Search that matches are sometimes not highlighted for broad wildcard search. Highlighting is an independent process after search. Once matching documents are retrieved, the highlighter looks up the search index for all terms that match the wildcard criteria, and use the terms in highlighting the retrieved documents. For broad wildcard search queries, like a* (or the*), the highlighter only uses the top N most significant terms based on their frequencies in the corpus for performance reasons. In your example, 'they' and 'there' are not included in the highlights probably because their appearances in most documents.
As this is a limitation in wildcard queries, one workaround is to preprocess the index to avoid issuing wildcard/prefix queries. Please take a look at custom analysis (https://learn.microsoft.com/en-us/rest/api/searchservice/custom-analyzers-in-azure-search) You can, for example, use edgeNgram tokenfilter and store prefixes of words in the index and issue a regular term query with the prefix (with out the '*' operator)
Hope this helps. Please let me know if you have any further questions.
Nate

Thanks for the reply, but it doesn't seem to be the issue, it seems to be an issue with the Boosting function I have on the search.
When I removed the boosting function then partial highlighighting worked as expected. When I added the boosting function back in partial highlighting stopped working. Can you verify that is a bug?
Here is my boosting function:
"scoringProfiles":[{"name":"PreRiskBoost",
"text":null,"functions":
[{"fieldName":"PreRiskCount",
"freshness":null,
"interpolation":"linear",
"magnitude":{"boostingRangeStart":1,
"boostingRangeEnd":99,
"constantBoostBeyondRange":true},
"distance":null,
"tag":null,
"type":"magnitude","boost":10}],
"functionAggregation":"sum"}],
"defaultScoringProfile":"PreRiskBoost"
Do you know why having the Boosting function prevents partial highlighting from working?

Related

How do you create Solr Queries with wildcard seaches and scoring, fuzzy search, distance searching and other features

I am trying to build a search over my domain with solr, and I am having trouble producing a keyword search that fulfils our requirements. My issue;
When my users search, the requirement is that the search must return results with partial token matches. For example:
Consider the text field: "CA-1234-ABCD California project"
The following keyword searches (what the user puts in the search field) should match this field:
``
"California"
"Cali"
"CA-1234-ABCD"
"ABCD"
"ABCD-1234"
``
etc.
With a text_en field (as configured in the example schema), the tokenization, stemming and grammar processing will allow non-wildcard searches to work for partial words/tokens in many cases, but Solr still seems limited to exact token match in many situations. For example, the following query does not match:
name:cali
The only way I have found to get the user experience that is required is to use a wildcard search:
name:*cali*
The problem with this is that tf scoring (and it seems other functionality like fuzzy searches) don't work with a wildcard search.
The question is, is there a way to get partial token matching (for all tokens not just those that have common stems/etc.) while retaining tf scoring and other advanced query functionality?
My best workaround at the moment is a query that includes both wildcard and non-wildcard clauses, such as:
name:cali OR name:*cali*
but I don't know if that is a good strategy here. Does SOLR provide a way?

Elements getting added in Solr index but not able to search elements as desired

I'm working with solr to store web crawling search results to be used in a search engine. The structure of my documents in solr is the following:
{
word: The word received after tokenizing the body obtained from the html.
url: The url where this word was found.
frequency: The no. of times the word was found in the url.
}
When I go the Solr dashboard on my system, which is http://localhost:8983/solr/#/CrawlerSearchResults/query I'm able to find a word say "Amazon" with the query "word: Amazon" but on directly searching for Amazon I get no results. Could you please help me out with this issue ?
Image links below.
First case
Second case (No results)
Thanks,
Nilesh.
In your second example, the value is searched against the default search field (since you haven't provided a field name). This is by default a field named _text_.
To support just typing a query into the q parameter without field names, you can either set the default field name to search in with df=wordin your URL, or use the edismax query parser (defType=edismax) and the qf parameter (query fields). qf allows multiple fields and giving them a weight, but in your case it'd just be qf=word.
Second - what you're doing seems to replicate what Lucene is doing internally, so I'm not sure why you'd do it this way (each word is what's called a "token", and each count is what's called a term frequency). You can write a custom similarity to add custom scoring based on these parameters.

Azure search contains word not working as expected

I am new to Azure Search. I am trying to use "contains" logic in my search query. I looked it up and found out that I need to add something like following in my search query.
&queryType=full&search=/.*_search.*/
where _search in the string I want to search. Now what happens is that the "contains" logic works fine. For example, I try to search sweep and I get well sweep-cmu in the results.
But, when I search well sweep-cmu, I get zero results. Why? and how can I improve my query to get results when I enter partial and full strings.
If you want exact match for the search query please surround the query with double quotes.
eg: "well sweep-cmu"
This will return all documents which contain the exact phrase.
Since you've just started to play with Azure Search you might find this article particularly interesting. It explains how the full text search works in Azure Search.
https://learn.microsoft.com/en-us/azure/search/search-lucene-query-architecture
In order to get results for partial terms, you should use wildcard expressions in your search queries. The above article explains this in detail.
PS: Some wildcard queries can be very expensive and hence slow.

Solr highlighting gives field/snippets with ANY term, instead of those that satisfy the query fully

I'm using Solr 5.x, standard highlighter, and i'm getting snippets which matches even one of the search terms only, even if i indicate q.op=AND.
I need ONLY the fields and snippets that matches ALL the terms (unless i say q.op=OR or just omit it), i.e. the field/snippet must satisfy the query. Solr does return the field/snippet that has all the terms, but also return many others.
I'm using hl.fl=*, to get the only fields having the terms, and searching against the default field ('text' containing full doc). Need to use * since i have multiple dynamic fields. Most fields are 'text_general' type (for search and HL), and some are 'string' type for faceting.
If its not possible for snippets to have all the terms, i MUST get only the fields that satisfy the query fully (since the question is more talking about matching all the terms, but the search query can become arbitrarily complex, so the fields/snippets should match the query).
Also, next is to get snippets highlighted with proximity based search/terms. What should i do/use for this? The fields coming in highlighting in this scenario should also satisfy the proximity query (unlike i get a field that contain any term, without regard to proximity constrains and other query terms etc)
Thanks for your help.
I've also encountered the same problem with highlighting. In my case, the query like
(foo AND bar) OR eggs
highlighted eggs and foo despite bar was not present in the document. I didn't manage to come up with proper solution, however I devised a dirty workaround.
I use the following query:
id:highlighted_document_id AND text:(my_original_query)
with debugQuery set to true. Then I parse explain text for highlighted_document_id. The text contains the terms from the query, which have contributed to the score. The terms, which should not be highlighted, are not present in the explanation.
The Python regex expressions I use to extract the terms (valid for Solr 5.2.1):
term_regex = re.compile(r'weight\(text:(.+) in')
wildcard_term_regex = re.compile(r'text:(.+), product')
then I simply search the markings in the highlighted text and remove them if the term doesn't match against any of the term in term_regex and wildcard_term_regex.
The solution is probably pretty limited, but works for me.

Solr query results using *

I want to provide for partial matching, so I am tacking on * to the end of search queries. What I've noticed is that a search query of gatorade will return 12 results whereas gatorade* returns 7. So * seems to be 1 or many as opposed to 0 or many ... how can I achieve this? Am I going about partial matching in Solr all wrong? Thanks.
First, I think Solr wildcards are better summarized by "0 or many" than "1 or many". I doubt that's the source of your problem. (For example, see the javadocs for WildcardQuery.)
Second, are you using stemming, because my first guess is that you're dealing with a stemming issue. Solr wildcards can behave kind of oddly with stemming. This is because wildcard expansion is based by searching through the list of terms stored in the inverted index; these terms are going to be in stemmed form (perhaps something like "gatorad"), rather than the words from the original source text (perhaps "gatorade" or "gatorades").
For example, suppose you have a stemmer that maps both "gatorade" and "gatorades" to the stem "gatorad". This means your inverted index will not contain either "gatorade" or "gatorades", only "gatorad". If you then issue the query gatorade*, Solr will walk the term index looking for all the stems beginning with "gatorade". But there are no such stems, so you won't get any matches. Similarly, if you searched gatorades*, Solr will look for all stems beginning with "gatorades". But there are no such stems, so you won't get any matches.
Third, for optimal help, I'd suggest posting some more information, in particular:
Some particular query URLs you are submitting to Solr
An excerpt from your schema.xml file. In particular, include A) the field elements for the fields you are having trouble with, and B) the field type definitions corresponding to those fields
so what I was looking for is to make the search term for 'gatorade' -> 'gatorade OR gatorade*' which will give me all the matches i'm looking for.
If you want a query to return all documents that match either a stemmed form of gatorade or words that begin with gatorade, you'll need to construct the query yourself: +(gatorade gatorade*). You could alternatively extend the SolrParser to do this, but that's more work.
Another alternative is to use NGrams and TokenFilterFactories, specifically the EdgeNGramFilterFactory. .
This will create indexes for ngrams or parts of words. Documents, with a min ngram size of 5 and max ngram size of 8, would index: Docum Docume Document Documents
There is a bit of a tradeoff for index size and time. One of the Solr books quotes as a rough guide: Indexing takes 10 times longer Uses 5 times more disk space Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries. As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
My guess is the missing matches are "Gatorade" (with a capital 'G'), and you have a lowercase filter on your field. The idea is that you have filters in your schema.xml that preprocess the input data, but wildcard queries do not use them;
see this about how Solr deals with wildcard queries:
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
("Solr and wildcard handling").
From what I've read the wildcards only matched words with additional characters after the search term. "Gatorade*" would match Gatorades but not Gatorade itself. It appears there's been an update to Solr in version 3.6 that takes this into account by using the 'multiterm' field type instead of the 'text' field.
A better description is here:
http://bensch.be/the-solr-wildcard-problem-and-multiterm-solution

Resources