Synonym Maps in Azure Search, synonym phrases - azure-cognitive-search

I'm trying to use synonym maps in Azure Search and i'm running into a problem. I want to have several words and phrases map into a single search query.
In other words, when i search for either:
product 123, product0123, product 0123
i want the search to return results for a query phrase:
product123.
After reading the tutorial it all seemed pretty straight forward.
I'm using .Net Azure.Search SDK 5.0 so i've done the following:
var synonymMap = new SynonymMap
{
Name = "test-map",
Format = SynonymMapFormat.Solr,
Synonyms = "product 123, product0123, product 0123=>product123\n"
};
_searchClient.SynonymMaps.CreateOrUpdate(synonymMap);
and i use the map on one of the search fields
index.Fields.First(x => x.Name == "Title").SynonymMaps = new[] {"test-map"};
So far so good. Now if i do a search for product0123 i get results for product123 as i would expect. But if i search for a phrase product 123 or product 0123 i get bunch of irrelevant results. It's almost as if the synonym maps do not work with multi word items.
So guess my question is, am i using synonym maps incorrectly or these maps only work with single word synonyms?

Are the phrases, product 123 or product 0123, in double quotes? It is required for the phrases to be in double quotes ("product 123"). Double quotes are the operators for phrase search and in the case for synonyms, they ensure that the terms in the phrase are analyzed and matched against the rules in the synonym map as a phrase. Without it, query parser separates the unquoted phrase to individual terms and tries synonym matching on individual terms. The query becomes product OR 123 in that case.
This documentation explains how queries are parsed (stage 1) and analyzed (stage 2). The application of synonyms in done in the second stage.
To answer your second question in the comment, unfortunately double quotes are required to match multi word synonyms. However, as an application developer, you have the full control of what gets passed to the search service. For example, given a query product 123 from the user, you can re-write the query under the hood to improve precision and recall before it gets passed to the search service. Phrasing or proximity searches can be used to improve precision and wildcard (such as fuzzy or prefix searches) can be used to improve recall of the query. You would rewrite the query product 123 to something like "product 123"~10 product 123 and synonyms will apply to the phrased part of the query.
Nate

Related

Solr query string not working for full text searches

I'm following this tutorial on how to perform indexing on sample documents using Solr. The default collection is "gettingstarted" as shown. Now I'm trying to query it. There are 52 entries as shown:
However, when I replace the q argument with say electronics, it should return 14 results. However, I get nothing.
When I replace the query string q with cat:electronics, then I actually get the 14 results. But why is this the case? isn't q=word supposed to search for word wherever it appears?
No, it's not. Your assumption that:
isn't q=word supposed to search for word wherever it appears?
is wrong. If you're using word as your only query, and nothing more - you're searching for word in the default search field. It does not search all available fields in all available documents.
Also be aware that the default query parser assumes that your query is in the Lucene Query Syntax. To handle more "natural" querying, you can use the edismax query parser. This query parser supports the qf parameter that tells Solr which fields to search, instead of having to use the cat:electronics syntax. Your example would then be q=electronics&qf=cat.
In the example documents you've given, qf=series_t author name cat is probably a decent value to search all these fields for the given query. You can also append ^<weight> to a field name to give hits in the different fields different weights. qf=name^10 cat would give a hit in name ten times the weight of a hit in the cat field.

Azure Search: boost results that contains word

I have an airports database in Azure Search which upon searching I would like to boost results with those airports that contains the word "international" in the airport name.
given 2 results that have the same score, i would like to boost the one that has the word "international" in the airport name using just Azure Search (i.e. if possible, not using any code to manipulate after getting the relevant results).
I tried Term Boosting but it returns me a list of airports that has "international" in them which is not what I want.
I looked at the Scoring Functions but none of them seems to suit my needs
in essence, i do not want to "match" results that contains the word "international"
but i want to "boost" results that contains the word "international" after the user keys in the query text
If you want results containing a term to score higher, but you don't want to require matching documents to contain the term, you can use OR as well as AND. For example, if the user typed "Dallas", your query could look like this:
Dallas OR (Dallas AND airportName:international)
If you further want to control the impact that the term international has on the score, you can use term boosting.
You might find this article on how Azure Search processes queries to be helpful.

How can I tune the Retrieve and Rank ranker with a dictionary/model of domain specific phrases?

We are trying to group phrases together in order to improve results.
For instance, if the user asks a question like "When do I have to change the filter of my air conditioning?" with a domain specific phrase such as “air conditioning”, R&R returns some answers containing the term “air” and no “conditioning” or it returns answers containing other terms like air bag or air filter.
This can be accomplish using a raw Solr instance and set the phrase between quotes. So, the Solr query would look like the following:
...
"debug": {
"rawquerystring": "When do I have to change the filter of my \"air conditioning\" ?",
"querystring": "When do I have to change the filter of my \"air conditioning\" ?",
"parsedquery": "text:when text:do text:i text:have text:to text:change text:the text:filter text:of text:my PhraseQuery(text:\"air conditioning\") text:?",
"parsedquery_toString": "text:when text:do text:i text:have text:to text:change text:the text:filter text:of text:my text:\"air conditioning\" text:?",
...
However, the R&R guide states:
The syntax is different from standard Solr syntax as follows:
You can search for a single term, or a phrase. You do not need to
surround the phrase with double quotation marks as with Solr, but you
can include phrases in the query and they are accounted for by the
ranker models.
We could not find more details regarding the above statement.
But, as we understand, the ranker is supposed to identify phrases. If that is the case, we were wondering if there is a way where we can set a dictionary of phrases in order to tune the ranker?
Or, could we set our own model of legal phrases? What are the options to accomplish this goal?
Thanks
Currently RnR doesn't support strict phrase querying, though there are features that will take term ordering and adjacent terms into consideration. We are working on a new version of service, in which users would be able to use full regular solr query syntax (including specifying phrases) for document retrieving.

Multi-word synonym search in Solr

I'm trying to use a synonym filter to search for a phrase.
peter=> spider man, spiderman, Mary Jane, .....
I use the default configuration. When I put these synonyms into synonym.txt and restart Solr it seems to work only partially: It starts to search for "spider", "man", "spiderman", "Mary" and "Jane" but what I want to search for are the meaningful combinations - like "spider man", "Mary Jane" and "spiderman".
Yes sadly this is a well known problem due to how the Solr query parser breaks up on whitespace before analyzing. So instead of seeing "spider" before "man" in the token stream, you instead simply see each word on its own. Just "spider" with nothing before/after and just "man" with nothing before/after.
This is because most Solr query forms see a space as basically an "OR". Search for "spider OR man" instead of looking at the full text, analyzing it to generate synonyms, then generating a query from that.
For more background, there's this blog post
There's a large number of solutions to this problem, including the following:
hon-lucene-synonyms. This plugin runs an analyzer before generating an edismax query over multiple fields. It's a bit of a blackbox, and I've found it can generate some complex query forms that generate weird performance and relevance bugs.
Lucidwork's autophrase query parser By selectively autophrasing, this plugin lets you specify key phrases (spider man) that should not be broken into OR queries and can have synonym expansion applied
OpenSource Connection's Match query parser. Searches a single field using a query-specified analyzer run before the field is searched. Also searches multi-word synonyms as phrases. My favorite, but disclaimer: I'm the author :)
Rene Kriegler's Querqy -- Querqy is a Solr plugin for query preprocessing rules. These rules can identify your key phrases and rewrite the query to non-multiterm form.
Roll your own: Learn to write your own query parser plugin and handle the problem however you want.
My usually strategy for this kind of problem is to use the synonym filter not to expand a search to include all of the possible synonyms, but to normalize to a single form. I do this both in my index and query field analysis.
For example, with this line in my fieldType/analyzer block in schema.xml:
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
(Note the expand="false")
...and this line in my synonyms.txt:
spiderman, spider man, Mary Jane => peter
This way I make sure that any of these four values will be indexed and searched as "peter". For example, if the source document mentions "The Amazing Spider Man" it will be indexed as "The Amazing peter". When a user searches for "Mary Jane" it will search for "peter" instead, so it will match.
The important thing here is that because "Mary" is not one of the comma-separated synonyms, it won't be changed if it appears without "Jane" following. So searching for "Mary is amazing" will actually search for "Mary is amazing", and it will not match the document.
One of the important details, is that I choose a normalized form (e.g. "peter") that is only one word. I could organize it this way:
peter, spiderman, spider man => Mary Jane
but because Mary Jane is two words, it may (depending on other features of my search), match the two words separately as well as together. By choosing a single word form to normalize into, I make sure that my tokenizer won't try to break it up.
It's a known limitation within Solr / Lucene. Essentially you would have to provide an alternative form of tokenization so that specific space delimited words (i.e. phrases) are treated as single words.
One way of achieving this is to do this client side - i.e. in your application that is calling Solr, when indexing, keep a list of synonym phrases and find / replace those phrase values with an alternative (for example removing the spaces or replacing it with a delimiter that isn't treated as a token boundary).
E.g. if you have "Hello There" as a phrase you want to use in a synonym, then replace it with "HelloThere" when indexing.
Now in your synonyms.txt file you can have (for example):
Hi HelloThere Wotcha => Hello
Similarly when you search, replace any incidences of "Hello There" in the query string with HelloThere and then they will be matched as a synonym of Hello.
Alternatively, you could use the AutoPhraseTokenFilter that LucidWorks created, available on github. This works by maintaining a token stream so that it can work out if a combination of two or more sequential tokens matches one of the synonym phrases, and if it doesn't, it throws away the first token as not matching the phrase. I'm not sure how much overhead this adds, but it seems a good approach - would be nice to have by default in Solr as part of the SynonymFilter.

Compound word search engine design

We have a search function using SQL Server's Full-Text Search. It is an any word search and works very well.
However, quotation marks around compound terms don't work with Full-Text Search.
So, currently a search for "peanut butter" returns peanut butter first, then peanuts and butter, etc.
We want the system to recognize certain compound terms and exclude all else.
So a search for: coffee ethiopian ground - would still perform an any word search.
However, a search for: ground coffee - would recognize the compound term and return only exact matches for "ground coffee".
Is the only way to do this to build your own dictionary of compound terms? Are there any other options?
Thanks, Jon
As long as you use CONTAINS or CONTAINSTABLE, SQL Server should honor your double quotes and match only compound word matches.
I suspect you are using FREETEXT or FREETEXTTABLE which performs more of a natural-language search and ignores double quotes.

Resources