Compound word search engine design

Compound word search engine design - sql-server

We have a search function using SQL Server's Full-Text Search. It is an any word search and works very well.
However, quotation marks around compound terms don't work with Full-Text Search.
So, currently a search for "peanut butter" returns peanut butter first, then peanuts and butter, etc.
We want the system to recognize certain compound terms and exclude all else.
So a search for: coffee ethiopian ground - would still perform an any word search.
However, a search for: ground coffee - would recognize the compound term and return only exact matches for "ground coffee".
Is the only way to do this to build your own dictionary of compound terms? Are there any other options?
Thanks, Jon

As long as you use CONTAINS or CONTAINSTABLE, SQL Server should honor your double quotes and match only compound word matches.
I suspect you are using FREETEXT or FREETEXTTABLE which performs more of a natural-language search and ignores double quotes.

Related

Synonym Maps in Azure Search, synonym phrases

I'm trying to use synonym maps in Azure Search and i'm running into a problem. I want to have several words and phrases map into a single search query.
In other words, when i search for either:
product 123, product0123, product 0123
i want the search to return results for a query phrase:
product123.
After reading the tutorial it all seemed pretty straight forward.
I'm using .Net Azure.Search SDK 5.0 so i've done the following:
var synonymMap = new SynonymMap
{
Name = "test-map",
Format = SynonymMapFormat.Solr,
Synonyms = "product 123, product0123, product 0123=>product123\n"
};
_searchClient.SynonymMaps.CreateOrUpdate(synonymMap);
and i use the map on one of the search fields
index.Fields.First(x => x.Name == "Title").SynonymMaps = new[] {"test-map"};
So far so good. Now if i do a search for product0123 i get results for product123 as i would expect. But if i search for a phrase product 123 or product 0123 i get bunch of irrelevant results. It's almost as if the synonym maps do not work with multi word items.
So guess my question is, am i using synonym maps incorrectly or these maps only work with single word synonyms?

Are the phrases, product 123 or product 0123, in double quotes? It is required for the phrases to be in double quotes ("product 123"). Double quotes are the operators for phrase search and in the case for synonyms, they ensure that the terms in the phrase are analyzed and matched against the rules in the synonym map as a phrase. Without it, query parser separates the unquoted phrase to individual terms and tries synonym matching on individual terms. The query becomes product OR 123 in that case.
This documentation explains how queries are parsed (stage 1) and analyzed (stage 2). The application of synonyms in done in the second stage.
To answer your second question in the comment, unfortunately double quotes are required to match multi word synonyms. However, as an application developer, you have the full control of what gets passed to the search service. For example, given a query product 123 from the user, you can re-write the query under the hood to improve precision and recall before it gets passed to the search service. Phrasing or proximity searches can be used to improve precision and wildcard (such as fuzzy or prefix searches) can be used to improve recall of the query. You would rewrite the query product 123 to something like "product 123"~10 product 123 and synonyms will apply to the phrased part of the query.
Nate

Solr/Lucene query lemmatization with context

I have successfully implemented a Czech lemmatizer for Lucene. I'm testing it with Solr and it woks nice at the index time. But it doesn't work so well when used for queries, because the query parser doesn't provide any context (words before or after) to the lemmatizer.
For example the phrase pila vodu is analyzed differently at index time than at query time. It uses the ambiguous word pila, which could mean pila (saw e.g. chainsaw) or pít (the past tense of the verb "to drink").
pila vodu ->
Index time: pít voda
Query time: pila voda
.. so the word pila is not found and not highlighted in a document snippet.
This behaviour is documented at the solr wiki (quoted bellow) and I can confirm it by debugging my code (only isolated strings "pila" and "vodu" are passed to the lemmatizer).
... The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, ...
So my question is:
Is it possible to somehow change, configure or adapt the query parser so the lemmatizer would see the whole query string, or at least some context of individual words? I would like to have a solution also for different solr query parsers like dismax or edismax.
I know that there is no such issue with phrase queries like "pila vodu" (quotes), but then I would lose the documents without the exact phrase (e.g. documents with "pila víno" or even "pila dobrou vodu").
Edit - trying to explain / answer following question (thank you #femtoRgon):
If the two terms aren't a phrase, and so don't necessarily come together, then why would they be analyzed in context to one another?
For sure it would be better to analyze only terms coming together. For example at the indexing time, the lemmatizer detects sentences in the input text and it analyzes together only words from a single sentence. But how to achieve a similar thing at the query time? Is implementing my own query parser the only option? I quite like the pf2 and pf3 options of the edismax parser, would I have to implement them again in case of my own parser?
The idea behind is in fact a bit deeper because the lemmatizer is doing word-sense-disambiguation even for words that has the same lexical base. For example the word bow has about 7 different senses in English (see at wikipedia) and the lemmatizer is distinguishing such senses. So I would like to exploit this potential to make searches more precise -- to return only documents containing the word bow in the concrete sense required by the query. So my question could be extended to: How to get the correct <lemma;sense>-pair for a query term? The lemmatizer is very often able to assign the correct sense if the word is presented in its common context, but it has no chance when there is no context.

Finally, I implemented my own query parser.
It wasn't that difficult thanks to the edismax sources as a guide and a reference implementation. I could easily compare my parser results with the results of edismax...
Solution :
First, I analyze the whole query string together. This gives me the list of "tokens".
There is a little clash with stop words - it is not that easy to get tokens for stop words as they are omitted by the analyzer, but you can detect them from PositionIncrementAttribute.
From "tokens" I construct the query in the same way as edismax do (e.g. creating all 2-token and/or 3-token phrase queries combined in DisjunctionMaxQuery instances).

How can I permit the Full-Text Search to search for conjuction?

When I do a search like: People in England, the full-text search engine ignores the all search and returns 0 results. I think It is because It separetes each word ("People", "in" and "England") and ignores the "in" word because It may return many results.
I don't want the exact word ("People in England") but I'd like to find in the same text the words People, in and England.

You want "in" keyword (like OR, AND, ...)be considered a simple word in criteria, right?
You must configure stopword for your fulltext query, here is as link about it:
https://dba.stackexchange.com/questions/44032/searching-for-keywords-in-fulltext-indexes-using-the-contains-function

Multi-word synonym search in Solr

I'm trying to use a synonym filter to search for a phrase.
peter=> spider man, spiderman, Mary Jane, .....
I use the default configuration. When I put these synonyms into synonym.txt and restart Solr it seems to work only partially: It starts to search for "spider", "man", "spiderman", "Mary" and "Jane" but what I want to search for are the meaningful combinations - like "spider man", "Mary Jane" and "spiderman".

Yes sadly this is a well known problem due to how the Solr query parser breaks up on whitespace before analyzing. So instead of seeing "spider" before "man" in the token stream, you instead simply see each word on its own. Just "spider" with nothing before/after and just "man" with nothing before/after.
This is because most Solr query forms see a space as basically an "OR". Search for "spider OR man" instead of looking at the full text, analyzing it to generate synonyms, then generating a query from that.
For more background, there's this blog post
There's a large number of solutions to this problem, including the following:
hon-lucene-synonyms. This plugin runs an analyzer before generating an edismax query over multiple fields. It's a bit of a blackbox, and I've found it can generate some complex query forms that generate weird performance and relevance bugs.
Lucidwork's autophrase query parser By selectively autophrasing, this plugin lets you specify key phrases (spider man) that should not be broken into OR queries and can have synonym expansion applied
OpenSource Connection's Match query parser. Searches a single field using a query-specified analyzer run before the field is searched. Also searches multi-word synonyms as phrases. My favorite, but disclaimer: I'm the author :)
Rene Kriegler's Querqy -- Querqy is a Solr plugin for query preprocessing rules. These rules can identify your key phrases and rewrite the query to non-multiterm form.
Roll your own: Learn to write your own query parser plugin and handle the problem however you want.

My usually strategy for this kind of problem is to use the synonym filter not to expand a search to include all of the possible synonyms, but to normalize to a single form. I do this both in my index and query field analysis.
For example, with this line in my fieldType/analyzer block in schema.xml:
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
(Note the expand="false")
...and this line in my synonyms.txt:
spiderman, spider man, Mary Jane => peter
This way I make sure that any of these four values will be indexed and searched as "peter". For example, if the source document mentions "The Amazing Spider Man" it will be indexed as "The Amazing peter". When a user searches for "Mary Jane" it will search for "peter" instead, so it will match.
The important thing here is that because "Mary" is not one of the comma-separated synonyms, it won't be changed if it appears without "Jane" following. So searching for "Mary is amazing" will actually search for "Mary is amazing", and it will not match the document.
One of the important details, is that I choose a normalized form (e.g. "peter") that is only one word. I could organize it this way:
peter, spiderman, spider man => Mary Jane
but because Mary Jane is two words, it may (depending on other features of my search), match the two words separately as well as together. By choosing a single word form to normalize into, I make sure that my tokenizer won't try to break it up.

It's a known limitation within Solr / Lucene. Essentially you would have to provide an alternative form of tokenization so that specific space delimited words (i.e. phrases) are treated as single words.
One way of achieving this is to do this client side - i.e. in your application that is calling Solr, when indexing, keep a list of synonym phrases and find / replace those phrase values with an alternative (for example removing the spaces or replacing it with a delimiter that isn't treated as a token boundary).
E.g. if you have "Hello There" as a phrase you want to use in a synonym, then replace it with "HelloThere" when indexing.
Now in your synonyms.txt file you can have (for example):
Hi HelloThere Wotcha => Hello
Similarly when you search, replace any incidences of "Hello There" in the query string with HelloThere and then they will be matched as a synonym of Hello.
Alternatively, you could use the AutoPhraseTokenFilter that LucidWorks created, available on github. This works by maintaining a token stream so that it can work out if a combination of two or more sequential tokens matches one of the synonym phrases, and if it doesn't, it throws away the first token as not matching the phrase. I'm not sure how much overhead this adds, but it seems a good approach - would be nice to have by default in Solr as part of the SynonymFilter.

Questions about SQL Server Full-Text Search Syntax

I'm trying to work out some of the finer details of the Full-Text Search syntax. I have the basics down but have run up against the following questions.
Near the top of this page, it shows cases where double quotes are required and includes the example "oatmeal". Why does this need double quotes? Is this a typo?
Near the bottom of the same page, it states that any instances of AND, AND NOT, OR, or OR NOT should be wrapped in quotes in some cases. Why? Aren't these stop words anyway?
This page provides a number of examples that include quoted and unquoted search terms. What is the difference between CONTAINS(Column, 'term') and CONTAINS(Column, '"term"')?

1) Full text search uses double quotes as a text delimiter and its good practise to use them especially if you are likely to have more complex search terms. E.g. You can put phrases or words in double quotes.
"Oatmeal"
"Hot Oatmeal"
2) I think the logic in the "and and not or or not" section is for this is to find the boolean terms and wrap the content either side in quotes.
e.g. OatMeal or Hot Oatmeal
translate to
'"Oatmeal" or "hot oatmeal"'
rather than
'oatmeal "or" hot oatmeal'
3) It should be contains(column, 'terms') containstable would use the 3 term query containstable(table, column, 'terms') I have done some testing on a database here with 30k documents in it contains(,'"dentist"') and contains(,'dentist') returned the same 652 rows. Containstable returned them with the same ranking.