Multi-word synonym search in Solr - solr

I'm trying to use a synonym filter to search for a phrase.
peter=> spider man, spiderman, Mary Jane, .....
I use the default configuration. When I put these synonyms into synonym.txt and restart Solr it seems to work only partially: It starts to search for "spider", "man", "spiderman", "Mary" and "Jane" but what I want to search for are the meaningful combinations - like "spider man", "Mary Jane" and "spiderman".

Yes sadly this is a well known problem due to how the Solr query parser breaks up on whitespace before analyzing. So instead of seeing "spider" before "man" in the token stream, you instead simply see each word on its own. Just "spider" with nothing before/after and just "man" with nothing before/after.
This is because most Solr query forms see a space as basically an "OR". Search for "spider OR man" instead of looking at the full text, analyzing it to generate synonyms, then generating a query from that.
For more background, there's this blog post
There's a large number of solutions to this problem, including the following:
hon-lucene-synonyms. This plugin runs an analyzer before generating an edismax query over multiple fields. It's a bit of a blackbox, and I've found it can generate some complex query forms that generate weird performance and relevance bugs.
Lucidwork's autophrase query parser By selectively autophrasing, this plugin lets you specify key phrases (spider man) that should not be broken into OR queries and can have synonym expansion applied
OpenSource Connection's Match query parser. Searches a single field using a query-specified analyzer run before the field is searched. Also searches multi-word synonyms as phrases. My favorite, but disclaimer: I'm the author :)
Rene Kriegler's Querqy -- Querqy is a Solr plugin for query preprocessing rules. These rules can identify your key phrases and rewrite the query to non-multiterm form.
Roll your own: Learn to write your own query parser plugin and handle the problem however you want.

My usually strategy for this kind of problem is to use the synonym filter not to expand a search to include all of the possible synonyms, but to normalize to a single form. I do this both in my index and query field analysis.
For example, with this line in my fieldType/analyzer block in schema.xml:
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
(Note the expand="false")
...and this line in my synonyms.txt:
spiderman, spider man, Mary Jane => peter
This way I make sure that any of these four values will be indexed and searched as "peter". For example, if the source document mentions "The Amazing Spider Man" it will be indexed as "The Amazing peter". When a user searches for "Mary Jane" it will search for "peter" instead, so it will match.
The important thing here is that because "Mary" is not one of the comma-separated synonyms, it won't be changed if it appears without "Jane" following. So searching for "Mary is amazing" will actually search for "Mary is amazing", and it will not match the document.
One of the important details, is that I choose a normalized form (e.g. "peter") that is only one word. I could organize it this way:
peter, spiderman, spider man => Mary Jane
but because Mary Jane is two words, it may (depending on other features of my search), match the two words separately as well as together. By choosing a single word form to normalize into, I make sure that my tokenizer won't try to break it up.

It's a known limitation within Solr / Lucene. Essentially you would have to provide an alternative form of tokenization so that specific space delimited words (i.e. phrases) are treated as single words.
One way of achieving this is to do this client side - i.e. in your application that is calling Solr, when indexing, keep a list of synonym phrases and find / replace those phrase values with an alternative (for example removing the spaces or replacing it with a delimiter that isn't treated as a token boundary).
E.g. if you have "Hello There" as a phrase you want to use in a synonym, then replace it with "HelloThere" when indexing.
Now in your synonyms.txt file you can have (for example):
Hi HelloThere Wotcha => Hello
Similarly when you search, replace any incidences of "Hello There" in the query string with HelloThere and then they will be matched as a synonym of Hello.
Alternatively, you could use the AutoPhraseTokenFilter that LucidWorks created, available on github. This works by maintaining a token stream so that it can work out if a combination of two or more sequential tokens matches one of the synonym phrases, and if it doesn't, it throws away the first token as not matching the phrase. I'm not sure how much overhead this adds, but it seems a good approach - would be nice to have by default in Solr as part of the SynonymFilter.

Related

Querytype=Full and searching for stop words returns no results

When using azure cognitive search, we are using full query syntax. When searching for something like: the document we create a query like this (this is a simplified example):
(Title:the OR Contents:the) AND (Title:document OR Contents:document)
(we need to split up the query for unrelated reasons)
The problem is that the could be a stopword in the language we are searching in (we search in several languages), causing the entire query to fail. We would like to be able to ignore stop words in generating queries like this, of have the search engine simply return true for the specific stop word search parts
I figure the latter is not possible. (or is it?). Might there be a way to query the stop words for specific language analyzers so we can exclude the stop words ourselves? Or is there a way to alter out query to be able to handle stop words better?
If you want to strip stop words from your search query the only thing I can think of is calling the analyzer with the search query and check the returned tokens.
In this example you would call the en.microsoft analyzer with the search query "the document".
The tokens returned only contain "document", so you know "the" is considered a stop word by the analyzer. But when searching multiple languages you might need to call multiple analyzers and strip stop words for all those languages.

Can changes in synonyms.txt file take effect without reindex?

We are using Sunspot-solr 4.0 when I update synonyms file it does not change anything in search. Do I really need to re-index after making changes in synonyms.txt or there is any other trick to update synonyms file that I am missing?
That depends on when you're expanding the synonyms. If you're expanding at query time, the updates will be visible without any reindexing, but if you're expanding at index time (which is the recommended way), you'll have to reindex to get the new synonyms included in the index.
The reasoning behind recommending expansion at index time compared to query time is described in the old wiki:
This is because there are two potential issues that can arrise at query time:
The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, and will not know that they match a synonym.
Phrase searching (ie: "sea biscit") will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect. This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term. For our example the resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would not match the simple case of "seabiscuit" occuring in a document
Even when you aren't worried about multi-word synonyms, idf differences still make index time synonyms a good idea. Consider the following scenario:
An index with a "text" field, which at query time uses the SynonymFilter with the synonym TV, Televesion and expand="true"
Many thousands of documents containing the term "text:TV"
A few hundred documents containing the term "text:Television"
A query for text:TV will expand into (text:TV text:Television) and the lower docFreq for text:Television will give the documents that match "Television" a much higher score then docs that match "TV" comparably -- which may be somewhat counter intuitive to the client. Index time expansion (or reduction) will result in the same idf for all documents regardless of which term the original text contained.
There's an really detailed explanation of what's actually happening behind the scenes available in Better synonym handling in Solr.
As long as you're aware of these issues and the trade-off, doing query time synonyms could work fine - but you'll have to test it against your queries and what you expect the results to be - and be aware of the pitfalls.

How to only remove stopwords when they are not nouns?

I'm using Solr 5 and need to remove stop words to prevent over-matching and avoid bloating the index with high IDF terms. However, the corpus includes a lot part numbers and name initials like "Steve A" and "123-OR-A". In those cases, I don't want "A" and "OR" to get removed by the stopword filter factory as they need to be searchable.
The Stanford POS tagger does a great job detecting that the above examples are nouns, not stop words, but is this the right approach for solving my problem?
Thanks!
Only you can decide whether this is the right approach. If you can integrate POS tagger in and it gives you useful results - that's good.
But just to give you an alternative, you could look at duplicating your fields and processing them differently. For example, if you see 123-OR-A being split and stopword-cleaned, that probably means you have WordDelimiterFilterFactory in your analyzer stack. That factory has a lot of parameters you could try tweaking. Or, you could copyField your content to another (store=false) field and process it without WordDelimiterFilterFactory all together. Then you search over both copies of your data, possibly with different boost for different fields.

Return stemmed word in Solr

We have stemming in our Solr search and we need to retrieve the word/phrase after stemming. That is if I search for "oranges", through stemming a search for "orange" is carried out. If I turn on debugQuery I would be able to see this, however we'd like to access it through the result if possible. Basically, we need this, because we pass the searched word as a parameter to a 3rd party application which highlights the word in an online PDF reader. Currently, if a user searches for "oranges" and a document contains "orange", then the PDF wouldn't highlight anything since it tries to highlight "oranges" not "orange".
Thanks all in advance,
Krt_Malta
I've no experience with Solr but if you need it just for presentation to users you could stem their queries using the same stemmer Solr uses yourself. This would probably be faster since it would avoid a trip to Solr's index. For English this would presumably be http://tartarus.org/~martin/PorterStemmer/ - or you could check Solr's implementation.
However, a word of caution, most stemming algorithms do not guarantee that stemmed words will be actual words. Check here http://snowball.tartarus.org/algorithms/english/stemmer.html for examples.
You could use the implicit analysis request handler to get the stemmed word.
For your example, if you are using the text_en field and the Snowball Stemmer, the URL
<YOUR SOLR HOST>/solr/<YOUR COLLECTION>/analysis/field?analysis.query=oranges&analysis.fieldtype=text_en&verbose_output=1
would give you a json response, including the following:
"org.apache.lucene.analysis.snowball.SnowballFilter",
[
{
"text": "orang",
...

Oracle Text - Index a BLOB Field (which contains PDF data)

Do any of you have any experience with using Oracle Text to search for content inside PDF files?
I have a table, with a field called FILEDATA(blob).
I would like to do the following query:
SELECT id FROM ttc.contract_attachment WHERE CONTAINS(filedata, 'EXAMPLE') > 0;
However, i'm not too sure about the type of index to add to it.
I found the following code:
begin
ctx_ddl.create_preference('doc_lexer', 'BASIC_LEXER');
ctx_ddl.set_attribute('doc_lexer', 'printjoins', '_-');
end;
/
create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT) indextype is ctxsys.context
parameters ('lexer doc_lexer sync (on commit)');
Ref: http://www.devx.com/dbzone/Article/21563/1954
I have no idea what BASIC_LEXER is. I'm at a bit of a loss. I shall endeavour to continue searching for an answer. Any help would be great.
Thanks.
I've used Oracle Text to index not only PDF's but other data like XML structures. Oracle has the concept of lexers which take content and parses, tokenizes and indexes the tokens. The basic lexer handles English words, there are other lexers for Chinese, Japanese, Korean, etc. The printjoin attribute allows you to index characters that are normally excluded such as hyphes, quotes, etc.
The index you have defined above will work. Keep in mind that Oracle Text indexing is an asynchronous process, meaning the commit occurs and then sometime in the future the document is indexed. However you will need to synchronize the index as part of a scheduled job or the like. With the option "sync (on commit)" on your index, it will index the document as part of the transaction. This is noteworthy only if you are indexing sizable PDF documents.
I would recommend utilizing progressive relaxation for any search you may want to run, as it can being with a restrictive search and expand out to a more generic search, thereby providing the user with results that are decreasing in relevancy. For instance:
<query>
<textquery lang="ENGLISH" grammar="CONTEXT"> cat dog
<progression>
<seq><rewrite>transform((TOKENS, "{", "}", " "))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "{", "}", "ACCUM"))</rewrite></seq>
</progression>
</textquery>
<score datatype="INTEGER" algorithm="COUNT"/>
</query>
The above query tokenizes the search keywords "cat dog" attempts to find them as a phrase, then any documents contains cat AND dog (not necessarily beside each other), then any document containing cat OR dog, documents containing both words are scored higher than if a document just has a single one. Futhermore the structure automatically dedups the results as it returns them.
All of that being said, you could simply define your index as:
create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT)
indextype is ctxsys.context
parameters ('sync (on commit)');
and it would probably work very well for your needs. You would only need to change the behavior of the lexer if you have a need for doing so. I hope this helps.

Resources