The synonyms don't seem to function in Azure Search
I updated my synonyms map with the following payload
{
"name" : "synonymmap1",
"format" : "solr",
"synonyms" :
"Bob, Bobby,Bobby\n
Bill, William, Billy\n
Harold, Harry\n
Elizabeth, Beth\n
Michael,Mike\n
Robert, Rob\n"
}
Then when I examined the synonymMap, I see this
{
"#odata.context":
"https://athenasearchdev.search.windows.net/$metadata#synonymmaps",
"value": [
{
"#odata.etag": "\"0x8D4E7F3C1A9404D\"",
"name": "synonymmap1",
"format": "solr",
"synonyms": "Bob, Bobby,Bobby\n\r\n Bill, William, Billy\n\r\n Harold, Harry\n\r\n Elizabeth, Beth,Liza, Elize\n\r\n Michael,Mike\n\r\n Robert, Rob\n\r\n"
}
]
}
However, the synonyms don't seem to function. e.g results for a search on Mike and Michael are not identical?
I understand this is a preview feature, but wanted help on the following
a) once defined as synonyms, should we not expect exact same results and search scores across all synonym variations
b) Can these synonyms apply at a column level (e. first name alone and not address)- or is it always across the document
c) if we have a large set of synonyms (over 1000)- does it lead to performance impact?
I am Nate from Azure Search. To answer the questions first :
a) Yes, you should. If "Bill" and "Williams" were defined as synonyms. Searching on either should yield the same result.
b) It's always at the column level. You use the field/column property called 'synonymMaps' to specify which synonym maps to use. Please see "Setting the synonym map in the index definition" in https://azure.microsoft.com/en-us/blog/azure-search-synonyms-public-preview/ for more information.
c) Do you mean over 1000 synonyms for a word? or 1000 synonym rule in the synonym map? The former definitely impacts performance because the search query will expand to 1000 of terms. In fact, you can't define more than 50 synonyms in a rule. The latter, 1000s of rules in a synonym map shouldn't impact performance unless the rules are constantly updated.
Regarding your comments that synonyms don't function, based on your questions, I was wondering if the synonyms feature was enabled in the index definition. Could you check that and if it doesn't function, feel free to drop me an email at nateko#microsoft.com.
The extraneous new line characters you see in the retrieved synonym map may have been inserted by the http client you were using at the time of uploading. Some http clients, fiddler and postman for example, insert new line character at the line ending automatically so you don't have to do it yourself.
Thanks,
Nate
Related
I have some data in an Azure suggester with language specific characters í,ó,ú, etc. Unless I search with those characters I don't get any results back. This would get solved if I was able to add an analyzer to the suggester (like lucene indexes have)
"suggesters": [
{
"name": "suggester",
"searchMode": "analyzingInfixMatching",
"sourceFields": [
"Name"
]
}
Suggesters are not supported on fields that use custom analyzers. In some scenarios it makes sense to create another field analyzed with the standard analyzer (or a language analyzer) and use it for suggestions only.
We realize this is a limitation, please vote for this feature to help us prioritize.
I have a rule from my SMEs for SOLR Search relevancy. It goes like this.
When words "XX", "YY", or "ZZ" are in the User's search terms, heavily boost the document_type "MMMM" in the results. (But ONLY then, which means I can't weight the doc itself I think.)
I can imagine building a "Query Pre-Processor" that checks for the presence of the specified terms "XX", etc. and then plugs them into a pre-built query that heavily boosts document_type "MMMM".
That feels more than a little clunky to me. Doing this in code and handling a "union" situation where terms from two rules are in the search doesn't sound like something I'd like to maintain.
I'm wondering if there could be a way to leverage SOLR to do this? The first thing that comes to mind is to put those particular search terms "XX", etc.. into any document_type "MMMM" when pre-processing the data to go into SOLR.
Just tossing them into the document's text is probably not going to change the weighting all that much -- especially if the term is in other documents NOT part of that document_type -- and that suggests to my mind an "important_abbreviations" field on all documents and a "standard" practice of including a boost for that general field on all queries. I say that because I don't recall ever seeing a way to boost a particular field within a doc except in a query.
I'm wondering if anyone else out there has solved this problem and if so, how -- since both of these feel a little clunky to me.
Attempting One Possible Answer: Please feel free to critique, advise or warn.
(I'm aware that an "abbreviation" field feels a bit like synonyms, Please comment if you think synonyms would be a better way to approach this.)
Step 1: Make an "abbreviation" multivalued field in SOLR on all collection docs.
Step 2: Add "XX", "YY", "ZZ" to all documents of type "MMMM" when I build the solrInputDocument to send to SOLR.
Step 3: Boost the "abbreviation" field when adding the abbreviations in step 2 so that resulting xml looks like this:
<field name="abbreviation" boost="5.0">myXXAbbreviationGoesHere</field>
[Concern: Can I boost some fields of type "abbreviation" and not others? In other words, will SOLR respect/correctly calculate the field boost value if it's "2" on one document "5" on another and there is no boost on a 3rd document?]
Step 4: Do a copyField and drop "abbreviation" into the default "text" search field. [This probably looses me my field-specific weighting, yes? -- Thus 5 or 6 below.]
Step 5: OR - add a Request Handler that forces doing search on the abbreviations field directly on every incoming search. Not totally sure on this one, but I got the idea from this stackoverflow question: Solr - Boosting result if query is found in a special field
Step 6: OR - append the query text for searching "abbreviation" on every query entered in my UI - before submission to SOLR.
[In this case, I want to search the default field AND the "abbreviation" field with this single query. I assume that's possible, I just haven't tried to write the query yet. Comments gratefully accepted.]
I'm trying to use a synonym filter to search for a phrase.
peter=> spider man, spiderman, Mary Jane, .....
I use the default configuration. When I put these synonyms into synonym.txt and restart Solr it seems to work only partially: It starts to search for "spider", "man", "spiderman", "Mary" and "Jane" but what I want to search for are the meaningful combinations - like "spider man", "Mary Jane" and "spiderman".
Yes sadly this is a well known problem due to how the Solr query parser breaks up on whitespace before analyzing. So instead of seeing "spider" before "man" in the token stream, you instead simply see each word on its own. Just "spider" with nothing before/after and just "man" with nothing before/after.
This is because most Solr query forms see a space as basically an "OR". Search for "spider OR man" instead of looking at the full text, analyzing it to generate synonyms, then generating a query from that.
For more background, there's this blog post
There's a large number of solutions to this problem, including the following:
hon-lucene-synonyms. This plugin runs an analyzer before generating an edismax query over multiple fields. It's a bit of a blackbox, and I've found it can generate some complex query forms that generate weird performance and relevance bugs.
Lucidwork's autophrase query parser By selectively autophrasing, this plugin lets you specify key phrases (spider man) that should not be broken into OR queries and can have synonym expansion applied
OpenSource Connection's Match query parser. Searches a single field using a query-specified analyzer run before the field is searched. Also searches multi-word synonyms as phrases. My favorite, but disclaimer: I'm the author :)
Rene Kriegler's Querqy -- Querqy is a Solr plugin for query preprocessing rules. These rules can identify your key phrases and rewrite the query to non-multiterm form.
Roll your own: Learn to write your own query parser plugin and handle the problem however you want.
My usually strategy for this kind of problem is to use the synonym filter not to expand a search to include all of the possible synonyms, but to normalize to a single form. I do this both in my index and query field analysis.
For example, with this line in my fieldType/analyzer block in schema.xml:
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
(Note the expand="false")
...and this line in my synonyms.txt:
spiderman, spider man, Mary Jane => peter
This way I make sure that any of these four values will be indexed and searched as "peter". For example, if the source document mentions "The Amazing Spider Man" it will be indexed as "The Amazing peter". When a user searches for "Mary Jane" it will search for "peter" instead, so it will match.
The important thing here is that because "Mary" is not one of the comma-separated synonyms, it won't be changed if it appears without "Jane" following. So searching for "Mary is amazing" will actually search for "Mary is amazing", and it will not match the document.
One of the important details, is that I choose a normalized form (e.g. "peter") that is only one word. I could organize it this way:
peter, spiderman, spider man => Mary Jane
but because Mary Jane is two words, it may (depending on other features of my search), match the two words separately as well as together. By choosing a single word form to normalize into, I make sure that my tokenizer won't try to break it up.
It's a known limitation within Solr / Lucene. Essentially you would have to provide an alternative form of tokenization so that specific space delimited words (i.e. phrases) are treated as single words.
One way of achieving this is to do this client side - i.e. in your application that is calling Solr, when indexing, keep a list of synonym phrases and find / replace those phrase values with an alternative (for example removing the spaces or replacing it with a delimiter that isn't treated as a token boundary).
E.g. if you have "Hello There" as a phrase you want to use in a synonym, then replace it with "HelloThere" when indexing.
Now in your synonyms.txt file you can have (for example):
Hi HelloThere Wotcha => Hello
Similarly when you search, replace any incidences of "Hello There" in the query string with HelloThere and then they will be matched as a synonym of Hello.
Alternatively, you could use the AutoPhraseTokenFilter that LucidWorks created, available on github. This works by maintaining a token stream so that it can work out if a combination of two or more sequential tokens matches one of the synonym phrases, and if it doesn't, it throws away the first token as not matching the phrase. I'm not sure how much overhead this adds, but it seems a good approach - would be nice to have by default in Solr as part of the SynonymFilter.
I am new to SOLR , we have CRM data for Contacts and Companies which are in millions, we have switched to SOLR for fast search results.
PROBLEM: We have large inclusion and exclusion lists with names of companies or contacts.
Ex: Include or Exclude : "company A" & "Company B" & "Company C" .... & "Company n" where assume n = 10000;
What would be the best way to do this kind of a query using SOLR.
WHAT I HAVE TRIED:
Setting "q" ==> field_name: ("companyA" OR "companyB" ..... OR "Company n");
This works only for a list of 400 odd.
Looking forward for assistance on this.
You can increase the max number of boolean clauses: See here: http://wiki.apache.org/solr/SolrConfigXml
Performance Hint: In your case I would think about packaging the inclusion and exclusion lists into a filter and letting the results cache for reuse.
This can happen for multiple reasons:
Check the way how you are querying Solr. Is it GET method or POST? If it is GET method then all the parameters are passed as the part of URL i.e. http://<host:>q=field_name:(....). Maximum number of character a URL can have just 2048 defined universally by Microsoft. If your programatically formed URL has more than 2048 characters then either change the query model or make it POST call.
If #1 doesn't apply for your case, then check for maxBooleanClauses tag present in solrConfig.xml file. If it is missing then add it as per guidelines by Solr wiki.
http://wiki.apache.org/solr/SolrConfigXml#The_Query_Section
You can increase the value of maxBooleanClauses in solrConfig.xml to a desired level. By default the value for this is 1028.
Shishir
I've got SOLR happily running indexing a list of department names that contain US states. It is working well however, searching for "Virginia" will turn up results containing "West Virginia", and while certainly helpful for some business requirements, is not in ours.
Is there a special way of saying that a query for X must not contain Y (I don't mind crafting a special query for the case of "Virginia"), or can I only do this post-query by iterating over the results and excluding results with "West Virginia"?
Use a minus sign (hyphen) combined with the phrases/terms you want to exclude. If you use the dismax query parser, then you don't even need to specify field names.
Examples:
using dismax:
q=virginia -"west virginia"
using standard query parser:
q=field_name:(virginia -"west virginia")
Refer to the Solr Query Syntax wiki page and its further links for more examples.
You could make a state field that is a string type and just search on state:"virginia" (lowercase the string before indexing / searching)