how to perform multi left-edge keyword matching in solr - solr

I have a requirement using Solr's schema.xml, where I need to search for a left-edge keyword in a Multiword wherein the search should be performed for each left-edge keyword after the delimiter.
For examples: Lets say my
1-> Title is : Split Air Condtioner
2-> Title is : Plastic chair
Now when I try to query on "air". My delimiter is space
I want it to give me only "Split Air Conditioner" and not "Plastic chair"
Note: Plastic chair appear in my result as "air" is contained in ch(air) keyword. I am using EdgeNGramFilterFactory currently.

You should StandardTokenizerFactory instead of EdgeNGramFilterFactory for title field.
But yes, As John suggested please share your schema and field definition as it will help us to resolve your issue

Related

Hybris: Solr facet truncate Japanese words

I found a similar issue here.
solr facet search truncate words
When I use solr facet for manufacturer name of products, then although the actual manufacturer name is like "化学商品", but in solr navigation area it shows up as two options: "化学" and "商品", which means it is stemming. For English manufacturer name it is working fine.
I cannot use fieldType string. I am using text.
How do I avoid it for Japanese characters so that it shows only the full manufacturer name. I also tried using the Tokenizer class CJKTokenizerFactory, but it didn't work.
Any help is greatly appreciated!
You cannot use text for facets. If you wish to both search and facet on manufacturer name then extract this information twice, once as string and once as text and use these different representations in appropriate places.

Multi-word synonym search in Solr

I'm trying to use a synonym filter to search for a phrase.
peter=> spider man, spiderman, Mary Jane, .....
I use the default configuration. When I put these synonyms into synonym.txt and restart Solr it seems to work only partially: It starts to search for "spider", "man", "spiderman", "Mary" and "Jane" but what I want to search for are the meaningful combinations - like "spider man", "Mary Jane" and "spiderman".
Yes sadly this is a well known problem due to how the Solr query parser breaks up on whitespace before analyzing. So instead of seeing "spider" before "man" in the token stream, you instead simply see each word on its own. Just "spider" with nothing before/after and just "man" with nothing before/after.
This is because most Solr query forms see a space as basically an "OR". Search for "spider OR man" instead of looking at the full text, analyzing it to generate synonyms, then generating a query from that.
For more background, there's this blog post
There's a large number of solutions to this problem, including the following:
hon-lucene-synonyms. This plugin runs an analyzer before generating an edismax query over multiple fields. It's a bit of a blackbox, and I've found it can generate some complex query forms that generate weird performance and relevance bugs.
Lucidwork's autophrase query parser By selectively autophrasing, this plugin lets you specify key phrases (spider man) that should not be broken into OR queries and can have synonym expansion applied
OpenSource Connection's Match query parser. Searches a single field using a query-specified analyzer run before the field is searched. Also searches multi-word synonyms as phrases. My favorite, but disclaimer: I'm the author :)
Rene Kriegler's Querqy -- Querqy is a Solr plugin for query preprocessing rules. These rules can identify your key phrases and rewrite the query to non-multiterm form.
Roll your own: Learn to write your own query parser plugin and handle the problem however you want.
My usually strategy for this kind of problem is to use the synonym filter not to expand a search to include all of the possible synonyms, but to normalize to a single form. I do this both in my index and query field analysis.
For example, with this line in my fieldType/analyzer block in schema.xml:
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
(Note the expand="false")
...and this line in my synonyms.txt:
spiderman, spider man, Mary Jane => peter
This way I make sure that any of these four values will be indexed and searched as "peter". For example, if the source document mentions "The Amazing Spider Man" it will be indexed as "The Amazing peter". When a user searches for "Mary Jane" it will search for "peter" instead, so it will match.
The important thing here is that because "Mary" is not one of the comma-separated synonyms, it won't be changed if it appears without "Jane" following. So searching for "Mary is amazing" will actually search for "Mary is amazing", and it will not match the document.
One of the important details, is that I choose a normalized form (e.g. "peter") that is only one word. I could organize it this way:
peter, spiderman, spider man => Mary Jane
but because Mary Jane is two words, it may (depending on other features of my search), match the two words separately as well as together. By choosing a single word form to normalize into, I make sure that my tokenizer won't try to break it up.
It's a known limitation within Solr / Lucene. Essentially you would have to provide an alternative form of tokenization so that specific space delimited words (i.e. phrases) are treated as single words.
One way of achieving this is to do this client side - i.e. in your application that is calling Solr, when indexing, keep a list of synonym phrases and find / replace those phrase values with an alternative (for example removing the spaces or replacing it with a delimiter that isn't treated as a token boundary).
E.g. if you have "Hello There" as a phrase you want to use in a synonym, then replace it with "HelloThere" when indexing.
Now in your synonyms.txt file you can have (for example):
Hi HelloThere Wotcha => Hello
Similarly when you search, replace any incidences of "Hello There" in the query string with HelloThere and then they will be matched as a synonym of Hello.
Alternatively, you could use the AutoPhraseTokenFilter that LucidWorks created, available on github. This works by maintaining a token stream so that it can work out if a combination of two or more sequential tokens matches one of the synonym phrases, and if it doesn't, it throws away the first token as not matching the phrase. I'm not sure how much overhead this adds, but it seems a good approach - would be nice to have by default in Solr as part of the SynonymFilter.

Apache Solr phrase query is not aware of filters from schema.xml

I am a newbie with solr and I have a question about query mechanism.
In my solr schema.xml for a particular field (say field1) i have a standard tokenizer that splits into words and a couple of filters. One of the filters is a solr.KeepWordFilterFactory filter that has a extremely short dictionary (just 10 words, say they are: red, orange, yellow, green etc). I tested the schema with analyze menu of solr and everything works.
that is a document with text "Red fox was sitting on green grass". would translate to {"red,"green"}
However, when I submit a query: field1:"red green" it fails to find such a document. As if the query is applied to unfiltered yet tokenized source.
Can you confirm that this is what standard query parser actually does. I.e the filters are applied exclusively for the index, but no for the actual search ??(i understand that the search will be applied only to those documents where the index matches the analyzed query). Or if not how the phrase query actually works in the above example.
When you do a query like this : "red green", Lucene expects to find these terms in consecutive positions , so pos(green) = pos(red) + 1. When you do it like this : "red green"~10 , you give it 10 moves to shuffle the terms around and try to make them seem consecutive (it's called a phrase slop) .
Other that that , what a KeywordMarkerFilter does is mark tokens with the keyword flag. Filters following it could implement a logic that check if the token is a keyword before modifying it. It does not stop lucene from indexing tokens not marked as keywords, but it could stop it from further modifying them.

Solr Sunspot -- How to make 2 strings like "Co." and "Company" appear identical?

I have many records that contain abbreviations in a column of names. I might have the record "ABC Brewing Co.", and I want solr to interpret this as being identical to a search for "ABC Brewing Company". Any ideas on how this might be accomplished?
You should look into using the SynonymFilterFactory at index or query time to accomplish this.

Query problem in Solr

We're using Solr to search on a Shop index and a Product index. Currently a Shop has a field shop_keyword which also contains the keywords of the products assigned to it. The shop keywords are separated by a space. Consequently, if there is a product which has a keyword "apple" and another which has "orange", a search for shops having Apple AND Orange would return the shop for these products.
However, this is incorrect since we want that a search for shops having Apple AND Orange returns shop(s) having products with both "apple" and "orange" as keywords.
We tried solving this problem, by making shop keywords multi-valued and assigning the keywords of every product of the shop as a new value in shop keywords. However as was confirmed in another post Querying Solr documents with one of the fields multi-valued, Solr does not support "all words must match
in the same value of a multi-valued field".
(Hope I explained myself well)
How can we go about this? Ideally, we shouldn't change our search infrastructure dramatically.
Thanks!
Krt_Malta
I am going to assume shop_keyword is a text field.
A keyword search of Apple AND Orange would return only shop_keyword terms that contain both Apple and Orange, provided you are searching on that field exclusively (shop_keyword:Apple AND Orange). For example, you should only see results that contain:
Apple Orange
And not:
Apple Mango
(I was able to confirm this on my local Solr instance with a text field)
However, you would see results that contain:
Apple Lime Orange Tree
(where "Orange Tree" is a single word but has spaces)
From the link you posted, it seems like this is the problem. So your real problem is that you have spaces in your keywords, which Solr is also using as a delimiter of sorts, in which case the technical solutions listed there are the only ones I know of. However...
If you have control of the terms and they aren't used in a free text search (or for google), you could consider removing the spaces from the keywords and adding quotes to your search. That would solve your problem:
shop_keyword:"Apple" AND "Orange"
Wouldn't return "Orange_Tree".
If you went this route you could use a separate field to index terms for free text search and other non-programmatic purposes.
Not ideal, but I hope that kinda helps =).

Resources