Match the whole phrase using userInput in Vespa - vespa

Assume that I have two documents with the following content.
{
"title": "Windsor Farmhouse Wood Writing Desk Light Brown - Martin Furniture Furniture"
}
{
"title": "Benjara 34 in. Rectangular Light Brown/White 1 Drawer Computer Desk, Light Brown & White"
}
The definition of the field is as follows.
field title type string {
indexing: summary | index | attribute
index: enable-bm25
}
How can I match only the first document and not the second document when I want to match the phrase desk light in Vespa 8? In other words, I want to match only documents with ... desk light ..., but not others like ... desk, light ....
I tried the following query, but it seems like a weakAnd operation in Vespa 8 and matches both documents. It also matches documents that contain only ... desk ..., which should be expected from the weakAnd operation but not my expectation.
_desk_light=desk light
yql=select id, title, summaryfeatures from sources * where ([{"defaultIndex": "title"}](userInput(#_desk_light)));
I also tried adding grammar: phrase annotation to the userInput. Both of the documents are still matched.
_desk_light=desk light
yql=select id, title, summaryfeatures from sources * where ([{"defaultIndex": "title", "grammar": "phrase"}](userInput(#_desk_light)));
Really appreciate any advise. Thanks!

Using grammar: phrase is the right solution if you only want to match the exact phrase "desk light", but in this case you'll still match both documents as they both contain that phrase (commas are ignored).

Related

Solr query, at least one occurrence of phrase without another phrase before it

How can I search for documents with phrases, where at least one occurrence of this phrase does not have a specific phrase before it?
To give an example, lets say I have three documents with the textfield as text_en
document 1 text: Oranges are tasty.
document 2 text: The color orange is bright, an orange is healthy.
document 3 text: The color orange.
I want all documents that mention oranges, except all of these occurances of orange in the document are preceded by "color".
The query should return documents 1 and 2. This is tricky in case of document 2, because there is an occurence of orange without "color" directly before it, so I want this document returned, eventhough "color orange" is also in this document. Document 3 should be not returned.
I tried to use boolean queries with negations, but this of course excludes document 2:
q=textfield:"orange"&fq=NOT textfield:"color orange"
Can such a query as described above be done in Solr?

Apache Solr phrase query is not aware of filters from schema.xml

I am a newbie with solr and I have a question about query mechanism.
In my solr schema.xml for a particular field (say field1) i have a standard tokenizer that splits into words and a couple of filters. One of the filters is a solr.KeepWordFilterFactory filter that has a extremely short dictionary (just 10 words, say they are: red, orange, yellow, green etc). I tested the schema with analyze menu of solr and everything works.
that is a document with text "Red fox was sitting on green grass". would translate to {"red,"green"}
However, when I submit a query: field1:"red green" it fails to find such a document. As if the query is applied to unfiltered yet tokenized source.
Can you confirm that this is what standard query parser actually does. I.e the filters are applied exclusively for the index, but no for the actual search ??(i understand that the search will be applied only to those documents where the index matches the analyzed query). Or if not how the phrase query actually works in the above example.
When you do a query like this : "red green", Lucene expects to find these terms in consecutive positions , so pos(green) = pos(red) + 1. When you do it like this : "red green"~10 , you give it 10 moves to shuffle the terms around and try to make them seem consecutive (it's called a phrase slop) .
Other that that , what a KeywordMarkerFilter does is mark tokens with the keyword flag. Filters following it could implement a logic that check if the token is a keyword before modifying it. It does not stop lucene from indexing tokens not marked as keywords, but it could stop it from further modifying them.

how to perform multi left-edge keyword matching in solr

I have a requirement using Solr's schema.xml, where I need to search for a left-edge keyword in a Multiword wherein the search should be performed for each left-edge keyword after the delimiter.
For examples: Lets say my
1-> Title is : Split Air Condtioner
2-> Title is : Plastic chair
Now when I try to query on "air". My delimiter is space
I want it to give me only "Split Air Conditioner" and not "Plastic chair"
Note: Plastic chair appear in my result as "air" is contained in ch(air) keyword. I am using EdgeNGramFilterFactory currently.
You should StandardTokenizerFactory instead of EdgeNGramFilterFactory for title field.
But yes, As John suggested please share your schema and field definition as it will help us to resolve your issue

Solr synonyms aren't working right

We have a large restaurant menu database where users can search for menu items. There are many items that when the words are side by side its a unique dish but the words are so common and appear all over the place.
Example: Users want to search for "cheese steak"
In the database...it can be "cheesesteak" or "cheese steak"
In my synonym file I have:
cheesesteak => cheesesteak, cheese steak
cheese steak => cheesesteak, cheese steak
When I search for "cheesesteak", I get valid results. I get menu items with "cheesesteak" and also "cheese steak" (words side by side)
But when I search for "cheese steak", I get all kinds of non relevant results like "steak salad with blue cheese" its picking up anything with the words cheese and steak
Is there a way to configure this synonym file so it works? I don't want to force user to enter quotes, etc.
What you are looking for is proximity search, were scoring improves with the correct ordering and distance of words. From the Solr FAQ
A proximity search can be done with a sloppy phrase query. The closer
together the two terms appear in the document, the higher the score
will be. A sloppy phrase query specifies a maximum "slop", or the
number of positions tokens need to be moved to get a match.
This example for the standard request handler will find all documents
where "batman" occurs within 100 words of "movie":
q=text:"batman movie"~100
what you should do is use edismax and let boosting show the most relevant docs. You can also do this by using standard handler if you add boosting queries or optional phrase with all terms like +cheese +steak ("cheesesteak"^100 "steak cheese"^50)

Query problem in Solr

We're using Solr to search on a Shop index and a Product index. Currently a Shop has a field shop_keyword which also contains the keywords of the products assigned to it. The shop keywords are separated by a space. Consequently, if there is a product which has a keyword "apple" and another which has "orange", a search for shops having Apple AND Orange would return the shop for these products.
However, this is incorrect since we want that a search for shops having Apple AND Orange returns shop(s) having products with both "apple" and "orange" as keywords.
We tried solving this problem, by making shop keywords multi-valued and assigning the keywords of every product of the shop as a new value in shop keywords. However as was confirmed in another post Querying Solr documents with one of the fields multi-valued, Solr does not support "all words must match
in the same value of a multi-valued field".
(Hope I explained myself well)
How can we go about this? Ideally, we shouldn't change our search infrastructure dramatically.
Thanks!
Krt_Malta
I am going to assume shop_keyword is a text field.
A keyword search of Apple AND Orange would return only shop_keyword terms that contain both Apple and Orange, provided you are searching on that field exclusively (shop_keyword:Apple AND Orange). For example, you should only see results that contain:
Apple Orange
And not:
Apple Mango
(I was able to confirm this on my local Solr instance with a text field)
However, you would see results that contain:
Apple Lime Orange Tree
(where "Orange Tree" is a single word but has spaces)
From the link you posted, it seems like this is the problem. So your real problem is that you have spaces in your keywords, which Solr is also using as a delimiter of sorts, in which case the technical solutions listed there are the only ones I know of. However...
If you have control of the terms and they aren't used in a free text search (or for google), you could consider removing the spaces from the keywords and adding quotes to your search. That would solve your problem:
shop_keyword:"Apple" AND "Orange"
Wouldn't return "Orange_Tree".
If you went this route you could use a separate field to index terms for free text search and other non-programmatic purposes.
Not ideal, but I hope that kinda helps =).

Resources