How to perform a full-text search in Vespa? - vespa

I am trying to do a full-text search on a field of some documents, and I was looking for your advices on how to do so. I first tried to do this type of request:
GET http://localhost:8080/search/?query=lord+of+the+rings
But it was returning me the documents where the field was an exact match and contained no other information than the given string , so I tried the equivalent in YQL:
GET http://localhost:8080/search/?yql=SELECT * FROM site WHERE text CONTAINS "lord of the rings";
And I had the exact same results. But when further reading the documentation I fell upon the MATCHES instruction, and it indeed gives me the results I'm seem to be looking for, by doing this kind of request:
GET http://localhost:8080/search/?yql=SELECT * FROM site WHERE text MATCHES "lord of the rings";
Though I don't know why, for some requests of this type I encountered a timeout error of this type:
{
"root": {
"id": "toplevel",
"relevance": 1,
"fields": {
"totalCount": 0
},
"errors": [
{
"code": 12,
"summary": "Timed out",
"source": "site",
"message": "Timeout while waiting for sc0.num0"
}
]
}
}
So I solved this issue by adding greater than default timeout value:
GET http://localhost:8080/search/?yql=SELECT * FROM site WHERE text MATCHES "lord of the rings";&timeout=20000
My question is, am I doing full-text search the right way, and how could I improve it ?
EDIT: Here is the corresponding search definition:
search site {
document site {
field text type string {
stemming: none
normalizing: none
indexing: attribute
}
field title type string {
stemming: none
normalizing: none
indexing: attribute
}
}
fieldset default {
fields: title, text
}
rank-profile post inherits default {
rank-type text: about
rank-type title: about
first-phase {
expression: nativeRank(title, text)
}
}
}

What does your search definition file look like? I suspect you have put your text content in an "attribute" field, which defaults to "word match" semantics. You probably want "text match" semantics which means you'll need to put your content in an "index" type field.
https://docs.vespa.ai/documentation/reference/search-definitions-reference.html#match
The "MATCHES" operator you are using interprets your input as a regular expression, which is powerful, but slow as it applies the regular expression on all attributes (further optimizations to something like https://swtch.com/~rsc/regexp/regexp4.html are possible but not currently implemented).

Related

Characters to split the user-query in Vespa engine

We split the user-query on ascii spaces to create a weakAnd(...).
The user-input "Watch【Docudrama】" does not contain a whitespace - but throws an error.
Question: Which codepoints beside whitespaces should be used to split the query?
YQL (fails):
select * from post where text contains "Watch【Docudrama】" limit 1;
YQL (works):
select * from post where weakAnd(text contains "Watch",text contains "【Docudrama】") limit 1;
Error message:
{
"root": {
"id": "toplevel",
"relevance": 1,
"fields": {
"totalCount": 0
},
"errors": [
{
"code": 4,
"summary": "Invalid query parameter",
"source": "content",
"message": "Can not add WORD_ALTERNATIVES text:[ Watch【Docudrama】(1.0) watch(0.7) ] to a segment phrase"
}
]
}
}
Are you sure you need to use WAND for this? Try setting the user query grammar to "any" (default is "all"), which will use the "OR" operator for user supplied terms. There is an example here: https://docs.vespa.ai/documentation/reference/query-language-reference.html#userinput
The process of splitting up the query is known as Tokenization. This is a complex and language dependent process, Vespa uses Apache OpenNLP to do this (and more): https://docs.vespa.ai/documentation/linguistics.html has more information and also references to the code which performs this operation.
If you really want to use WAND, instead of reimplementing the query parsing logic outside Vespa, I suggest you create a Java searcher which descends the query tree and modifies it by replacing the created AndItem with WeakAndItem. See https://docs.vespa.ai/documentation/searcher-development.html and the code example here: https://docs.vespa.ai/documentation/advanced-ranking.html

Solr Conditional Highlighting: How to highlight with conditions?

In a Solr Implementation, I am trying to do some conditional highlight depending on others fields than the one we search on.
I want to get the matching result a field "content" highlighted only if it is indicated in Solr that this field can be exposed for this element.
Given a Solr base populated with :
[{ firstname:"Roman",
content: "A quick response is the best",
access:"" },
{ "firstname":"Roman",
"content": "Responsive is important",
"access":"contentAuthorized" }
]
I would like to get both document in my answer, and the highlight on the "content" field only for the one with the data "access":"contentAuthorized", so I am executing the query:
q:(firstname:r* OR (+tags:contentAuthorized AND +content:r*))
The expected answer would be:
...
{
{
"firstname":"Roman"
},
{
"firstname":"Roman"
}
},
highlighting":{
"0f278cb5-7150-42f9-8dca-81bfa68a9c6e":{
"firstname":["<em>Roman</em>"],
"105c6464-0350-4873-9936-b46c39c88647":{
"firstname":["<em>Roman</em>"],
"content":["<em>Responsive</em> is important],
}
}
But I actually get:
...
{
{
"firstname":"Roman"
},
{
"firstname":"Roman"
}
},
highlighting":{
"0f278cb5-7150-42f9-8dca-81bfa68a9c6e":{
"firstname":["<em>Roman</em>"],
"content":["A quick <em>response</em> is the best"],
"105c6464-0350-4873-9936-b46c39c88647":{
"firstname":["<em>Roman</em>"],
"content":["<em>Responsive</em> is important],
}
}
So, I get the "content" on the highlight of the second element while (+tags:contentAuthorized AND +content:r*) is false.
Does anyone have an idea of how I could do conditional highlighting with Solr so ?
Thank you for reading this and for taking your time to think about it :D
If you want highlighting to be applied on certain fields only, then you need to set the query parameter hl.fl to those fields. In your case hl.fl=content. You should then set hl.requireFieldMatch=true.
Refer to Solr Highlighting documentation:
By default, false, all query terms will be highlighted for each field to be highlighted (hl.fl) no matter what fields the parsed query refer to. If set to true, only query terms aligning with the field being highlighted will in turn be highlighted.
For further info on how to use the query parameters: https://solr.apache.org/guide/8_6/highlighting.html

Highlight matches in MongoDB full text search

Is it possible to define which part of the text in which of the indexed text fields matches the query?
No, as far as I know and can tell from the Jira, no such feature exists currently. You can, of course, attempt to highlight the parts of the text yourself, but that requires to implement the highlighting and also implement the stemming according to the rules applied by MongoDB.
The whole feature is somewhat complicated - even consuming it - as can be seen from the respective elasticsearch documentation.
Refer to Mongodb Doc Highlighting
db.fruit.aggregate([
{
$searchBeta: {
"search": {
"path": "description",
"query": ["variety", "bunch"]
},
"highlight": {
"path": "description"
}
}
},
{
$project: {
"description": 1,
"_id": 0,
"highlights": { "$meta": "searchHighlights" }
}
}
])
I'm afraid that solution applies only to MongoDB Atlas at the moment #LF00.

Elastic Search Interaction of Highlights with Synonym Filter

We have an analyzer which includes the synonym filter which is defined as follows:
synonym_filter :
type : synonym
synonyms_path : synonyms.txt
ignore_case : true
expand : true
format : solr
In the synonym file we have a synonym defined as follows:
dawdle,waste time
Then in our data we have an entity with a name field "dawdle company".
Because of the synonym filter this gets analyzed to something like:
1 -dawdle- 2 -company- 3
1 -wasted- 2 -time- 3
With time and company in the same position. Then when performing a search for "wasted time" we get a hit in this entity. We would like the highlights to be "dawdle" since that is the equivalent synonym, but it seems elastic search sees this as a two hits since it matched "wasted" and "time" and it returns two highlights: "dawdle" and "company".
Is there a recommended way to solve these kind of issues where an unexpected word is returned in the highlights because it occupies the same position of a search term that was inserted because of a synonym?
#SergeyS the situation both you and #user2430530 has is perfectly described in this section of the documentation.
And the suggestion there is to try and define a single term for each serie of synonyms not to get back that mix up of terms highlighted in the result.
Something like this:
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"dawdle, waste time=>waste_time"
]
}
}
}
Then you'll get the desired result from ES:
"highlight": {
"text": [
"some <em>dawdle</em> company"
]
}

Solr, adding a record via JSON with a multi-value field and boosted values

I'm pretty new to Solr, I'm trying to add a multi-value field with boost values defined for each value, all defined via JSON. In other words, I'd like this to work:
[{ "id": "ID1000",
"tag": [
{ "boost": 1, "value": "A test value" },
{ "boost": 2, "value": "A boosted value" } ]
}]
I know how to do that in XML (multiple <field name = 'tag' boost = '...'>), but the JSON code above doesn't work, the server says "Error parsing JSON field value. Unexpected OBJECT_START". Has Solr a limit/bug?
PS: I fixed the originally-missing ']' and that's not the problem.
EDIT: It seems the way to go should be payloads (http://wiki.apache.org/solr/Payloads), but I couldn't make them to work on Solr (followed this: http://sujitpal.blogspot.co.uk/2011/01/payloads-with-solr.html). Leaving the question open to see if someone can further help.
Found the following sentence in the from the Solr Relevancy FAQ - Query Elevation Component section
An Index-time boost on a value of a multiValued field applies to all values for that field.
I do not think adding an individual boost to each value in the multivalued field is going to work. I know that the Xml will allow it, but I would guess that it may only apply the boost value from the last value applied to the field.
So based on that I would change the Json to the following and see if that works.
[
{
"id": "ID1000",
"tag": {
"boost": 2,
"value": [ "A test value", "A boosted value"]
}
}
]
The JSON seems to be invalid missing a closing ]
[
{
"id": "ID1000",
"tag": [
{
"boost": 1,
"value": "A test value"
},
{
"boost": 2,
"value": "A boosted value"
}
]
}
]
You hit an edge case. You can have the boosts on single values and you can have an array of values. But not one inside another (from my reading of Solr 4.1 source code)
That might be something to create as an enhancement request.
If you are generating that JSON by hand, you can try:
"tag": { "boost": 1, "value": "A test value" },
"tag": { "boost": 2, "value": "A boosted value" }
I believe Sols will merge the values then. But if you are generating it via a framework, it will most likely disallow or override multiple object property names (tag here).
The error has nothing to do with boosting.
I get the same error with a very simple json doc.
No luck solving it.
see Solr errors when trying to parse a collection: Error parsing JSON field value. Unexp ected OBJECT_START
I hit the same error message. Actually the error message was misplaced. The underlying real error was the two of the required fields as per schema.xml in solr configuration were missing in the json payload.
An error message of the kind "required parameters are missing in the document" would have been more helpful here. You might want to check if some required fields are missing in the json payload.

Resources