Characters to split the user-query in Vespa engine - vespa

We split the user-query on ascii spaces to create a weakAnd(...).
The user-input "Watch【Docudrama】" does not contain a whitespace - but throws an error.
Question: Which codepoints beside whitespaces should be used to split the query?
YQL (fails):
select * from post where text contains "Watch【Docudrama】" limit 1;
YQL (works):
select * from post where weakAnd(text contains "Watch",text contains "【Docudrama】") limit 1;
Error message:
{
"root": {
"id": "toplevel",
"relevance": 1,
"fields": {
"totalCount": 0
},
"errors": [
{
"code": 4,
"summary": "Invalid query parameter",
"source": "content",
"message": "Can not add WORD_ALTERNATIVES text:[ Watch【Docudrama】(1.0) watch(0.7) ] to a segment phrase"
}
]
}
}

Are you sure you need to use WAND for this? Try setting the user query grammar to "any" (default is "all"), which will use the "OR" operator for user supplied terms. There is an example here: https://docs.vespa.ai/documentation/reference/query-language-reference.html#userinput
The process of splitting up the query is known as Tokenization. This is a complex and language dependent process, Vespa uses Apache OpenNLP to do this (and more): https://docs.vespa.ai/documentation/linguistics.html has more information and also references to the code which performs this operation.
If you really want to use WAND, instead of reimplementing the query parsing logic outside Vespa, I suggest you create a Java searcher which descends the query tree and modifies it by replacing the created AndItem with WeakAndItem. See https://docs.vespa.ai/documentation/searcher-development.html and the code example here: https://docs.vespa.ai/documentation/advanced-ranking.html

Related

How to ensure all items in a collection match a filter in Azure Cognitive Search

I have Azure Cognitive Search running, and my index is working as expected.
We are trying to add a security filter into the search, based on the current users permissions.
The users permissions are coming to me in as IEnumerable, but I am currently selecting just a string[] and passing that into my filter, then do a string.join, which looks like this.
permission1, permission2, permission3, permission4
In our SQL database, we have a view that is where the index is getting it's data from. There is a column on the view called RequiredPermissions, it is a Collection(Edm.string) in the index, and the data looks like this.
[ 'permission1', 'permission2', 'permission3' ]
The requirement is that for a record to return in the results, a user's permissions must contain all of the RequiredPermissions for that record.
So if we have a user with the following permissions
permission1, permission3, permission5
And we have the following records
Id, SearchText, Type, Permissions
1, abc, User, [ 'permission1', 'permission2' ]
2, abc.pdf, Document, [ 'permission1' ]
3, abc, Thing, [ 'permission1', 'permission3' ]
4, abc, Stuff, [ 'permission3', 'permission4' ]
If the user searched for 'abc' and these four results would come back, I need to $filter results that do not have the proper permissions. So I would expect the following results
Id, Returned, Reason
1, no, the user does not have permission2
2, yes, the user has permission1 and nothing else is needed
3, yes, the user has both permission1 and permission3
4, no, the user does not have permission4
If I run the following filter, then I get back anything that has permission1 or permission3, which is not acceptable, since the user should not see items Id 1 or 4
RequiredPermissions/any(role: search.in(role, 'permission1, permission3', ','))
If I run this filter, then I get nothing back, everything is rejected, because no records have permission5, and the user has it
RequiredPermissions/all(role: not search.in(role, 'permission1, permission3', ','))
If I try to run the search using 'all' and without the 'not' I get the following error
RequiredPermissions/all(role: search.in(role, 'permission1, permission3', ','))
Invalid expression: Invalid lambda expression. Found a test for equality or inequality where the opposite was expected in a lambda expression that iterates over a field of type Collection(Edm.String). For 'any', please use expressions of the form 'x eq y' or 'search.in(...)'. For 'all', please use expressions of the form 'x ne y', 'not (x eq y)', or 'not search.in(...)'.\r\nParameter name: $filter
So it seems that I cannot use the 'not' with 'any', and I must use the 'not' with 'all'
What I wish for is a way to say that a user has all the permissions in their list that is in the RequiredPermissions column.
I am currently just working in Postman using the RestApi to solve this, but I will eventually move this into .Net.
Your scenario can't be implemented with Collection(Edm.String) due to the limitations on how all and any work on such collections (documented here).
Fortunately, there is an alternative. You can model permissions as a collection of complex types, which allows you to use all the way that you need to implement your permissions model. Here is a JSON example of how the field would be defined:
{
"name": "test",
"fields": [
{ "name": "Id", "type": "Edm.String", "key": true },
{ "name": "RequiredPermissions", "type": "Collection(Edm.ComplexType)", "fields": [{ "name": "Name", "type": "Edm.String" }] }
]
}
Here is a JSON example of what a document would look like with its permissions defined:
{ "#search.action": "upload", "Id": "1", "RequiredPermissions": [{"Name": "permission1"}, {"Name": "permission2"}] }
Here is how you could construct a filter that has the desired effect:
RequiredPermissions/all(perm: search.in(perm/Name, 'permission1,permission3,permission5'))
While this works, you are strongly advised to test the performance of this solution with a realistic set of data. Under the hood, all is executed as a negated any, and negated queries can sometimes perform poorly with the type of inverted indexes used by a search engine.
Also, please be aware that there is currently a limit on the number of elements in all complex collections across a document. This limit is currently 3000. So if RequiredPermissions were the only complex collection in your index, this means you could have at most 3000 permissions defined per document.

Differences between Suggesters and NGram

I've built an index with a Custom Analyzer
"analyzers": [
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "ingram",
"tokenizer": "whitespace",
"tokenFilters": [ "lowercase", "NGramTokenFilter" ],
"charFilters": []
}
],
"tokenFilters": [
{
"#odata.type": "#Microsoft.Azure.Search.NGramTokenFilterV2",
"name": "NGramTokenFilter",
"minGram": 3,
"maxGram": 8
}
],
I came upon Suggesters and was wondering what the pros/cons were between these 2 approaches.
Basically, I'm doing an JavaScript autocomplete text box. I need to do partial text search inside of the search text (i.e. search=ell would match on "Hello World".
Azure Search offers two features to enable this depending on the experience you want to give to your users:
- Suggestions: https://learn.microsoft.com/en-us/rest/api/searchservice/suggestions
- Autocomplete: https://learn.microsoft.com/en-us/rest/api/searchservice/autocomplete
Suggestions will return a list of matching documents even with incomplete query terms, and you are right that it can be reproduced with a custom analyzer that uses ngrams. It's just a simpler way to accomplish that (since we took care of setting up the analyzer for you).
Autocomplete is very similar, but instead of returning matching documents, it will simply return a list of completed "terms" that match the incomplete term in your query. This will make sure terms are not duplicated in the autocomplete list (which can happen when using the suggestions API, since as I mentioned above, suggestions return matching documents, rather than a list of terms).

How to perform a full-text search in Vespa?

I am trying to do a full-text search on a field of some documents, and I was looking for your advices on how to do so. I first tried to do this type of request:
GET http://localhost:8080/search/?query=lord+of+the+rings
But it was returning me the documents where the field was an exact match and contained no other information than the given string , so I tried the equivalent in YQL:
GET http://localhost:8080/search/?yql=SELECT * FROM site WHERE text CONTAINS "lord of the rings";
And I had the exact same results. But when further reading the documentation I fell upon the MATCHES instruction, and it indeed gives me the results I'm seem to be looking for, by doing this kind of request:
GET http://localhost:8080/search/?yql=SELECT * FROM site WHERE text MATCHES "lord of the rings";
Though I don't know why, for some requests of this type I encountered a timeout error of this type:
{
"root": {
"id": "toplevel",
"relevance": 1,
"fields": {
"totalCount": 0
},
"errors": [
{
"code": 12,
"summary": "Timed out",
"source": "site",
"message": "Timeout while waiting for sc0.num0"
}
]
}
}
So I solved this issue by adding greater than default timeout value:
GET http://localhost:8080/search/?yql=SELECT * FROM site WHERE text MATCHES "lord of the rings";&timeout=20000
My question is, am I doing full-text search the right way, and how could I improve it ?
EDIT: Here is the corresponding search definition:
search site {
document site {
field text type string {
stemming: none
normalizing: none
indexing: attribute
}
field title type string {
stemming: none
normalizing: none
indexing: attribute
}
}
fieldset default {
fields: title, text
}
rank-profile post inherits default {
rank-type text: about
rank-type title: about
first-phase {
expression: nativeRank(title, text)
}
}
}
What does your search definition file look like? I suspect you have put your text content in an "attribute" field, which defaults to "word match" semantics. You probably want "text match" semantics which means you'll need to put your content in an "index" type field.
https://docs.vespa.ai/documentation/reference/search-definitions-reference.html#match
The "MATCHES" operator you are using interprets your input as a regular expression, which is powerful, but slow as it applies the regular expression on all attributes (further optimizations to something like https://swtch.com/~rsc/regexp/regexp4.html are possible but not currently implemented).

SoapUI: Count Nodes Returned in JSON Array Response

I've learned so much using SoapUI, but, I'm just stuck on this one thing. I have the following payload returned:
[
{
"#c": ".CreditPaymentInfo",
"supplementalInfo": null,
"date": "06/30/2015 17:03:50",
"posTxCode": "107535",
"amt": 2.56,
"transactionId": 235087,
"id": 232163,
"cardType": "CREDIT",
"cardHolderName": "SMITH2/JOE",
"expMonthYear": "0119",
"lastFourDigits": "4444",
"approvalCode": "315PNI",
"creditTransactionNumber": "A71A7DB6C2F4"
},
{
"#c": ".CreditPaymentInfo",
"supplementalInfo": null,
"date": "07/01/2015 15:53:29",
"posTxCode": "2097158",
"amt": 58.04,
"transactionId": 235099,
"id": 232176,
"cardType": "CREDIT",
"cardHolderName": "SMITH2/JOE",
"expMonthYear": "0119",
"lastFourDigits": "4444",
"approvalCode": "",
"creditTransactionNumber": null
}
]
I would like to count how many nodes are returned... so, in this case, I would expect that 2 nodes be returned whenever I run this test step in SoapUI.
I was attempting to get this done using the JsonPath Count assertion, but, I just can't see to format it correctly.
Any help would be greatly appreciated!
I have not used JsonPath, but you can do this with XPath ... which works for all older versions too.
Internally SoapUI represents everything as XML. So you could use XPath assertion to check for:
${#ResponseAsXml#count(//*:id)}
and make sure it comes back as
2
Counted successfully with 'JsonPath count' using one of the following (assuming my top level object is an array) :
$
$.
$[*]
If you need to be more specific on the objects you're counting, you can rely only on the 3rd syntax, specifying one of the redundant field. One of the following worked for me :
$[*].fieldName
$[*].['fieldName']
Should return 2 in your case with one of the following :
$[*].#c
$[*].['#c']
$[*].id
$[*].['id']
And so on
This is a JSON format. Just use JsonPath Count. Use $ at the top and 2 at the bottom.
$[index].your.path.here
thus
$[0].date would return "06/30/2015 17:03:50"
and
$[1].date would return "07/01/2015 15:53:29"

Solr, adding a record via JSON with a multi-value field and boosted values

I'm pretty new to Solr, I'm trying to add a multi-value field with boost values defined for each value, all defined via JSON. In other words, I'd like this to work:
[{ "id": "ID1000",
"tag": [
{ "boost": 1, "value": "A test value" },
{ "boost": 2, "value": "A boosted value" } ]
}]
I know how to do that in XML (multiple <field name = 'tag' boost = '...'>), but the JSON code above doesn't work, the server says "Error parsing JSON field value. Unexpected OBJECT_START". Has Solr a limit/bug?
PS: I fixed the originally-missing ']' and that's not the problem.
EDIT: It seems the way to go should be payloads (http://wiki.apache.org/solr/Payloads), but I couldn't make them to work on Solr (followed this: http://sujitpal.blogspot.co.uk/2011/01/payloads-with-solr.html). Leaving the question open to see if someone can further help.
Found the following sentence in the from the Solr Relevancy FAQ - Query Elevation Component section
An Index-time boost on a value of a multiValued field applies to all values for that field.
I do not think adding an individual boost to each value in the multivalued field is going to work. I know that the Xml will allow it, but I would guess that it may only apply the boost value from the last value applied to the field.
So based on that I would change the Json to the following and see if that works.
[
{
"id": "ID1000",
"tag": {
"boost": 2,
"value": [ "A test value", "A boosted value"]
}
}
]
The JSON seems to be invalid missing a closing ]
[
{
"id": "ID1000",
"tag": [
{
"boost": 1,
"value": "A test value"
},
{
"boost": 2,
"value": "A boosted value"
}
]
}
]
You hit an edge case. You can have the boosts on single values and you can have an array of values. But not one inside another (from my reading of Solr 4.1 source code)
That might be something to create as an enhancement request.
If you are generating that JSON by hand, you can try:
"tag": { "boost": 1, "value": "A test value" },
"tag": { "boost": 2, "value": "A boosted value" }
I believe Sols will merge the values then. But if you are generating it via a framework, it will most likely disallow or override multiple object property names (tag here).
The error has nothing to do with boosting.
I get the same error with a very simple json doc.
No luck solving it.
see Solr errors when trying to parse a collection: Error parsing JSON field value. Unexp ected OBJECT_START
I hit the same error message. Actually the error message was misplaced. The underlying real error was the two of the required fields as per schema.xml in solr configuration were missing in the json payload.
An error message of the kind "required parameters are missing in the document" would have been more helpful here. You might want to check if some required fields are missing in the json payload.

Resources