Using inherited struct array attributes in ranking an searching - vespa

I am currently using document inheritance to avoid having to include specific fields in every document type that I make. It works for most scenarios, but when an array of structs is inherited, using that field in ranking and searching seems to break things. For example, take the following application.
schema base {
document base {
field simple_base type string {
indexing: summary | index
}
struct base_struct {
field name type string {}
}
field complex_base type array<base_struct> {
indexing: summary
struct-field name { indexing: attribute }
}
}
}
schema sub {
document sub inherits base {
field simple_sub type string {
indexing: summary | index
}
struct sub_struct {
field name type string {}
}
field complex_sub type array<sub_struct> {
indexing: summary
struct-field name { indexing: attribute }
}
}
rank-profile default inherits default {
first-phase {
expression: nativeRank(simple_sub, complex_sub.name, simple_base, complex_base.name)
}
}
}
If you try to prepare this application, you will get the error
WARNING: invalid rank feature 'nativeRank(simple_sub,complex_sub.name,simple_base,complex_base.name)': The parameter list used for setting up rank feature nativeRank is not valid: Param[3]: Field 'complex_base.name' was not found in the index environment
If you take out the complex_base.name from the nativeRank function, it will compile properly. Additionally if you try to search using the various fields after feeding the following json:
{
"put": "id:content:sub::0",
"fields": {
"simple_base": "simple",
"simple_sub": "simple",
"complex_base": [
{
"name": "complex"
}
],
"complex_sub": [
{
"name": "complex"
}
]
}
}
Then the following are the results:
/search/?query=simple_sub:simple -> 1 result
/search/?query=complex_sub.name:complex -> 1 result
/search/?query=simple_base:simple -> 1 result
/search/?query=complex_base.name:complex -> 0 results
If this is the intended behavior then is there a workaround for this other than copy/pasting the
complex_base field into every document that needs to inherit it? Thank you for your assistance.

This is not related to inheritance. Vespa.ai does not support ranking for user defined struct fields. See https://github.com/vespa-engine/vespa/issues/11580.

Related

Searching non-primitive types within a struct

I need to search within a an array nested in another array. Let's say I have the following document
schema foos {
document foos {
struct foo {
field bars type array<string> {}
}
field baz type string {
indexing: summary
}
field foos type array<foo> {
indexing: summary
struct-field bars { indexing: attribute } // breaks due to non-primitive typing
}
}
}
I need to be able to search within the bars field or at least access that field within a rank-profile whilst also being able to search based on baz. My first thought at a solution would be to structure it as follows:
schema foos {
document foos {
field id type string {
indexing: summary
}
field baz type string {
indexing: summary | attribute
}
field foos type array<reference<foo>> {
indexing: summary
}
}
}
schema foo {
document foo {
field foos_ref type reference<foos> {
indexing: attribute
}
field bars type array<string> {
indexing: summary | index
}
}
import field foos_ref.baz as baz {}
}
This would allow me to search within the foo cluster then get the corresponding foos reference, but the overall goal is to provide the user a list of foos documents which would require multiple searches from the list of returned foo documents resulting in slow searches overall.
If there is a recommended way to handle situations like these, any help would be appreciated. Thank you.
First, note that a reference field used for parent child relationships can only be single value and not an array (https://docs.vespa.ai/documentation/reference/schema-reference.html#type:reference). The reference field is specified in the child type to reference the parent. The schemas can be defined as follows, where foos is the parent type and foo is the child type:
schema foos {
document foos {
field id type string {
indexing: summary | attribute
}
field baz type string {
indexing: summary | attribute
}
}
}
schema foo {
document foo {
field foos_ref type reference<foos> {
indexing: attribute
}
field bars type array<string> {
indexing: summary | index
}
}
import field foos_ref.baz as foos_baz {}
import field foos_ref.id as foos_id {}
}
Now you can search for foo documents using the fields bars and foos_baz in the query. Use grouping (https://docs.vespa.ai/documentation/grouping.html) on the foos_id field to structure the result around the foos documents instead. This is handled in a single query request.

Using $rename in MongoDB for an item inside an array of objects

Consider the following MongoDB collection of a few thousand Objects:
{
_id: ObjectId("xxx")
FM_ID: "123"
Meter_Readings: Array
0: Object
Date: 2011-10-07
Begin_Read: true
Reading: 652
1: Object
Date: 2018-10-01
Begin_Reading: true
Reading: 851
}
The wrong key was entered for 2018 into the array and needs to be renamed to "Begin_Read". I have a list using another aggregate of all the objects that have the incorrect key. The objects within the array don't have an _id value, so are hard to select. I was thinking I could iterate through the collection and find the array index of the errored Readings and using the _id of the object to perform the $rename on the key.
I am trying to get the index of the array, but cannot seem to select it correctly. The following aggregate is what I have:
[
{
'$match': {
'_id': ObjectId('xxx')
}
}, {
'$project': {
'index': {
'$indexOfArray': [
'$Meter_Readings', {
'$eq': [
'$Meter_Readings.Begin_Reading', True
]
}
]
}
}
}
]
Its result is always -1 which I think means my expression must be wrong as the expected result would be 1.
I'm using Python for this script (can use javascript as well), if there is a better way to do this (maybe a filter?), I'm open to alternatives, just what I've come up with.
I fixed this myself. I was close with the aggregate but needed to look at a different field for some reason that one did not work:
{
'$project': {
'index': {
'$indexOfArray': [
'$Meter_Readings.Water_Year', 2018
]
}
}
}
What I did learn was the to find an object within an array you can just reference it in the array identifier in the $indexOfArray method. I hope that might help someone else.

ElasticSearch: how to perform search as MSSQL "LIKE word% with tokenized string?

Currently we are performing full text search within MSSQL with query:
select * from contract where number like 'word%'
the problem is that contract number may be like
АА-1641471
TST-100069
П-5112-90-00230
001-1000017
1617/292/000001
and ES split all this into tokens.
How to configure ES not to split all this contract numbers into tokens and perform same search like SQL query above ?
the closest solution i've found is to perform query like this:
{
"size": 10,
"query": {
"regexp": {
"contractNumber": {
"value": ".*п-11.*"
}
}
}
}
this solution work same as MSSQL LIKE 'word%' with value like 1111,2568 etc, but fails with п-11
One option could be to use the wildcard query which can perform any type of wildcard combination i.e %val%, %val or val%
{
"query": {
"wildcard" : { "contractNumber" : "*11" }
}
}
NOTE: It's not recommended to start with a wildcard in the search. Could be extremely slow
To make this work with string values to prevent them from being tokenized, you need to update your index and tell the analyser to stay away. One way of doing that is to define the property as type keyword instead of text
PUT /_template/template_1
{
"index_patterns" : ["your_index*"],
"order" : 0,
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"your_document_type" : {
"properties" : {
"contractNumber" : {
"type" : "keyword"
}
}
}
}
NOTE: replace your_index with your index name and your_document_type with the document type.
When the mapping is added, delete the current index and recreate it, then it will use the template for properties and your contractNumber will be indexed as a keyword

Spark get datatype of nested object

I have some JSON data which looks like this:
{
"key1":"value1",
"key2":[
1,
2,
3
],
"key3":{
"key31":"value31",
"key32":"value32"
},
"key4":[
{
"key41":"value411",
"key42":"value412",
"key43":"value413"
},
{
"key41":"value421",
"key42":"value422",
"key43":"value423"
}
],
"key5":{
"key51":[
{
"key511":"value511",
"key512":"value512",
"key513":"value513"
},
{
"key511":"value521",
"key512":"value522",
"key513":"value523"
}
]
},
"key6":{
"key61":{
"key611":[
{
"key_611":"value_611",
"key_612":"value_612",
"key_613":"value_613"
},
{
"key_611":"value_621",
"key_612":"value_622",
"key_613":"value_623"
},
{
"key_611":"value_621",
"key_612":"value_622",
"key_613":"value_623"
}
]
}
}
}
It contains the a mix of simple, complex and array type values.
If I try to get the datatype of key1 schema.("key1").dataType, I get StringType and likewise for key2, key3 and key4.
For key5 also, I get StructType.
But when I try to get the datatype for key51, which is nested under key5 using schema.("key5.key51").dataType, I'm getting the following error:
java.lang.IllegalArgumentException: Field "key5.key51" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)
... 48 elided
The main intention for me is to be able to explode a given type if its of ArrayType and not explode for any other type.
The explode function is able to recognize this given key (key5.key51) properly and exploding the array. But the problem is with determining the datatype.
One possible solution for me is to do a select of key5.key51 as a separate column key51 and then explode that column.
But is there any better and more elegant way of doing this while still being able to determine the datatype of the given column?
The simplest solution is to select the field of interest, and then retrieve the schema:
df.select("key5.key51").schema.head.dataType
Using full schema directly, would require traversing schema, and might be hard to do right, while with embedded ., StructTypes and complex types (Maps and Arrays).
Here is some (recursive) code to find all ArrayType fields names:
import org.apache.spark.sql.types._
def findArrayTypes(parents:Seq[String],f:StructField) : Seq[String] = {
f.dataType match {
case array: ArrayType => parents
case struct: StructType => struct.fields.toSeq.map(f => findArrayTypes(parents:+f.name,f)).flatten
case _ => Seq.empty[String]
}
}
val arrayTypeColumns = df.schema.fields.toSeq
.map(f => findArrayTypes(Seq(f.name),f))
.filter(_.nonEmpty).map(_.mkString("."))
For your dataframe, this gives:
arrayTypeColumns.foreach(println)
key2
key4
key5.key51
key6.key61.key611
This does not work yet for arrays inside maps or nested arrays

findBy query not returning correct page info

I have a Person collection that is made up of the following structure
{
"_id" : ObjectId("54ddd6795218e7964fa9086c"),
"_class" : "uk.gov.gsi.hmpo.belt.domain.person.Person",
"imagesMatch" : true,
"matchResult" : {
"_id" : null,
"score" : 1234,
"matchStatus" : "matched",
"confirmedMatchStatus" : "notChecked"
},
"earlierImage" : DBRef("image", ObjectId("54ddd6795218e7964fa9086b")),
"laterImage" : DBRef("image", ObjectId("54ddd67a5218e7964fa908a9")),
"tag" : DBRef("tag", ObjectId("54ddd6795218e7964fa90842"))
}
Notice that the "tag" is a DBRef.
I've got a Spring Data finder that looks like the following:
Page<Person> findByMatchResultNotNullAndTagId(#Param("tagId") String tagId, Pageable page);
When this code is executed the find query looks like the following:
{ matchResult: { $ne: null }, tag: { $ref: "tag", $id: ObjectId('54ddd6795218e7964fa90842') } } sort: {} projection: {} skip: 0 limit: 1
Which is fine, I get a collection of 1 person back (limit=1). However the page details are not correct. I have 31 persons in the collection so I should have 31 pages. What I get is the following:
"page" : {
"size" : 1,
"totalElements" : 0,
"totalPages" : 0,
"number" : 0
}
The count query looks like the following:
{ count: "person", query: { matchResult: { $ne: null }, tag.id: "54ddd6795218e7964fa90842" } }
That tag.id doesn't look correct to me compared with the equivalent find query above.
I've found that if I add a new method to org.springframework.data.mongodb.core.MongoOperations:
public interface MongoOperations {
public long count(Query query, Class<?> entityClass, String collectionName);
}
And then re-jig AbstractMongoQuery.execute(Query query) to use that method instead of the similar method without the entityClass parameter then I get the correct paging results.
Question: Am I doing something wrong or is this a bug in Spring Data Mongo?
Edit
Taking inspiration from Christoph I've added the following test code on Git https://github.com/tedp/Spring-Data-Test
The information contained in the Page returned depends on the query executed. Assuming a total number of 31 elements in you collection, only a few of them, or even just one might match the given criteria by referencing the tag with id: 54ddd6795218e7964fa90842. Therefore you only get the total elements that match the query, and not the total elements within your collection.
This bug was actually fixed DATAMONGO-1120 as pointed out by Christoph. I needed to override the spring data version to use 1.6.2.RELEASE until the next iteration of Spring Boot where presumably Spring Data will be up lifted to at least 1.6.2.RELEASE.

Resources