Vespa response: Summary data is incomplete: Timed out waiting for summary data - vespa

I am deploying a simple text retrieval system with Vespa. However, I found when setting topk to some large number, e.g. 40, the response will include the error message "Summary data is incomplete: Timed out waiting for summary data." and also some unexpected ids. The system works fine for some small topk like 10. The response was as follows:
{'root': {'id': 'toplevel', 'relevance': 1.0, 'fields': {'totalCount': 1983140}, 'coverage': {'coverage': 19, 'documents': 4053984, 'degraded': {'match-phase': False, 'timeout': True, 'adaptive-timeout': False, 'non-ideal-state': False}, 'full': False, 'nodes': 1, 'results': 1, 'resultsFull': 0}, 'errors': [{'code': 12, 'summary': 'Timed out', 'message': 'Summary data is incomplete: Timed out waiting for summary data. 1 responses outstanding.'}], 'children': [{'id': 'index:square_datastore_content/0/34b46b2e96fc0aa18ed4941b', 'relevance': 44.44359956427316, 'source': 'square_datastore_content'}, {'id': 'index:square_datastore_content/0/16dbc34c5e77684cd6f554fd', 'relevance': 43.94371735208669, 'source': 'square_datastore_content'}, {'id': 'index:square_datastore_content/0/9f2fd93f6d74e88f96d7014f', 'relevance': 43.298002713993384, 'source': 'square_datastore_content'}, {'id': 'index:square_datastore_content/0/76c4e3ee15dc684a78938a9d', 'relevance': 40.908658368905485, 'source': 'square_datastore_content'}, {'id': 'index:square_datastore_content/0/c04ceee4b9085a4d041d8c81', 'relevance': 36.13561898237115, 'source': 'square_datastore_content'}, {'id': 'index:square_datastore_content/0/13806c518392ae7b80ab4e4c', 'relevance': 35.688377118163714, 'source': 'square_datastore_content'}, {'id': 'index:square_datastore_content/0/87e0f13fdef1a1c404d3c8c6', 'relevance': 34.74150232183567, 'source': 'square_datastore_content'}, ...]}}
I am using the schema:
schema wiki {
document wiki {
field title type string {
indexing: summary | index
index: enable-bm25
}
field text type string {
indexing: summary | index
index: enable-bm25
}
field id type long {
indexing: summary | attribute
}
field dpr_embedding type tensor<bfloat16>(x[769]) {
indexing: attribute | index
attribute {
distance-metric: euclidean
}
}
}
fieldset default {
fields: title, text
}
rank-profile bm25 inherits default {
first-phase {
expression: bm25(title) + bm25(text)
}
}
rank-profile dpr inherits default {
first-phase {
expression: closeness(dpr_embedding)
}
}}
This error occurred for both BM25 and DPR dense retrieval. So what is wrong? And what can I do? Thanks.

The default Vespa timeout is 500 ms and can be adjusted by &timeout=x where x is given in seconds, e.g &timeout=2 would use an overall request timeout of 2 seconds.
A query is executed in two protocol phases:
Find the top k matches given the query/ranking profile combination, each node returns up to k results
The stateless container merges the results and finally asks for summary data (e.g the contents of only the top k results)
See https://docs.vespa.ai/en/performance/sizing-search.html for an explanation of this.
In your case you are hit by two things
A soft timeout at the content node (coverage is reported to be only 19%) so within the default timeout of 500ms it could retrieve and rank 19% of the available content. At 500ms minus a factor it timed out and returned what it had retrieved and rank up til the.
When trying to use the time left it also timed out waiting for the hits data for those documents which it managed to retrieve and rank within the soft timeout, this is the incomplete summary data response.
Generally, if you want cheap BM25 search use WAND (https://docs.vespa.ai/en/using-wand-with-vespa.html) If you want to search using embeddings, use ANN instead of brute force NN. We also have a complete sample application reproducing the DPR (Dense Passage Retrieval) here https://github.com/vespa-engine/sample-apps/tree/master/dense-passage-retrieval-with-ann

Related

Delimit records by recurring value

I have documents that contain an object array. Within that array are pulses in a dataset. For example:
samples: [{"time":1224960,"flow":0,"temp":null},{"time":1224970,"flow":0,"temp":null},
{"time":1224980,"flow":23,"temp":null},{"time":1224990,"flow":44,"temp":null},
{"time":1225000,"flow":66,"temp":null},{"time":1225010,"flow":0,"temp":null},
{"time":1225020,"flow":650,"temp":null},{"time":1225030,"flow":40,"temp":null},
{"time":1225040,"flow":60,"temp":null},{"time":1225050,"flow":0,"temp":null},
{"time":1225060,"flow":0,"temp":null},{"time":1225070,"flow":0,"temp":null},
{"time":1225080,"flow":0,"temp":null},{"time":1225090,"flow":0,"temp":null},
{"time":1225100,"flow":0,"temp":null},{"time":1225110,"flow":67,"temp":null},
{"time":1225120,"flow":23,"temp":null},{"time":1225130,"flow":0,"temp":null},
{"time":1225140,"flow":0,"temp":null},{"time":1225150,"flow":0,"temp":null}]
I would like to construct an aggregate pipeline to act on each collection of consecutive 'samples.flow' values above zero. As in, the sample pulses are delimited by one or more zero flow values. I can use an $unwind stage to flatten the data but I'm at a loss as to how to subsequently group each pulse. I have no objections to this being a multistep process. But I'd rather not have to loop through it in code on the client side. The data will comprise fields from a number of documents and could total in the hundreds of thousands of entries.
From the example above I'd like to be able to extract:
[{"time":1224980,"total_flow":123,"temp":null},
{"time":1225020,"total_flow":750,"temp":null},
{"time":1225110,"total_flow":90,"temp":null}]
or variations thereof.
If you are not looking for specific values to be on the time field, then you can use this pipeline with $bucketAuto.
[
{
"$bucketAuto": {
"groupBy": "$time",
"buckets": 3,
"output": {
total_flow: {
$sum: "$flow"
},
temp: {
$first: "$temp"
},
time: {
"$min": "$time"
}
}
}
},
{
"$project": {
"_id": 0
}
}
]
If you are looking for some specific values for time, then you will need to use $bucket and provide it a boundaries argument with precalculated lower bounds. I think this solution should do your job

How to fetch only condition satisfied/match object from an array in mongodb?

I am new to MongoDB.
This is my 'masterpatients' collection it has many documents. every documents contain 'visits' array and every visits array contains multiple objects. I want only those object which is satisfied with my input.
I am expecting only below the expected output. if the facility match with my input and visit date range match with my provided input then the query should return only that object as I have given below.
_id:5ef59134a3d8d92e580510fe
flag:0
name:"emicon_test"
dob:2020-06-25T00:00:00.000+00:00
visits:[
{
visit:2020-06-09T10:36:10.635+00:00,
facility:"Atria Lady Lake"
},
{
visit:2020-05-09T10:36:10.635+00:00,
facility:"demo"
}]
_id:5ee3213040f8830e04ff74a8
flag:0
name:"xyz"
dob:1995-06-25T00:00:00.000+00:00
visits:[
{
visit:2020-05-01T10:36:10.635+00:00,
facility:"pqr"
},
{
visit:2020-05-15T10:36:10.635+00:00,
facility:"demo"
},
{
visit:2020-05-09T10:36:10.635+00:00,
facility:"efg"
}]
My query input parameters is facility='demo' and visit date range is from '1st May 2020' to '10th May 2020'
output expected:
_id:5ef59134a3d8d92e580510fe
flag:0
name:"emicon_test"
dob:2020-06-25T00:00:00.000+00:00
visits:[
{
visit:2020-05-09T10:36:10.635+00:00,
facility:"demo"
}]
Thanks in advance.
I got an answer.
MasterPatientModel.aggregate([
{
'$unwind':"$visits"
},
{"$match": {"visits.visit": {"$gte": new Date(req.body.facilitySummaryFromdate), "$lte": new Date(req.body.facilitySummaryTodate)
} , "visits.facility": req.body.facilitySummary
}
}
])
You cannot filter the contents of a mongo collection property on the server.
Make visits array into a top level collection/model and you can filter by criteria on the server.

Mongodb: upsert multiple documents at the same time (with key like $in)

I stumbled upon a funny behavior in MongoDB:
When I run:
db.getCollection("words").update({ word: { $in: ["nico11"] } }, { $inc: { nbHits: 1 } }, { multi: 1, upsert: 1 })
it will create "nico11" if it doesn't exist, and increase nbHits by 1 (as expected).
However, when I run:
db.getCollection("words").update({ word: { $in: ["nico10", "nico11", "nico12"] } }, { $inc: { nbHits: 1 } }, { multi: 1, upsert: 1 })
it will correctly update the keys that are already in the DB, but not insert the missing ones.
Is that the expected behavior, and is there any way I can provide an array to mongoDB, for it to update the existing elements, and create the ones that need to be created?
That is expected behaviour according to the documentation:
The update creates a base document from the equality clauses in the
parameter, and then applies the update expressions from the
parameter. Comparison operations from the will not be
included in the new document.
And, no, there is no way to achieve what you are attempting to do here using a simple upsert. The reason for that is probably that the expected outcome would be impossible to define. In your specific case it might be possible to argue along the lines of: "oh well, it is kind of obvious what we should be doing here". But imagine a more complex query like this:
db.getCollection("words").update({
a: { $in: ["b", "c" ] },
x: { $in: [ "y", "z" ]}
},
{ $inc: { nbHits: 1 } },
{ multi: 1, upsert: 1 })
What should MongoDB do in this case?
There is, however, the concept of bulk write operations in MongoDB where you would need to define three separate updateOne operations and package them up in a single request to the server.

sagemaker clienterror rows 1-5000 have more fields than expected size 3

I have created a K-means training job with a csv file that I have stored in S3. After a while I receive the following error:
Training failed with the following error: ClientError: Rows 1-5000 in file /opt/ml/input/data/train/features have more fields than than expected size 3.
What could be the issue with my file?
Here are the parameters I am passing to sagemaker.create_training_job
TrainingJobName=job_name,
HyperParameters={
'k': '2',
'feature_dim': '2'
},
AlgorithmSpecification={
'TrainingImage': image,
'TrainingInputMode': 'File'
},
RoleArn='arn:aws:iam::<my_acc_number>:role/MyRole',
OutputDataConfig={
"S3OutputPath": output_location
},
ResourceConfig={
'InstanceType': 'ml.m4.xlarge',
'InstanceCount': 1,
'VolumeSizeInGB': 20,
},
InputDataConfig=[
{
'ChannelName': 'train',
'ContentType': 'text/csv',
"CompressionType": "None",
"RecordWrapperType": "None",
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': data_location,
'S3DataDistributionType': 'FullyReplicated'
}
}
}
],
StoppingCondition={
'MaxRuntimeInSeconds': 600
}
I've seen this issue appear when doing unsupervised learning, such as the above example using clustering. If you have a csv input, you can also address this issue by setting label_size=0 in the ContentType parameter of the Sagemaker API call, within the InputDataConfig branch.
Here's an example of what the relevant section of the call might look like:
"InputDataConfig": [
{
"ChannelName": "train",
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": "some/path/in/s3",
"S3DataDistributionType": "ShardedByS3Key"
}
},
"CompressionType": "None",
"RecordWrapperType": "None",
"ContentType": "text/csv;label_size=0"
}
]
Make sure your .csv doesn't have column headers, and that the label is the first column.
Also make sure your values for the hyper-parameters are accurate ie feature_dim means number of features in your set. If you give it the wrong value, it'll break.
Heres a list of sagemaker knn hyper-parameters and their meanings: https://docs.aws.amazon.com/sagemaker/latest/dg/kNN_hyperparameters.html

ArangoDB Indexes and arrays

I am trying use document collection for fast lookup, sample document
document Person {
...
groups: ["admin", "user", "godmode"],
contacts: [
{
label: "main office",
items: [
{ type: "phone", value: '333444222' },
{ type: "phone", value: '555222555' },
{ type: "email", value: 'bob#gmail.com' }
]
}
]
...
}
Create Hash index for "groups" field
Query: For P in Person FILTER "admin" IN P.groups RETURN P
Result: Working, BUT No index used via explain query
Question: How use queries with arrays filter and indexes ? performance is main factor
Create Hash index for "contacts[].items[].value"
Query: For P in Person FILTER "333444222" == P.contacts[*].items[*].value RETURN P
Result: Double usage of wildcard not supported?? Index not used, query empty
Question: How organize fast lookup with for this structure with indexes ?
P.S. also tried MATCHES function, multi lever for-in, hash indexed for arrays never used
ArangoDB version 2.6.8
Indexes can be used from ArangoDB version 2.8 on.
For the first query (FILTER "admin" IN p.groups), an array hash index on field groups[*] will work:
db._create("persons");
db.persons.insert(personDateFromOriginalExample);
db.persons.ensureIndex({ type: "hash", fields: [ "groups[*]" ] });
This type of index does not exist in versions prior to 2.8.
With an array index in place, the query will produce the following execution plan (showing that the index is actually used):
Execution plan:
Id NodeType Est. Comment
1 SingletonNode 1 * ROOT
6 IndexNode 1 - FOR p IN persons /* hash index scan */
3 CalculationNode 1 - LET #1 = "admin" in p.`groups` /* simple expression */ /* collections used: p : persons */
4 FilterNode 1 - FILTER #1
5 ReturnNode 1 - RETURN p
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
6 hash persons false false 100.00 % [ `groups[*]` ] "admin" in p.`groups`
The second query will not be supported by array indexes, as it contains multiple levels of nesting. The array indexes in 2.8 are restricted to one level, e.g. groups[*] or contacts[*].label will work, but not groups[*].items[*].value.
about 1.) this is Work-in-progress and will be included in one of the next releases (most likely 2.8).
We have not yet decided about the AQL syntax to retrieve the array, but FILTER "admin" IN P.groups is among the most likely ones.
about 2.) having implemented 1. this will work out of the box as well, the index will be able to cover several depths of nesting.
Neither of the above can be properly indexed in the current release (2.6)
The only alternative i can offer is to externalize the values and use edges instead of arrays.
In your code the data would be the following (in arangosh).
I used fixed _key values for simplicity, works without them as well:
db._create("groups"); // saves the group elements
db._create("contacts"); // saves the contact elements
db._ensureHashIndex("value") // Index on contacts.value
db._create("Person"); // You already have this
db._createEdgeCollection("isInGroup"); // Save relation group -> person
db._createEdgeCollection("hasContact"); // Save relation item -> person
db.Person.save({_key: "user"}) // The remainder of the object you posted
// Now the items
db.contacts.save({_key:"phone1", type: "phone", value: '333444222' });
db.contacts.save({_key:"phone2", type: "phone", value: '555222555' });
db.contacts.save({_key:"mail1", type: "email", value: 'bob#gmail.com'});
// And the groups
db.groups.save({_key:"admin"});
db.groups.save({_key:"user"});
db.groups.save({_key:"godmode"});
// Finally the relations
db.hasContact.save({"contacts/phone1", "Person/user", {label: "main office"});
db.hasContact.save({"contacts/phone2", "Person/user", {label: "main office"});
db.hasContact.save({"contacts/mail1", "Person/user", {label: "main office"});
db.isInGroup.save("groups/admin", "Person/user", {});
db.isInGroup.save("groups/godmode", "Person/user", {});
db.isInGroup.save("groups/user", "Person/user", {});
Now you can execute the following queries:
Fetch all admins:
RETURN NEIGHBORS(groups, isInGroup, "admin")
Get all users having a contact with value 333444222:
FOR x IN contacts FILTER x.value == "333444222" RETURN NEIGHBORS(contacts, hasContact, x)

Resources