How to perform exact nearest neighbors search in Vespa? - vespa

I have such schema
schema embeddings {
document embeddings {
field id type int {}
field text_embedding type tensor<double>(d0[960]) {
indexing: attribute | index
attribute {
distance-metric: euclidean
}
}
}
rank-profile distance {
num-threads-per-search:1
inputs {
query(query_embedding) tensor<double>(d0[960])
}
first-phase {
expression: distance(field, text_embedding)
}
}
}
and such query body:
body = {
'yql': 'select * from embeddings where ({approximate:false, targetHits:10} nearestNeighbor(text_embedding, query_embedding));',
"hits":10,
'input': {
'query(query_embedding)': [...],
},
'ranking': {
'profile': 'distance',
},
}
The thing is the output of this query returns different results depending on targetHits parameter. For example, the top-1 distance for targetHits: 10 is 2.847000, and the top-1 distance for targetHits: 200 is 3.028079.
More of that, if I perform the same query using vespa cli:
vespa query -t http://query "select * from embeddings where ([{\"targetHits\":10}] nearestNeighbor(text_embedding, query_embedding));" \
"approximate=false" \
"ranking.profile=distance" \
"ranking.features.query(query_embedding)=[...]"
I'm receiving the third result:
{
"root": {
"id": "toplevel",
"relevance": 1.0,
"fields": {
"totalCount": 10
},
"coverage": {
"coverage": 100,
"documents": 1000000,
"full": true,
"nodes": 1,
"results": 1,
"resultsFull": 1
},
"children": [
{
"id": "id:embeddings:embeddings::926288",
"relevance": 0.8158006540357854,
...
where as we can see top-1 distance is 0.8158
So, how can I perform the exact and not approximate nearest neighbors search, which results do not depend on any parameters?

Vespa sorts results by descending relevance score. When you use the distance rank-feature instead of closeness as the relevance score (your first-phase ranking expression), you end up inverting the order, so that more distant (worse) neighbors are ranked higher. As you increase targetHits you get even worse neighbors.
The correct query syntax for exact search is to set approximate:false:
select * from embeddings where ({approximate:false, targetHits:10} nearestNeighbor(text_embedding, query_embedding));
But you want to use closeness(field, text_embedding) in your first-phase ranking expression.
From https://docs.vespa.ai/en/nearest-neighbor-search.html
The closeness(field, image_embedding) is a rank-feature calculated by the nearestNeighbor query operator. The closeness(field, tensor) rank feature calculates a score in the range [0, 1], where 0 is infinite distance, and 1 is zero distance. This is convenient because Vespa sorts hits by decreasing relevancy score, and one usually want the closest hits to be ranked highest.
The first-phase is part of Vespa’s phased ranking support. In this example the closeness feature is re-used and documents are not re-ordered.

Related

How to show all bins of histogram in selected range in Google Looker Studio?

Here is a post showing how to plot histogram in Google Looker Studio. The solution of the post works well when that data is almost "continuous", but some bins will not be shown when there is a big gap between data. For example, if my raw data is [1,2,3,4,5,6,7,8,9,1000], the bins between 10-999 will not be shown. How can I show those bins with zero y-value in Google Looker Studio?
My data source is a CSV file. A general solution is better.
There is no way to force zero bar entries with the standard graphs in Looker Studio. A solution can be the use of customer viz, such as the VEGA plugin.
Please use then following Vega code:
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
"mark": {"type": "bar", "tooltip": true},
"encoding": {
"x": {
"bin": {"binned": true, "step": 1},
"field": "$dimension0"
},
"y": {"aggregate": "sum",
"field":"$metric0"
}
}
}
The step under bin sets the width of a bar.
For building non linear range, there is the need to create an extra column:
CASE
WHEN Age <= 5 THEN FLOOR(age/1) * 1
WHEN Age <= 20 THEN FLOOR(age/5) * 5
WHEN Age <= 100 THEN FLOOR(age/10) * 10
WHEN Age <= 1000 THEN FLOOR(age/100) * 100
ELSE -1
END

FQL Fauna Function - Query Indexed Document Data Given Conditions

I have a collection of shifts for employees, data (trimmed out some details, but this is the structure for start/end times) looks like this:
{
"ref": Ref(Collection("shifts"), "123451234512345123"),
"ts": 1234567891012345,
"data": {
"id": 1,
"start": {
"time": 1659279600000
},
"end": {
"time": 1659283200000
},
"location": "12341234-abcd-1234-cdef-123412341234"
}
}
I have an index that will query return an array of shifts_by_location in this format: ["id", "startTime", "endTime"] ...
Now I want to create a user-defined-function to filter these results "start" and "end" times to fall in between given dayStart and dayEnd times to get shifts by date, hoping to get some FQL assistance here, thanks!
Here's my broken attempt:
Query(
Lambda(
["location_id", "dayStart", "dayEnd"], // example: ["124-abd-134", 165996000, 165922000]
Map(
Paginate(Match(Index("shifts_by_location"), Var("location_id"))),
Lambda(["id", "startTime", "endTime"],
If(
And(
GTE(Var("startTime"), Var("dayStart")), // GOAL -> shift starts after 8am on given day
LTE(Var("endTime"), Var("dayEnd")) // GOAL -> shift ends before 5pm on given day
),
Get(Var("shift")) // GOAL -> return shift for given day
)
)
)
)
)
Found a working solution with this query, the biggest fix was really just to use a filter over the map, which seems obvious in hindsight:
Query(
Lambda(
["location_id", "dayStart", "dayEnd"],
Filter(
Paginate(Match(Index("shifts_by_location"), Var("location_id"))),
Lambda(
["start", "end", "id"],
And(GTE(Var("start"), Var("dayStart")), LTE(Var("end"), Var("dayEnd")))
)
)
)
)

ElasticSearch order based on type of hit

I started using ElasticSearch in my ReactJS project but I'm having some troubles with it.
When I search, I'd like to have my results ordered based on this table
Full-Hit
Start-hit
Sub-hit
Fuzzy-hit
category
1
3
5
10
trade name
2
4
6
11
official name
7
8
9
12
The definition (the way I see it, unless I'm wrong) are like this:
Full-hit
examples:
Term "John" has a full-hit on "John doe"
Term "John" doesn't have a full-hit on "JohnDoe"
Start-hit
examples:
Term "John" has a start-hit on "Johndoe"
Term "Doe" doesn't have a start-hit on "JohnDoe"
sub-hit
examples:
Term "baker" has a sub-hit on "breadbakeries"
Term "baker" doesn't have a sub-hit on "de backer"
fuzzy-hit
From my understanding fuzzy-hit is when the queried word has 1 mistake or 1 letter is missing
examples:
Term "bakker" has a fuzzy-hit on "baker"
Term "bakker" doesn't have a fuzzy-hit on "bakers"
I found out that we can boost fields like this
fields = [
`category^3`,
`name^2`,
`official name^1`,
];
But that is not based on the full-, start-, sub-, or fuzzy-hit
Is this doable in ReactJS with Elasticsearch?
I need to understand your problem.
In a nutshell
1."If a full-hit is found in the category field, then we should boost it by 1".
If a full-hit is found in the official_name field we should boost by 7..
and so on for all the 12 possibilities?
If this is what you want, you are going to need 12 seperate queries, all covered under one giant bool -> should clause.
I won't write out the query for you, but I will give you some pointers, on how to structure the 4 subtypes of the queries.
Full Hit
{
"term" : {"field" : "category/tradE_name/official_name", "value" : "the_search_term"}
}
Start-hit
{
"match_phrase_prefix" : {"category/trade_name/official_name" : "the search term"}
}
Sub-hit
{
"regexp" : {
"category/official/trade" : {"value" : "*term*"}
}
}
Fuzzy
{
"fuzzy" : {
"category/trade/official" : {"value" : "term"}
}
}
You will need one giant bool
{
"query" : {
"bool" : {
"should" : [
// category field queries, 4 total clauses.
{
}
// official field queries, 4 clauses, to each clauses assign the boost as per your table. that's it.
]
}
}
}
To each clause, assign a boost as per your table.
That;s it.
HTH.

I would like to search by filtering by date in mongodb

I would like to have the total of names that do not have "SENT" as status and I would also like to filter this number according to a date of beginning and end, for the moment only the first part walks it marks the filtering by the dated. my field is called "date".
sorry my code did not indent ):
db.name.aggregate([{ $group: {_id:"$id", sent:
{$max: {$cond: {if: {
$eq: [ "$status", "SENT" ] }, then: 1, else: 0}}} } },
{ $match: { sent:
0 } }, {$count: "total"}])
You can rewrite your query to add the $match as first stage and include date and status filter followed by $count stage to count the matched documents.
Something like
db.name.aggregate([
{"$match": {"status":"SENT","date":{"$gte":input date1, "$lte":input date2} }
{"$count": "total"}
])
You don't need aggregation for the request. You can use regular queries.
db.name.count({"status":"SENT","date":{"$gte":input date1, "$lte":input date2} })

lucene solr - how to know numCount of each word in query

i have a query string with 5 words. for exmple "cat dog fish bird animals".
i need to know how many matches each word has.
at this point i create 5 queries:
/q=name:cat&rows=0&facet=true
/q=name:dog&rows=0&facet=true
/q=name:fish&rows=0&facet=true
/q=name:bird&rows=0&facet=true
/q=name:animals&rows=0&facet=true
and get matches count of each word from each query.
but this method takes too many time.
so is there a way to check get numCount of each word with one query?
any help appriciated!
In this case, functionQueries are your friends. In particular:
termfreq(field,term) returns the number of times the term appears in the field for that document. Example Syntax:
termfreq(text,'memory')
totaltermfreq(field,term) returns the number of times the term appears in the field in the entire index. ttf is an alias of
totaltermfreq. Example Syntax: ttf(text,'memory')
The following query for instance:
q=*%3A*&fl=cntOnSummary%3Atermfreq(summary%2C%27hello%27)+cntOnTitle%3Atermfreq(title%2C%27entry%27)+cntOnSource%3Atermfreq(source%2C%27activities%27)&wt=json&indent=true
returns the following results:
"docs": [
{
"id": [
"id-1"
],
"source": [
"activities",
"activities"
],
"title": "Ajones3 Activity Entry 1",
"summary": "hello hello",
"cntOnSummary": 2,
"cntOnTitle": 1,
"cntOnSource": 1,
"score": 1
},
{
"id": [
"id-2"
],
"source": [
"activities",
"activities"
],
"title": "Common activity",
"cntOnSummary": 0,
"cntOnTitle": 0,
"cntOnSource": 1,
"score": 1
}
}
]
Please notice that while it's working well on single value field, it seems that for multivalued fields, the functions consider just the first entry, for instance in the example above, termfreq(source%2C%27activities%27) returns 1 instead of 2.

Resources