How to use "map" data structure "in a query" - vespa

I want to include map in a vespa query (not a document attribute) and lookup it in expression, but I have some questions.
Can I use map in a Vespa query?
If it's possible, how to lookup the elements in expression?
If it's impossible, can I use mapped tensor type instead?

To pass sparse values to a ranking expression accessing them individually (such as e.g an XGBoost or other GBTD model), pass them as individual query features: query(mykey) etc.
In a HTTP request:
ranking.features.query(mykey)=30.3
Or in Java code (in a Searcher):
query.getRanking().getFeatures().put("query(mykey", String.valueOf(30.3));
You may also want to assign a default value to each query feature used in your model. See https://docs.vespa.ai/documentation/ranking.html#using-query-variables
(You would use a mapped tensor query feature instead of many scalar query features instead if your model computed on the map as a whole, e.g by joining it with a document map.)

There are examples on how to search map fields using sameElement query operator here https://docs.vespa.ai/documentation/reference/query-language-reference.html#sameelement but if I understand you correctly you want to pass a map as a query operator? If so the question if you want to use this for recall (which documents matches or purely as an input for ranking in a configured ranking expression? If the latter you can use tensors but tensors cannot be used for recall/search.

Related

Getting same record on multiple pages, when implemented pagination in vespa

I am getting same record on different pages when implementing pagination using group by.
I am using the query mentioned below:
http://<hostname>:<port>/search/?yql=select * from sources document_name where sddocname contains 'document_name' | all(group(key) max(2) each(each(output(summary()))));
Are you looking at the grouping results or the normal hits structure? Please note that the grouping expression will not in any way affect the normal hits returned.
You will probably want to add LIMIT 0 / hits=0 and only look at the results from the grouping expression.
You also need a (stable) ordering of the hits for pagination by continuations to work well. This is usually the case as in most use cases there will be a ranking expression in place.
The default ordering in grouping expressions is by rank - in grouping expression syntax this would be order(max(relevance())).
The query above only limits on document type. All documents of that document type will match this query equally well. I tested this using the "album-recommendation-selfhosted" sample app, and relevance was 0 for all documents. When the relevance is the same for all documents, the order will essentially be random. The same thing may occur when doing e.g. order(-count()) if count() is the same for several groups.
I was able to achieve the expected results by adding and using a ranking profile using the random.match rank feature: https://docs.vespa.ai/documentation/reference/rank-features.html#random
I believe this should ensure a stable ordering of hits, although this may still produce different results if the query is dispatched to different (groups of) content hosts. If you need a stable global ordering, consider storing a random float/double to each document to rank/order by - this can also be used as a "tie breaker" to help ensure a stable order from ranking expressions.

How processed tokens get stored in base index in Vespa?

While working with search definition which looks like
search music{
document music{
field title type string {
indexing: summary | attribute | index
}
}
}
if I use my custom logic of tokenizing string by developing document processor (I save processed tokens in context of Processing), how to store tokens in the base index? and how they are mapped back to the original content of the field, while recall for a particular query? Do we solve it by ProcessingEndPoint? If yes, how?
First, you should almost certainly drop "attribute" for this field - "attribute" means the text will be stored in a forward store in memory in addition to creating an index for searching. That may be useful for structured data for sorting, grouping and ranking, but not for a free-text field.
Unnecessary details:
You can perform your own document processing by adding document processor components: http://docs.vespa.ai/documentation/docproc-development.html. Token information for indexing are stored as annotations over the text which are consumed by the indexer: http://docs.vespa.ai/documentation/annotations.html
The code doing this in Vespa (called by a document processor) is https://github.com/vespa-engine/vespa/blob/master/indexinglanguage/src/main/java/com/yahoo/vespa/indexinglanguage/linguistics/LinguisticsAnnotator.java, and the annotations it adds, which are consumed during indexing are https://github.com/vespa-engine/vespa/blob/master/document/src/main/java/com/yahoo/document/annotation/AnnotationTypes.java. You'd also need to do the same tokenization at the query side, in a Searcher: http://docs.vespa.ai/documentation/searcher-development.html
However, there is a much simpler way to do this: You can plug in your own tokenizer as described here: http://docs.vespa.ai/documentation/linguistics.html: Create your own component subclassing SimpleLinguistics and override getTokenizer to return your implementation. This will be executed by Vespa as needed both on the document processing and query side.
The reason for doing this is usually to provide linguistics for other languages than english. If you do this, please consider providing your linguistics code back to Vespa.

Google App Engine - keyword search + ordering on other properties

Say I have an entity that looks a bit like this:
class MyEntity(db.Model):
keywords = db.StringListProperty()
sortProp = db.FloatProperty()
I have a filter that does a keyword search by doing this:
query = MyEntity.all()\
.filter('keywords >=', unicode(kWord))\
.filter('keywords <', unicode(kWord) + u"\ufffd")\
.order('keywords')
Which works great. The issue I'm running into is that if I try to put an order on that using 'sortProp':
.order('sortProp')
ordering has no effect. I realize why - the documentation specifically says this is not possible and that sort order is ignored when using equality filters with a multi-valued property (from the Google docs):
One important caveat is queries with both an equality filter and a
sort order on a multi-valued property. In those queries, the sort
order is disregarded. For single-valued properties, this is a simple
optimization. Every result would have the same value for the property,
so the results do not need to be sorted further. However, multi-valued
properties may have additional values. Since the sort order is
disregarded, the query results may be returned in a different order
than if the sort order were applied. (Restoring the dropped sort order
would be expensive and require extra indices, and this use case is
rare, so the query planner leaves it off.)
My question is: does anyone know of a good workaround for this? Is there a better way to do a keyword search that circumvents this limitation? I'd really like to combine using keywords with ordering for other properties. The only solution I can think of is sorting the list after the query, but if I do that I lose the ability to offset into the query and I may not even get the results with the highest sort order if the data set is large.
Thanks for your tips!
Workaround 1:
Apply stemming algorithms for keywords then you won't need to do a comparison look up.
Workaround 2:
Store all unique keywords in separate entity group ("table"). From this group find keywords which match your criteria. Then do query with keywords IN [kw1, kw2, ...]. Make sure that the number of matching keywords is not too big, for example you can select only first 10.
Workaround 3:
Reorder list of items on application side
Workaround 4:
Use IndexTank for full-text search, or apply for "Trusted Tester Program" as mentioned by #proppy.
Instead of doing prefix matches, properly tokenize, stem and normalize your strings, and do equality comparisons on them.

Is it possible to have SOLR MoreLikeThis use different fields for model and matches?

Let's say I have documents with two fields, A and B.
I'd like to use SOLR's MoreLikeThis, but with a twist: I'm most interested in boosting documents whose A field is like my model document's B field. (That is, extract MLT's 'interesting terms' from the model B field, but only collect MLT results based on the A field.)
I don't see a way to use the mlt.fl fields or mlt.qf boosts to achieve this effect in a single query. (It seems mlt.fl specifies fields used for both discovery of 'interesting terms' and matching to those terms.) Am I missing some option?
Or will I have to extract the 'interesting terms' myself and swap the 'field:term' details?
(Other ideas in this same vein appreciated as well.)
Two options I see are:
Use a copyField - index your original document with a copy of field A named B, and then query using B.
Extend MoreLikeThisHandler and change the fields you query.
The first option costs a bit of programming (mostly configuration changes) and some memory consumption. The second involves more programming but no memory footprint increase. Hope one of them suits your needs.
I now think there are two ways to achieve the desired effect (without customizing the MLT source code).
First option: Do an initial MLT query with the MLT handler, adding the parameter &mlt.interestingTerms=details. This includes the list of terms that were deemed interesting, ranked with their relative boosts. The usual behavior uses those discovered terms against the same mlt.fl fields to find similar documents. For example, the response will include something like:
"interestingTerms":
["field_b:foo",5.0,"field_b:bar",2.9085307,"field_b:baz",1.67070794]
(Since the only thing about this initial query that's interesting is the interestingTerms, throwing in an fq that rules out all docs could help it skip unnecessary scoring work.)
Explicitly re-composing that interestingTerms info into a new OR query field_a:foo^5.0 field_a:bar^2.9085307 field_a:baz^1.67070794 amounts to using the B field example text to find documents that are similar in field A, and may be mimicking exactly the kind of query default MLT does on its usual model field.
Second option: Grab the model document's actual field B text, and feed it directly as a ContentStream body, to be used in lieu of a query, for specifying the model document. Then target mlt.fl at field A for the sake of collecting similar results. For example, a fragment of the parameters might be …&stream.body=foo bar baz&mlt.fl=field_a&…. Again, the net effect being that model text originally from field_b is finding documents similar only in field_a.

Developing custom facet calculations in SOLR

I'm looking into using Solr for a project where we have some specific faceting requirements. From what I've learned, Solr provides range-based facets, where Solr can provide facets of different value-ranges or date-ranges, e.i. field values are "grouped" and aggregated into different bins.
I would like to do something similar, but I want to create a custom function that maps field values to my specific facets, so that each field value is evaluated using a function to see which facet it belongs to.
myFacet = myFacetMapper(fieldValue)
Its sort of a more advanced version of range-facets, but where values are mapped using a custom function rather than just into different bins.
Does anyone know if this is possible and where to start?
I would look into using SimpleFacets to implement your logic. Then you embed it inside a SearchComponent, that you can register into your solrconfig. Look at the code of FacetComponent for an example.
Create another field with value = myFacetMapper(field) , then do normal faceting on that field.

Resources