How to merge (union) multiple search documents in vespa.ai? - vespa

I have 10 search definition of exact similar schema in Vespa, I want to merge (union) them all.
Example-
I have search_definition_1.sd
I have search_definition_2.sd
I have search_definition_3.sd
.
.
.
I have search_defination_10.sd
Now I need to search from all of the search definition at once, is it possible to union them all and create new search_definition_1_to_10.sd or search in all of at once.

Vespa will by default query all schemas (document types), so supported out of the box. For more fine-grained control, see https://docs.vespa.ai/documentation/federation.html
I did not get what you mean by 10 exact similar schemas, though, there should be no need to have identical schemas.

Unless you use the restrict parameter to restrict to a document type Vespa federates and searches all document types in a content cluster and merges the hits by relevancy score. See end of this document https://docs.vespa.ai/documentation/schemas.html#querying-multiple-document-types and https://docs.vespa.ai/documentation/reference/query-api-reference.html#model.restrict

Related

No matches when mixing keywords

I am trying to do a product search setup using Solr. It does return results for keywords that follow the same order in the product name. However, when the keywords are mixed up, no results are returned. I would like to get results with scores that closely match the given keywords in any order.
My question on scoring has the schema, data configuration and query. Any help will be greatly appreciated.
As long as you enter your query as a regular query, instead of using wildcards, any hits in a text_general field as you've defined should be returned.
You can use the mm parameter to adjust how many of the terms supplied that need to match from a query. I suggest using the edismax query parser, as that allows you do to more "natural" queries instead of having to add the fieldnames in the query itself:
defType=edismax&qf=catchall&q=nikon dslr
defType=edismax&qf=catchall&q=dslr nikon
should both give the same set of documents (but possibly different scores when using phrase boosts).

Solr : Boost Results from a specific collection

We have solr index which has multiple collections i.e. collection_data_sales and collection_data_marketing. So when the user performs a search query, both the collections are queried upon using collection alias. Both collections have same solr schema.
Is there a way to boost the result from a specific collection ?
i.e. Suppose user specifies collection sales data, then search should happen on both collection_data_sales and collection_data_marketing but boost should be given for documents from collection_data_sales.
If you are able to differentiate both collections using data from it it will be enough. Lets imagine that in schema you have field type so for collection_data_marketing you have type:marketing and for collection_data_sales you have type:sales.
The only thing now you have to do is to use boost function like for example this:
bf=sum(product(query($q1),10), product(query($q2,3)))&q1=type:sales&q2=type:marketing
In this example sales will have weight 10 and marketing will have weight 3

Does Lucene / Solr support hypernyms and hyponyms?

For example, houses are buildings, therefore when searching for 'buildings' Lucene would return matches for 'house' as well. This is not the same as synonyms, searching for 'house' shouldn't match 'building'.
You can simply construct a dictionary/hash-table of hypernyms and write a Query Expansion Module having support for hypernyms. To put it simply (1) when the user types in say "Building" in the search Box (2) send your query to your hash table (3) retrieve hypernyms for Building (4) Expand your query something like q=Building+House+Apartment+Villa.

Solr statistical information

Is that possible to get some kind of stats from solr. E.g. Most frequently used words (unigrams), or phrases (bi- trigrams)?
Take a look at the schema browser (e.g. http://localhost:8983/solr/admin/schema.jsp), it gives you the top terms for any given field. You can also access this information with the LukeRequestHandler (e.g. http://localhost:8983/solr/admin/luke).
The TermsComponent also gives you information about indexed terms in a field and the number of documents that match each term.
The StatsComponent gives you statististics about numeric fields.

Is it possible to have SOLR MoreLikeThis use different fields for model and matches?

Let's say I have documents with two fields, A and B.
I'd like to use SOLR's MoreLikeThis, but with a twist: I'm most interested in boosting documents whose A field is like my model document's B field. (That is, extract MLT's 'interesting terms' from the model B field, but only collect MLT results based on the A field.)
I don't see a way to use the mlt.fl fields or mlt.qf boosts to achieve this effect in a single query. (It seems mlt.fl specifies fields used for both discovery of 'interesting terms' and matching to those terms.) Am I missing some option?
Or will I have to extract the 'interesting terms' myself and swap the 'field:term' details?
(Other ideas in this same vein appreciated as well.)
Two options I see are:
Use a copyField - index your original document with a copy of field A named B, and then query using B.
Extend MoreLikeThisHandler and change the fields you query.
The first option costs a bit of programming (mostly configuration changes) and some memory consumption. The second involves more programming but no memory footprint increase. Hope one of them suits your needs.
I now think there are two ways to achieve the desired effect (without customizing the MLT source code).
First option: Do an initial MLT query with the MLT handler, adding the parameter &mlt.interestingTerms=details. This includes the list of terms that were deemed interesting, ranked with their relative boosts. The usual behavior uses those discovered terms against the same mlt.fl fields to find similar documents. For example, the response will include something like:
"interestingTerms":
["field_b:foo",5.0,"field_b:bar",2.9085307,"field_b:baz",1.67070794]
(Since the only thing about this initial query that's interesting is the interestingTerms, throwing in an fq that rules out all docs could help it skip unnecessary scoring work.)
Explicitly re-composing that interestingTerms info into a new OR query field_a:foo^5.0 field_a:bar^2.9085307 field_a:baz^1.67070794 amounts to using the B field example text to find documents that are similar in field A, and may be mimicking exactly the kind of query default MLT does on its usual model field.
Second option: Grab the model document's actual field B text, and feed it directly as a ContentStream body, to be used in lieu of a query, for specifying the model document. Then target mlt.fl at field A for the sake of collecting similar results. For example, a fragment of the parameters might be …&stream.body=foo bar baz&mlt.fl=field_a&…. Again, the net effect being that model text originally from field_b is finding documents similar only in field_a.

Resources