How processed tokens get stored in base index in Vespa?

How processed tokens get stored in base index in Vespa? - vespa

While working with search definition which looks like
search music{
document music{
field title type string {
indexing: summary | attribute | index
}
}
}
if I use my custom logic of tokenizing string by developing document processor (I save processed tokens in context of Processing), how to store tokens in the base index? and how they are mapped back to the original content of the field, while recall for a particular query? Do we solve it by ProcessingEndPoint? If yes, how?

First, you should almost certainly drop "attribute" for this field - "attribute" means the text will be stored in a forward store in memory in addition to creating an index for searching. That may be useful for structured data for sorting, grouping and ranking, but not for a free-text field.
Unnecessary details:
You can perform your own document processing by adding document processor components: http://docs.vespa.ai/documentation/docproc-development.html. Token information for indexing are stored as annotations over the text which are consumed by the indexer: http://docs.vespa.ai/documentation/annotations.html
The code doing this in Vespa (called by a document processor) is https://github.com/vespa-engine/vespa/blob/master/indexinglanguage/src/main/java/com/yahoo/vespa/indexinglanguage/linguistics/LinguisticsAnnotator.java, and the annotations it adds, which are consumed during indexing are https://github.com/vespa-engine/vespa/blob/master/document/src/main/java/com/yahoo/document/annotation/AnnotationTypes.java. You'd also need to do the same tokenization at the query side, in a Searcher: http://docs.vespa.ai/documentation/searcher-development.html
However, there is a much simpler way to do this: You can plug in your own tokenizer as described here: http://docs.vespa.ai/documentation/linguistics.html: Create your own component subclassing SimpleLinguistics and override getTokenizer to return your implementation. This will be executed by Vespa as needed both on the document processing and query side.
The reason for doing this is usually to provide linguistics for other languages than english. If you do this, please consider providing your linguistics code back to Vespa.

Related

Implement morphological search using Solr

I am trying to implement morphological search using Solr.
Here's a quick intro to morpholgical search:
It means that the search algorithm considers all grammar forms of words when creating the search index and searching for the requested phrases.
For example, when indexing the word child, the system adds both child and children to the index. Similar rule applies to verbs: for bring, the system adds bringing, brought etc. Consequently, if a user searches for a phrase "children bring", the system will display all results containing child, children, bring, bringing, brought etc.
Here are my two options:
1) Lemmatize each token and use that at index time as well as do the same with the query string at search time.
I DON't WANT to use this approach since this would make my index inconsistent when I start supporting morphpological search, since the previous documents will lack the lemma tokens. I don't want to reindex either.
2) Only at query time, find all variants of the lemma (eg: lemma of 'brought' is 'bring')and generate these as additional tokens through my Token Filter. This would serve a morphological search without having to index/reindex anything.
Question:
Are there any good Java libraries which would give me variants/inflections of a lemma (or the root word. eg: lemma of 'brought' is 'bring') ?

Something near to your requirement is using solr synonym dictionary and synonym filter.There you can add base word like child and add variants like kid,children,baby.
Collection reload would be required after editing dictionary each time. And search would be performed on every variant of child if "kid" is searched.

Is there a way to get a list of all the words in the solr spellcheck index?

I'm using Solr's SpellCheckComponent with IndexBasedSpellChecker. Wondering if there's a way to get an output of all the words in the dictionary.
Might help us catch some of the misspellings on our site.

yes, there is. IndexBasedSpellChecker, according to the doc: "The IndexBasedSpellChecker uses a Solr index as the basis for a parallel index used for spell checking. It requires defining a field as the basis for the index terms "
So it just uses one field you choose from the index. To enumerate all terms on a field you use the Terms component and you set terms.fl to that field. If you have lots of terms, you could play do some scrolling with terms.lower, terms.limit and terms.upper to get the info in multiple calls.

Relevance feedback in Apache Solr

I would like to implement relevance feedback in Solr. Solr already has a More Like This feature: Given a single document, return a set of similar documents ranked by similarity to the single input document. Is it possible to configure Solr's More Like This feature to behave like More Like Those? In other words: Given a set of documents, return a list of documents similar to the input set (ranked by similarity).
According to the answer to this question turning Solr's More Like This into More Like Those can be done in the following way:
Take the url of the result set of the query returning the specified documents. For example, the url http://solrServer:8983/solr/select?q=id:1%20id:2%20id:3 returns the response to the query id:1 id:2 id:3 which is practically the concatenation of documents 1, 2, 3.
Put the above url (concatenation of the specified documents) in the url.stream GET parameter of the More Like This handler: http://solrServer:8983/solr/mlt?mlt.fl=text&mlt.mintf=0&stream.url=http://solrServer:8983/solr/select%3Fq=id:1%20id:2%20id:3. Now the More Like This handler treats the concatenation of documents 1, 2 and 3 as a single input document and returns a ranked set of documents similar to the concatenation.
This is a pretty bad implementation: Treating the set of input documents like one big document discriminates against short documents because short documents occupy a small portion of the entire big document.
Solr's More Like This feature is implemented by a variation of The Rocchio Algorithm: It takes the top 20 terms of the (single) input document (the terms with the highest TF-IDF values) and uses those terms as the modified query, boosted according to their TF-IDF. I am looking for a way to configure Solr's More Like This feature to take multiple documents as its input, extract the top n terms from each input document and query the index with those terms boosted according to their TF-IDF.
Is it possible to configure More Like This to behave that way? If not, what is the best way to implement relevance feedback in Solr?

Unfortunately, it is not possible to configure the MLT handler that way.
One way to do it would be to implement a custom SearchComponent and register it to a (dedicated) SearchHadler.
I've already done something similar and it is quite easy if you look a the original implementation of MLT component.
The most difficult part is the synchronization of the results from different shard servers, but it can be skipped if you do not use shards.
I would also strongly recommend to use your own parameters in your implementation to prevent collisions with other components.

SOLR index time boost depending on the field value

Is it possible to boost a document on the indexing stage depending on the field value?
I'm indexing a text field pulled from the database. I would like to boost results that are shorter over the longer ones. So the value of boost should depend on the length of the text field.
This is needed to alter the standard SOLR behavior that in my case tends to return documents with multiple matches first.
Considering I have a field that stores the length of the document, the equivalent in the query of what I need at indexing would be:
q={!boost b=sqrt(length)}text:abcd
Example:
I have two items in the DB:
ABCDEBCE
ABCD
I always want to get ABCD first for the 'BC' query even though the other item contains the search query twice.
The other solution to the problem would be ability to 'switch off' the feature that scores multiple matches higher at query time. Don't know if that is possible either...
Doing this at index time is important as the hardware I run the SOLR on is not too powerful and trying to boost on query time returns with OutOfMemory Exception. (Even If I could work around that increasing memory for java I prefer to be on the safe side and implement the index the most efficient way possible.)

Yes and no - but how you do it depends on how you're indexing your documents.
As far as I know there's no way of resolving this only on the solr server side at the moment.
If you're using the regular XML based interface to submit documents, let the code that generates the submitted XML add boost=".." values to the field or to the document depending on the length of the text field.

You can check upon DIH Special Commands which has a $docBoost command
$docBoost : Boost the current doc. The value can be a number or the
toString of a number
However, there seems no $fieldBoost Command.
For you case though, if you are using DefaultSimilarity, shorter fields are boosted higher then longer fields in the Score calculation.
You can surely implement your own Simiarity class with a changed TF (Term Frequency) and LengthNorm Calculation as your needs.

Is it possible to have SOLR MoreLikeThis use different fields for model and matches?

Let's say I have documents with two fields, A and B.
I'd like to use SOLR's MoreLikeThis, but with a twist: I'm most interested in boosting documents whose A field is like my model document's B field. (That is, extract MLT's 'interesting terms' from the model B field, but only collect MLT results based on the A field.)
I don't see a way to use the mlt.fl fields or mlt.qf boosts to achieve this effect in a single query. (It seems mlt.fl specifies fields used for both discovery of 'interesting terms' and matching to those terms.) Am I missing some option?
Or will I have to extract the 'interesting terms' myself and swap the 'field:term' details?
(Other ideas in this same vein appreciated as well.)

Two options I see are:
Use a copyField - index your original document with a copy of field A named B, and then query using B.
Extend MoreLikeThisHandler and change the fields you query.
The first option costs a bit of programming (mostly configuration changes) and some memory consumption. The second involves more programming but no memory footprint increase. Hope one of them suits your needs.

I now think there are two ways to achieve the desired effect (without customizing the MLT source code).
First option: Do an initial MLT query with the MLT handler, adding the parameter &mlt.interestingTerms=details. This includes the list of terms that were deemed interesting, ranked with their relative boosts. The usual behavior uses those discovered terms against the same mlt.fl fields to find similar documents. For example, the response will include something like:
"interestingTerms":
["field_b:foo",5.0,"field_b:bar",2.9085307,"field_b:baz",1.67070794]
(Since the only thing about this initial query that's interesting is the interestingTerms, throwing in an fq that rules out all docs could help it skip unnecessary scoring work.)
Explicitly re-composing that interestingTerms info into a new OR query field_a:foo^5.0 field_a:bar^2.9085307 field_a:baz^1.67070794 amounts to using the B field example text to find documents that are similar in field A, and may be mimicking exactly the kind of query default MLT does on its usual model field.
Second option: Grab the model document's actual field B text, and feed it directly as a ContentStream body, to be used in lieu of a query, for specifying the model document. Then target mlt.fl at field A for the sake of collecting similar results. For example, a fragment of the parameters might be …&stream.body=foo bar baz&mlt.fl=field_a&…. Again, the net effect being that model text originally from field_b is finding documents similar only in field_a.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight