How filtered document list look up work in nearest neighbour search prefiltering - vespa

In pre-filter based ANN, once we have list of documents after applying pre-filter, vespa starts hsnw algorithm to find nearest neighbours. In hsnw algorithm, vespa starts with a node and look for the neighbours which are present in pre-filter list, How search of neighbour in document list is implemented in vespa ? it's linear search or hashing ?

This blog post has an excellent overview of how Vespa combines filters with HNSW search https://blog.vespa.ai/constrained-approximate-nearest-neighbor-search/.
In pre-filter case, the resulting "allow" document list is a bitvector, and the lookup is O(1).

Related

Why is bidirectional graph search complete only when BFS is used?

In every article I found it seems to say that only when BFS is used in both directions the bidirectional search is complete. I do not realluy understand that, because there are way more "complete" search algorithms. For example, if one of the directions used IDS(iterative deepening search) or A* path instead of BFS, would it not be complete?
So, my main question is what is the basis of the phrase "only when BFS is used in both directions then bidirectional search is complete"? And what are the true criteria of the completeness of a search algorithm like that?
Thanks
I thought about running a bidirectional search graph code in python in order to determine if those would be complete, but I do not know if it will work in every example so it is a little inaccurate to do that.

Dgraph: Deep graph traversal possible with recurse?

I have a couple of questions regarding the capabilities of Dgraph regarding graph traversal.
Let's say we have a dataset that consists of nodes of the type post. Each post can have n posts that are replies to this post. The depth of this tree is not limited.
Is it possible with Dgraph to search trough all leaf nodes starting from one starting node and return all leafs that fulfill a certain condition?
Is it possible to set a depth limit to not end up with a gigantic dataset?
Is it also possible to find the children of all parent nodes that fulfill a certain condition?
And finally: Are edges in Dgraph directed? And can I include that in the query?
Author of Dgraph here.
Is it possible with Dgraph to search trough all leaf nodes starting
from one starting node and return all leafs that fulfill a certain
condition?
Yes. You could use the recurse directive (https://docs.dgraph.io/query-language/#recurse-query).
Is it possible to set a depth limit to not end up with a gigantic
dataset?
Yes. Recursion supports a maximum depth.
Is it also possible to find the children of all parent nodes that fulfill a certain condition?
Yes. You can traverse an edge, and put a filter on it. https://docs.dgraph.io/query-language/#applying-filters
And finally: Are edges in Dgraph directed? And can I include that in the query?
Edges in dgraph are directed. But, Dgraph also supports a "reverse" index, which can be used to automatically generate the edges in the reverse direction. You can then traverse these reverse edges, by adding a tilde (~) in front of the predicate name.
https://docs.dgraph.io/query-language/#reverse-edges

Sorting in Beam Search

Although I have good understanding of beam search but I have a query regarding beam search. When we select n best paths should we sort them or simply we should keep them in the order in which they exist and just discard other expensive nodes?
I searched a lot about this but every where it says that keep best. Nothing is found about should we sort them or not?
I think that we should sort them because by applying sorting we will reach to goal node quickly. But I want confirmation of my sorting idea and I did not found it till now.
I will be thankful to you if you can help me in improving my concepts.
When we select n best paths should we sort them or simply we should keep them in the order in which they exist and just discard other expensive nodes?
We just sort them and keep the top k.
At each step after the initialization you sort the beam_size * vocabulary_size hypotheses and choose the top k. For each of the beam_size * vocabulary_size hypotheses, its weight/probability is the product of all probabilities along its history normalized by the length(length normalization).
One problem arises from the fact that the completed hypotheses may have different lengths. Because models generally assign lower probabilities to longer strings, a naive algorithm would also choose shorter strings for y. This was not an issue during the earlier steps of decoding; due to the breadth-first nature of beam search all the hypotheses being compared had the same length. The usual solution to this is to apply some form of length normalization to each of the hypotheses, for example simply dividing the negative log probability by the number of words:
For more information please refer to this answer.
Reference:
https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
****Beam search uses breadth-first search to build its search tree. At each level of the tree, it generates all successors of the states at the current level, ***
sorting them in increasing order of heuristic cost
***. However, it only stores a predetermined number of best states at each level (called the beam width). Only those states are expanded next. The greater the beam width, the fewer states are pruned. With an infinite beam width, no states are pruned and beam search is identical to breadth-first search.
NOTE: (I got this information from WikipediA during my search.)may be its helpful.****

Forward Index vs Inverted index Why?

I was reading about inverted index (used by the text search engines like Solr, Elastic Search etc) and as I understand (if we take "Person" as an example):
The attribute to Person relationship is inverted:
John -> PersonId(1), PersonId(2), PersonId(3)
London -> PersonId(1), PersonId(2), PersonId(5)
I can now search the person records for 'John who lives in London'
Doesn't this solve all the problems? Why do we have the forward (or regular database index) at all? Or in other words, in what cases the regular indexing is useful? Please explain. Thanks.
The point that you're missing is that there is no real technical distinction between a forward index and an inverted index. "Forward" and "inverted" in this case are just descriptive terms to distinguish between:
A list of words contained in a document.
A list of documents containing a word.
The concept of an inverted index only makes sense if the concept of a regular (forward) index already exists. In the context of a search engine, a forward index would be the term vector; a list of terms contained within a particular document. The inverted index would be a list of documents containing a given term.
When you understand that the terms "forward" and "inverted" are really just relative terms used to describe the nature of the index you're talking about - and that really an index is just an index - your question doesn't really make sense any more.
Here's an explanation of inverted index, from Elasticsearch:
Elasticsearch uses a structure called an inverted index, which is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.
https://www.elastic.co/guide/en/elasticsearch/guide/current/inverted-index.html
Inverted indexing is for fast full text search. Regular indexing is less efficient, because the engine looks through all entries for a term, but very fast with indexing!
You can say this:
Forward index: fast indexing, less efficient query's
Inverted index: fast query, slower indexing
But, it's always context related. If you compare it with MySQL: myisam has fast read, innodb has fast insert/update and slower read.
Read more here: https://www.found.no/foundation/indexing-for-beginners-part3/
In forward index, the input is a document and the output is words contained in the document.
{
doc1: [word1, word2, word3],
doc2: [word4, word5]
}
In the reverse/inverted index, the input is a word, and the output is all the documents in which the words are contained.
{
word1: [doc1, doc10, doc3],
word2: [doc5, doc3]
}
Search engines make use of reverse/inverted index to get us documents from keywords.

SOLR: size of term dictionary and how to prune it

Two related questions:
Q1. I would like to find out the term dictionary size (in number of terms) of a core.
One thing I do know how to do is to list the file size of *.tim. For example:
> du -ch *.tim | tail -1
1,3G total
But how can I convert this to number of terms? Even a rough estimate would suffice.
Q2. A typical technique in search is to "prune" the index by removing all rare (very low frequency) terms. The objective is not to prune the size of the index, but the size of the actual term dictionary. What would be the simpler way to do this in SOLR, or programatically in SOLRj?
More exactly: I wish to eliminate these terms (tokens) from an existing index (term dictionary and all the other places in the index). The result should be similar to 1) adding the terms to a stop word list, 2) re-indexing an entire collection, 3) removing the terms from the stop word list.
You can get information in the Schema Browser page and click in "Load Term info", in the luke admin handler https://wiki.apache.org/solr/LukeRequestHandler and also, in then stats component https://cwiki.apache.org/confluence/display/solr/The+Stats+Component.
To prune the index, you could do it by do a facet of the field, and get the terms with low frecuency. Then, get the docs and update the document without this term (this could be difficult because it's depends the analyzers and tokenizers of your field). Also, you can use the lucene libraries to open the index and do it programmatically.
You can check the number and distribution of your terms with the AdminUI under the collection's Schema Browser screen. You need to Load Term Info:
Or you can use Luke which allows you to look inside the Lucene index.
It is not clear what you mean to 'remove'. You can add them to the stopwords in the analyzer chain for example if you want to avoid indexing them.

Resources