Attribute Comparators in Vespa.ai

Attribute Comparators in Vespa.ai - vespa

Does Vespa support comparators for string matching like Levenshtein, Jaro–Winkler, Soundex etc? Is there any way we can implement them as plugins as some are available in Elasticsearch? What are the approaches to do this type of searches?

The match modes supported by Vespa is documented here https://docs.vespa.ai/documentation/reference/schema-reference.html#match plus regular expression for attribute fields https://docs.vespa.ai/documentation/reference/query-language-reference.html#matches
None of the mentioned string matching/ranking algorithms are supported out of the box. Both edit distance variants sounds more like a text ranking feature which should be easy to implement. (Open a github issue at https://github.com/vespa-engine/vespa/issues)
The matching in Vespa happens in a c++ component so no plugin support there yet.
You can deploy a plugin in the container which is written in Java by deploying a custom searcher (https://docs.vespa.ai/documentation/searcher-development.html). Then you can work on the top k hits, using e.g regular expression or n-gram matching to retrieve candidate documents. The soundex algorithm can be implemented accurately using a searcher and a document processor.

Related

How to get synonyms.txt to be used in solr for different languages

Do you know if I can get synonyms.txt files for all languages supported by SOLR ?
Thanks for your help.
Before we were using Verity that provide a dictionary of synonyms for each language supported but we want maybe to move to Solr/Lucene.
I know that we can provide a custom synonym list that is no what I want. I am looking for a way to have a default dictionary of synonyms for each language supported by Lucene.

There is no 'out of the box' synonym resource provided for all the languages.
At least for some, you have wordnet (which is free) - see the solr wiki for wordnet usage.

A list of synonyms for any language is going to be very specific to the use cases related to a given set of indexed items. For that reason it would not be practical to have any prebuilt language specific versions of these files. Even the synonyms.txt that comes with Solr distribution is only built out enough to show examples of how the synonyms can be constructed.

Document search in Lucene/Solr, Whoosh, Sphinx, Xapian

I am comparing Lucene/Solr, Whoosh, Sphinx and Xapian for searching documents in DOC, DOCX, HTML and PDF. Only Solr is documented to have a document parser (Tika) which directly indexes documents. So it seems a clear winner.
But to level the playing field, I like to consider the alternatives. Do the others have direct document indexing (which I may have missed)? If not are they can it be implemented easily? Or is Solr the overwhelming choice?

On Sphinx you're able to convert file using a PHP script through the xmlpipe_command option. Since PHP has a Tika-wrapper, writing the script and the setup itself aren't hard.

Does Solr have an equivalent to CompassQueryBuilder?

I am rewriting our company's search functionality to use Solr instead of Compass. Our old code is using CompassQueryBuilder.CompassQueryStringBuilder to build a query out of a list of keywords. The keywords may have spaces in them: for example: "john smith", "tom jones".
Is there an existing facility I can use in Solr to replicate this functionality?

The closest thing I know for SolrJ is the solrj-criteria project. It seems to be currently unmaintained though.

Solr offers a wide variety of querying and indexing options. So fields that contain keywords with spaces in it, can be made possible by defining a custom type in the configuration file (see here). Queries with spaced keywords in it can be made possible by specifying a custom QueryParser. (see here)
Solr itself doesn't offer a QueryStringBuilder in an API. Actually, Solr itself doesn't offer any API classes at all, since all interaction is done by posting messages over Http. There are client libraries for Java, .NET and PHP etc. In the SolrNet api there exists a SolrMultipleCriteriaQuery, which is quite similar to the CompassQueryStringBuilder.

How to go about indexing 300,000 text files for search?

I have a static collection of over 300,000 text and html files. I want to be able to search them for words, exact phrases, and ideally regex patterns. I want the searches to be fast.
I think searching for words and phrases can be done by looking up a dictionary of unique words referencing to the files that contain each word, but is there a way to have reasonably fast regex matching?
I don't mind using existing software if such exists.

Consider Lucene http://lucene.apache.org/java/docs/index.html

There are quite a bunch available in the market which will help you achieve what you want, some are open-source and some comes with pricing:
Opensource:
elasticsearch - based on lucene
constellio - based on lucene
Sphinx - based on C++
Solr - built on top of lucene

You can have a look at Microsoft Search Server Express 2010: http://www.microsoft.com/enterprisesearch/searchserverexpress/en/us/technical-resources.aspx

http://blog.webdistortion.com/2011/05/29/open-source-search-engines/

Sunspot / Solr / Lucene : Find similar article

Let's say we have a list of articles that are indexed by sunspot/solr/lucene (or any other search engine).
How can be used to find similar articles with a given article?
Should this be done with a resuming tool, like:
http://www.wordsfinder.com/api_Keyword_Extractor.php, or termextract from http://developer.yahoo.com/yql/console, or http://www.alchemyapi.com/api/demo.html ?

It seems you're looking for the MoreLikeThis feature.

What you are trying to do is very similar to the task I outlined in this answer.
In brief, you need to generate a summary for each document that you can use as the query to compare it with every other. A document summary could be as simple as the top N terms in that document (excluding stop words). You can generate top N terms from a Lucene document pretty easily without using any 3rd party tools, there are plenty examples on SO and the web to do this.