How do I add a char filter to a Microsoft language analyzer in Azure Search? - azure-cognitive-search

we want to use the language specific analyzers provided by azure search, but add the html_char filter from Lucene. Our idea was to build a custom analyzer that uses the same components (tokenizer, filters) as for example the en.microsoft analyzer but add the additional char filter.
Sadly we can't find any documentation on what exactly constitutes the en.microsoft analyzer or any other Microsoft analyzer. We do not know which tokenizers or filters to use to get the same result with a custom analyzer.
Can anyone point us in to the right documentation?
The documentation says that the en.microsoft analyzer performs lemmatization instead of stemming but I can't find any tokenizer or filter that claims to use lemmatization only stemmers.

To create a customized version of a Microsoft analyzer, start with the Microsoft tokenizer for a given language (we have a stemming and non-stemming version), and add token filters from the set of available token filters to customize the output token stream. Note that the stemming tokenizer also does lemmatization, depending on the language.
In most cases, a Microsoft language analyzer is a Microsoft tokenizer plus a stopwords token filter and a lowercase token filter, but this varies depending on the language. In some cases we do language specific character normalization.
We recommend using the above as a starting point. The Analyze API can then be used for testing your configuration to see if it gives you the results you want.

Related

Attribute Comparators in Vespa.ai

Does Vespa support comparators for string matching like Levenshtein, Jaro–Winkler, Soundex etc? Is there any way we can implement them as plugins as some are available in Elasticsearch? What are the approaches to do this type of searches?
The match modes supported by Vespa is documented here https://docs.vespa.ai/documentation/reference/schema-reference.html#match plus regular expression for attribute fields https://docs.vespa.ai/documentation/reference/query-language-reference.html#matches
None of the mentioned string matching/ranking algorithms are supported out of the box. Both edit distance variants sounds more like a text ranking feature which should be easy to implement. (Open a github issue at https://github.com/vespa-engine/vespa/issues)
The matching in Vespa happens in a c++ component so no plugin support there yet.
You can deploy a plugin in the container which is written in Java by deploying a custom searcher (https://docs.vespa.ai/documentation/searcher-development.html). Then you can work on the top k hits, using e.g regular expression or n-gram matching to retrieve candidate documents. The soundex algorithm can be implemented accurately using a searcher and a document processor.

How to ignore accents in Azure Search?

Does Azure Search support some way of ignoring accented characters? For example, if somebody searches for e it should include é characters in the search. Or would we need to add some converting at the moment of building the Azure Search Index?
Any recomendations welcome, thanks.
Yes. Please use the ASCII folding analyzer on your field. To do that, set the analyzer property on your field to
analyzer:"standardasciifolding.lucene"
Alternatively, use a language specific analyzer e.g. analyzer:"fr.microsoft".
To learn more about analyzers in Azure Search take a look here.
Note: Different language analyzers treat diacritic marks differently, use the Analyze API to test analyzer behavior.

Azure Search Analyzer

We need to create a field for an Index that is not going to be tokenised but still be searchable. In Azure Search if you make a field searchable, then the contents of the field are tokenised. If you make it filterable (documentation says then it wont be tokenised) then you cannot search it.
In Lucene a KeywordAnalyzer does this job. Since Azure Search is also using Lucene cant understand why we cannot store a field contents AS IS in the index for searching WITHOUT splitting all the words/removing stop words etc. etc.
Would appreciate any assistance
Using keyword and other Lucene analyzers is now possible using Custom analyzers feature of Azure Search. Note: this functionality is still in preview.
HTH!

Solr : Stemming words Using Solr

I am learning solr and want to use solr for stemming words.I'll be passing the word to the solr and it should send the stemmed word back.I know how to configure solr core for different stemming patterns and also i am able to view their stemmed words in the analyzer (solr admin ui) but i am not sure how to achieve this using java code.I am able to index and query using java api.
I am using solr-5.3.0.
If you need to just stem the words I would recommend you not to use the whole Solr. Just use the code they use for stemming or something similar. E.g. you can use
org.apache.lucene.analysis.en.PorterStemmer.stem(String)
Unfortunately PorterStemmer has package level access so I would just copy it from the sources or you can search the Internet for some other stemmer implementations. I hope that helps.
Good luck!

Does Solr have an equivalent to CompassQueryBuilder?

I am rewriting our company's search functionality to use Solr instead of Compass. Our old code is using CompassQueryBuilder.CompassQueryStringBuilder to build a query out of a list of keywords. The keywords may have spaces in them: for example: "john smith", "tom jones".
Is there an existing facility I can use in Solr to replicate this functionality?
The closest thing I know for SolrJ is the solrj-criteria project. It seems to be currently unmaintained though.
Solr offers a wide variety of querying and indexing options. So fields that contain keywords with spaces in it, can be made possible by defining a custom type in the configuration file (see here). Queries with spaced keywords in it can be made possible by specifying a custom QueryParser. (see here)
Solr itself doesn't offer a QueryStringBuilder in an API. Actually, Solr itself doesn't offer any API classes at all, since all interaction is done by posting messages over Http. There are client libraries for Java, .NET and PHP etc. In the SolrNet api there exists a SolrMultipleCriteriaQuery, which is quite similar to the CompassQueryStringBuilder.

Resources