How do I provide an OpenNLP model for tokenization in vespa? This mentions that "The default linguistics module is OpenNlp". Is this what you are referring to? If yes, can I simply set the set_language index expression by referring to the doc? I did not find any relevant information on how to implement this feature in https://docs.vespa.ai/en/linguistics.html, could you please help me out with this?
Required for CJK support.
Yes, the default tokenizer is OpenNLP and it works with no configuration needed. It will guess the language if you don't set it, but if you know the document language it is better to use set_language (and language=...) in queries, since language detection is unreliable on short text.
However, OpenNLP tokenization (not detecting) only supports Danish, Dutch, Finnish, French, German, Hungarian, Irish, Italian, Norwegian, Portugese, Romanian, Russian, Spanish, Swedish, Turkish and English (where we use kstem instead). So, no CJK.
To support CJK you need to plug in your own tokenizer as described in the linguistics doc, or else use ngram instead of tokenization, see https://docs.vespa.ai/documentation/reference/schema-reference.html#gram
n-gram is often a good choice with Vespa because it doesn't suffer from the recall problems of CJK tokenization, and by using a ranking model which incorporates proximity (such as e.g nativeRank) you'l still get good relevancy.
Related
Solr have the SnowballPorterFilterFactory that you can use with a language parameter
<filter class="solr.SnowballPorterFilterFactory" language="Portuguese" />
Solr also have some language specific stemmers like the PortugueseStemFilterFactory. I have read the documentation but I am unable to find out what the difference between them are.
From the source comments:
Portuguese stemmer implementing the RSLP (Removedor de Sufixos da Lingua Portuguesa) algorithm. This is sometimes also referred to as the Orengo stemmer.
The algorithm used is specifically tailored to the necessities of the Portuguese language, and know about the different word classes and how they should be stemmed in Portuguese.
The Snowball stemmer however is a general stemmer engine, where you give it a dictionary to work with - i.e. suffixes that should be stemmed, etc. These does not allow the same kind of knowledge about how to classify and stem specific word classes.
I can't see any reason why you'd want to use the Snowball version when you have the Portuguese RSLP available, but I haven't done any work in Portuguese (I did however have to manually update the Norwegian one for certain edge cases that Snowball didn't catch by default).
In Algolia's documentation the below is stated:
"Our engine is language-agnostic: both alphabet-base and symbol-based languages such as Chinese, Japanese or Korean are supported."
How do you determine the language analyser? for example how do you determines the difference between es_ES and es_MX
Which is the better approach for better search result single index vs index per language
In our specific case we add attributes like:
title_en
title_de
....
I'm trying to index some old documents for searching -- 16th, 17th, 18th century.
Modern stemmers don't seem to handle the antiquated word endings: worketh, liveth, walketh.
Are there stemmers that specialize in the English from the time of Shakespeare and the King James Bible? I'm currently using solr.PorterStemFilterFactory.
It looks like the rule changes are minimal for that.
So, it might be possible to copy/modify the PorterStemmer class and related Factories/Filters.
Or it might be possible to add those specific rules as Regular expression filter before Porter.
When creating a document to add to a search index, you can specify the document language. I've done this, but would now like to query only those docs in a specific language. Is this possible? I assumed it would be trivial (and documented), but I can't find how to do it.
Thanks!
I don't think you can currently, but I haven't seen anything explicitly saying that. I'm implying from these sentences that the language field is for their use and not for querying.
The language parameter for search.TextField:
Two-letter ISO 693-1 language code for the field's content, to assist in tokenization. If None, the language code of the document will be used.
And Building Queries:
Search supports all space-delimited languages as well as some languages not segmented by spaces (specifically, Chinese, Japanese, Korean, and Thai). For these languages, Search segments the text automatically.
They need to know the language so they know how to parse it into words.
My plan is to just add an additional field to my search documents that has the same value as the language field. It's slightly redundant, but simple to do.
search.Document(
fields = [
...,
search.TextField(name='language', value=lang),
],
language = lang,
)
I indexed German and English docs with solr and i want to some ability to search just inside German docs or English docs,How to configure this?
thanks
Some of the options as mentioned # http://lucidworks.lucidimagination.com/display/LWEUG/Multilingual+Indexing+and+Search
You may end up with having to implement multiple language fields and language detection.
Have a field indicating the language, and search on that.