Filter language in solr - solr

I indexed German and English docs with solr and i want to some ability to search just inside German docs or English docs,How to configure this?
thanks

Some of the options as mentioned # http://lucidworks.lucidimagination.com/display/LWEUG/Multilingual+Indexing+and+Search
You may end up with having to implement multiple language fields and language detection.

Have a field indicating the language, and search on that.

Related

How to provide OpenNLP model for tokenization in vespa?

How do I provide an OpenNLP model for tokenization in vespa? This mentions that "The default linguistics module is OpenNlp". Is this what you are referring to? If yes, can I simply set the set_language index expression by referring to the doc? I did not find any relevant information on how to implement this feature in https://docs.vespa.ai/en/linguistics.html, could you please help me out with this?
Required for CJK support.
Yes, the default tokenizer is OpenNLP and it works with no configuration needed. It will guess the language if you don't set it, but if you know the document language it is better to use set_language (and language=...) in queries, since language detection is unreliable on short text.
However, OpenNLP tokenization (not detecting) only supports Danish, Dutch, Finnish, French, German, Hungarian, Irish, Italian, Norwegian, Portugese, Romanian, Russian, Spanish, Swedish, Turkish and English (where we use kstem instead). So, no CJK.
To support CJK you need to plug in your own tokenizer as described in the linguistics doc, or else use ngram instead of tokenization, see https://docs.vespa.ai/documentation/reference/schema-reference.html#gram
n-gram is often a good choice with Vespa because it doesn't suffer from the recall problems of CJK tokenization, and by using a ranking model which incorporates proximity (such as e.g nativeRank) you'l still get good relevancy.

Solr multilingual stemisation

I'm using Solr to index documents like .pdf or .docx. These documents are in french or in english and I want to use the stemisation for both languages.
For exemple, if I search "chevaux" I want to find "cheval" (french) and if I search "raise" I want to find "raising" (english).
Is there a way to do this without createting 2 core (one in english and one in french) ?
Have two fields, one with the field definition you want for French, and one with the field definition you want for English. Then use the Language Detection feature to submit the content to the correct field.
When searching, query the field that has the correct language as the user, or if you don't know, search both - or use language detection to try to do a better guess.
You can also index the same content into both fields, but my initial guess is that it'll give you weird results down the road, where someone enters a French word, but due to the processing rules for English, you get hit that wouldn't have happened if you only indexed to the correct field.
By enabling langid.map, you can tell Solr to index the content into fields named fieldname_langcode (where fieldname is picked up from langid.fl).
langid.map: Enables field name mapping. If true, Solr will map field names for all fields listed in langid.fl.
You can use langid.map.replace or langid.map.pattern if you want to change the default fieldname_langcode naming, but I'd leave those alone for now.

Spell checking with Solr

I use Solr to index documents (pdf, word, .txt, etc). I need to use spell checker (in french) but I don't know how to do this. I need this function only on the field "content" the type of this field is text_general.
The spellchecker uses the content of your index to build the terms that are used for suggestions - there is no language configuration, since as long as the content that has been indexed is French, the suggestion back to the user will be based on those terms.
The exception is if you're using the FileBasedSpellChecker, where you provide a dictionary of terms with their correct spelling.
# spellcheck.q is only necessary if you want to use a different query than your actual query
&spellcheck=true&spellcheck.q=foo

Algolia Multilingual Search

In Algolia's documentation the below is stated:
"Our engine is language-agnostic: both alphabet-base and symbol-based languages such as Chinese, Japanese or Korean are supported."
How do you determine the language analyser? for example how do you determines the difference between es_ES and es_MX
Which is the better approach for better search result single index vs index per language
In our specific case we add attributes like:
title_en
title_de
....

Specify language in queries to Search API

When creating a document to add to a search index, you can specify the document language. I've done this, but would now like to query only those docs in a specific language. Is this possible? I assumed it would be trivial (and documented), but I can't find how to do it.
Thanks!
I don't think you can currently, but I haven't seen anything explicitly saying that. I'm implying from these sentences that the language field is for their use and not for querying.
The language parameter for search.TextField:
Two-letter ISO 693-1 language code for the field's content, to assist in tokenization. If None, the language code of the document will be used.
And Building Queries:
Search supports all space-delimited languages as well as some languages not segmented by spaces (specifically, Chinese, Japanese, Korean, and Thai). For these languages, Search segments the text automatically.
They need to know the language so they know how to parse it into words.
My plan is to just add an additional field to my search documents that has the same value as the language field. It's slightly redundant, but simple to do.
search.Document(
fields = [
...,
search.TextField(name='language', value=lang),
],
language = lang,
)

Resources