Algolia Multilingual Search - multilingual

In Algolia's documentation the below is stated:
"Our engine is language-agnostic: both alphabet-base and symbol-based languages such as Chinese, Japanese or Korean are supported."
How do you determine the language analyser? for example how do you determines the difference between es_ES and es_MX
Which is the better approach for better search result single index vs index per language

In our specific case we add attributes like:
title_en
title_de
....

Related

How to provide OpenNLP model for tokenization in vespa?

How do I provide an OpenNLP model for tokenization in vespa? This mentions that "The default linguistics module is OpenNlp". Is this what you are referring to? If yes, can I simply set the set_language index expression by referring to the doc? I did not find any relevant information on how to implement this feature in https://docs.vespa.ai/en/linguistics.html, could you please help me out with this?
Required for CJK support.
Yes, the default tokenizer is OpenNLP and it works with no configuration needed. It will guess the language if you don't set it, but if you know the document language it is better to use set_language (and language=...) in queries, since language detection is unreliable on short text.
However, OpenNLP tokenization (not detecting) only supports Danish, Dutch, Finnish, French, German, Hungarian, Irish, Italian, Norwegian, Portugese, Romanian, Russian, Spanish, Swedish, Turkish and English (where we use kstem instead). So, no CJK.
To support CJK you need to plug in your own tokenizer as described in the linguistics doc, or else use ngram instead of tokenization, see https://docs.vespa.ai/documentation/reference/schema-reference.html#gram
n-gram is often a good choice with Vespa because it doesn't suffer from the recall problems of CJK tokenization, and by using a ranking model which incorporates proximity (such as e.g nativeRank) you'l still get good relevancy.

Solr multilingual stemisation

I'm using Solr to index documents like .pdf or .docx. These documents are in french or in english and I want to use the stemisation for both languages.
For exemple, if I search "chevaux" I want to find "cheval" (french) and if I search "raise" I want to find "raising" (english).
Is there a way to do this without createting 2 core (one in english and one in french) ?
Have two fields, one with the field definition you want for French, and one with the field definition you want for English. Then use the Language Detection feature to submit the content to the correct field.
When searching, query the field that has the correct language as the user, or if you don't know, search both - or use language detection to try to do a better guess.
You can also index the same content into both fields, but my initial guess is that it'll give you weird results down the road, where someone enters a French word, but due to the processing rules for English, you get hit that wouldn't have happened if you only indexed to the correct field.
By enabling langid.map, you can tell Solr to index the content into fields named fieldname_langcode (where fieldname is picked up from langid.fl).
langid.map: Enables field name mapping. If true, Solr will map field names for all fields listed in langid.fl.
You can use langid.map.replace or langid.map.pattern if you want to change the default fieldname_langcode naming, but I'd leave those alone for now.

Simple search in App Engine

I want people to be able to search from a title field and a short description field (max 150 characters), so no real full-text search. Mainly they search for keywords, like "salsa" or "club", but I also want them to be able to search for "salsa" and match words like "salsaclub", so at least some form of partial matching.
Would the new Search API be useful for this kind of search, or would I be better off putting all keywords, including possible partial matches, in a list and filter on this list?
Trying to put all the keywords and partial matches (some sort of support for stemming etc) might work if you limit yourself to small numbers of query terms (ie 1 or 2) anything more complex will become costly. If you want anything more than a one or two terms I would look at the alternatives.
You haven't said if your using python or java, go php. If python have a look at Whoosh for appengine https://github.com/tallstreet/Whoosh-AppEngine or go with the Search API.

Specify language in queries to Search API

When creating a document to add to a search index, you can specify the document language. I've done this, but would now like to query only those docs in a specific language. Is this possible? I assumed it would be trivial (and documented), but I can't find how to do it.
Thanks!
I don't think you can currently, but I haven't seen anything explicitly saying that. I'm implying from these sentences that the language field is for their use and not for querying.
The language parameter for search.TextField:
Two-letter ISO 693-1 language code for the field's content, to assist in tokenization. If None, the language code of the document will be used.
And Building Queries:
Search supports all space-delimited languages as well as some languages not segmented by spaces (specifically, Chinese, Japanese, Korean, and Thai). For these languages, Search segments the text automatically.
They need to know the language so they know how to parse it into words.
My plan is to just add an additional field to my search documents that has the same value as the language field. It's slightly redundant, but simple to do.
search.Document(
fields = [
...,
search.TextField(name='language', value=lang),
],
language = lang,
)

Filter language in solr

I indexed German and English docs with solr and i want to some ability to search just inside German docs or English docs,How to configure this?
thanks
Some of the options as mentioned # http://lucidworks.lucidimagination.com/display/LWEUG/Multilingual+Indexing+and+Search
You may end up with having to implement multiple language fields and language detection.
Have a field indicating the language, and search on that.

Resources