Specify language in queries to Search API - google-app-engine

When creating a document to add to a search index, you can specify the document language. I've done this, but would now like to query only those docs in a specific language. Is this possible? I assumed it would be trivial (and documented), but I can't find how to do it.
Thanks!

I don't think you can currently, but I haven't seen anything explicitly saying that. I'm implying from these sentences that the language field is for their use and not for querying.
The language parameter for search.TextField:
Two-letter ISO 693-1 language code for the field's content, to assist in tokenization. If None, the language code of the document will be used.
And Building Queries:
Search supports all space-delimited languages as well as some languages not segmented by spaces (specifically, Chinese, Japanese, Korean, and Thai). For these languages, Search segments the text automatically.
They need to know the language so they know how to parse it into words.
My plan is to just add an additional field to my search documents that has the same value as the language field. It's slightly redundant, but simple to do.
search.Document(
fields = [
...,
search.TextField(name='language', value=lang),
],
language = lang,
)

Related

How to provide OpenNLP model for tokenization in vespa?

How do I provide an OpenNLP model for tokenization in vespa? This mentions that "The default linguistics module is OpenNlp". Is this what you are referring to? If yes, can I simply set the set_language index expression by referring to the doc? I did not find any relevant information on how to implement this feature in https://docs.vespa.ai/en/linguistics.html, could you please help me out with this?
Required for CJK support.
Yes, the default tokenizer is OpenNLP and it works with no configuration needed. It will guess the language if you don't set it, but if you know the document language it is better to use set_language (and language=...) in queries, since language detection is unreliable on short text.
However, OpenNLP tokenization (not detecting) only supports Danish, Dutch, Finnish, French, German, Hungarian, Irish, Italian, Norwegian, Portugese, Romanian, Russian, Spanish, Swedish, Turkish and English (where we use kstem instead). So, no CJK.
To support CJK you need to plug in your own tokenizer as described in the linguistics doc, or else use ngram instead of tokenization, see https://docs.vespa.ai/documentation/reference/schema-reference.html#gram
n-gram is often a good choice with Vespa because it doesn't suffer from the recall problems of CJK tokenization, and by using a ranking model which incorporates proximity (such as e.g nativeRank) you'l still get good relevancy.

Solr multilingual stemisation

I'm using Solr to index documents like .pdf or .docx. These documents are in french or in english and I want to use the stemisation for both languages.
For exemple, if I search "chevaux" I want to find "cheval" (french) and if I search "raise" I want to find "raising" (english).
Is there a way to do this without createting 2 core (one in english and one in french) ?
Have two fields, one with the field definition you want for French, and one with the field definition you want for English. Then use the Language Detection feature to submit the content to the correct field.
When searching, query the field that has the correct language as the user, or if you don't know, search both - or use language detection to try to do a better guess.
You can also index the same content into both fields, but my initial guess is that it'll give you weird results down the road, where someone enters a French word, but due to the processing rules for English, you get hit that wouldn't have happened if you only indexed to the correct field.
By enabling langid.map, you can tell Solr to index the content into fields named fieldname_langcode (where fieldname is picked up from langid.fl).
langid.map: Enables field name mapping. If true, Solr will map field names for all fields listed in langid.fl.
You can use langid.map.replace or langid.map.pattern if you want to change the default fieldname_langcode naming, but I'd leave those alone for now.

Algolia Multilingual Search

In Algolia's documentation the below is stated:
"Our engine is language-agnostic: both alphabet-base and symbol-based languages such as Chinese, Japanese or Korean are supported."
How do you determine the language analyser? for example how do you determines the difference between es_ES and es_MX
Which is the better approach for better search result single index vs index per language
In our specific case we add attributes like:
title_en
title_de
....

Including currency symbols in solr / lucene indexes

Is it possible to index a text field considering currency symbols as separate tokens?
For example in a text field I have this:
"16 €"
and I need to build an index with this entries:
16
€
In order to search for "€" and finding the document.
Now I'm using StandardTokenizer and it discards currency symbols.
A possible solution could be using a more "trivial" tokenizer such as the WhitespaceTokenizer but I think it will get worse tokenization on other text.
Note that the problem is not how to index currencies, this is a trivial example but in the field i could have an arbitrary text.
One possible solution, albeit not very pretty, is to replace the eurosign with something the tokenizer you've chosen will leave alone. You can use a MappingCharFilterFactory to replace the eurosign with a string like EUROSIGN, and then replace it after tokenization again.
Unless you're able to formally express exactly how you want your tokenizer to work, you'll have to go with one of the preset versions that are suitable for most content to give usable search results. If you have a more specific rule set, writing your own tokenizer in Java is an option.

Searching for words that are contained in other words

Let's say that one of my fields in the index contains the word entrepreneurial. When I search for the word entrepreneur I don't get that document. But entrepreneur* does.
Is there a mode/parameter in which queries search for document that have words that contain a word token in search text?
Another example would be finding a doc that has Matthew when you're looking for Matt.
Thanks
We don't currently have a mode where all input terms are treated as prefixes. You have a few options depending of what exactly are you looking for:
Set the target searchable field to a language specific analyzer. This is the nicest option from the linguistics perspective. When you do this, if appropriate for the language we'll do stemming which helps with things such as "run" versus "running". It won't help with your specific sample of "entrepreneurial" but generally speaking this helps significantly with recall.
Split search input before sending it to search and add "" to all. Depending on your target language this is relatively easy (i.e. if there are spaces) or very hard. Note that prefixes don't mix well with stemming unless take them into account and search both (e.g. something like search=aa bb -> (aa | aa) (bb | bb*))
Lean on suggestions. This is more of a different angle that may or may not match your scenario. Search suggestions are good at partial/prefix matching and they'll help users land on the right terms. You can read more about this here.
perhaps this page might be of interest..?
https://msdn.microsoft.com/en-us/library/azure/dn798927.aspx
search=[string]
Optional. The text to search for. All searchable fields are searched by
default unless searchFields is specified. When searching searchable fields, the search text itself is tokenized, so multiple terms can be separated by white space (e.g.: search=hello world). To match any term, use * (this can be useful for boolean filter queries). Omitting this parameter has the same effect as setting it to *. See Simple query syntax in Azure Search for specifics on the search syntax.

Resources