Applying Language Specific Analyzer Dynamically before Solr Indexing - solr

I want to index multi-lingual data. I can identify the language of any field by Solr's Language Detection. Now how can I apply language specific analyzer dynamically to that field? I do not want to create language specific fields (Like: content_en, content_hi etc.). I want to apply language specific analyzer to same field in run time...
I am new in Search Technology. Can any one help me out?
Regards,
Sagar Majumder

My suggestion is to use a separate Solr core for each language.
That's what I did and it was very elegant and practical solution.

Related

Custom analyzer or multi language support for same field

I am trying to see possible solutions to deal with multi-language search functionality using Azure Cognitive Search. For the index below, the Name field has various language related options for Analyzer, but it supports only one language per field.
Is there a way to have multi-language support in an index?
This article should help. In summary, you'll need multiple fields, one for each language. The article discusses options for how to structure queries over those fields.

what is the difference between suggester component and carrot2 cluster in solr?

I need to know in common language what is the difference between them and what are they used for exactly and how? I'm having a Solr project that suggests results based on queries as a personalization approach.
which one can be used?
They're very different features. Carrot2 is a clusterer - i.e. it finds clusters of similar documents that belong together. That means that it attempts to determine which documents describe the same thing, and group them together based on these characteristics.
The suggester is component is mainly used for autocomplete-like features, where you're giving the user suggestions on what to search for (i.e. trying to guess what the user wants to accomplish before he or she has typed all of their query).
Neither is intended for personalization. You might want to look at Learning to rank to apply certain models based on what you know about the input from the user. You'll have to find out which features you have that describe your users and apply those as external feature information
There's also a feature to examine semantic knowledge graphs (i.e. "this concept is positively related to this other concept"), but that's probably on the side of what you're looking for.

Is it possible to get the keywords in the documents returned by Solr

Solr provides an easy way to search documents based on keywords, but I was wondering if it had the ability to return the keywords themselves?
For example, I may want to search for all documents created by Joe Blogs last week and then get a feel for the contents of those documents by the keywords inside them. Or do I have to work out the key words myself and save them in a field?
Assuming by keywords you mean the tokens that Solr generates when parsing a particular field, you may want to review the documentation and examples for the Term Vector Component.
Before implementing it though, just checking the Analysis screen of the Solr (4+) Admin WebUI, as it has a section that shows the terms/tokens particular field actually generates.
If these are not quite the keywords that you are trying to produce, you may need to have a separate field that generates those keywords, possibly by using UpdateRequestProcessor in the indexing pipeline.
Finally, if you are trying to get a feel to do some sort of clustering, you may want to look at the Carrot2, which already does this and integrates with Solr.
What you are asking for is know as "Topic Model". Solr does not have out of the box support for this. However there are other tools that you can integrate to achieve this.
Apache Mahout supports LDA algorithm, that can be used to model topics. There are several examples of integrating Solr with Mahout. Here is one such.
Apache UIMA (Unstructured Information Management Applications.) I won't bother typing about it. Instead, here is a brilliant presentation.

Multi-lingual Solr setup

I have a number of documents quite evenly distributed among a number of languages (6 at the moment, perhaps 12 in the near future). There would be no need to guess the language of a document, as that information is available.
Furthermore, the use-cases for search are such that one search will always be in one language and search only for documents in that language.
Now, I want to apply proper language handling such as stemming to both index and queries. What would be the suggested way to go? From my yet limited Solr knowledge, I can imagine:
Just use one core per language. Keeps the indexes small, the queries match the language by core URL and the configuration is simple. However, it duplicates lots of the configuration.
Use one core and apply something like Solr: DIH for multilingual index & multiValued field?. The search for a specific language would than be via a field such as title_de:sehen
I sure one core per language is the best solution.
You can share all configuration except schema.xml between cores (using single conf folder) and specify schema.xml location per core (check http://wiki.apache.org/solr/CoreAdmin)
I went with a single core instead. The duplication of configuration was daunting. Now it’s all in a single core. A bit of Java magic, and it works perfectly.

Solr indexing approach

I'm having a scenario where i have to build multilingual index. specially for two scripts , these two scripts are totally different (Hindi and English). so their stemmers and lemmatisers dont affect each other. My indexing will be huge containing millions of documents.
from follwing 3 which approach do i use for indexing?? :
Single field for two languages.
advantage - a) as scripts are different i can use both analysers on it. b) faster searching because fields will be limited. c) will need to take care of relevancy issue.
Language specific fields : a) possibly slower searching because of many fields.
multicore approach : a) problem in handling multilingual docs. b) administration will be hard. c) language specific search will be easy.
I suggest separate cores. IMHO, it's simply the right way to go.
You don't have to use Solr's automatic language recognition, since you define analyzers (lemmatizers/stemmers) for each core/language separately.
The only drawback is boilerplate config elements (most settings are the same for both cores).
See this recent, similar post:
Applying Language Specific Analyzer Dynamically before Solr Indexing
Please read that: Apache Solr multilanguage search, that should help.
If a ware you, I would go with option 2 (I'm using that option).

Resources