I am trying to see possible solutions to deal with multi-language search functionality using Azure Cognitive Search. For the index below, the Name field has various language related options for Analyzer, but it supports only one language per field.
Is there a way to have multi-language support in an index?
This article should help. In summary, you'll need multiple fields, one for each language. The article discusses options for how to structure queries over those fields.
Related
Solr provides an easy way to search documents based on keywords, but I was wondering if it had the ability to return the keywords themselves?
For example, I may want to search for all documents created by Joe Blogs last week and then get a feel for the contents of those documents by the keywords inside them. Or do I have to work out the key words myself and save them in a field?
Assuming by keywords you mean the tokens that Solr generates when parsing a particular field, you may want to review the documentation and examples for the Term Vector Component.
Before implementing it though, just checking the Analysis screen of the Solr (4+) Admin WebUI, as it has a section that shows the terms/tokens particular field actually generates.
If these are not quite the keywords that you are trying to produce, you may need to have a separate field that generates those keywords, possibly by using UpdateRequestProcessor in the indexing pipeline.
Finally, if you are trying to get a feel to do some sort of clustering, you may want to look at the Carrot2, which already does this and integrates with Solr.
What you are asking for is know as "Topic Model". Solr does not have out of the box support for this. However there are other tools that you can integrate to achieve this.
Apache Mahout supports LDA algorithm, that can be used to model topics. There are several examples of integrating Solr with Mahout. Here is one such.
Apache UIMA (Unstructured Information Management Applications.) I won't bother typing about it. Instead, here is a brilliant presentation.
I would like to use NLP while indexing the data with Apache Solr.
Identify the synonyms of the words and index that also.
Identify thenamed entity and label it while indexing.
when some one query the Solr Index, I should able to extract the
named entity and intention from the query and form the query string,
so that it can effectively search the indexed file.
Is there any tools / plugins available to satisfy my requirements? I believe it is a common use cases for most of the content based websites. How people handling it?
Here's a tutorial on using Stanford NER with SOLR.
Check out Apache UIMA
Specifically, if you need Solr to do named entity recognition, you can integrate it with UIMA using SolrUIMA
Check out this talk, that demonstrates UIMA + Solr.
I'm wondering if it's possible to use Solr to query more than one index and combine the results.
The concrete problem is a web site based on various PDFs & DOCs as well as Notes documents. The Notes documents are user-restricted and should not appear in search results unless the user is authorised to view the document.
I think the simple docs could be searched for using Solr and Lucene and the Notes documents using Notes search.
Is there a way to extend Solr to search multiple indexes and merge the results?
Don't think that's possible. Sounds like that logic should be in the application layer. One approach to consider would be to have a field in the schema which will indicate the type of document (like notes) or access level (public, private) then you could exclude them from the search results
q=search+keywords&fq=-DocType:notes
I want to index multi-lingual data. I can identify the language of any field by Solr's Language Detection. Now how can I apply language specific analyzer dynamically to that field? I do not want to create language specific fields (Like: content_en, content_hi etc.). I want to apply language specific analyzer to same field in run time...
I am new in Search Technology. Can any one help me out?
Regards,
Sagar Majumder
My suggestion is to use a separate Solr core for each language.
That's what I did and it was very elegant and practical solution.
I'm having a scenario where i have to build multilingual index. specially for two scripts , these two scripts are totally different (Hindi and English). so their stemmers and lemmatisers dont affect each other. My indexing will be huge containing millions of documents.
from follwing 3 which approach do i use for indexing?? :
Single field for two languages.
advantage - a) as scripts are different i can use both analysers on it. b) faster searching because fields will be limited. c) will need to take care of relevancy issue.
Language specific fields : a) possibly slower searching because of many fields.
multicore approach : a) problem in handling multilingual docs. b) administration will be hard. c) language specific search will be easy.
I suggest separate cores. IMHO, it's simply the right way to go.
You don't have to use Solr's automatic language recognition, since you define analyzers (lemmatizers/stemmers) for each core/language separately.
The only drawback is boilerplate config elements (most settings are the same for both cores).
See this recent, similar post:
Applying Language Specific Analyzer Dynamically before Solr Indexing
Please read that: Apache Solr multilanguage search, that should help.
If a ware you, I would go with option 2 (I'm using that option).