Solr indexing approach - solr

I'm having a scenario where i have to build multilingual index. specially for two scripts , these two scripts are totally different (Hindi and English). so their stemmers and lemmatisers dont affect each other. My indexing will be huge containing millions of documents.
from follwing 3 which approach do i use for indexing?? :
Single field for two languages.
advantage - a) as scripts are different i can use both analysers on it. b) faster searching because fields will be limited. c) will need to take care of relevancy issue.
Language specific fields : a) possibly slower searching because of many fields.
multicore approach : a) problem in handling multilingual docs. b) administration will be hard. c) language specific search will be easy.

I suggest separate cores. IMHO, it's simply the right way to go.
You don't have to use Solr's automatic language recognition, since you define analyzers (lemmatizers/stemmers) for each core/language separately.
The only drawback is boilerplate config elements (most settings are the same for both cores).
See this recent, similar post:
Applying Language Specific Analyzer Dynamically before Solr Indexing

Please read that: Apache Solr multilanguage search, that should help.
If a ware you, I would go with option 2 (I'm using that option).

Related

Is it possible to get the keywords in the documents returned by Solr

Solr provides an easy way to search documents based on keywords, but I was wondering if it had the ability to return the keywords themselves?
For example, I may want to search for all documents created by Joe Blogs last week and then get a feel for the contents of those documents by the keywords inside them. Or do I have to work out the key words myself and save them in a field?
Assuming by keywords you mean the tokens that Solr generates when parsing a particular field, you may want to review the documentation and examples for the Term Vector Component.
Before implementing it though, just checking the Analysis screen of the Solr (4+) Admin WebUI, as it has a section that shows the terms/tokens particular field actually generates.
If these are not quite the keywords that you are trying to produce, you may need to have a separate field that generates those keywords, possibly by using UpdateRequestProcessor in the indexing pipeline.
Finally, if you are trying to get a feel to do some sort of clustering, you may want to look at the Carrot2, which already does this and integrates with Solr.
What you are asking for is know as "Topic Model". Solr does not have out of the box support for this. However there are other tools that you can integrate to achieve this.
Apache Mahout supports LDA algorithm, that can be used to model topics. There are several examples of integrating Solr with Mahout. Here is one such.
Apache UIMA (Unstructured Information Management Applications.) I won't bother typing about it. Instead, here is a brilliant presentation.

Multi-lingual Solr setup

I have a number of documents quite evenly distributed among a number of languages (6 at the moment, perhaps 12 in the near future). There would be no need to guess the language of a document, as that information is available.
Furthermore, the use-cases for search are such that one search will always be in one language and search only for documents in that language.
Now, I want to apply proper language handling such as stemming to both index and queries. What would be the suggested way to go? From my yet limited Solr knowledge, I can imagine:
Just use one core per language. Keeps the indexes small, the queries match the language by core URL and the configuration is simple. However, it duplicates lots of the configuration.
Use one core and apply something like Solr: DIH for multilingual index & multiValued field?. The search for a specific language would than be via a field such as title_de:sehen
I sure one core per language is the best solution.
You can share all configuration except schema.xml between cores (using single conf folder) and specify schema.xml location per core (check http://wiki.apache.org/solr/CoreAdmin)
I went with a single core instead. The duplication of configuration was daunting. Now it’s all in a single core. A bit of Java magic, and it works perfectly.

Manipulate Solr index with lucene

I have a solr core with 100K-1000k documents.
I have a scenario where I need to add or set a field value on most document.
Doing it through Solr takes too much time.
I was wondering if there is a way to do such task with Lucene library and access the Solr index directly (with less overhead).
If needed, I can shutdown the core, run my code and reload the core afterwards (hoping it will take less time than doing it with Solr).
It will be great to hear if someone already done such a thing and what are the major pitfalls in the way.
Similar problem has been discussed multiple times in Lucene Java mailing list. The underlying problem is that you can not update document in Lucene (and hence Solr).
Instead, you need to delete the document and insert a new one. This obviously adds overhead of analyzing, merging index segments, etc. Yet, the specified amount of documents isn't something major and should not take days (have you tried updating Solr with multiple threads?).
You can of course try doing this via Lucene and see if this makes any difference, but you need to be absolutely sure you will be using the same analyzers as Solr does.
I have a scenario where I need to add or set a field value on most document.
If you have to do it often, maybe you need to look at things like ExternalFileField. There are limitations, but it may be better than hacking around Solr's infrastructure by going directly to Lucene.

Is it advisable to use Lucene for this?

I have a huge XML file, about 2GB in size, containing Resumes. There are thousands of resumes in this file, tagged properly. Right now I am using XPATH to query it. So is it advisable to use Lucene for the same instead of XPATH?
Depends upon what your requirements are. If you need full-text searching and all other great features of a full-blown search engine, Lucene is the way to go. I would recommend Solr which builds on top of lucene and provides a much better API and abstraction.
Like everything else technology related, it depends.
What Lucene gives you that you're not getting with XPath is the power of a full-text engine that supports among other things ranking and the ability to phrase queries, wildcard queries etc.
Based on your use-case I would say that at full-text search engine makes sense. That's not to say that vanilla Lucene is the best way to go (there are for example other alternatives that build on Lucene).
2GB seems to be pretty less for which I would contruct my own inverted index (a minimal one) :) However no problem in using Lucene/Solr though. Go ahead. It will help you once your records starts doubling. However at this scale (2GB) or even much larger many real life stuff is working on databases full text searches using SQL like keyword.

Single or multi-core Solr

We are planning to deploy Solr for searching multiple sites published from common CMS platform.
There will be separate sites per language where other languages will mostly have content translated from English.
The search requirements include – keyword highlighting, suggestions (“did you mean?”), stopwords, faceting.
We are evaluating using single core vs per-language multi-core Solr option. What is the recommended approach here?
You need multicore because you cannot do stemming and stopwords on a multilingual database.
Common stopwords in English are "by" and "is" but these words mean "town" and "ice" in many Nordic languages.
If you do multicore, each language can be on its own core with a customized schema.xml that selects the right stemmer, stopwords and protected words. But the same JVM is running it all on the same server, so you are not spending any extra money for servers for one specific language. Then, if the load is too great for one server, you replicate your multicore setup and all of the indexes benefit from the replicas.
You should use the multicore approach.
When you want to query multiple cores at once you can use the shards parameter
http://wiki.apache.org/solr/DistributedSearch

Resources