Multi-lingual Solr setup - solr

I have a number of documents quite evenly distributed among a number of languages (6 at the moment, perhaps 12 in the near future). There would be no need to guess the language of a document, as that information is available.
Furthermore, the use-cases for search are such that one search will always be in one language and search only for documents in that language.
Now, I want to apply proper language handling such as stemming to both index and queries. What would be the suggested way to go? From my yet limited Solr knowledge, I can imagine:
Just use one core per language. Keeps the indexes small, the queries match the language by core URL and the configuration is simple. However, it duplicates lots of the configuration.
Use one core and apply something like Solr: DIH for multilingual index & multiValued field?. The search for a specific language would than be via a field such as title_de:sehen

I sure one core per language is the best solution.
You can share all configuration except schema.xml between cores (using single conf folder) and specify schema.xml location per core (check http://wiki.apache.org/solr/CoreAdmin)

I went with a single core instead. The duplication of configuration was daunting. Now it’s all in a single core. A bit of Java magic, and it works perfectly.

Related

Manipulate Solr index with lucene

I have a solr core with 100K-1000k documents.
I have a scenario where I need to add or set a field value on most document.
Doing it through Solr takes too much time.
I was wondering if there is a way to do such task with Lucene library and access the Solr index directly (with less overhead).
If needed, I can shutdown the core, run my code and reload the core afterwards (hoping it will take less time than doing it with Solr).
It will be great to hear if someone already done such a thing and what are the major pitfalls in the way.
Similar problem has been discussed multiple times in Lucene Java mailing list. The underlying problem is that you can not update document in Lucene (and hence Solr).
Instead, you need to delete the document and insert a new one. This obviously adds overhead of analyzing, merging index segments, etc. Yet, the specified amount of documents isn't something major and should not take days (have you tried updating Solr with multiple threads?).
You can of course try doing this via Lucene and see if this makes any difference, but you need to be absolutely sure you will be using the same analyzers as Solr does.
I have a scenario where I need to add or set a field value on most document.
If you have to do it often, maybe you need to look at things like ExternalFileField. There are limitations, but it may be better than hacking around Solr's infrastructure by going directly to Lucene.

Searching over documents stored in Hadoop - which tool to use?

I'm lost in: Hadoop, Hbase, Lucene, Carrot2, Cloudera, Tika, ZooKeeper, Solr, Katta, Cascading, POI...
When you read about the one you can be often sure that each of the others tools is going to be mentioned.
I don't expect you to explain every tool to me - sure not. If you could help me to narrow this set for my particular scenario it would be great. So far I'm not sure which of the above will fit and it looks like (as always) there are more then one way of doing what's to be done.
The scenario is: 500GB - ~20 TB of documents stored in Hadoop. Text documents in multiple formats: email, doc, pdf, odt. Metadata about those documents stored in SQL db (sender, recipients, date, department etc.) Main source of documents will be ExchangeServer (emails and attachments), but not only. Now to the search: User needs to be able to do complex full-text searches over those documents. Basicaly he'll be presented with some search-config panel (java desktop application, not webapp) - he'll set date range, document types, senders/recipients, keywords etc. - fire the search and get the resulting list of the documents (and for each document info why its included in search results i.e. which keywords are found in document).
Which tools I should take into consideration and which not? The point is to develop such solution with only minimal required "glue"-code. I'm proficient in SQLdbs but quite uncomfortable with Apache-and-related technologies.
Basic workflow looks like this: ExchangeServer/other source -> conversion from doc/pdf/... -> deduplication -> Hadopp + SQL (metadata) -> build/update an index <- search through the docs (and do it fast) -> present search results
Thank you!
Going with solr is a good option. I have used it for similar scenario you described above. You can use solr for real huge data as its a distributed index server.
But to get the meta data about all of these documents formats you should be using some other tool. Basically your workflow will be this.
1) Use hadoop cluster to store data.
2) Extract data in hadoop cluster using map/redcue
3) Do document identification( identify document type)
4) Extract meta data from these document.
5) Index metadata in solr server, store other ingestion information in database
6) Solr server is distributed index server, so for each ingestion you could create a new shard or index.
7) When search is required search on all the indexs.
8) Solr supports all the complex searches , so you don't have to make your own search engine.
9) It also does paging for you as well.
We've done exactly this for some of our clients by using Solr as a "secondary indexer" to HBase. Updates to HBase are sent to Solr, and you can query against it. Typically folks start with HBase, and then graft search on. Sounds like you know from the get go that search is what you want, so you can probably embed the secondary indexing in from your pipeline that feeds HBase.
You may find though that just using Solr does everything you need.
Another project to look at is Lily, http://www.lilyproject.org/lily/index.html, which has already done the work of integrating Solr with a distributed database.
Also, I do not see why you would not want to use a browser for this application. You are describing exactly what faceted search is. While you certainly could set up a desktop app that communicates with the server (parses JSON) and displays the results in a thick client GUI, all of this work is already done for you in the browser. And, Solr comes with a free faceted search system out of the box: just follow along the tutorial.
Going with Solr (http://lucene.apache.org/solr) is a good solution, but be ready to have to deal with some non-obvious things. First is planning your indexes properly. Multiple terabytes of data will almost definitely need multiple shards on Solr for any level of reasonable performance and you'll be in charge of managing those yourself. It does provide distributed search (doing the queries off multiple shards), but that is only half the battle.
ElasticSearch (http://www.elasticsearch.org/) is another popular alternative, but i don't have much experience with it regarding scale. It uses the same Lucene engine so i'd expect the search feature-set to be similar.
Another type of solution is something like SenseiDB - open sourced from LinkedIn - which gives the full-text search functionality (also Lucene-based) as well as proven scale for large amounts of data:
http://senseidb.com
They've definitely done a lot of work on search over there and my casual use of it is pretty promising.
Assuming all your data is already in Hadoop, you could write some custom MR jobs that pull the data in a consistent schema-friendly format into SenseiDB. SenseiDB already provides a Hadoop MR indexer which you can look at.
The only caveat is it is a little more complex to setup, but will save you with the scaling issues many times over - especially around indexing performance and faceting functionality. It also provides clustering support if HA is important to you - which is still in Alpha for Solr (Solr 4.x is alpha atm).
Hope that helps and good luck!
Update:
I asked a friend who is more versed in ElasticSearch than me and it does have the advantage of clustering and rebalancing based on the # of machines and shards you have. This is a definite win over Solr - especially if you're dealing with TBs of data. The only downside is the current state of documentation on ElasticSearch leaves a lot to be desired.
As a side note, you can't say the documents are stored in Hadoop, they are stored in a distributed file system (most probably HDFS since you mentioned Hadoop).
Regarding searching/indexing: Lucene is the tool to use for your scenario. You can use it for both indexing and searching. It's a java library. There is also an associated project (called Solr) which allows you to access the indexing/searching system through WebServices. So you should also take a look at Solr as it allows the handling of different types of documents (Lucene puts the responsability of interpreting the document (PDF, Word, etc) on your shoulders but you, probably, can already do that)

Multiple index locations Solr

I am new to Solr, and am trying to figure out the best way to index and search our catalogs.
We have to index multiple manufactures and each manufacturer has a different catalog per country. Each catalog for each manufacture per country is about 8GB of data.
I was thinking it might be easier to have an index per manufacture per country and have some way to tell Solr in the URL which index to search from.
Is that the best way of doing this? If so, how would I do it? Where should I start looking? If not, what would be the best way?
I am using Solr 3.5
In general there are two ways of solving this:
Split each catalog into its own core, running a large multi core setup. This will keep each index physically separated from each other, and will allow you to use different properties (language, etc) and configuration for each core. This might be practical, but will require quite a bit of overhead if you plan on searching through all the core at the same time. It'll be easy to split the different cores into running on different servers later - simply spin the cores up on a different server.
Run everything in a single core - if all the attributes and properties of the different catalogs are the same, add two fields - one containing the manufacturer and one containing the country. Filter on these values when you need to limit the hits to a particular country or manufacturer. It'll allow you to easily search the complete index, and scalability can be implemented by replication or something like SolrCloud (coming in 4.0). If you need multilanguage support you'll have to have a field for each language with the settings you need for that language (such as stemming).
There are a few tidbits of information about this on the Solr wiki, but my suggestion is to simply try one of the methods and see if that solves your issue. Moving to the other solution shouldn't be too much work. The simplest implementation is to keep everything in the same index.

Solr indexing approach

I'm having a scenario where i have to build multilingual index. specially for two scripts , these two scripts are totally different (Hindi and English). so their stemmers and lemmatisers dont affect each other. My indexing will be huge containing millions of documents.
from follwing 3 which approach do i use for indexing?? :
Single field for two languages.
advantage - a) as scripts are different i can use both analysers on it. b) faster searching because fields will be limited. c) will need to take care of relevancy issue.
Language specific fields : a) possibly slower searching because of many fields.
multicore approach : a) problem in handling multilingual docs. b) administration will be hard. c) language specific search will be easy.
I suggest separate cores. IMHO, it's simply the right way to go.
You don't have to use Solr's automatic language recognition, since you define analyzers (lemmatizers/stemmers) for each core/language separately.
The only drawback is boilerplate config elements (most settings are the same for both cores).
See this recent, similar post:
Applying Language Specific Analyzer Dynamically before Solr Indexing
Please read that: Apache Solr multilanguage search, that should help.
If a ware you, I would go with option 2 (I'm using that option).

Single or multi-core Solr

We are planning to deploy Solr for searching multiple sites published from common CMS platform.
There will be separate sites per language where other languages will mostly have content translated from English.
The search requirements include – keyword highlighting, suggestions (“did you mean?”), stopwords, faceting.
We are evaluating using single core vs per-language multi-core Solr option. What is the recommended approach here?
You need multicore because you cannot do stemming and stopwords on a multilingual database.
Common stopwords in English are "by" and "is" but these words mean "town" and "ice" in many Nordic languages.
If you do multicore, each language can be on its own core with a customized schema.xml that selects the right stemmer, stopwords and protected words. But the same JVM is running it all on the same server, so you are not spending any extra money for servers for one specific language. Then, if the load is too great for one server, you replicate your multicore setup and all of the indexes benefit from the replicas.
You should use the multicore approach.
When you want to query multiple cores at once you can use the shards parameter
http://wiki.apache.org/solr/DistributedSearch

Resources