We are planning to deploy Solr for searching multiple sites published from common CMS platform.
There will be separate sites per language where other languages will mostly have content translated from English.
The search requirements include – keyword highlighting, suggestions (“did you mean?”), stopwords, faceting.
We are evaluating using single core vs per-language multi-core Solr option. What is the recommended approach here?
You need multicore because you cannot do stemming and stopwords on a multilingual database.
Common stopwords in English are "by" and "is" but these words mean "town" and "ice" in many Nordic languages.
If you do multicore, each language can be on its own core with a customized schema.xml that selects the right stemmer, stopwords and protected words. But the same JVM is running it all on the same server, so you are not spending any extra money for servers for one specific language. Then, if the load is too great for one server, you replicate your multicore setup and all of the indexes benefit from the replicas.
You should use the multicore approach.
When you want to query multiple cores at once you can use the shards parameter
http://wiki.apache.org/solr/DistributedSearch
Related
Hello i have already working application for searching in database. In database I have like 50M indexed documents. There is any idea to run all together i mean i don't want solr on http? what should i do? it's better to use Lucene or EmbeddedSolrServer? Or maybe you have other solution?
I have already something like on 1st diagram and i want make this in single process
And if i will go in lucene can i use my indexes from solr?
solr-5.2.1
Tomcat v8.0
It is not recommended to have one tomcat and deploy the application and solr.
If solr crashes then they are chances of getting downtime for the application. So its always better to run solr independently. Embedding solr is also not recommended.
The simplest, safest, way to use Solr is via Solr's standard HTTP interfaces. Embedding Solr is less flexible, harder to support, not as well tested, and should be reserved for special circumstances.
for reference http://wiki.apache.org/solr/EmbeddedSolr
It depends. If you want to use parts of the Solr feature set (Solr adds quite a few features on top of Lucene), you'll reimplement features that you otherwise would get for free.
You can use EmbeddedSolr to have Solr internal to your application, and then use the EmbeddedSolrServer client in SolrJ to talk to it - the rest of your application would still use Solr as it were a remote instance.
The problem with EmbeddedSolr is that you'll run into scalability issues as the index size grows, since you'll have a harder time scaling onto multiple servers and to separate concerns.
I am using solr version 3.0.1, and I am about to change to solr 4.6.0.
Usually I just use solr without defining core (I think solr 3.0.1 doesn't have core yet).
And now I want to upgrade my solr to version 4.6.0, there is something new on it.
So i have 3 questions:
What exactly solr core is?
When i should use solr core?
Is it right that each solr core is like a table in a (relational) database? That is, can I save different type of data in different core?
Thanks in advance.
A core is basically an index with a given schema and will hold a set of documents.
You should use different cores for different collections of documents, it doesn't mean you should store different kind of documents in different indexes.
Some examples:
you could have same documents in different languages stored on different cores and select the core based on configured language;
you could have different type of documents stored in different cores to organize them physically separated;
but at the same time you could have different documents stored on the same index and differentiate them by a field value;
it really depends on your use-case.
You have to think up-front about what type of queries you are going to execute against you Solr index. You then lay down your schema of a core or several cores accordingly.
If you for example execute some JOIN queries on your relational DB, those won't be very efficient (if at all possible) with lots of documents in the SOLR index, because it is NoSQL world (here read as: non-relational). In such a case you might need to duplicate your data from several DB tables into one core's schema.
As Francisco has already mentioned physically core is represented as an independent entity with its own schema, config and index data.
One caution with multi-core setup: all the cores configured under the same container instance will hence share the same JVM. This means you should be careful with the amount of data you store on those cores. Lucene, which is an indexing engine inside Solr, has really neat and fast (de)compression algorithms (in versions 4.x) so disk can leave for longer, but JVM heap is something to care about.
The goodies of cores coupled with the Solr admin UI are things like:
core reload after schema / solrconfig changes
core hot swap (if you have a live core serving queries you can hot swap it with a new core with same data and some modifications)
core index optimization
core renaming
I have a number of documents quite evenly distributed among a number of languages (6 at the moment, perhaps 12 in the near future). There would be no need to guess the language of a document, as that information is available.
Furthermore, the use-cases for search are such that one search will always be in one language and search only for documents in that language.
Now, I want to apply proper language handling such as stemming to both index and queries. What would be the suggested way to go? From my yet limited Solr knowledge, I can imagine:
Just use one core per language. Keeps the indexes small, the queries match the language by core URL and the configuration is simple. However, it duplicates lots of the configuration.
Use one core and apply something like Solr: DIH for multilingual index & multiValued field?. The search for a specific language would than be via a field such as title_de:sehen
I sure one core per language is the best solution.
You can share all configuration except schema.xml between cores (using single conf folder) and specify schema.xml location per core (check http://wiki.apache.org/solr/CoreAdmin)
I went with a single core instead. The duplication of configuration was daunting. Now it’s all in a single core. A bit of Java magic, and it works perfectly.
I'm having a scenario where i have to build multilingual index. specially for two scripts , these two scripts are totally different (Hindi and English). so their stemmers and lemmatisers dont affect each other. My indexing will be huge containing millions of documents.
from follwing 3 which approach do i use for indexing?? :
Single field for two languages.
advantage - a) as scripts are different i can use both analysers on it. b) faster searching because fields will be limited. c) will need to take care of relevancy issue.
Language specific fields : a) possibly slower searching because of many fields.
multicore approach : a) problem in handling multilingual docs. b) administration will be hard. c) language specific search will be easy.
I suggest separate cores. IMHO, it's simply the right way to go.
You don't have to use Solr's automatic language recognition, since you define analyzers (lemmatizers/stemmers) for each core/language separately.
The only drawback is boilerplate config elements (most settings are the same for both cores).
See this recent, similar post:
Applying Language Specific Analyzer Dynamically before Solr Indexing
Please read that: Apache Solr multilanguage search, that should help.
If a ware you, I would go with option 2 (I'm using that option).
Can I use a MapReduce framework to create an index and somehow add it to a distributed Solr?
I have a burst of information (logfiles and documents) that will be transported over the internet and stored in my datacenter (or Amazon). It needs to be parsed, indexed, and finally searchable by our replicated Solr installation.
Here is my proposed architecture:
Use a MapReduce framework (Cloudera, Hadoop, Nutch, even DryadLinq) to prepare those documents for indexing
Index those documents into a Lucene.NET / Lucene (java) compatible file format
Deploy that file to all my Solr instances
Activate that replicated index
If that above is possible, I need to choose a MapReduce framework. Since Cloudera is vendor supported and has a ton of patches not included in the Hadoop install, I think it may be worth looking at.
Once I choose the MatpReduce framework, I need to tokenize the documents (PDF, DOCx, DOC, OLE, etc...), index them, copy the index to my Solr instances, and somehow "activate" them so they are searchable in the running instance. I believe this methodolgy is better that submitting documents via the REST interface to Solr.
The reason I bring .NET into the picture is because we are mostly a .NET shop. The only Unix / Java we will have is Solr and have a front end that leverages the REST interface via Solrnet.
Based on your experience, how does
this architecture look? Do you see
any issues/problems? What advice can
you give?
What should I not do to lose faceting search? After reading the Nutch documentation, I believe it said that it does not do faceting, but I may not have enough background in this software to understand what it's saying.
Generally, you what you've described is almost exactly how Nutch works. Nutch is an crawling, indexing, index merging and query answering toolkit that's based on Hadoop core.
You shouldn't mix Cloudera, Hadoop, Nutch and Lucene. You'll most likely end up using all of them:
Nutch is the name of indexing / answering (like Solr) machinery.
Nutch itself runs using a Hadoop cluster (which heavily uses it's own distributed file system, HDFS)
Nutch uses Lucene format of indexes
Nutch includes a query answering frontend, which you can use, or you can attach a Solr frontend and use Lucene indexes from there.
Finally, Cloudera Hadoop Distribution (or CDH) is just a Hadoop distribution with several dozens of patches applied to it, to make it more stable and backport some useful features from development branches. Yeah, you'd most likely want to use it, unless you have a reason not to (for example, if you want a bleeding edge Hadoop 0.22 trunk).
Generally, if you're just looking into a ready-made crawling / search engine solution, then Nutch is a way to go. Nutch already includes a lot of plugins to parse and index various crazy types of documents, include MS Word documents, PDFs, etc, etc.
I personally don't see much point in using .NET technologies here, but if you feel comfortable with it, you can do front-ends in .NET. However, working with Unix technologies might feel fairly awkward for Windows-centric team, so if I'd managed such a project, I'd considered alternatives, especially if your task of crawling & indexing is limited (i.e. you don't want to crawl the whole internet for some purpose).
Have you looked at Lucandra https://github.com/tjake/Lucandra a Cassandra based back end for Lucense/Solr which you can use Hadoop to populate the Cassandra store with the index of your data.