Why do we need an external zookeeper for Solrcloud? - solr

I want to set up a solr cloud using 3 different machines. All the examples i came across are advising me to download zookeeper as well.
However Solr comes with an inbuilt Zookeeper. Why cannot we use that?

Right now the embedded zookeeper is supposed to be mostly for testing/developing. Reasons being that it is less battle tested in production and that if Solr process goes down, zk goes down too, making the cluster less resilient than a separate process.
That being said, there are some voices saying that setting up a separate zk ensemble is too troublesome, so maybe this changes in the future.

Related

Deploy SolrCloud to multiple servers

I am a little bit confused with solrCloud. But how can I deploy SolrCloud on multiple servers? Will it be multiple nodes one per separate server or maybe will it bee one solrCloud node and multiple shards one per server?
And how all of this will communicate with Zookeeper (as far as I understand Zookeeper has to be also deployed on the separate server, is this correct?)
I am a little bit confused with all of this? Can you help me? Or maybe give a link to a good tutorial?
The SolrCloud section of the reference manual should be able to help you out about the concepts of Solr Cloud.
You can run multiple nodes on a single server, or you can run one node on each server. That's really up to you - but all the nodes running in a single server will disappear when that server goes down. The use case for running multiple nodes on a single server is usually for experimenting or for very particular requirements to try to get certain speedups from the single threaded parts of Lucene, so unless you're doing low-level optimization, having one node per server is what you want.
The exception to that rule is for development and experimenting - running multiple nodes on a single machine is fine when the data doesn't matter.
All the nodes make up a single SolrCloud cluster - so you'd be running multiple nodes, not multiple clusters.
Zookeeper should (usually) be deployed on three to five servers - depending on what kind of resiliency you want for failovers. While Solr bundles a Zookeeper instance you can use if you don't want to set up Zookeeper yourself, that is not recommended for production. In a production environment you'd run Zookeeper as a separate process - but that may not mean that you'll be running it on separate servers. Depending on how much traffic and use you'll see for Zookeeper for your nodes, running them on the same server as your cloud nodes will work perfectly fine. The point is to avoid using the bundled version to have full control over Zookeeper and its configuration, and to be able to upgrade/manage the instances outside of Solr.
If the need arises later you can move Zookeeper to its own cluster of servers then (at least three).

Setting up solrcloud with data on two machines

I want to set up solr cloud with data split across 2 machines. For now, I need no replication, load balancing, or fault tolerance. Is there a simple way of achieving this? Most of the tutorials end up talking a lot about external zookeeper dependencies, which I think aren't needed for the barebones configuration I mentioned, and it has been hard to use those to create what I want.
If you do not need any fault tolerance, you can just start two Solr cloud instances and point them to the embedded Zookeeper. You'll need three nodes for Zookeeper to be able to do establish a quorum on failure anyway.
The embedded zookeeper runs on port <solrport + 1000>, and you'd start the nodes with -z host1:port,host2:port. In this case you could just point the latter instance to the first one, since you don't need any fault tolerance.
This is the same configuration as given in Getting Started with Solr Cloud.

Solr - Multi Core vs Multiple Instance for Many Database Tables

I have performance concern and want a suggestion that which will be best, Multi Core or Multi Instance(with different port)? Lets have a look on My Case First:
Currently I am running solr with multiple core and its running OK. There is only one issue that sometime it goes "out of heap memory while processing facets fields", then I have to restart the solr. ( To minimize the no. of restarts, I starts the solr with high memory : java -Xms1000M -Xmx8000M -jar start.jar )
I have amazon ec2 instance with 8core-2.8GHtz /15GB Ram with optimized hard disk.
I have many database-tables(about 100) and have to create different schemas for each(leads to create different core).
Each table have millions of documents, with 7-9 indexed fields and 10-50 stored fields in each document.
My web portals should handle very high traffic (currently I m having 10 request/second, may increase to 50-100/second). I know 'solr' can handle that but it is to just inform you that I am concern about every-smallest performance issue also
Searching solr by PHP and CURL in to specific core, so there is no problem in searching in different solr instance also.
Question:
As per as I know Solr handles one request at a time. So I think if I create multiple instance of solr and starts those at different port, then my web portal can handle more request at a time. (if user search in different table).
So, what you will suggest me? Multi Core in Single Solr Instance? or Multiple Instances with Single/Dual Core in each?
Is there any problem in having multiple solr instances running at different ports?
NOTE: Here, I can/may/will combine less-searched-core(s)/small-core(s) in one instance AND heavy-traffic-core(s) in separate instance OR two-three-heavy-traffic-core in one-instance etc. Coz, creating different Instances for each table(~100 here) will take too much hardware resources.
As I didn't got any answer since more then week AND I had also tried many case with solr (and also read some articles), I want to share my experience as answer to my own question. This may/will help to future viewer. I tried on serverfault also with no success.
Solr can handle more request at a time.
I have tested it by running a long query [qTime=7203, approx. 7sec] and several small-queries-after-long-one [qTime=30], solr respond for small-queries first even they ran after long-one.
This point gives much reason in answer: Use single solr instance with multiple core. Just assign High memory to JVM.
Other Points:
1. Each solr instance will require RAM, so running multiple instances will require more resources, which will be expensive. And if you are using facets, sort fields then you need to allocate more RAM to each instance.
As you can see in my case I need to start the solr with high memory(8GB). You can see a case for Danish Web Archive, Which uses multiple instances and allocated 9GB RAM to each and having 256GM total RAM.
2. You can run multiple instances of solr on different PORT by java -Djetty.port=8984 -jar start.jar. Everything running ok BUT I got one problem.
While indexing it may give "not enough memory error" and then solr instance will be killed. So you again need to start second instance with high memory, which will leads to more RAM requirement.
3. Solr Resource Requirement and Performance Problem can be understand here. According to this 64bit environment and 12GB RAM is recommended for good performance. Solr Optimization are explained here.

How can Slaves automatically detect new Cores on Solr Master?

I have a Solr environment with one master and some slaves. The index consists of multiple Solr Cores that share their Schema but need to be separated from each other.
From time to time, there are new Cores added to the Master via Software. At the moment, for replication I have to add these new Cores to the Slaves manually, which sucks.
Is there a way to have the Slaves automatically detect new Cores on the Master, create them locally and start replication right away? Your help is very much appreciated.
Update: The current setup ist Solr3, but a migration towards Solr4 is already planned. So this is basically a Solr4 question.
I do not know of any automatic core detection/setup for Slaves in the standard replication settings. You might be able to automate this yourself using the CoreAdmin commands via software. Or since you are migrating towards Solr 4, you should look into using SolrCloud as this may provide some of the functionality you are seeking.

Solr Master Slave Failover setup for High Availability

While using Solr (we are currently using 3.5), how do we setup the Masters for a Failover?
Lets say in my Setup I have Two Masters and Two Slaves. The Application commits all the writes to One Active Master, and both the slaves get the updates from this Active Master. There is another repeater which serves the same purpose of the Master.
Now my question is if the Master for some reason comes down, how can I make the Repeater as a Master without any Manual intervention. How can the slaves start getting the updates from the Repeater instead of the broken Master. Is there a recommended way to do this? Are there any other recommended Master/Slave setup's to ensure High availability of the Solr systems?
At this time, your best option is probably to investigate the SolrCloud functionality present in the current Solr 4.0 alpha, which at the time of this writing is due for its final release within a few months. The goal of SolrCloud is to handle data distribution and master election, using the ZooKeeper distributed database to maintain consensus within the cluster about which nodes are serving in while roles.
There are other more traditional ways to set up failover for Solr 3's replicated master-slave architecture, but I personally wouldn't want to make that investment with Solr 4.0 so near to release.
Edit: See Linux-HA, for one such traditional approach. Personally, I would create a purpose-built daemon that reconfigures your cores and load balancer, using ZooKeeper for presence detection and distributed locks.
If outsourcing is an option, you might consider a hosted service such as my own humble Websolr. We provide this kind of distribution and hot failover by default, so our customers don't have to worry as much about the mechanics of how it's implemented.
I agree with Nick. The way replication works in Solr 3.x is not always handy, especially for master fail-over. If you are going to consider Solr 4 you might want to have a look at elasticsearch too, which solves this kind of problems in a really brilliant way!
It uses push replication instead of the pull mechanism used by Solr. That means the document is literally reindexed on all nodes. It might sound strange but that allows to reduce the network load (due to segment merge for example). Furthermore, a node is elected as master and if it crashes one other node will automatically replace it becoming the new master.

Resources