Deploy SolrCloud to multiple servers - solr

I am a little bit confused with solrCloud. But how can I deploy SolrCloud on multiple servers? Will it be multiple nodes one per separate server or maybe will it bee one solrCloud node and multiple shards one per server?
And how all of this will communicate with Zookeeper (as far as I understand Zookeeper has to be also deployed on the separate server, is this correct?)
I am a little bit confused with all of this? Can you help me? Or maybe give a link to a good tutorial?

The SolrCloud section of the reference manual should be able to help you out about the concepts of Solr Cloud.
You can run multiple nodes on a single server, or you can run one node on each server. That's really up to you - but all the nodes running in a single server will disappear when that server goes down. The use case for running multiple nodes on a single server is usually for experimenting or for very particular requirements to try to get certain speedups from the single threaded parts of Lucene, so unless you're doing low-level optimization, having one node per server is what you want.
The exception to that rule is for development and experimenting - running multiple nodes on a single machine is fine when the data doesn't matter.
All the nodes make up a single SolrCloud cluster - so you'd be running multiple nodes, not multiple clusters.
Zookeeper should (usually) be deployed on three to five servers - depending on what kind of resiliency you want for failovers. While Solr bundles a Zookeeper instance you can use if you don't want to set up Zookeeper yourself, that is not recommended for production. In a production environment you'd run Zookeeper as a separate process - but that may not mean that you'll be running it on separate servers. Depending on how much traffic and use you'll see for Zookeeper for your nodes, running them on the same server as your cloud nodes will work perfectly fine. The point is to avoid using the bundled version to have full control over Zookeeper and its configuration, and to be able to upgrade/manage the instances outside of Solr.
If the need arises later you can move Zookeeper to its own cluster of servers then (at least three).

Related

Setting up solrcloud with data on two machines

I want to set up solr cloud with data split across 2 machines. For now, I need no replication, load balancing, or fault tolerance. Is there a simple way of achieving this? Most of the tutorials end up talking a lot about external zookeeper dependencies, which I think aren't needed for the barebones configuration I mentioned, and it has been hard to use those to create what I want.
If you do not need any fault tolerance, you can just start two Solr cloud instances and point them to the embedded Zookeeper. You'll need three nodes for Zookeeper to be able to do establish a quorum on failure anyway.
The embedded zookeeper runs on port <solrport + 1000>, and you'd start the nodes with -z host1:port,host2:port. In this case you could just point the latter instance to the first one, since you don't need any fault tolerance.
This is the same configuration as given in Getting Started with Solr Cloud.

Why do we need an external zookeeper for Solrcloud?

I want to set up a solr cloud using 3 different machines. All the examples i came across are advising me to download zookeeper as well.
However Solr comes with an inbuilt Zookeeper. Why cannot we use that?
Right now the embedded zookeeper is supposed to be mostly for testing/developing. Reasons being that it is less battle tested in production and that if Solr process goes down, zk goes down too, making the cluster less resilient than a separate process.
That being said, there are some voices saying that setting up a separate zk ensemble is too troublesome, so maybe this changes in the future.

Should zookeeper be run on the worker machines or independent machines?

We have several kinds of software that use zookeeper like Solr, Storm, Kafka, Hbase etc.
There are 2 options to install zookeeper cluster (more than 1 nodes):
Embedded cluster: Install ZK on some of the same machines as the other software are installed OR
External cluster: Have a few not very powerful but dedicated zookeeper machines (in the same region, cloud and data-center though) to run zookeeper on.
Which is a better option for cluster stability? Note that in both the cases, we always have an odd number of machines in our zookeeper cluster and not just one machine.
It appears that the embedded option is easier to setup and is a better use of the machines but the external option seems more stable because a loss of single machine means the loss of just one component (Loss of a machine in embedded zookeeper means loss of zookeeper node as well as the worker node of Solr, Storm, Kafka whatever the case maybe).
What is the industry standard to run zookeepers in production for maximum stability?
Zookeeper is a critical component for a Kafka cluster but since the implementation of the new generation of clients the load on ZK has been greatly reduced and is now only used by the cluster itself. Even though the load is usually not very high, it can be sensitive to latency and therefore the best practice is to run a Zookeeper ensemble on dedicated machines and optimally even use dedicated disks for ZK transaction logs to avoid IO contention.
By using larger Zookeeper ensembles you gain resiliency but this also increase communication within the cluster and you could lose some performance. Since Zookeeper works with simple majority voting you need an odd number of nodes for it to make sense. A 3 node ensemble allow losing 1 node without impact, a 5 node ensemble allow losing 2 nodes and so on.
In practice, I´ve seen small, low workload clusters run very well with Zookeeper installed on the same machines as the Kafka nodes but if you aim for maximum stability and have increasing traffic, separate clusters would be recommended.
You should consider yourself discouraged from using internal ZooKeeper in production.
Its good to have external zookeeper, Best if Zookeeper ensemble(two or more)
If you have one zookeeper node and it might create problems when it goes down.
if you have cluster setup of zookeeper nodes and if one zookeeper node goes down the remaining majority nodes are running will continue to work.
More details
For SolrCloud, we strongly recommend that Zookeeper is external, and that you have at least three of them.
This does NOT mean that it cannot run on the same servers as Solr, but it DOES mean that you should NOT use the zookeeper server that Solr itself can start, embedded within itself.
Here's some information related to performance and SolrCloud that touches on zookeeper:
https://wiki.apache.org/solr/SolrPerformanceProblems#SolrCloud
Whether or not you need completely separate machines, or even separate disks for the zookeeper database when running on the same machine as Solr, is VERY dependent on the characteristics of your SolrCloud install. If your index is very small and your query load is low, it's possible that you can put zookeeper on the same machines and even the same disks.
For the other services you mentioned, I have no idea what the recommendation is.

How to setup Solr Cloud with two search servers?

Hi I'm developing rails project with sunspot solr and configuring Solr Cloud.
My environment: rails 3.2.1, ruby 2.1.2, sunspot 2.1.0, Solr 4.1.6.
Why SolrCloud: I need more stable system - oftentimes search server goes on maintenance and web application stop working on production. So, I think about how to make 2 identical search servers instead of one, to make system more stable: if one server will be down, other will continue working.
I cannot find any good turtorial with simple, easy to understand and described in details turtorial...
I'm trying to set up SolrCloud on two servers, but I do not fully understand how it is working inside:
synchronize data between two servers (is it automatic action?)
balances search requests between two servers
when one server suddenly stop working other should become a master (is it automatic action?)
is there SolrCloud features other than listed?
Read more about SolrCloud here..! https://wiki.apache.org/solr/SolrCloud
Couple of inputs from my experience.
If your application just reads data from SOLR and does not write to SOLR(in real time but you index using an ETL or so) then you can just go for Master Slave hierarchy.
Define one Master :- Point all writes to here. If this master is down you will no longer be able to index the data
Create 2(or more) Slaves :- This is an feature from SOLR and it will take care of synchronizing data from the master based on the interval we specify(Say every 20 seconds)
Create a load balancer based out of slaves and point your application to read data from load balancer.
Pros:
With above setup, you don't have high availability for Master(Data writes) but you will have high availability for data until the last slave goes down.
Cons:
Assume one slave went down and you bought it back after an hour, this slave will be behind the other slaves by one hour. So its manual task to check for data consistency among other slaves before adding back to ELB.
How about SolrCloud?
No Master here, so you can achieve high availability for Writes too
No need to worry about data inconsistency as I described above, SolrCloud architecture will take care of that.
What Suits Best for you.
Define a external Zookeeper with 3 nodes Quorom
Define at least 2 SOLR severs.
Split your Current index to 2 shards (by default each shard will reside one each in 2 solr nodes defined in step #2
Define replica as 2 (This will create replica for shards in each nodes)
Define an LB to point to above solr nodes.
Point your Solr input as well as application to point to this LB.
By above setup, you can sustain fail over for either nodes.
Let me know if you need more info on this.
Regards,
Aneesh N
-Let us learn together.

Zookeeper and SolrCloud on AWS EC2 instances

I have used Solr for a while, but am new to SolrCloud. I am investigating whether it makes sense in my context to deploy SolrCloud or to have multiple Solr instances (with matching indexed content) sitting behind an ELB.
My deployment will be in AWS on EC2 instances. Our current troubleshooting strategy in AWS is to terminate misbehaving instances and allow them to be automatically recreated by an AutoScaling group (which configures new instances via scripts when they are created). In fact, we do not have access to log on to the instances once they are in production. Everything stored in Solr can be re-indexed, so there is not a concern for data loss.
When trying to understand the SolrCloud infrastructure, however, I had a few questions:
Is Zookeeper able to automatically add a new instance if I destroy one of them? Everything I have seen seems to have static IP addresses in the configurations, which would require the configs to be updated (and Zookeeper restarted) if an instance was terminated and replaced.
Is there a "master" Zookeeper instance that I should call, or can I call any of them? If I can call any of them, we would likely put an ELB in front of Zookeeper.
If we hit heavy usage and allow the AWS AutoScaling group to create additional servers that serve as SolrCloud shards, will SolrCloud gracefully add the instances and terminate them without problems? (This appears to be true, and the whole point of using SolrCloud.)
Is Zookeeper able to automatically add a new instance if I destroy one of them? Everything I have seen seems to have static IP addresses in the configurations, which would require the configs to be updated (and Zookeeper restarted) if an instance was terminated and replaced.
AN: In ZooKeeper, you will just have to mention about other ZooKeepers. This is to make the ZooKeepers aware of other running ZooKeepers. You don't need to change this config unless you plan to increase/decrease the number of ZooKeepers. Even if we have to do, we can do without disturbing the cluster by doing one at time. Also we keep hostname in config so that change in ip will have no impact on this.
Is there a "master" Zookeeper instance that I should call, or can I call any of them? If I can call any of them, we would likely put an ELB in front of Zookeeper.
AN: In ZooKeeper, we have a leader and followers. We don't need to bother about them as we don't communicate with ZooKeepers
If we hit heavy usage and allow the AWS AutoScaling group to create additional servers that serve as SolrCloud shards, will SolrCloud gracefully add the instances and terminate them without problems? (This appears to be true, and the whole point of using SolrCloud.)
AN: When you create a new SOLR node, you will have to start the node under the same cluster (Pass same ZooKeepers). Once you start with this, you will have to split a shard and move it to another node so as to balance the cluster. Not automated as of now.
SOLR Nodes are the one that you have to add in your ELB.
When you start a SOLR node, you will mention the list of ZooKeepers by which SOLR node will understand which cluster is that part of and other nodes serving the cluster

Resources