Loadbalancer and Solrcloud - solr

I am wondering how loadbalancer can be set up on top of SolrCloud or a load-balancer is not needed?
If the former, shard leaders need to be added to the loadbalancer? Then what if the shard leader changes for some reason? Or all machines in the cluster (including replica) better be added to the load balancer?
If the latter, I guess a cname needs to point to the SolrCloud cluster and it should be round robin DNS?
Any advice from some actual Solrcloud operation experience would be really appreicated.

Usually SolrCloud is used with combination of ZooKeeper, the client uses CloudSolrServer to access to SolrCloud.
The query will be done in following flow.
Note that I only read the source code of Solr partially and there are lot of guesses. Also what I read was source code of Solr 4.1, so it might be outdated.
ZooKeeper holds the list of IPAddress:Port of all SolrCloud servers.
(Client Side) The instance of CloudSolrServer retrieves the list of servers from ZooKeeper.
(Client Side) The instance of CloudSolrServer chooses one of SolrCloud server randomly and sends query to it. (Also LBHttpSolrServer chooses the server in round-robin?)
(Server Side) The SolrCloud server which recieved the query chooses randomly from replica of shards (one server per shard) from server list and redirects the query to it. (Note that all the SolrCloud server holds the server list which can be recieved from ZooKeeper)
The update will be done in same manner as above but also be populated to all servers.
Note that as for SolrCloud, the leader and replica has small difference and we can send query/update to any of the server. It is automatically redirected to other servers.
In short, the loadbalancing is done in both client side and server side.
So you don't need to worry about it.

A Load Balancer is needed and would be implemented by Zookeeper used in conjunction with SolrCloud.
When you use SolrCloud you must setup sharding and replication through the use of Zookeeper either using the embedded Zookeeper server that comes bundled with SolrCloud or you use a stand-alone Zookeeper ensemble (which is recommended for redundancy).
Then you would use SolrCloudClient to send your queries to Zookeeper which will then forward your query to the correct shard among your cluster. SolrCloudClient will require the name and address of all your Zookeeper instances upon instantiation and your Load-Balancing will be handled as appropriate from there.
Please see the following excllent tutorial:
http://www.francelabs.com/blog/tutorial-solrcloud-amazon-ec2/
Solr Docs:
https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble

This quote refers to latest version of Solr, at time of writing was ver. 7.1
Solrcloud - Distributed Requests
When a Solr node receives a search request, the request is routed
behind the scenes to a replica of a shard that is part of the
collection being searched.
The chosen replica acts as an aggregator: it creates internal requests
to randomly chosen replicas of every shard in the collection,
coordinates the responses, issues any subsequent internal requests as
needed (for example, to refine facets values, or request additional
stored fields), and constructs the final response for the client.
Solrcloud - Read Side Fault Tolerance
In a SolrCloud cluster each individual node load balances read
requests across all the replicas in collection. You still need a load
balancer on the 'outside' that talks to the cluster, or you need a
smart client which understands how to read and interact with Solr’s
metadata in ZooKeeper and only requests the ZooKeeper ensemble’s
address to start discovering to which nodes it should send requests.
(Solr provides a smart Java SolrJ client called CloudSolrClient.)

I am in a similar situation where I can't rely on CloudSolrServer for loadbalancing, a possible solution that I am evaluating is to use Airbnb's synapse (http://nerds.airbnb.com/smartstack-service-discovery-cloud/) to reconfigure dynamically an existing haproxy loadbalancer based on the status of the SolrCloud cluster that we get from Zookeeper.

Related

Run SolrCloud on different machine

I am trying to configure SolrCloud on more than one Server/machine so that it one server is fail another replica can serve that request.
I can successfully run SolrCloud on single machine with two node on different port address. I am refering this link
But How can I run it on different machine. What configuration I need to do to achieve this?
Any help is appreciated.
You provide Solr with the address to the Zookeeper ensemble you'll be using to distribute the workload. It's highly recommended to run Zookeeper by itself instead of using the one bundled with Solr. In the last example in the link you've provided, you can see the -z parameter that provides the connection list of zookeeper instances available. Solr uses Zookeeper to keep its cluster state and available servers synced across instances.

Sitecore 8.1 index rebuild strategy for SOLR search provider

Just read through the index update strategies document below but couldn't get the clear answer on which strategy is best for SOLR search implementation:
https://doc.sitecore.net/sitecore_experience_platform/search_and_indexing/index_update_strategies
We have setup the master and slave Solr endpoints where master will be used for create/update. And slave for reading only.
Appreciate if you could suggest the indexing strategy to be used for:
Content Authoring
Content Delivery
Solution is hosted in azure web apps and content delivery can be scaled up or down from 1-N number at any time.
I'm planning to configure below:
Only CA have a OnPublishEndAsync
All CDs will not have any indexing strategy.
Appreciate if you could suggest a solution that has worked for you. Also how do we disable indexing strategy?
Thanks.
Usually when you use replication in Solr (master + slave Solr servers), it should be configured like that:
Content Authoring (CM server):
connects to Solr master server.
It runs syncMaster strategy for master database, and onPublishEndAsync for web database.
Content Delivery (CD servers):
connects to Solr slave server (or to some load balancer if there are multiple Solr slave servers).
has all the indexing strategies set to manual - they should NEVER update Slave solr servers.
With this solution, CD servers always can get results from Solr, even if there is full index rebuild in progress (this happens on Master Solr server and data is copied to Slaves after it's finished).
You should think about having 2 Solr Slave servers and load balancer for them. If you do this:
If Solr master is down for some reason, slaves still answers to requests from CD boxes. You can safely restart master, reindex, and the only thing you lost is that you didn't have 100% up to date search results on CD for some time.
If one of the Solr slave servers is down, second slave server still answers to the request and load balancer should redirect all the traffic to the slave server which works.

SolrCloud - Updates to schema or dataConfig

We have a SolrCloud managed by Zookeeper. One concern that we have is with updating the schema or dataConfig on the fly. All changes that we are planning to make is in the indexing server node on the SolrCloud. Once the changes to the schema or dataConfig are made, then we do a full dataimport.
The concern is that the replication of the new indexes on the slave nodes in the cloud would not happen immediately, but only after the replication interval. Also for the different slave nodes the replication will happen at different times, which might cause inconsistent results.
For e.g.
The index replication interval is 5 mins.
Slave node A started at 10:00 => next index replication would be at 10:05.
Slave node B started at 10:03 => next index replication would be at 10:08.
If we make changes to the schema in the indexing server and re-index the results at 10:04, then the results of this change would be available on node A at 10:05, but in node B only at 10:08. Requests made to the SolrCloud between 10:05 and 10:08 would have inconsistent results depending on which slave node the request gets redirected to.
Please let me know if there is any way to make the results more consistent.
#Wish, what you are stating is not the behavior of a SolrCloud.
In SolrCloud indexing are routed to shard leaders and leader sent the copies to all the replicas.
At any point of time, if the ZooKeeper identifies that any of the replica is not in sync with leader, it will brought down to recovering mode. In this mode it will not serve any requests including the query.
P.S: In solr cloud configs are maintained at ZooKeeper and not at the nodes level.
I guess you are little confusing Solr Cloud and Master Slave mode, please confirm which one setup are you in?

How to setup Solr Replication with two search servers?

Hi I'm developing rails project with sunspot solr and configure Solr replication.
My environment: rails 3.2.1, ruby 2.1.2, sunspot 2.1.0, Solr 4.1.6.
Why replication: I need more stable system - oftentimes search server goes on maintenance and web application stop working on production. So, I think about how to make 2 identical search servers instead of one, to make system more stable: if one server will be down, other will continue working.
I cannot find any good turtorial with simple, easy to understand and described in details turtorial...
I'm trying to set up replication on two servers, but I do not fully understand how replication working inside:
synchronize data between two servers (is it automatic action?)
balances search requests between two servers
when one server suddenly stop working other should become a master (is it automatic action?)
is there replication features other than listed?
Answer to this is similar to
How to setup Solr Cloud with two search servers?
What is the difference between Solr Replication and Solr Cloud?
Can we close this as duplicate?

Solr cloud sharding

Currently I have a zookeeper instance controlling replication on 3 servers. It is the solr integrated zookeeper. It works well in my web based application.
I have a new requirement which will require sharding in the cloud and I am not sure how to implement it. Basically I want to separate the data which can only be updated by me, shard 1, from the data that users can update, shard 2. From time to time I will be completely replacing the data directory in shard 1 - but I don't want to disturb the user created data in shard 2.
Shard 1 does not need replication since I can copy the new data to each server when I chose to update it however shard 2 does need replication.
Currently I run the following command on the server running zookeeper -
java -Dbootstrap_confdir=solr/myApp/conf -Dcollection.configName=myConfig -DzkRun -DnumShards=1 -jar start.jar
And the following command on the other 2 non zookeeper servers
java -Djetty-port=8983 -DzkHost=129.**.30.11:9983 -jar start.jar&
This creates a single shard solr instance * 3
I think I just need to add 1 static shard to this configuration however I am not sure the sequence of commands to accomplish it.
Many thanks
Firstly you are using zookeeper to maintain your shards and leaders/replicas. So if you want to have one shard with two instances and another shard with only a leader then you will have to modify your command as:
1)provide -DnumShards=2 so that the zookeeper knows that you need two shards
2)specify the -DzkHost parameter for this first solr instance also.
java -Dbootstrap_confdir=solr/myApp/conf -Dcollection.configName=myConfig -DzkRun -DnumShards=2 -DzkHost=** -jar start.jar
When you do this you will see some errors on console since shard2 is not created as yet.
Now start your other two servers and you should see a shard1 with two servers(leader and replica) and shard2 will have only one instance i.e leader
If you want separation of indexes and control over those indexes.You will have to create two collections instead of two shards.
Explanation
you have 3 servers right!!! so when you will start solrCloud using zookeeper. following things will happen as:
1) start first solr server along with the zookeeper and you will get 1 shard for solr cloud as shard1
2) start second solr server and point to the zookeeper... since you have declared DnumShards=2 ,Zookeeper will check that it needs to create 1 more shard, so it creates shard2 for your collection. By now you will be able to see your admin console with 2 shards for 1 collection.
3) Now start your 3rd server and point it to zookeeper and now zookeeper sees that 2 shards are there so it will now create a replica for shard1 instead of a new shard.
so it will be like
collection--->shard1--->server1,server3
--->shard2--->server2

Resources