SolrCloud - Updates to schema or dataConfig - solr

We have a SolrCloud managed by Zookeeper. One concern that we have is with updating the schema or dataConfig on the fly. All changes that we are planning to make is in the indexing server node on the SolrCloud. Once the changes to the schema or dataConfig are made, then we do a full dataimport.
The concern is that the replication of the new indexes on the slave nodes in the cloud would not happen immediately, but only after the replication interval. Also for the different slave nodes the replication will happen at different times, which might cause inconsistent results.
For e.g.
The index replication interval is 5 mins.
Slave node A started at 10:00 => next index replication would be at 10:05.
Slave node B started at 10:03 => next index replication would be at 10:08.
If we make changes to the schema in the indexing server and re-index the results at 10:04, then the results of this change would be available on node A at 10:05, but in node B only at 10:08. Requests made to the SolrCloud between 10:05 and 10:08 would have inconsistent results depending on which slave node the request gets redirected to.
Please let me know if there is any way to make the results more consistent.

#Wish, what you are stating is not the behavior of a SolrCloud.
In SolrCloud indexing are routed to shard leaders and leader sent the copies to all the replicas.
At any point of time, if the ZooKeeper identifies that any of the replica is not in sync with leader, it will brought down to recovering mode. In this mode it will not serve any requests including the query.
P.S: In solr cloud configs are maintained at ZooKeeper and not at the nodes level.
I guess you are little confusing Solr Cloud and Master Slave mode, please confirm which one setup are you in?

Related

SolrCloud on different machines

I have setup a Solr cloud on two machines, I created a collection collection1 and split it into two shards with 2 replica's, I added my other Solr machine to the cloud and in the Solr admin page in cloud->tree->live nodes, I can see 4 live, which includes the last Solr instance launched, but I can see my shards are running on the same machine just on different ports, even replica is still showing the leader address.
Now I want to shift the replica to the newly launched Solr instance or just put the entire shard 1 or 2 on the other machines.
I have tried searching about it, but nothing tells me the exact commands.
This question is rather old, but for the sake of completeness:
In the Solr UI goto Collections
Select your collection
Click on the shards on the right side
Click add replica
Choose your new node as the target node
Wait for the replica to be ready (watch in Cloud > Graph)
Back in the shards list, delete the old replica
If the old replica was the leader, a leader election will be triggered automatically.

SolrCloud 5.1.0 Replica

I have a SolrCloud running on Solr 5.1.0 which consist of a set of powerful machines for searches and updates.
I have a set of additional and slower machines which are supposed to be only replica. These machines do not receive any direct traffic.
However, the logfiles of the slower machines show a lot of query traffic originating the other nodes.
I want the slow replicas for recovery only, they should not process any searches.
Is there a possibility to configure this behaviour in SolrCloud?
No. Solrcloud does not provide a mechanism by which a node participates in replication, but is not a candidate for per-shard fan-out queries.
That said...
You can do this with master/slave replication.
If all the shards for your collection(s) are on every node, then the preferLocalShards=true parameter might accomplish the same thing. See https://cwiki.apache.org/confluence/display/solr/Distributed+Requests

How to handle solr replication when master goes down

I have solr setup, which is configured for Master and slave. The indexing is happening in master and slave is replicating the index at every 2 Min interval from master. So there is a delay of 2 Minutes in getting data from master to slave. Lets assume that my master was indexing at 10:42 some data but due to some hardware issue, master went down at 10:43. So now the data which was indexing at 10:42 was suppose to replicate on Slave by 10:44 (as we have set two minutes interval) Since now the master is not available, how to identify what the last indexed data in solr Master server. Is there way in solr log to track the index activity.
Thanks in Advance
Solr does log the indexing operations if you have the Solr log set to INFO. Any commit/add will show up in the log, so you can check the log for when the last addition was made. Depending on the setup, it might be hard to get the last log when then server is down, though.
You can reduce the time between replications to get more real time replication, or use SolrCloud instead (which should distribute the documents as they're being indexed).
There are also API endpoints (see which connections the Admin interface makes when browsing to the 'replication' status page) for getting the replication status, but those wouldn't help you if the server is gone.
In general - if the server isn't available, you'll have a hard time telling when it was last indexed to. You can work around a few of the issues by storing the indexing time outside of Solr from the indexing task, for example updating a value in memcache or MySQL every time you send something to be indexed from your application.

How to correctly configure SolrCloud replicas on a two-node / two shards cluster

I'm new to SolrCloud (and Solr).
I need your help understanding collection shard and replicas.
I have two SolrCLoud instances running on two different server.
I have a collection, mycol, with two shards. Each solrcloud host a shard.
Because I'm running two nodes, I am thinking to add redundancy. I have some questions about it:
First Way:
add a new one core on each SolrCloud, assign it to mycol shard2 on SolrCloud hosting mycol shard1 and assign it to mycol shard1 on SolrCloud hosting mycol shard2. New shards will become replica and on each node I will have the complete collection in the case of hardware failure.
Second way:
add two SOlrcCLoud instances on two more servers. They will become replicas automatically.
Third way:
add two SolrCloud instances, now for each existing server. They will become replicas automatically.
I'm driving me crazy to understand what is the correct way.
Can you help me?
Thank you
Regards
Giova
It's a bit hard to discect what you are looking for based on your question, however the standard practice is to deploy two or more SolrCloud nodes. Make sure they can talk to each other and zookeeper. Once that is set-up, you can configure your collections with numShards and ReplicationFactor parameter. These parameter will determine how many shards are created and how many replicas will be created for each shard.Shards are used to break up the collection into smaller chucks, shards don't provide any redundancy. Shard replicas are exact copies of your shards, this will actually provide redundancy.
Once you fire off this command to any of the replicas in the SolrCloud cluster, your collection will be created. The replicas are created on the second server to provide redundancy if the first one goes down. At this point, you should be able to query any replica and SolrCloud will automatically route the query internally and provide results.

Loadbalancer and Solrcloud

I am wondering how loadbalancer can be set up on top of SolrCloud or a load-balancer is not needed?
If the former, shard leaders need to be added to the loadbalancer? Then what if the shard leader changes for some reason? Or all machines in the cluster (including replica) better be added to the load balancer?
If the latter, I guess a cname needs to point to the SolrCloud cluster and it should be round robin DNS?
Any advice from some actual Solrcloud operation experience would be really appreicated.
Usually SolrCloud is used with combination of ZooKeeper, the client uses CloudSolrServer to access to SolrCloud.
The query will be done in following flow.
Note that I only read the source code of Solr partially and there are lot of guesses. Also what I read was source code of Solr 4.1, so it might be outdated.
ZooKeeper holds the list of IPAddress:Port of all SolrCloud servers.
(Client Side) The instance of CloudSolrServer retrieves the list of servers from ZooKeeper.
(Client Side) The instance of CloudSolrServer chooses one of SolrCloud server randomly and sends query to it. (Also LBHttpSolrServer chooses the server in round-robin?)
(Server Side) The SolrCloud server which recieved the query chooses randomly from replica of shards (one server per shard) from server list and redirects the query to it. (Note that all the SolrCloud server holds the server list which can be recieved from ZooKeeper)
The update will be done in same manner as above but also be populated to all servers.
Note that as for SolrCloud, the leader and replica has small difference and we can send query/update to any of the server. It is automatically redirected to other servers.
In short, the loadbalancing is done in both client side and server side.
So you don't need to worry about it.
A Load Balancer is needed and would be implemented by Zookeeper used in conjunction with SolrCloud.
When you use SolrCloud you must setup sharding and replication through the use of Zookeeper either using the embedded Zookeeper server that comes bundled with SolrCloud or you use a stand-alone Zookeeper ensemble (which is recommended for redundancy).
Then you would use SolrCloudClient to send your queries to Zookeeper which will then forward your query to the correct shard among your cluster. SolrCloudClient will require the name and address of all your Zookeeper instances upon instantiation and your Load-Balancing will be handled as appropriate from there.
Please see the following excllent tutorial:
http://www.francelabs.com/blog/tutorial-solrcloud-amazon-ec2/
Solr Docs:
https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble
This quote refers to latest version of Solr, at time of writing was ver. 7.1
Solrcloud - Distributed Requests
When a Solr node receives a search request, the request is routed
behind the scenes to a replica of a shard that is part of the
collection being searched.
The chosen replica acts as an aggregator: it creates internal requests
to randomly chosen replicas of every shard in the collection,
coordinates the responses, issues any subsequent internal requests as
needed (for example, to refine facets values, or request additional
stored fields), and constructs the final response for the client.
Solrcloud - Read Side Fault Tolerance
In a SolrCloud cluster each individual node load balances read
requests across all the replicas in collection. You still need a load
balancer on the 'outside' that talks to the cluster, or you need a
smart client which understands how to read and interact with Solr’s
metadata in ZooKeeper and only requests the ZooKeeper ensemble’s
address to start discovering to which nodes it should send requests.
(Solr provides a smart Java SolrJ client called CloudSolrClient.)
I am in a similar situation where I can't rely on CloudSolrServer for loadbalancing, a possible solution that I am evaluating is to use Airbnb's synapse (http://nerds.airbnb.com/smartstack-service-discovery-cloud/) to reconfigure dynamically an existing haproxy loadbalancer based on the status of the SolrCloud cluster that we get from Zookeeper.

Resources