SolrCloud on different machines - solr

I have setup a Solr cloud on two machines, I created a collection collection1 and split it into two shards with 2 replica's, I added my other Solr machine to the cloud and in the Solr admin page in cloud->tree->live nodes, I can see 4 live, which includes the last Solr instance launched, but I can see my shards are running on the same machine just on different ports, even replica is still showing the leader address.
Now I want to shift the replica to the newly launched Solr instance or just put the entire shard 1 or 2 on the other machines.
I have tried searching about it, but nothing tells me the exact commands.

This question is rather old, but for the sake of completeness:
In the Solr UI goto Collections
Select your collection
Click on the shards on the right side
Click add replica
Choose your new node as the target node
Wait for the replica to be ready (watch in Cloud > Graph)
Back in the shards list, delete the old replica
If the old replica was the leader, a leader election will be triggered automatically.

Related

SolrCloud - Updates to schema or dataConfig

We have a SolrCloud managed by Zookeeper. One concern that we have is with updating the schema or dataConfig on the fly. All changes that we are planning to make is in the indexing server node on the SolrCloud. Once the changes to the schema or dataConfig are made, then we do a full dataimport.
The concern is that the replication of the new indexes on the slave nodes in the cloud would not happen immediately, but only after the replication interval. Also for the different slave nodes the replication will happen at different times, which might cause inconsistent results.
For e.g.
The index replication interval is 5 mins.
Slave node A started at 10:00 => next index replication would be at 10:05.
Slave node B started at 10:03 => next index replication would be at 10:08.
If we make changes to the schema in the indexing server and re-index the results at 10:04, then the results of this change would be available on node A at 10:05, but in node B only at 10:08. Requests made to the SolrCloud between 10:05 and 10:08 would have inconsistent results depending on which slave node the request gets redirected to.
Please let me know if there is any way to make the results more consistent.
#Wish, what you are stating is not the behavior of a SolrCloud.
In SolrCloud indexing are routed to shard leaders and leader sent the copies to all the replicas.
At any point of time, if the ZooKeeper identifies that any of the replica is not in sync with leader, it will brought down to recovering mode. In this mode it will not serve any requests including the query.
P.S: In solr cloud configs are maintained at ZooKeeper and not at the nodes level.
I guess you are little confusing Solr Cloud and Master Slave mode, please confirm which one setup are you in?

How to correctly configure SolrCloud replicas on a two-node / two shards cluster

I'm new to SolrCloud (and Solr).
I need your help understanding collection shard and replicas.
I have two SolrCLoud instances running on two different server.
I have a collection, mycol, with two shards. Each solrcloud host a shard.
Because I'm running two nodes, I am thinking to add redundancy. I have some questions about it:
First Way:
add a new one core on each SolrCloud, assign it to mycol shard2 on SolrCloud hosting mycol shard1 and assign it to mycol shard1 on SolrCloud hosting mycol shard2. New shards will become replica and on each node I will have the complete collection in the case of hardware failure.
Second way:
add two SOlrcCLoud instances on two more servers. They will become replicas automatically.
Third way:
add two SolrCloud instances, now for each existing server. They will become replicas automatically.
I'm driving me crazy to understand what is the correct way.
Can you help me?
Thank you
Regards
Giova
It's a bit hard to discect what you are looking for based on your question, however the standard practice is to deploy two or more SolrCloud nodes. Make sure they can talk to each other and zookeeper. Once that is set-up, you can configure your collections with numShards and ReplicationFactor parameter. These parameter will determine how many shards are created and how many replicas will be created for each shard.Shards are used to break up the collection into smaller chucks, shards don't provide any redundancy. Shard replicas are exact copies of your shards, this will actually provide redundancy.
Once you fire off this command to any of the replicas in the SolrCloud cluster, your collection will be created. The replicas are created on the second server to provide redundancy if the first one goes down. At this point, you should be able to query any replica and SolrCloud will automatically route the query internally and provide results.

Solr cloud sharding

Currently I have a zookeeper instance controlling replication on 3 servers. It is the solr integrated zookeeper. It works well in my web based application.
I have a new requirement which will require sharding in the cloud and I am not sure how to implement it. Basically I want to separate the data which can only be updated by me, shard 1, from the data that users can update, shard 2. From time to time I will be completely replacing the data directory in shard 1 - but I don't want to disturb the user created data in shard 2.
Shard 1 does not need replication since I can copy the new data to each server when I chose to update it however shard 2 does need replication.
Currently I run the following command on the server running zookeeper -
java -Dbootstrap_confdir=solr/myApp/conf -Dcollection.configName=myConfig -DzkRun -DnumShards=1 -jar start.jar
And the following command on the other 2 non zookeeper servers
java -Djetty-port=8983 -DzkHost=129.**.30.11:9983 -jar start.jar&
This creates a single shard solr instance * 3
I think I just need to add 1 static shard to this configuration however I am not sure the sequence of commands to accomplish it.
Many thanks
Firstly you are using zookeeper to maintain your shards and leaders/replicas. So if you want to have one shard with two instances and another shard with only a leader then you will have to modify your command as:
1)provide -DnumShards=2 so that the zookeeper knows that you need two shards
2)specify the -DzkHost parameter for this first solr instance also.
java -Dbootstrap_confdir=solr/myApp/conf -Dcollection.configName=myConfig -DzkRun -DnumShards=2 -DzkHost=** -jar start.jar
When you do this you will see some errors on console since shard2 is not created as yet.
Now start your other two servers and you should see a shard1 with two servers(leader and replica) and shard2 will have only one instance i.e leader
If you want separation of indexes and control over those indexes.You will have to create two collections instead of two shards.
Explanation
you have 3 servers right!!! so when you will start solrCloud using zookeeper. following things will happen as:
1) start first solr server along with the zookeeper and you will get 1 shard for solr cloud as shard1
2) start second solr server and point to the zookeeper... since you have declared DnumShards=2 ,Zookeeper will check that it needs to create 1 more shard, so it creates shard2 for your collection. By now you will be able to see your admin console with 2 shards for 1 collection.
3) Now start your 3rd server and point it to zookeeper and now zookeeper sees that 2 shards are there so it will now create a replica for shard1 instead of a new shard.
so it will be like
collection--->shard1--->server1,server3
--->shard2--->server2

Loadbalancer and Solrcloud

I am wondering how loadbalancer can be set up on top of SolrCloud or a load-balancer is not needed?
If the former, shard leaders need to be added to the loadbalancer? Then what if the shard leader changes for some reason? Or all machines in the cluster (including replica) better be added to the load balancer?
If the latter, I guess a cname needs to point to the SolrCloud cluster and it should be round robin DNS?
Any advice from some actual Solrcloud operation experience would be really appreicated.
Usually SolrCloud is used with combination of ZooKeeper, the client uses CloudSolrServer to access to SolrCloud.
The query will be done in following flow.
Note that I only read the source code of Solr partially and there are lot of guesses. Also what I read was source code of Solr 4.1, so it might be outdated.
ZooKeeper holds the list of IPAddress:Port of all SolrCloud servers.
(Client Side) The instance of CloudSolrServer retrieves the list of servers from ZooKeeper.
(Client Side) The instance of CloudSolrServer chooses one of SolrCloud server randomly and sends query to it. (Also LBHttpSolrServer chooses the server in round-robin?)
(Server Side) The SolrCloud server which recieved the query chooses randomly from replica of shards (one server per shard) from server list and redirects the query to it. (Note that all the SolrCloud server holds the server list which can be recieved from ZooKeeper)
The update will be done in same manner as above but also be populated to all servers.
Note that as for SolrCloud, the leader and replica has small difference and we can send query/update to any of the server. It is automatically redirected to other servers.
In short, the loadbalancing is done in both client side and server side.
So you don't need to worry about it.
A Load Balancer is needed and would be implemented by Zookeeper used in conjunction with SolrCloud.
When you use SolrCloud you must setup sharding and replication through the use of Zookeeper either using the embedded Zookeeper server that comes bundled with SolrCloud or you use a stand-alone Zookeeper ensemble (which is recommended for redundancy).
Then you would use SolrCloudClient to send your queries to Zookeeper which will then forward your query to the correct shard among your cluster. SolrCloudClient will require the name and address of all your Zookeeper instances upon instantiation and your Load-Balancing will be handled as appropriate from there.
Please see the following excllent tutorial:
http://www.francelabs.com/blog/tutorial-solrcloud-amazon-ec2/
Solr Docs:
https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble
This quote refers to latest version of Solr, at time of writing was ver. 7.1
Solrcloud - Distributed Requests
When a Solr node receives a search request, the request is routed
behind the scenes to a replica of a shard that is part of the
collection being searched.
The chosen replica acts as an aggregator: it creates internal requests
to randomly chosen replicas of every shard in the collection,
coordinates the responses, issues any subsequent internal requests as
needed (for example, to refine facets values, or request additional
stored fields), and constructs the final response for the client.
Solrcloud - Read Side Fault Tolerance
In a SolrCloud cluster each individual node load balances read
requests across all the replicas in collection. You still need a load
balancer on the 'outside' that talks to the cluster, or you need a
smart client which understands how to read and interact with Solr’s
metadata in ZooKeeper and only requests the ZooKeeper ensemble’s
address to start discovering to which nodes it should send requests.
(Solr provides a smart Java SolrJ client called CloudSolrClient.)
I am in a similar situation where I can't rely on CloudSolrServer for loadbalancing, a possible solution that I am evaluating is to use Airbnb's synapse (http://nerds.airbnb.com/smartstack-service-discovery-cloud/) to reconfigure dynamically an existing haproxy loadbalancer based on the status of the SolrCloud cluster that we get from Zookeeper.

Add shard replica in SolrCloud

Everytime i start a new node in the Solr cluster a shard or a shard replica is assigned automatically.
How could i specify which shard/shards should be replicated on this new node ?
I'm trying to get to a configuration with 3 shards, 6 servers - one for each shard master and 3 for the replicas - and shard1 to have 3 replicas, one on each of the servers while shard1 and shard2 only one.
How can this be achieved?
You can go to the core admin at the solrcloud Web GUI, unload the core that has been automatically assigned to that node and then create a new core, specifying the collection and the shard you want it to be assigned at. After you create that core you should see at the cloud view , that your node has been adeed to that specific shard and after some time that all documents of that shard have been sychronized with your node.

Resources