Solr cloud sharding

Solr cloud sharding - solr

Currently I have a zookeeper instance controlling replication on 3 servers. It is the solr integrated zookeeper. It works well in my web based application.
I have a new requirement which will require sharding in the cloud and I am not sure how to implement it. Basically I want to separate the data which can only be updated by me, shard 1, from the data that users can update, shard 2. From time to time I will be completely replacing the data directory in shard 1 - but I don't want to disturb the user created data in shard 2.
Shard 1 does not need replication since I can copy the new data to each server when I chose to update it however shard 2 does need replication.
Currently I run the following command on the server running zookeeper -
java -Dbootstrap_confdir=solr/myApp/conf -Dcollection.configName=myConfig -DzkRun -DnumShards=1 -jar start.jar
And the following command on the other 2 non zookeeper servers
java -Djetty-port=8983 -DzkHost=129.**.30.11:9983 -jar start.jar&
This creates a single shard solr instance * 3
I think I just need to add 1 static shard to this configuration however I am not sure the sequence of commands to accomplish it.
Many thanks

Firstly you are using zookeeper to maintain your shards and leaders/replicas. So if you want to have one shard with two instances and another shard with only a leader then you will have to modify your command as:
1)provide -DnumShards=2 so that the zookeeper knows that you need two shards
2)specify the -DzkHost parameter for this first solr instance also.
java -Dbootstrap_confdir=solr/myApp/conf -Dcollection.configName=myConfig -DzkRun -DnumShards=2 -DzkHost=** -jar start.jar
When you do this you will see some errors on console since shard2 is not created as yet.
Now start your other two servers and you should see a shard1 with two servers(leader and replica) and shard2 will have only one instance i.e leader
If you want separation of indexes and control over those indexes.You will have to create two collections instead of two shards.
Explanation
you have 3 servers right!!! so when you will start solrCloud using zookeeper. following things will happen as:
1) start first solr server along with the zookeeper and you will get 1 shard for solr cloud as shard1
2) start second solr server and point to the zookeeper... since you have declared DnumShards=2 ,Zookeeper will check that it needs to create 1 more shard, so it creates shard2 for your collection. By now you will be able to see your admin console with 2 shards for 1 collection.
3) Now start your 3rd server and point it to zookeeper and now zookeeper sees that 2 shards are there so it will now create a replica for shard1 instead of a new shard.
so it will be like
collection--->shard1--->server1,server3
--->shard2--->server2

Related

SolrCloud on different machines

I have setup a Solr cloud on two machines, I created a collection collection1 and split it into two shards with 2 replica's, I added my other Solr machine to the cloud and in the Solr admin page in cloud->tree->live nodes, I can see 4 live, which includes the last Solr instance launched, but I can see my shards are running on the same machine just on different ports, even replica is still showing the leader address.
Now I want to shift the replica to the newly launched Solr instance or just put the entire shard 1 or 2 on the other machines.
I have tried searching about it, but nothing tells me the exact commands.

This question is rather old, but for the sake of completeness:
In the Solr UI goto Collections
Select your collection
Click on the shards on the right side
Click add replica
Choose your new node as the target node
Wait for the replica to be ready (watch in Cloud > Graph)
Back in the shards list, delete the old replica
If the old replica was the leader, a leader election will be triggered automatically.

Multiple Solr environments with one Zookeeper ensemble

We have two Solr environments in production.
One Solr environment has latest two years data. Other has last 10 years of archived data.
At the moment, these two Solr environments connect to separate Zookeeper ensembles.
The collections have same name & configuration in both Solr environments.
We want to reduce the number of servers for Zookeeper.
Is it feasible to have both Solr environments in production connect to one Zookeeper ensemble without overwriting configs for each other?
Or is it mandatory to have separate Zookeeper ensemble for each Solr environment?

You can use the same Zookeeper ensemble to handle more than one Solr or SolrCloud instance.
However, the data must be kept separate. This is (probably) best done by using the "chroot" functionality in Zookeeper.
Essentially, when you create the "space" in Zookeeper for your Solr instance, you append a /some_thing_unique and keep that in the appropriate config files in Solr - then you should have no trouble.
I haven't experienced moving an existing Solr instance from one Zookeeper to another - I'd guess you would have to take Solr down, change the configs, set up the collection etc.. in Zookeeper, and restart Solr. For sure I'd get that all worked out in a test environment before doing it live.
Hope that helps...
Oh, here's how I did it when creating a collection "new" in Zookeeper... You'll note I gave it a name (the name of my collection) as well as noting what version of Solr I was using. This allows me to install later versions of Solr and move my collection to that later version and keep it all in the same Zookeeper ensemble...
/opt/solr/server/scripts/cloud-scripts/zkcli.sh -zkhost 10.196.12.103,10.196.12.104,10.196.22.103 -cmd makepath /myCollectionName_solr6_2

SolrCloud DIH implementation with zookeeper

I am going to put my old DataImportHandler configuration of solr 4.3 to SolrCloud 5.0.
I have already deployed zookeeper on 3 virtual machines and all are well communicating with each other. I have read about nodes, collections, shards and replicas but I am not able to collect how I can put my old DIH configurations to zookeeper. Currently I have 5 different DIH configurations which I need to put into solrCloud. Is that mean I have to create 5 nodes or collections?, yup I am confused here.
Thanks in Advance!

There is no need of extra node for configuration. Solr Cloud depends upon collection which is sharded across the nodes and you can create replica of it.
These are the Steps you need to do for SolrCloud :-
Run Zookeeper
Run Solrnodes with zookeeper
Upload configuration to zookeeper
Create collection by referring to the configuration
To upload configuration to zookeeper and create collection :-
Create a solrlibs directory
Copy /opt/solr/server/solr-webapp/webapp/WEB-INF/lib/* to it
Copy /opt/solr/server/lib/ext/* to it
Run the command java -classpath .:/opt/solrlibs/* org.apache.solr.cloud.ZkCLI -cmd upconfig -zkhost 192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181 -confdir /opt/solrconfigs/test/conf -confname testconf
Create the collection using following command http://192.168.1.4:8080/solr/admin/collections?action=CREATE&name=test&collection.configName=testconf&numShards=2&replicationFactor=2
Num Shards and Replication factor will be based on number of nodes you have.

How to setup solr cloud with 2 shards 1 leader and 1 replica and with zookeeper on different machines?

I'm still confused on setting up a solr cloud cluster. The one in the tutorial are setup for localhost binded to different ports. But I wanna know how would it be like using different machines. What do I need? Do I need to extract the downloaded Solr to each machine? Should I setup zookeeper first and set the configuration? Should zookeeper be installed on a different machine which is not a Solr server?

This tutorial is a lot closer to what you need:
http://solr.pl/en/2013/03/11/solrcloud-howto-2/
If you don't want to run a separate Zookeeper, you can run the embedded Zookeeper on one of your Solr instances by passing -Dzkrun on this instance, and -DzkHost on the other instances to point to the first one.

Loadbalancer and Solrcloud

I am wondering how loadbalancer can be set up on top of SolrCloud or a load-balancer is not needed?
If the former, shard leaders need to be added to the loadbalancer? Then what if the shard leader changes for some reason? Or all machines in the cluster (including replica) better be added to the load balancer?
If the latter, I guess a cname needs to point to the SolrCloud cluster and it should be round robin DNS?
Any advice from some actual Solrcloud operation experience would be really appreicated.

Usually SolrCloud is used with combination of ZooKeeper, the client uses CloudSolrServer to access to SolrCloud.
The query will be done in following flow.
Note that I only read the source code of Solr partially and there are lot of guesses. Also what I read was source code of Solr 4.1, so it might be outdated.
ZooKeeper holds the list of IPAddress:Port of all SolrCloud servers.
(Client Side) The instance of CloudSolrServer retrieves the list of servers from ZooKeeper.
(Client Side) The instance of CloudSolrServer chooses one of SolrCloud server randomly and sends query to it. (Also LBHttpSolrServer chooses the server in round-robin?)
(Server Side) The SolrCloud server which recieved the query chooses randomly from replica of shards (one server per shard) from server list and redirects the query to it. (Note that all the SolrCloud server holds the server list which can be recieved from ZooKeeper)
The update will be done in same manner as above but also be populated to all servers.
Note that as for SolrCloud, the leader and replica has small difference and we can send query/update to any of the server. It is automatically redirected to other servers.
In short, the loadbalancing is done in both client side and server side.
So you don't need to worry about it.

A Load Balancer is needed and would be implemented by Zookeeper used in conjunction with SolrCloud.
When you use SolrCloud you must setup sharding and replication through the use of Zookeeper either using the embedded Zookeeper server that comes bundled with SolrCloud or you use a stand-alone Zookeeper ensemble (which is recommended for redundancy).
Then you would use SolrCloudClient to send your queries to Zookeeper which will then forward your query to the correct shard among your cluster. SolrCloudClient will require the name and address of all your Zookeeper instances upon instantiation and your Load-Balancing will be handled as appropriate from there.
Please see the following excllent tutorial:
http://www.francelabs.com/blog/tutorial-solrcloud-amazon-ec2/
Solr Docs:
https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble

This quote refers to latest version of Solr, at time of writing was ver. 7.1
Solrcloud - Distributed Requests
When a Solr node receives a search request, the request is routed
behind the scenes to a replica of a shard that is part of the
collection being searched.
The chosen replica acts as an aggregator: it creates internal requests
to randomly chosen replicas of every shard in the collection,
coordinates the responses, issues any subsequent internal requests as
needed (for example, to refine facets values, or request additional
stored fields), and constructs the final response for the client.
Solrcloud - Read Side Fault Tolerance
In a SolrCloud cluster each individual node load balances read
requests across all the replicas in collection. You still need a load
balancer on the 'outside' that talks to the cluster, or you need a
smart client which understands how to read and interact with Solr’s
metadata in ZooKeeper and only requests the ZooKeeper ensemble’s
address to start discovering to which nodes it should send requests.
(Solr provides a smart Java SolrJ client called CloudSolrClient.)

I am in a similar situation where I can't rely on CloudSolrServer for loadbalancing, a possible solution that I am evaluating is to use Airbnb's synapse (http://nerds.airbnb.com/smartstack-service-discovery-cloud/) to reconfigure dynamically an existing haproxy loadbalancer based on the status of the SolrCloud cluster that we get from Zookeeper.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight