How to scale and distribute the SOLR CLOUD nodes - solr

I have initially setup the SOLR CLOUD with two solr nodes as shown below.
I have to add a new solr node (i.e) with additional shard and same number of replicas with the existing SOLR CLUSTER nodes.
I have already gone through the SOLR scaling and distributing https://cwiki.apache.org/confluence/display/solr/Introduction+to+Scaling+and+Distribution
But the above link contains information of scaling only for SOLR standalone mode. That's the sad part.
I have started the SOLR CLUSTER nodes using the following command
./bin/solr start -c -s server/solr -p 8983 -z [zkip's] -noprompt
Kindly share the command command for creating the new shard for adding new node.
Thanks in advance.

From my knowledge am sharing this answer.
Adding a new SOLR CLOUD /SOLR CLUSTER node is that having the copy of
all the SHARDs into the new box(through replication of all SHARDs).
SHARD : The actual data is equally splitted across the number of SHARDs we create (while creating the collection).
So while adding the new SOLR CLOUD node make sure that all the SHARD should be available on the new node(RECOMENDED) or as required.
Naming Standards of SOLR CORE in SOLR CLOUD MODE/ CLUSTER MODE
Syntax:
<COLLECTION_NAME>_shard<SHARD_NUMBER>_replica<REPLICA_NUMBER>
Example
CORE NAME : enter_2_shard1_replica1
COLLECTION_NAME : enter_2
SHARD_NUMBER : 1
REPLICA_NUMBER : 1
STEPS FOR ADDING THE NEW SOLR CLOUD/CLUSTER NODE
Create a core with the common collection name as we used in the existing SOLR CLOUD nodes.
Notes while creating a new core in new node
Example :
enter_2_shard1_replica1
enter_2_shard1_replica2
From the above example the maximum value/number of repilca of the corresponding shard is 2(enter_2_shard1_replica2)
So in the new node while creating a core, give the replica numbering as 3 "enter_2_shard1_replica3" so that SOLR will take this as the third replication of the corresponding SHARD.
Note : replica numbering should be in a incremental oreder of 1
Give time to replicate the data from the existing node to the new node.

Related

SolrNet support for failover scanerio with SolrCloud cluster

Does SolrNet have built-in support for fail-over scenarios with SolrCloud?
I have 3 nodes in SolrCloud cluster, with external ZooKeeper ensembly. I use SolrNet client to communicate with Solr, but it is obviously uses connects to just one Solr node; when this Solr node I need to use another node.
I am currently using ZooKeeperNetEx library to get the list of alive nodes from /live_nodes - but I am wondering maybe it is an overkill and SolrNet is already SOlr-Cloud-aware and will automatically switch to another Solr node if currect one dies?
According to Basic usage of cloud mode in the SolrNet documentation, you use a SolrCloudStateProvider with the zkurls when creating your instance:
var zookeeperConnectionString = "127.0.0.1:2181";
var collectionName = "collection_name";
Startup.Init<Product>(new SolrCloudStateProvider(zookeeperConnectionString), collectionName);
I'm guessing the connection string follow the regular zookeeper format, which means that you can give it a list of zookeeper instances to use by separating the hosts/ips with , between them (192.168.0.1:2181,192.168.0.2:2181,...).

Solr AutoScaling - Add replicas on new nodes

Using Solr version 7.3.1
Starting with 3 nodes:
I have created a collection like this:
wget "localhost:8983/solr/admin/collections?action=CREATE&autoAddReplicas=true&collection.configName=my_col_config&maxShardsPerNode=1&name=my_col&numShards=1&replicationFactor=3&router.name=compositeId&wt=json" -O /dev/null
In this way I have a replica on each node.
GOAL:
Each shard should add a replica to new nodes joining the cluster.
When a node are shoot down. It should just go away.
Only one replica for each shard on each node.
I know that it should be possible with the new AutoScalling API but I am having a hard time finding the right syntax. The API is very new and all I can find is the documentation. Its not bad but I am missing some more examples.
This is how its looks today. There are many small shard each with a replication factor that match the numbers of nodes. Right now there are 3 nodes.
This video was uploaded yesterday (2018-06-13) and around 30 min. into the video there is an example of the Solr.HttpTriggerListener that can be used to call any kind of service, for example an AWS Lamda to add new nodes.
The short answer is that your goals are not not achievable today (till Solr 7.4).
The NodeAddedTrigger only moves replicas from other nodes to the new node in an attempt to balance the cluster. It does not support adding new replicas. I have opened SOLR-12715 to add this feature.
Similarly, the NodeLostTrigger adds new replicas on other nodes to replace the ones on the lost node. It, too, has no support for merely deleting replicas from cluster state. I have opened SOLR-12716 to address that issue. I hope to release both the enhancements in Solr 7.5.
As for the third goal:
Only one replica for each shard on each node.
To achieve this, a policy rule given in the "Limit Replica Placement" example should suffice. However, looking at the screenshot you've posted, you actually mean a (collection,shard) pair which is unsupported today. You'd need a policy rule like the following (following does not work because collection:#EACH is not supported):
{"replica": "<2", "collection": "#EACH", "shard": "#EACH", "node": "#ANY"}
I have opened SOLR-12717 to add this feature.
Thank you for these excellent use-cases. I'll recommend asking questions such as these on the solr-user mailing list because not a lot of Solr developers frequent Stackoverflow. I could only find this question because it was posted on the docker-solr project.

Solrcloud - remove a node

I have a Solrcloud setup, which runs 3 ZKs and 3 Solrs (version 4.10.3). I would like to take out one of the Solr servers completely from this setup so that I only have 2 Solrs and 3 ZKs.
I've tried googling and the only results I can find are to remove replicas or collections but not to remove a node.
Any idea how I can remove a node in SolrCloud?
you can first remove shards and replica. using below command
<SOLR_URL>/solr/admin/collections?action=DELETEREPLICA&collection=<Collection_name>&shard=<shard_name>&replica=<replica_node_naeme>
The shard name (shard) and replica node name (coreNodeName) values can be found in the core.properties file located under /solr/server/solr/.
For example:
http://localhost:9501/solr/admin/collections?action=DELETEREPLICA&collection=wblib&shard=shard1&replica=core_node1
Then you can delete the installation files for the removed Solr node.

Why my solr multi shards have duplicate docs

I use solr and hbase lily to store data that are from hbase. My solr version is 5.3.0. I create a solr cluster with two shard.When i put data to hbase,i find the solr collection has some duplicate docs on different shards .I create collection shards's mothed is:
On one solr node I use solr core admin page to add a core name with
HealthProfile1 for collection HealthProfile. And then on another node
I add a core name with HealthProfile2 for the same collection.
My question is: Is my create shards method correct?

Solr data beign indexed in all server[Sharding Mode]

I created three Solr cloud instances for sharding data across three instances and querying from those three instances. I created them using below commands
CMD:
solr.cmd start -c -s Node1 -p 8983
solr.cmd start -c -s Node2 -z localhost:9983 -p 8984
solr.cmd start -c -s Node3 -z localhost:9983 -p 8985
Then I created a collection which uses three shards and has a replication factor of 1.
CMD1:
solr.cmd create_collection -c tests -shards 3 replicationFactor 1
Then I index data into the collection using post jar using following command.
CMD2:
java -jar post.jar *.xml
There was 32 XML files in that location
As per my understanding the data will be split and indexed on all on the three Solr cloud instance.
But what happened was 32 document was indexed on all the three instances.
I confirmed this by using following URLs
http://localhost:8984/solr/tests/select?indent=on&q=*:*&wt=json
http://localhost:8985/solr/tests/select?indent=on&q=*:*&wt=json
http://localhost:8983/solr/tests/select?indent=on&q=*:*&wt=json
Everything returned the same number of records.
And my understanding is the document will be split and indexed on all the three instances.
Since I want to index 3 billion documents into Solr and there is 2 billion hard limit in Solr. I wanted to make sure the they are splited and indexed in the three Solr instances.
let me know if have made any mistakes.
Versions.
Solr =6.1.0
Windows= 7
When you're querying /solr/tests, you're querying the tests collection. Behind the scenes Solr is fetching all the documents in that collection and returning them for you, from all the shards added to the collection.
You've stumbled upon the idea behind a collection in Solr - regardless of which server you're querying, Solr is returning the result of the collection to you, including all documents added to that collection. The only difference in the three requests you're making, is which server is responsible for returning the result to the client and making the requests to fetch results from the other cores.
If you want to explore the contents of a single core, these cores are named collectionname_shardX_replicaY. You can examine the current cluster state if you download the json file from the Zookeeper instance - this will show you exactly which shards are located where.
You can also use the CoreAdmin API on a single node to examine which cores have been placed on that server. Be aware that you do NOT want to do any mutable actions through the CoreAdmin API when you're running in cloud mode.

Resources