How to resize existing SolrCloud with new shards - solr

I've setup a SolrCloud structures having 3 shards. Each shard consist of 2 nodes. One is Leader and another is replica. Each solr instance (as node) is running in the separate machine. Now I need to add more machines as my data volume increases. But if I add new node without creating new shard, it'll simply increase more replica of shards. I want to create more shards with new machines and the data should be distributed among the shards.
For testing purpose, I created a SolrCloud with one shard (2 nodes). I tried solr SPLITSHARD with solr-4.5.1. Finally, I see total 3 shards (shard1, shard1_0 and shard1_1) from the admin window. Now it's showing total 6 nodes.
In the background, it has created the following folders under each node.
node1 :
solr/collection1
solr/collection1_shard1_0_replica1
solr/collection1_shard1_1_replica1
node2 :
solr/collection1
solr/collection1_shard1_0_replica2
solr/collection1_shard1_1_replica2
It means, it created 2 new cores under each instance. But I want to run a single core under each machine.

We have been into the same problem too. The only solution I can see for the current version of Solr is to add replicas on the new machines, wait for replication done and delete the original one.
In addition, if you split only one shard in the collection, the cluster will not be uniformly distributed. So you have to split every shards by the same factor.

Once you set numShards property when creating a collection, your intention become impossible. Other answers are only depicting about splitting origial no. of shards into more no. of shards but the data won't be distributed evenly, i.e. suppose 1 data starts with 2 shards, say S1 and S2. When do splitting-shards on S1, it becomes S11,S12,S2 which data in S2 is much more than S11,S12. But I think what you want is data in S1 & S2 is cut evenly into S11, S12, and S2 where S11,S12, and S2 are running on different nodes on different machines. That's NOT possible in current Solr (even v6) AFAIK.
What you want is also me and many other Solrcloud users want and I think it's a very normal intention. Let's hope future version of Solrcloud will provide this functionality.

Related

Why we need sharding in solr and what is the benifit of it

I am beginner in solr and I have no idea about how to do sharding in solr so my question is why we need sharding when we create collection and what is the benifit of it .If I am not creating sharding what happened.
Sharding allows us to have indexes that span more than a single instance of Solr - i.e. multiple servers or multiple running instances of Solr (which could be useful under specific conditions because of some single thread limitations in Lucene, as well as some memory usage patterns).
If we didn't have sharding, you'd be limited to a total size of your index to whatever you could fit on a single server. Sharding means that one part of the index (for example half of all your documents) will be located on one server, while the other half will be located on the other server. When you query Solr for any results, each shard will receive the query, and the result will then be merged before being returned back to you.
There's a few limitations in features that won't work properly when an index is shared (and scores are calculated locally on each server, which is why you usually want your documents spread as evenly as possible), but in those cases where sharding is useful (and it very often is!), there really isn't any better solutions.
Sharding helps us split the data into multiple replicas.
eg. If you have a collection named Employee with 1 shard and 2 replica.
Then assuming there are 100 records,
Employee_shard1_replica1 will have 100 records and
Employee_shard1_replica2 will have 100 records
The replica did the copying of entire records into another core so that you have loan balancing as well as fault taulrence.
Now, eg2. If you have the same collection Employee with 2 shard and 2 replica. In this scenario, the data will be split to both the shards.
Employee_shard1_replica1 will have 50 records
Employee_shard1_replica2 will have 50 records
Employee_shard2_replica2 will have 50 records
Employee_shard2_replica2 will have 50 records
Note : Shard 1 replicas have same data here and shard 2 replicas will have same data.

Solr storage handling

I have six node solr cluster and every node having 200GB of storage, we created one collection with two shards.
I like to know what will happen if my document reached 400GB (node1-200GB,node-2 200GB) ? is solr automatically use another free node from my cluster ?
If my document reached 400GB (node1-200GB,node-2 200GB) ?
Ans: I am not sure about what exacly error you may get, however in production you should try not to face this situation. To avoid/handle such scenarios we have monitoring/autoscaling triggers apis.
Is solr automatically use another free node from my cluster ?
Ans: No, Extra shards will not be added automatically. However whenever you observe that search is getting slow or if solr is crossing physical limitations of machines then you should go for splitShard .
So ultimately you can handle this with autscaling triggers. That is you can set autscaling triggers to identify whether a shard is crossing specified limits about the number of document or size of the index etc. Once this limits reaches this trigger can call splitShard
This link mentions
This trigger can be used for monitoring the size of collection shards,
measured either by the number of documents in a shard or the physical
size of the shard’s index in bytes.
When either of the upper thresholds is exceeded the trigger will
generate an event with a (configurable) requested operation to perform
on the offending shards - by default this is a SPLITSHARD operation.

Why in all shades, a specific Node is selected as a leader

I have setup Solr 7.4 cluster with 3 nodes and 3 replicas and one collection with 5 shards
I added a collection called ‍‍posts (with 5 shards and 3 replicas), and by default, the leader of all its shards is 196.209.182.40
Is it appropriate that each shard has a different node as a leader?
for example :
Why Solr chooses all the leaders alike?
Since shards can be located on completely different servers (and usually are), instead of as shown in your example where all shards are located on the same set of three nodes, yes, there can be different leaders for all shards.
The election process is described in Shards and indexing in SolrCloud.
In SolrCloud there are no masters or slaves. Instead, every shard consists of at least one physical replica, exactly one of which is a leader. Leaders are automatically elected, initially on a first-come-first-served basis, and then based on the ZooKeeper process described at https://zookeeper.apache.org/doc/r3.1.2/recipes.html#sc_leaderElection.
Referenced from the URL above:
A simple way of doing leader election with ZooKeeper is to use the SEQUENCE|EPHEMERAL flags when creating znodes that represent "proposals" of clients. The idea is to have a znode, say "/election", such that each znode creates a child znode "/election/n_" with both flags SEQUENCE|EPHEMERAL. With the sequence flag, ZooKeeper automatically appends a sequence number that is greater that any one previously appended to a child of "/election". The process that created the znode with the smallest appended sequence number is the leader.
In your case the same node was the first to respond in all cases (and possibly the one you submitted the create request to), and thus, was elected the original leader.

What to do when nodes in a Cassandra cluster reach their limit?

I am studying up Cassandra and in the process of setting up a cluster for a project that I'm working on. Consider this example :
Say I setup a 5 node cluster with 200 gb space for each. That equals up to 1000 gb ( round about 1 TB) of space overall. Assuming that my partitions are equally split across the cluster, I can easily add nodes and achieve linear scalability. However, what if these 5 nodes start approaching the SSD limit of 200 gb? In that case, I can add 5 more nodes and now the partitions would be split across 10 nodes. But the older nodes would still be writing data, as they are part of the cluster. Is there a way to make these 5 older nodes 'read-only'? I want to shoot off random read-queries across the entire cluster, but don't want to write to the older nodes anymore( as they are capped by a 200 gb limit).
Help would be greatly appreciated. Thank you.
Note: I can say that 99% of the queries will be write queries, with 1% or less for reads. The app has to persist click events in Cassandra.
Usually when cluster reach its limit we add new node to cluster. After adding a new node, old cassandra cluster nodes will distribute their data to the new node. And after that we use nodetool cleanup in every node to cleanup the data that distributed to the new node. The entire scenario happens in a single DC.
For example:
Suppose, you have 3 node (A,B,C) in DC1 and 1 node (D) in DC2. Your nodes are reaching their limit. So, decided to add a new node (E) to DC1. Node A, B, C will distribute their data to node E and we'll use nodetool cleanup in A,B,C to cleanup the space.
Problem in understanding the question properly.
I am assuming you know that by adding new 5 nodes, some of the data load would be transferred to new nodes as some token ranges will be assigned to them.
Now, as you know this, if you are concerned that old 5 nodes would not be able to write due to their limit reached, its not going to happen as new nodes have shared the data load and hence these have free space now for further write.
Isolating the read and write to nodes is totally a different problem. But if you want to isolate read to these 5 nodes only and write to new 5 nodes, then the best way to do this is to add new 5 nodes in another datacenter under the same cluster and then use different consistency levels for read and write to satisfy your need to make old datacenter read only.
But the new datacenter will not lighten the data load from first. It will even take the same load to itself. (So you would need more than 5 nodes to accomplish both problems simultaneously. Few nodes to lighten the weight and others to isolate the read-write by creating new datacenter with them. Also the new datacenter should have more then 5 nodes). Best practice is to monitor data load and fixing it before such problem happen, by adding new nodes or increasing data limit.
Considering done that, you will also need to ensure that the nodes you provided for read and write should be from different datacenters.
Consider you have following situation :
dc1(n1, n2, n3, n4, n5)
dc2(n6, n7, n8, n9, n10)
Now, for read you provided with node n1 and for write you provided with node n6
Now the read/write isolation can be done by choosing the right Consistency Levels from bellow options :
LOCAL_QUORUM
or
LOCAL_ONE
These basically would confine the search for the replicas to local datacenter only.
Look at these references for more :
Adding a datacenter to a cluster
and
Consistency Levels

Merging collections split across multiple shards

Brief overview of the setup:
5 x SolrCloud (Solr 4.6.1) node instances (separate machines).
The setup is intended to store last 48 hours webapp logs (which are pretty intense... ~ 3MB/sec)
"logs" collection has 5 shards (one per node instance).
One logline represents one document of "logs" collection
If I keep storing log documents to this "logs" collection, cores on shards start getting really big and CPU graphs show that instances spend more and more time waiting for disk I/O.
So, my idea is to create new collection with each 15 minutes and name it "logs-201402051400" with shards spread across 5 instances. Document writers will start writing to the new collection as soon as it is created. At some time I will get the list of collection like that:
...
logs-201402051400
logs-201402051415
logs-201402051430
logs-201402051445
logs-201402051500
...
Since there will be max 192 collections (~1000 cores) in the SolrCloud at some certain period of time. It seems that search performance should degrade drastically.
So, I would like to merge collections that are not being currently written to into one large collection (but still sharded across 5 instances). I have found information how to merge cores, but how can I merge collections?
This might NOT be a complete answer to your query - but something tells me that you need to redo the design of your collection.
This is a classic debate between using a Single Collection with Multiple Shards versus Multiple Collections.
I think you ought to setup a Single Collection - and then use Solr Cloud's dynamic sharding capability (implicit router) to add new shards (for newer 15 minute intervals) / delete old shards (for older 15 minute intervals).
Managing a single collection means that you will have a single end point and will save you from complexity of querying multiple collections.
Take a look at one of the answers on this link that talks about using the implicit router for dynamic sharding in SolrCloud.
How to add shards dynamically to collection in solr?

Resources