How does segment merging in SolrCloud happen on different types of replicas - solr

We have 3 different types of replicas in SolrCloud: NRT, tlog and pull.
When segment merging happens, does it happen independently on all replicas? This would create inconcistencies between segments on replicas however.
Or is the leader first merges the segments and then sends it to its replicas to overwrite the segments?
I want to see if configuring segment merge policy with a combination of tlog and pull replicas can improve query/index performance.
Any insights would be highly appreciated.

Related

Recoverying from single shard loss with replica in solrcloud

I have a solrcloud cluster which has a collection with RF=2 and NumShards=3 on 6 Nodes. We want to test how to recover from unexpected situations like shard loss. So we will probably execute an rm -rf on the solr data directory on one of the replica or master.
Now the question is, how will this shredded node recover from the shard loss? Are manual steps required(if yes, then what needs to be done), or will it automatically recover from the replica?
You haven't specified a solr version, but here's a synopsis of some of the concepts:
SolrCloud records cluster state in two places. The local disk of the node, and in ZooKeeper. When Solr starts on a node, it scans its local disk for solr "Cores", (Replicas, in this case) and if it finds any, it registers itself in ZK as serving that replica. If according to ZK it's not the Leader of the shard for that replica, it'll sync itself from the Leader before it starts serving traffic.
Leader (I avoid Master/Slave terminology here, because that's generally used in a non-solrcloud setup) for a shard is an ephemeral role. If the leader goes down, a non-leader will be elected the new Leader and life goes on. If the former Leader comes back, it's a non-leader now. Generally you don't need to concern yourself with which replica is the leader.
SolrCloud does not generally assign replicas automatically. You explicitly tell it where you want things.
Given these things, your intended "failure mode" is a bit interesting. Deleting the files from a running JVM probably won't do much. The JVM has an open filehandle to all the index files, so the OS can't clean them up even though you've deleted the references. Things will probably continue normally until the next time Solr needs to write a new segment file to a directory that no longer exists, at which point things will explode. I don't know exactly what.
If you stop Solr, delete the directory, and restart Solr though - You've deleted the knowledge that that Solr node is participating in any index. Solr will come up, and join the cluster, and not host any replicas of any shard. You'll probably need to ADDREPLICA to put it back.

How to setup Solr Cloud with two search servers?

Hi I'm developing rails project with sunspot solr and configuring Solr Cloud.
My environment: rails 3.2.1, ruby 2.1.2, sunspot 2.1.0, Solr 4.1.6.
Why SolrCloud: I need more stable system - oftentimes search server goes on maintenance and web application stop working on production. So, I think about how to make 2 identical search servers instead of one, to make system more stable: if one server will be down, other will continue working.
I cannot find any good turtorial with simple, easy to understand and described in details turtorial...
I'm trying to set up SolrCloud on two servers, but I do not fully understand how it is working inside:
synchronize data between two servers (is it automatic action?)
balances search requests between two servers
when one server suddenly stop working other should become a master (is it automatic action?)
is there SolrCloud features other than listed?
Read more about SolrCloud here..! https://wiki.apache.org/solr/SolrCloud
Couple of inputs from my experience.
If your application just reads data from SOLR and does not write to SOLR(in real time but you index using an ETL or so) then you can just go for Master Slave hierarchy.
Define one Master :- Point all writes to here. If this master is down you will no longer be able to index the data
Create 2(or more) Slaves :- This is an feature from SOLR and it will take care of synchronizing data from the master based on the interval we specify(Say every 20 seconds)
Create a load balancer based out of slaves and point your application to read data from load balancer.
Pros:
With above setup, you don't have high availability for Master(Data writes) but you will have high availability for data until the last slave goes down.
Cons:
Assume one slave went down and you bought it back after an hour, this slave will be behind the other slaves by one hour. So its manual task to check for data consistency among other slaves before adding back to ELB.
How about SolrCloud?
No Master here, so you can achieve high availability for Writes too
No need to worry about data inconsistency as I described above, SolrCloud architecture will take care of that.
What Suits Best for you.
Define a external Zookeeper with 3 nodes Quorom
Define at least 2 SOLR severs.
Split your Current index to 2 shards (by default each shard will reside one each in 2 solr nodes defined in step #2
Define replica as 2 (This will create replica for shards in each nodes)
Define an LB to point to above solr nodes.
Point your Solr input as well as application to point to this LB.
By above setup, you can sustain fail over for either nodes.
Let me know if you need more info on this.
Regards,
Aneesh N
-Let us learn together.

Multiple solr Shard on single Machine with a master shard exposed to outside

Though I am new to distributed search I have fair idea on Sharding and solr cloud.
Technically When we have bigger index we split it to multiple shards for faster distributed search(Not including Solr Cloud benefits).
I have huge index (obviously same schema) but There is logical separation of data, I call it bucket. Each of these bucket has its own update/deletion/add.
1st Approach:
So technically I can create N number of shared depending on the bucket count. But this will lead to too many server and will slowdown the search because it has to merge the result.
2nd Approach:
To reduce the distribution I can logically combine these buckets and create a less number parent buckets which will then reduce the overall shard . But then I will loose some of the advantage related to individual bucket.
3nd Approach
I was thinking if I can create something similar to shard of shard. On each machine I will have one parent bucket as shard(Call it Parent Shard) which inter will have multiple child shard hosted on the same solr instance on the same machine (call it as child shard).This will like multiple core in single solr with same schema. In this scenario parent shard will merge the record from each of the child shard by executing queries parallel on child shards and merge it. As it is on the same solr instance I believe it will be faster. I want the parent shared to be empty and just handle merging of result. Is it possible and if so will performance match 2nd Approach. Can someone give me some idea how to implement it I am fine with customizing solr for this implementation.

SolrCloud High Availability during indexing operation

I am testing High Availability feature of SolrCloud. I am using the following setup
8 linux hosts
8 Shards
1 leader, 1 replica / host
Using Curl for update operation
I tried to index 80K documents on replicas (10K/replica in parallel). During indexing process, I stopped 4 Leader nodes. Once indexing is done, out of 80K docs only 79808 docs are indexed.
Is this an expected behaviour ? In my opinion replica should take care of indexing if leader is down.
If this is an expected behaviour, any steps that can be taken from the client side to avoid such a situation.
I suggest you should use CloudSolrServer to update solrcloud index.As it take cares of down nodes do not receive any update request and routes all further request to an appropriate node in the cluster.One more thing you need to ensure is all your 80k documents have unique field value, and its value is really unique across all documents

Keeping index optimized / merged in SolrCloud

With master-slave implementation of distributed Solr (prior to Solr 4.x) it was a straight design solution to have master which takes load for indexing, merging and optimizing index. Then the index gets copied to replicas while replicas meanwhile are always serving searches.
Could someone explain how this is done now with SolrCloud?
Seems like SolrCloud sends indexing commands to each replica from leader. But how the search performance could be achieved then? Indexing and searching on each replica makes load on each node server (to index and run merge thread in background) and since my index is quite big it takes a lot of time usually to merge segments or simply optimize.
Should I deliver that all now to merge policy and not worry at all? Does TieredMergePolicy provide both good search performance and low resource load (CPU, I/O) at the same time?
I'll try to answer part of your questions: SolrCloud indeed indexes on all nodes, and therefore it has a performance impact on replicas. This is done due to 'hot replication' model instead of 'cold replication' as you are used to. It comes to solve data integrity issues as well as real time search on a cluster. You get consistent data and faster data availability as a price of performance impact. Actually, you can always split data to shards (at a price of additional hardware), and have comparable performance.
In either case, it's up to you to decide whether SolrCloud suits your needs. You can use Solr 4 without cloud model and manage it yourself as before.

Resources