SolrCloud High Availability during indexing operation - solr

I am testing High Availability feature of SolrCloud. I am using the following setup
8 linux hosts
8 Shards
1 leader, 1 replica / host
Using Curl for update operation
I tried to index 80K documents on replicas (10K/replica in parallel). During indexing process, I stopped 4 Leader nodes. Once indexing is done, out of 80K docs only 79808 docs are indexed.
Is this an expected behaviour ? In my opinion replica should take care of indexing if leader is down.
If this is an expected behaviour, any steps that can be taken from the client side to avoid such a situation.

I suggest you should use CloudSolrServer to update solrcloud index.As it take cares of down nodes do not receive any update request and routes all further request to an appropriate node in the cluster.One more thing you need to ensure is all your 80k documents have unique field value, and its value is really unique across all documents

Related

Does Solr cloud needs a load balancer e.g. HAPROXY in master failure

I have searched a lot but unfortunately have some simple confusion about solr cloud. Lets say, I have three systems where solrCloud in configured (1 master and 2 slave) and external Zookeeper on same three machines to make a quorum. Systems names are
master
slave1
slave2
Public-Front
The Public-Front is the system where, I have configured HAPROXY. It receives requests from WWW and the send to backend server depending on ACLs.
According to my understanding, If I request to Solr collection (i.e., master), it routes it to slaves and hence load balanced. There is no need to specify slaves here. Isn't ?
Now in Public-Front, should I configured each Solr as a separate slave to load balance or just to master system.
Now if I only configure master system as solr-server in HAPROXY then if solr-server (master) goes down then I think I cannot get service from Solr from HAPROXY (although slaves are till up but not configured in HAPROXY).
Where am I wrong and what is the best approach ?
There is no traditional master or slave in Solr Cloud - there is a set of replicas, one of which is defined as the leader. The leader selection is automagic - i.e. the first replica that says it wants to be the leader, receives that status. This is per collection state. In your example there is three replicas, one which is designed as the leader. If that replica disappears, one of the two remaining replicas becomes the new leader, and everything continues as normal. The role of the leader is to be the up-to-date version of the index and handle any updates - first to its own index, then route those updates to any replicas.
There is also several types of replicas, and not all of them are suited to be promoted to a leader - but in the default configuration they can be.
Here's the thing - since there isn't really a master, all three indexes contain the same data and they all are replicas of the same shard, the request won't have to be routed through the master. If you're using a dumb haproxy, you can safely spread the requests across all three nodes and they should be able to answer the query without contacting any other nodes (as long as they all contain all the shards of the collection).
However, if you're using SolrJ or another Zookeeper capable client (and using the Zookeeper compatible client), the client will keep in touch with Zookeeper instead, and read the state information for your cluster. That allows the client to know which servers are currently replicas for your collection, and contact any of those nodes that it can decide have the required information for your query. In your case the result will be the same, except that your client will know not to connect to any nodes that disappear and will automagically know about nodes that are added to the cluster.
The "one Solr node routing requests to a different node" is only relevant if the node you're contacting doesn't have any replicas for the collection you're querying - i.e. it'll have to contact a different node to fetch that content. In that case an inter cluster request will happen and the load on the cluster will be slightly higher than necessary. When the collection is replicated to all three nodes - or when you're using SolrJ, that inter cluster request should not happen.

SolrCloud - 2 nodes cluster

We are planning to implement SolrCloud in our solution (mainly for data replication reasons and disaster recovery), unfortunately some of our customers have only 2DCs - and one DC may be completely destroyed.
We are aware that running ZK in 2 locations is problematic, as ZK requires quorum. And downtime on any side with 2 ZK nodes would cause cluster failure. And cluster failure would be also triggered by network partition between locations (master will cease to be master due to quorum lost, slave can't elect himself for the same reason).
--
So our current plan A is to go with a single ZK for both sites and backup ZK into the other site. So if the site withou ZK dies, we are OK. If the site with ZK dies, we should be able to start new ZK from backup and reconfigure Solr.
--
We also considered plan B with classic master-slave replication between the sites. BUT we are using Time Routed Aliases, hence we need SolrCloud features, hence we would need also to replicate data/configuratin in ZooKeeper (not only Solr index). So this case seems only as more manual work in Solr, while we would still need to backup/restore ZK. So this plan was rejected.
--
Plan C may be to have 2ZK, but one with with bigger weight. This should survive partition and dead of ZK with lower weight. The first ZK node should be automatically backed up using standard cluster mechanics. But I do not even know about anyone using ZK this way...
--
Is there any smarter way, how to setup SolrCloud in 2 nodes environment? Which solution should we prefer?
We do not expect High Availability; we want to achieve disaster recovery. Administrator intervention is expected in case of node failure, we only need to be resilient to short network glitches.
Edit: CDCR (Cross Data Center Replication) with Time Routed Aliases
We are considering to use TRA, because our data are time based, and customers are usually interested only in latest slice/partition. Without TRA, the index grows and performance degrades, more (unused/old) stuff is in index & RAM...
Here comes a problem with CDCR, according to docs, the source&target collection parameters are required. But with TRA, collections are created with the same solrconfig.xml automatically (every X days/months). This problem in CDCR is known (see comments), but not resolved yet.
Also it seems that CDCR really does not synchronize ZooKeeper (I have not found any mentions of the functionality in docs, jira and in code), which may be ok with static number of collections, but is very problematic with dynamically created collections (especially by some machinery in background outside users/developers code).
Edit: According to David (the main author of TRA), CDCR&TRA combination is not to be supported.

How to setup Solr Cloud with two search servers?

Hi I'm developing rails project with sunspot solr and configuring Solr Cloud.
My environment: rails 3.2.1, ruby 2.1.2, sunspot 2.1.0, Solr 4.1.6.
Why SolrCloud: I need more stable system - oftentimes search server goes on maintenance and web application stop working on production. So, I think about how to make 2 identical search servers instead of one, to make system more stable: if one server will be down, other will continue working.
I cannot find any good turtorial with simple, easy to understand and described in details turtorial...
I'm trying to set up SolrCloud on two servers, but I do not fully understand how it is working inside:
synchronize data between two servers (is it automatic action?)
balances search requests between two servers
when one server suddenly stop working other should become a master (is it automatic action?)
is there SolrCloud features other than listed?
Read more about SolrCloud here..! https://wiki.apache.org/solr/SolrCloud
Couple of inputs from my experience.
If your application just reads data from SOLR and does not write to SOLR(in real time but you index using an ETL or so) then you can just go for Master Slave hierarchy.
Define one Master :- Point all writes to here. If this master is down you will no longer be able to index the data
Create 2(or more) Slaves :- This is an feature from SOLR and it will take care of synchronizing data from the master based on the interval we specify(Say every 20 seconds)
Create a load balancer based out of slaves and point your application to read data from load balancer.
Pros:
With above setup, you don't have high availability for Master(Data writes) but you will have high availability for data until the last slave goes down.
Cons:
Assume one slave went down and you bought it back after an hour, this slave will be behind the other slaves by one hour. So its manual task to check for data consistency among other slaves before adding back to ELB.
How about SolrCloud?
No Master here, so you can achieve high availability for Writes too
No need to worry about data inconsistency as I described above, SolrCloud architecture will take care of that.
What Suits Best for you.
Define a external Zookeeper with 3 nodes Quorom
Define at least 2 SOLR severs.
Split your Current index to 2 shards (by default each shard will reside one each in 2 solr nodes defined in step #2
Define replica as 2 (This will create replica for shards in each nodes)
Define an LB to point to above solr nodes.
Point your Solr input as well as application to point to this LB.
By above setup, you can sustain fail over for either nodes.
Let me know if you need more info on this.
Regards,
Aneesh N
-Let us learn together.

Combining Solr 3x-style Master/Slave "Repeater" to feed remote 4x SolrCloud instances?

Solr 3x "Repeaters" and Multiple Data Centers:
Solr 3x let a node behave as both a slave and master, pull from one master, and then feed copies downstream to its own slaves. This was so common/useful it even had a name, a "Repeater".
This was useful if you wanted span multiple data centers. You could have the real master in data center A (DCA), and a "repeater" in data center B (DCB). That repeater would then grab content from DCA and feed all of the other nodes in DCB, saving on bandwidth.
Suppose you want to upgrade this setup to Solr 4x and SolrCloud. (Note that Solr 4x still supports Solr 3x-style legacy replication)
It's said that you should NOT have a single SolrCloud cluster span disparate data centers. So data center B should have it's own SolrCloud.
One idea is to have the DCA -> DCB link still use Solr 3x-style Master/Slave replication. And then the "repeater" in DCB, being also a SolrCloud node, would automatically be propagated to other nodes.
Main question:
Can a Solr node participate in both Solr 3x-style master/slave mode (as a slave) and also be part of a SolrCloud cluster? And if so, how is this configured?
Complications:
In the simple case, if it's just 1 shard with replicas, it's easy to see how that might work in terms of data. It's a little less clear if you have multiple shards in DCB, how do I tell each shard to only replicate its own share of data? Note that SolrCloud normally replicates via transactions, whereas 3x uses binary indices.
Another complexity is if you're doing replication. How do you tell just the master node for each shard to pull from the remote DCA node?
Alternatives:
On solution is to upgrade to 4x but continue using 3x-style replication in DCB, so just don't use SolrCloud.
I realize that another solution would be to have the data feed send it's updates to both data centers, or usE something like RabbitMQ. For the sake of this question, let's assume thats not an option (long story...)
Maybe there's some other way I haven't thought of?
Has anybody actually tried having SolrCloud span data centers? How horrible is it?
Somebody must have asked this question before!
But I've looked on Google and, although it finds tons of pages with the keywords, I haven't seen this specific "hybrid" mode fleshed out. I found one thread from 2013 but it didn't really talk about the configuration and complexity.
To answer your first question, a Solr slave in 3.X style cannot be a node in a Solr Cloud. The reason is the slave in a master/slave 3.X Solr config simply replicates, byte for byte, all the index files on the master. That's all it does. It can, in the repeater config, then also be a master for others to replicate from, or be a dedicated query slave or both. But that's it.
A node in a Solr Cloud config is a full participant in a distributed computing cluster where indexing is generally intended to be distributed across all nodes, and all nodes participate in queries. It's a very powerful feature which automatically handles failed nodes and significantly eases the work load of scaling up that was very manual in 3.X style.
However, part of what you pay for that is increased complexity (Zookeeper), requirements for lower latency inter-node communications (because all the nodes now talk to each other and to Zookeeper) and the loss of the simplicity of Master/Slave replication.
At 20M docs you are well within the constraints of a single node master index with an effectively unlimited number of slaves and therefor very high query capacity. I do this today with a production environment where each master has on the order of 60M docs in it with no significant problems.
The question is do you need NRT, multi-node indexing, automated failover, the ability to autoscale well past 100M docs? If so then Master/Slave it probably not going to work for you.
You could take a look at writing the same data to two different Solr Cloud clusters, one in each datacenter. You could do that directly, or use something like Apache Flume to do it for you - in either there are some issues with doing this and so the real question is are dealing with those issues worth it to get the added benefit of Solr Cloud?

Solr appears to block update requests while committing

We're running a master-slave setup with Solr 3.6 using the following auto-commit options:
maxDocs: 500000
maxTime: 600000
We have approx 5 million documents in our index which takes up approx 550GB. We're running both master and slave on Amazon EC2 XLarge instances (4 virtual cores and 15GB). We don't have a particularly high write throughput - about 100 new documents per minute.
We're using Jetty as a container which has 6GB allocated to it.
The problem is that once a commit has started, all our update requests start timing out (we're not performing queries against this box). The commit itself appears to take approx 20-25mins during which time we're unable to add any new documents to Solr.
One of the answers in the following question suggests using 2 cores and swapping them once its fully updated. However this seems a little over the top.
Solr requests time out during index update. Perhaps replication a possible solution?
Is there anything else I should be looking at regarding why Solr seems to be blocking requests? I'm optimistically hoping there's a "dontBlockUpdateRequestsWhenCommitting" flag in the config that I've overlooked...
Many thanks,
According to bounty reason and the problem mentioned at question here is a solution from Solr:
Solr has a capability that is called as SolrCloud beginning with 4.x version of Solr. Instead of previous master/slave architecture there are leaders and replicas. Leaders are responsible for indexing documents and replicas answers queries. System is managed by Zookeeper. If a leader goes down one of its replicas are selected as new leader.
All in all if you want to divide you indexing process that is OK with SolrCloud by automatically because there exists one leader for each shard and they are responsible for indexing for their shard's documents. When you send a query into the system there will be some Solr nodes (of course if there are Solr nodes more than shard count) that is not responsible for indexing however ready to answer the query. When you add more replica, you will get faster query result (but it will cause more inbound network traffic when indexing etc.)
For those who is facing a similar problem, the cause of my problem was i had too many fields in the document, i used automatic fields *_t, and the number of fields grows pretty fast, and when that reach a certain number, it just hogs solr and commit would take forever.
Secondarily, I took some effort to do a profiling, it end up most of the time is consumed by string.intern() function call, it seems the number of fields in the document matters, when that number goes up, the string.intern() seems getting slower.
The solr4 source appears no longer using the string.intern() anymore. But large number of fields still kills the performance quite easily.

Resources