dse enterprise solr re-indexing - solr

Is there a way to re-index a solr core without impacting applications that rely on that core? For example, can we spin up a new replacement core and let it get indexed fully before swapping out and decommissioning the old core?
In our use case, we cannot afford to have partial data available to our applications - which is what will happen if we do an in-place re-index. Currently, it takes anywhere between 24 - 36 hours to fully re-index our core.

If the relevant keyspace is configured with a replication factor of 2 or more, you should be able to do a rolling re-index of your cluster without affecting availability. (i.e. You should be able to use dsetool reload_core <your core name> distributed=false reindex=true.) While a node is re-indexing, it will not service queries for the token ranges it owns, unless there are no other replicas available.

Related

Migrating Solr Cloud cluster over new cloud vendor

We need to move our solr cloud cluster from one cloud vendor to another, the cluster is composed of 8 shards with 2 replica factor spread among 8 servers with roughly a total of 500GB worth of data.
I wonder what are the common approaches to migrate the cluster but specially its data with the less impact in availability and performance etc..
I was thinking in some sort of initial dump copy to then synchronize them catching up the diff (which could be huge) after keeping them in sync just switch whenever everything is ready from the other side.
Is that something doable? what tools should/could I use?
Thanks!
You have multiple choices depending on your existing setup and Solr version:
As mentioned earlier, make use of backup and restore APIs from Collections API
If you have Solr 6 and above, I would recommend exploring the option of CDCR, which is Solr's native Cross Data Centre Replication.
Reindexing onto the new cluster and then leverage Solr Collection Aliasing to change your application end points to the target provider upon the completion of reindexing

How to setup Solr Cloud with two search servers?

Hi I'm developing rails project with sunspot solr and configuring Solr Cloud.
My environment: rails 3.2.1, ruby 2.1.2, sunspot 2.1.0, Solr 4.1.6.
Why SolrCloud: I need more stable system - oftentimes search server goes on maintenance and web application stop working on production. So, I think about how to make 2 identical search servers instead of one, to make system more stable: if one server will be down, other will continue working.
I cannot find any good turtorial with simple, easy to understand and described in details turtorial...
I'm trying to set up SolrCloud on two servers, but I do not fully understand how it is working inside:
synchronize data between two servers (is it automatic action?)
balances search requests between two servers
when one server suddenly stop working other should become a master (is it automatic action?)
is there SolrCloud features other than listed?
Read more about SolrCloud here..! https://wiki.apache.org/solr/SolrCloud
Couple of inputs from my experience.
If your application just reads data from SOLR and does not write to SOLR(in real time but you index using an ETL or so) then you can just go for Master Slave hierarchy.
Define one Master :- Point all writes to here. If this master is down you will no longer be able to index the data
Create 2(or more) Slaves :- This is an feature from SOLR and it will take care of synchronizing data from the master based on the interval we specify(Say every 20 seconds)
Create a load balancer based out of slaves and point your application to read data from load balancer.
Pros:
With above setup, you don't have high availability for Master(Data writes) but you will have high availability for data until the last slave goes down.
Cons:
Assume one slave went down and you bought it back after an hour, this slave will be behind the other slaves by one hour. So its manual task to check for data consistency among other slaves before adding back to ELB.
How about SolrCloud?
No Master here, so you can achieve high availability for Writes too
No need to worry about data inconsistency as I described above, SolrCloud architecture will take care of that.
What Suits Best for you.
Define a external Zookeeper with 3 nodes Quorom
Define at least 2 SOLR severs.
Split your Current index to 2 shards (by default each shard will reside one each in 2 solr nodes defined in step #2
Define replica as 2 (This will create replica for shards in each nodes)
Define an LB to point to above solr nodes.
Point your Solr input as well as application to point to this LB.
By above setup, you can sustain fail over for either nodes.
Let me know if you need more info on this.
Regards,
Aneesh N
-Let us learn together.

Combining Solr 3x-style Master/Slave "Repeater" to feed remote 4x SolrCloud instances?

Solr 3x "Repeaters" and Multiple Data Centers:
Solr 3x let a node behave as both a slave and master, pull from one master, and then feed copies downstream to its own slaves. This was so common/useful it even had a name, a "Repeater".
This was useful if you wanted span multiple data centers. You could have the real master in data center A (DCA), and a "repeater" in data center B (DCB). That repeater would then grab content from DCA and feed all of the other nodes in DCB, saving on bandwidth.
Suppose you want to upgrade this setup to Solr 4x and SolrCloud. (Note that Solr 4x still supports Solr 3x-style legacy replication)
It's said that you should NOT have a single SolrCloud cluster span disparate data centers. So data center B should have it's own SolrCloud.
One idea is to have the DCA -> DCB link still use Solr 3x-style Master/Slave replication. And then the "repeater" in DCB, being also a SolrCloud node, would automatically be propagated to other nodes.
Main question:
Can a Solr node participate in both Solr 3x-style master/slave mode (as a slave) and also be part of a SolrCloud cluster? And if so, how is this configured?
Complications:
In the simple case, if it's just 1 shard with replicas, it's easy to see how that might work in terms of data. It's a little less clear if you have multiple shards in DCB, how do I tell each shard to only replicate its own share of data? Note that SolrCloud normally replicates via transactions, whereas 3x uses binary indices.
Another complexity is if you're doing replication. How do you tell just the master node for each shard to pull from the remote DCA node?
Alternatives:
On solution is to upgrade to 4x but continue using 3x-style replication in DCB, so just don't use SolrCloud.
I realize that another solution would be to have the data feed send it's updates to both data centers, or usE something like RabbitMQ. For the sake of this question, let's assume thats not an option (long story...)
Maybe there's some other way I haven't thought of?
Has anybody actually tried having SolrCloud span data centers? How horrible is it?
Somebody must have asked this question before!
But I've looked on Google and, although it finds tons of pages with the keywords, I haven't seen this specific "hybrid" mode fleshed out. I found one thread from 2013 but it didn't really talk about the configuration and complexity.
To answer your first question, a Solr slave in 3.X style cannot be a node in a Solr Cloud. The reason is the slave in a master/slave 3.X Solr config simply replicates, byte for byte, all the index files on the master. That's all it does. It can, in the repeater config, then also be a master for others to replicate from, or be a dedicated query slave or both. But that's it.
A node in a Solr Cloud config is a full participant in a distributed computing cluster where indexing is generally intended to be distributed across all nodes, and all nodes participate in queries. It's a very powerful feature which automatically handles failed nodes and significantly eases the work load of scaling up that was very manual in 3.X style.
However, part of what you pay for that is increased complexity (Zookeeper), requirements for lower latency inter-node communications (because all the nodes now talk to each other and to Zookeeper) and the loss of the simplicity of Master/Slave replication.
At 20M docs you are well within the constraints of a single node master index with an effectively unlimited number of slaves and therefor very high query capacity. I do this today with a production environment where each master has on the order of 60M docs in it with no significant problems.
The question is do you need NRT, multi-node indexing, automated failover, the ability to autoscale well past 100M docs? If so then Master/Slave it probably not going to work for you.
You could take a look at writing the same data to two different Solr Cloud clusters, one in each datacenter. You could do that directly, or use something like Apache Flume to do it for you - in either there are some issues with doing this and so the real question is are dealing with those issues worth it to get the added benefit of Solr Cloud?

Solr 4.1 Solr Cloud Shard Structure

If I plan on holding 7TB of data in a Solr Cloud, is it bad practice to begin with 1 server holding 100 shards and then begin populating the collection where once the size grew, each shard ultimately will be peeled off into its own dedicated server (holding ~70GB ea with its own dedicated resources and replicas)?
That is, I would start the collection with 100 shards locally, then as data grew, I could peel off one shard at a time and give it its own server -- dedicated w/plenty of resources.
Is this okay to do -- or would I somehow incur a massive bottleneck internally by putting that many shards in 1 server to start with while data was low?
This seems valid strategy for Solr 4.1 as Mark Miller one of the main contributor to the project also mentioned this approach check here
However if you can upgrade or planning to upgrade. It is possible to add new shrad dynamically in Solr after 4.3
check here

Solr Master Slave Failover setup for High Availability

While using Solr (we are currently using 3.5), how do we setup the Masters for a Failover?
Lets say in my Setup I have Two Masters and Two Slaves. The Application commits all the writes to One Active Master, and both the slaves get the updates from this Active Master. There is another repeater which serves the same purpose of the Master.
Now my question is if the Master for some reason comes down, how can I make the Repeater as a Master without any Manual intervention. How can the slaves start getting the updates from the Repeater instead of the broken Master. Is there a recommended way to do this? Are there any other recommended Master/Slave setup's to ensure High availability of the Solr systems?
At this time, your best option is probably to investigate the SolrCloud functionality present in the current Solr 4.0 alpha, which at the time of this writing is due for its final release within a few months. The goal of SolrCloud is to handle data distribution and master election, using the ZooKeeper distributed database to maintain consensus within the cluster about which nodes are serving in while roles.
There are other more traditional ways to set up failover for Solr 3's replicated master-slave architecture, but I personally wouldn't want to make that investment with Solr 4.0 so near to release.
Edit: See Linux-HA, for one such traditional approach. Personally, I would create a purpose-built daemon that reconfigures your cores and load balancer, using ZooKeeper for presence detection and distributed locks.
If outsourcing is an option, you might consider a hosted service such as my own humble Websolr. We provide this kind of distribution and hot failover by default, so our customers don't have to worry as much about the mechanics of how it's implemented.
I agree with Nick. The way replication works in Solr 3.x is not always handy, especially for master fail-over. If you are going to consider Solr 4 you might want to have a look at elasticsearch too, which solves this kind of problems in a really brilliant way!
It uses push replication instead of the pull mechanism used by Solr. That means the document is literally reindexed on all nodes. It might sound strange but that allows to reduce the network load (due to segment merge for example). Furthermore, a node is elected as master and if it crashes one other node will automatically replace it becoming the new master.

Resources