Need to migrate the whole cluster from one DC to another DC - solr

I have a SolrCloud cluster consists of 5 hosts in one DC.
The collection configuration is 5 shards and 3 replicas and max 3 shards per host.
Solr version used is 5.3.1.
Because of some unforeseen maintenance activity, it needs to be moved to some other DC temporarily. In order to minimize the impact we need the indexed data to be available with the new setup. All the nodes has roughly 100GB of indexed data.
I already have tried copying the whole setup to the new DC and restarted after after updating the host information in the config files. It always complains some or other shards not available from hosts while querying data. [error code 503]
Note: The back up was taken from a running setup.
I also have tried creating the whole cluster again with the same configuration and copying only the data directory from the back up. It also results in shards not available from the hosts.
I wanted to understand if there is something wrong in the process I am following. One thing I am suspecting is , the back up should be taken after stoping a particular node.
Is there any simple and better way available? I am using Solr-5.3.1.

The right way to do it is using backup and restore feature. This feature was already available in the 5.3 version, check the appropiate doc and follow the steps. Should work just fine.

Related

SolrCloud - 2 nodes cluster

We are planning to implement SolrCloud in our solution (mainly for data replication reasons and disaster recovery), unfortunately some of our customers have only 2DCs - and one DC may be completely destroyed.
We are aware that running ZK in 2 locations is problematic, as ZK requires quorum. And downtime on any side with 2 ZK nodes would cause cluster failure. And cluster failure would be also triggered by network partition between locations (master will cease to be master due to quorum lost, slave can't elect himself for the same reason).
--
So our current plan A is to go with a single ZK for both sites and backup ZK into the other site. So if the site withou ZK dies, we are OK. If the site with ZK dies, we should be able to start new ZK from backup and reconfigure Solr.
--
We also considered plan B with classic master-slave replication between the sites. BUT we are using Time Routed Aliases, hence we need SolrCloud features, hence we would need also to replicate data/configuratin in ZooKeeper (not only Solr index). So this case seems only as more manual work in Solr, while we would still need to backup/restore ZK. So this plan was rejected.
--
Plan C may be to have 2ZK, but one with with bigger weight. This should survive partition and dead of ZK with lower weight. The first ZK node should be automatically backed up using standard cluster mechanics. But I do not even know about anyone using ZK this way...
--
Is there any smarter way, how to setup SolrCloud in 2 nodes environment? Which solution should we prefer?
We do not expect High Availability; we want to achieve disaster recovery. Administrator intervention is expected in case of node failure, we only need to be resilient to short network glitches.
Edit: CDCR (Cross Data Center Replication) with Time Routed Aliases
We are considering to use TRA, because our data are time based, and customers are usually interested only in latest slice/partition. Without TRA, the index grows and performance degrades, more (unused/old) stuff is in index & RAM...
Here comes a problem with CDCR, according to docs, the source&target collection parameters are required. But with TRA, collections are created with the same solrconfig.xml automatically (every X days/months). This problem in CDCR is known (see comments), but not resolved yet.
Also it seems that CDCR really does not synchronize ZooKeeper (I have not found any mentions of the functionality in docs, jira and in code), which may be ok with static number of collections, but is very problematic with dynamically created collections (especially by some machinery in background outside users/developers code).
Edit: According to David (the main author of TRA), CDCR&TRA combination is not to be supported.

How to setup Solr Cloud with two search servers?

Hi I'm developing rails project with sunspot solr and configuring Solr Cloud.
My environment: rails 3.2.1, ruby 2.1.2, sunspot 2.1.0, Solr 4.1.6.
Why SolrCloud: I need more stable system - oftentimes search server goes on maintenance and web application stop working on production. So, I think about how to make 2 identical search servers instead of one, to make system more stable: if one server will be down, other will continue working.
I cannot find any good turtorial with simple, easy to understand and described in details turtorial...
I'm trying to set up SolrCloud on two servers, but I do not fully understand how it is working inside:
synchronize data between two servers (is it automatic action?)
balances search requests between two servers
when one server suddenly stop working other should become a master (is it automatic action?)
is there SolrCloud features other than listed?
Read more about SolrCloud here..! https://wiki.apache.org/solr/SolrCloud
Couple of inputs from my experience.
If your application just reads data from SOLR and does not write to SOLR(in real time but you index using an ETL or so) then you can just go for Master Slave hierarchy.
Define one Master :- Point all writes to here. If this master is down you will no longer be able to index the data
Create 2(or more) Slaves :- This is an feature from SOLR and it will take care of synchronizing data from the master based on the interval we specify(Say every 20 seconds)
Create a load balancer based out of slaves and point your application to read data from load balancer.
Pros:
With above setup, you don't have high availability for Master(Data writes) but you will have high availability for data until the last slave goes down.
Cons:
Assume one slave went down and you bought it back after an hour, this slave will be behind the other slaves by one hour. So its manual task to check for data consistency among other slaves before adding back to ELB.
How about SolrCloud?
No Master here, so you can achieve high availability for Writes too
No need to worry about data inconsistency as I described above, SolrCloud architecture will take care of that.
What Suits Best for you.
Define a external Zookeeper with 3 nodes Quorom
Define at least 2 SOLR severs.
Split your Current index to 2 shards (by default each shard will reside one each in 2 solr nodes defined in step #2
Define replica as 2 (This will create replica for shards in each nodes)
Define an LB to point to above solr nodes.
Point your Solr input as well as application to point to this LB.
By above setup, you can sustain fail over for either nodes.
Let me know if you need more info on this.
Regards,
Aneesh N
-Let us learn together.

Running a weekly update on a live Solr environment

I have a server which has a Solr Environment hosted on it. I want to run a weekly update of the data that our Solr database contains.
I have a couple solutions but I was wondering whether one is possible and if it is which one would be better:
My first solution is to have 2 Servers with a Solr environment on both and when one is updating you just switch the url using to connect to Solr and connect to the other one.
My other solution is the one I am not sure how to do. Is there a way to switch the datasource that a Solr environment looks at without restarting it or cutting out any current searches.
If anyone has any ideas it would be much appreciated.
Depending on the size of the data, you can probably just keep the Solr core running while doing the update. First issue a delete, then index the data and finally commit the changes. The new index state won't be seen before the commit is issued, which allows you to serve the old data while waiting for the indexing to complete.
Another option is to use the core admin to switch cores as you mentioned, similar to copying data into other cores (drop the mergeindex command).
If you're also talking about updating and upgrading the actual Solr version or application server while still serving content, having a second server that replicates the index from the master is an easy way to get more redundancy. That way you can keep serving queries from the second server while the first one is being maintained and then do it the other way around. Point your clients to an HTTP load balancer, and take the maintained server out of the list of servers serving requests while it's down. This will also make you resistant against single hardware failures, etc.
There's also the option of setting up SolrCloud, but that might require a bit more restructuring.

Solr master-master replication alternatives?

Currently we have 2 servers with a load-balancer before them. We want to be able to turn 1 machine off and later on, without the user noticing it.
Our application also uses solr and now i wanted to install & configure solr on both servers and the question is how do i configure a master-master replication?
After my initial research i found out that it's not possible :(
But what are my options here? I want both indices to stay in sync and when a document is commited on one server it should also go to the other.
Thanks for your help!
Not certain of your specific use case (why turn 1 server on and off?), there is no specific "master-master" replication. Solr does however support distributed indexing and querying via SolrCloud. From the documentation for SolrCloud:
Replication ensures redundancy for your data, and enables you to send
an update request to any node in the shard. If that node is a
replica, it will forward the request to the leader, which then
forwards it to all existing replicas, using versioning to make sure
every replica has the most up-to-date version. This architecture
enables you to be certain that your data can be recovered in the event
of a disaster, even if you are using Near Real Time searching.
It's a bit complex so I'd suggest you spend some time going thru the documentation as it's not quite as simple as setting up a couple of masters and load balancing between them. It is a big step up from the previous master/slave replication that Solr used, so even if it's not a perfect fit it will be a lot closer to what you need.
https://cwiki.apache.org/confluence/display/solr/SolrCloud
https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud
You can just create a simple master - slave replication as described here:
https://cwiki.apache.org/confluence/display/solr/Index+Replication
But be sure you send your inserts, deletes, updates directly to the master, but selects can go through the load balancer.
The other alternative is to create a third server as a master, and 2 slaves, and the lode balancer can be in front of the two slaves.

Solr 4 Adding Shard to existing Cluster

Background: I just finished reading the Apache Solr 4 Cookbook. In it the author mentions that setting up shards needs to be done wisely b/c new ones cannot be added to an existing cluster. However, this was written using Solr 4.0 and at the present I am using 4.1. Is this still the case? I wish I hadn't found this issue and I'm hoping someone can tell me otherwise.
Question: Am I expected to know how much data I'll store in the future when setting up shards in a SolrCloud cluster?
I have played with Solandra and read up on elastic search, but quite honestly I am a fan of Solr as it is (and its large community!). I also like Zookeeper. Am I stuck for now or is there a workaround/patch?
Edit: If Question above is NO, could I build a SolrCloud with a bunch (maybe 100 or more) shards and let them grow (internally) and while I grow my data start peeling them off one by one and put them into larger, faster servers with more resources?
Yes, of course you can. You have to setup a new Solr server pointing to the same zookeeper instance. During the bootstrap the server connects to zk ensemble and registers itself as a cluster member.
Once the registration process is complete, the server is ready to create new cores. You can create replicas of the existing shards using CoreAdmin. Also you can create new shards, but they won't be balanced due to Lucene index format (not all fields are stored), because it may not have all document information to rebalance the cluster, so only new indexed/updated documents will get to this server (doing this is not recommendable).
When you setup your SolrCloud you have to create the cluster taking into account your document number growth factor, so if you have 1M documents at first and it grows as 10k docs/day, setup the cluster with 5 shards, so at start you have to host this shards in your two machines initial setup, but in the future, as needed, you can add new servers to the cluster and move those shards to this new servers. Be careful to not overgrow you cluster because, in Lucene, a single 20Gb index split across 5 shards won't be a 4Gb index in every shard. Every shard will take about (single_index_size/num_shards)*1.1 (due to dictionary compression). This may change depending on your term frequency.
The last chance you have is to add the new servers to the cluster and instead of adding new shards/replicas to the existing server, setup a new different collection using your new shards and reindex in parallel to this new collection. Then, once your reindex process finished, swap this collection and the old one.
One solution to the problem is to use the "implicit router" when creating your Collection.
Lets say - you have to index all "Audit Trail" data of your application into Solr. New Data gets added every day. You might most probably want to shard by year.
You could do something like the below during the initial setup of your collection:
admin/collections?
action=CREATE&
name=AuditTrailIndex&
router.name=implicit&
shards=2010,2011,2012,2013,2014&
router.field=year
The above command:
a) Creates 5 shards - one each for the current and the last 4 years 2010,2011,2012,2013,2014
b) Routes data to the correct shard based on the value of the "year" field (specified as router.field)
In December 2014, you might add a new shard in preparation for 2015 using the CREATESHARD API (part of the Collections API) - Do something like:
/admin/collections?
action=CREATESHARD&
shard=2015&
collection=AuditTrailIndex
The above command creates a new shard on the same collection.
When its 2015, all data will get automatically indexed into the "2015" shard assuming your data has the "year" field populated correctly to 2015.
In 2015, if you think you don't need the 2010 shard (based on your data retention requirements) - you could always use the DELETESHARD API to do so:
/admin/collections?
action=DELETESHARD&
shard=2015&
collection=AuditTrailIndex
P.S. This solution only works if you used the "implicit router" when creating your collection. Does NOT work when you use the default "compositeId router" - i.e. collections created with the numshards parameter.
This feature is truly a game changer - allows shards to be added dynamically based on growing demands of your business.

Resources