Running a weekly update on a live Solr environment - solr

I have a server which has a Solr Environment hosted on it. I want to run a weekly update of the data that our Solr database contains.
I have a couple solutions but I was wondering whether one is possible and if it is which one would be better:
My first solution is to have 2 Servers with a Solr environment on both and when one is updating you just switch the url using to connect to Solr and connect to the other one.
My other solution is the one I am not sure how to do. Is there a way to switch the datasource that a Solr environment looks at without restarting it or cutting out any current searches.
If anyone has any ideas it would be much appreciated.

Depending on the size of the data, you can probably just keep the Solr core running while doing the update. First issue a delete, then index the data and finally commit the changes. The new index state won't be seen before the commit is issued, which allows you to serve the old data while waiting for the indexing to complete.
Another option is to use the core admin to switch cores as you mentioned, similar to copying data into other cores (drop the mergeindex command).
If you're also talking about updating and upgrading the actual Solr version or application server while still serving content, having a second server that replicates the index from the master is an easy way to get more redundancy. That way you can keep serving queries from the second server while the first one is being maintained and then do it the other way around. Point your clients to an HTTP load balancer, and take the maintained server out of the list of servers serving requests while it's down. This will also make you resistant against single hardware failures, etc.
There's also the option of setting up SolrCloud, but that might require a bit more restructuring.

Related

Which SOLR server should a distributed request be sent to when specifying shards in the URL?

I am setting up a distributed search with shards in SOLR.
Which server should I send this request to? or does it not matter?
host1:8983/solr/core?q=:&shards=host1:8983/solr/core,host2:8983/solr/core
vs
host2:8983/solr/core?q=:&shards=host1:8983/solr/core,host2:8983/solr/core
Similarly, would it be a better idea to have a separate empty solr server to direct these searches to instead of using one of the shards?
Unless you're seeing performance issues I wouldn't be too concerned about the performance difference between those two. The queries will run on both servers anyway, it'll just be a different server that's responsible for merging the end result to the client. If you want to spread this load across both servers, that's fine - in that case I'd go with alternating between both in a round robin manner (for example by placing an HTTP load balancer in front or letting your Solr library load balance between the available servers).
If you start getting replicas into the mix it becomes harder, where a load balancer will be useful. In that case it might be a good idea to look into Solr in cloud mode instead, where Solr will handle all this for you transparently (both load balancing and replica balancing, as long as your library is Zookeeper aware).

Need to migrate the whole cluster from one DC to another DC

I have a SolrCloud cluster consists of 5 hosts in one DC.
The collection configuration is 5 shards and 3 replicas and max 3 shards per host.
Solr version used is 5.3.1.
Because of some unforeseen maintenance activity, it needs to be moved to some other DC temporarily. In order to minimize the impact we need the indexed data to be available with the new setup. All the nodes has roughly 100GB of indexed data.
I already have tried copying the whole setup to the new DC and restarted after after updating the host information in the config files. It always complains some or other shards not available from hosts while querying data. [error code 503]
Note: The back up was taken from a running setup.
I also have tried creating the whole cluster again with the same configuration and copying only the data directory from the back up. It also results in shards not available from the hosts.
I wanted to understand if there is something wrong in the process I am following. One thing I am suspecting is , the back up should be taken after stoping a particular node.
Is there any simple and better way available? I am using Solr-5.3.1.
The right way to do it is using backup and restore feature. This feature was already available in the 5.3 version, check the appropiate doc and follow the steps. Should work just fine.

Solr Cloud Data Import Handler slow with replication

I am setting up a Solr Cloud deployment with 3 nodes and 3 shards. Without replication my data import handler imports very quickly- around 1.2M documents in ~5minutes. This is great, however when I enable replication, i.e. re-create the collection with a replication factor of 2, the data import handler becomes significantly slower, taking around 1hr 30mins for the same 1.2M documents.
I am using solr 5.3.1 in cloud mode on 3 4x16 virtual servers with a zookeeper instance on each node. The data import comes from an MS SQL DB.
Most of my configuration is the defaults that come with Solr, I have tried changing the auto commit for hard and soft commits to being very long but no effect.
Any ideas/pointers would be much appreciated.
Thanks,
Ewen
Maybe not a proper answer but the issue seemed to resolve itself. Of course we must have done something to make this happen, however all we can think of that we did was removing the CONSOLE logging in the log4j properties file and deleting the 11GB log file it had created.
Guess this may at least may give something else for others to try who are having the same issue.
When you send a document to a collection, it first gets proxied to the leader shard for that document, then the leader shard applies it locally, then sends it to all active replicas, then returns to the client.
This means the 'send a document' request is held open until all replicas have either received the document or failed. This means that the time to insert a doc is the max time for any replica to insert the document.
So yes, a collection with a higher replication factor will be slower to insert documents, assuming a fixed number of indexer connections.
With respect to logging, Solr uses synchronous logging by default, so if you're writing logs to a very slow disk or nfs or something, that could certainly influence query time. I highly recommend async logging for everything, but that means messing with the default Solr settings.

How to quickly transfer data between servers using Sunspot / Websolr?

Since I suspect my setup is rather conventional, I'd like to start by providing a little context. Our Solr setup involves three environments:
Production - Solr server hosted on Websolr.
Staging - Also a Solr server hosted on Websolr.
Development - Supported via the sunspot_solr gem which allows us to easily set up our own local Solr server for development.
For the most part, this is working well. We have a lot of records so doing a full reindex takes a few hours (despite eager loading and using background jobs to parallelize the work). But that's not too terrible since we don't need to completely reindex very often.
But there's another scenario which is starting to become very annoying... We very frequently need to populate our local machine (or staging environment) with production data (i.e. basically grab a SQL dump from production and pipe it into our local database). We do this all the time for bugfixes and whatnot.
At this point, because our data has changed, our local Solr index is out of date. So, if we want our search to work correctly, we also need to reindex our local Solr server and that takes a really long time.
So now the question: Rather than doing a full reindex, I would like to simply copy the production index down on to my machine (i.e. conceptually similar to a SQL dump but for a Solr server rather than a database). I've Googled around enough to know that this is possible but have not seen any solutions specific to Websolr / Sunspot. These are such common tools that I figured someone else must have figured this out already.
Thanks in advance for any help!
One of the better kept secrets of Solr (and websolr): You can use the Solr Replication API to copy the data between two indices.
If you're making a copy of the production index "prod54321" into the QA index "qa12345", then you'd initiate the replication with the fetchindex command on the QA index's replication handler. Here's a quick command to approximate that, using cURL.
curl -X POST https://index.websolr.com/solr/qa12345/replication \
-d command=fetchindex \
-d masterUrl=https://index.websolr.com/solr/prod54321/replication
(Note the references to the replication request handler on both URLs.)

Solr master-master replication alternatives?

Currently we have 2 servers with a load-balancer before them. We want to be able to turn 1 machine off and later on, without the user noticing it.
Our application also uses solr and now i wanted to install & configure solr on both servers and the question is how do i configure a master-master replication?
After my initial research i found out that it's not possible :(
But what are my options here? I want both indices to stay in sync and when a document is commited on one server it should also go to the other.
Thanks for your help!
Not certain of your specific use case (why turn 1 server on and off?), there is no specific "master-master" replication. Solr does however support distributed indexing and querying via SolrCloud. From the documentation for SolrCloud:
Replication ensures redundancy for your data, and enables you to send
an update request to any node in the shard. If that node is a
replica, it will forward the request to the leader, which then
forwards it to all existing replicas, using versioning to make sure
every replica has the most up-to-date version. This architecture
enables you to be certain that your data can be recovered in the event
of a disaster, even if you are using Near Real Time searching.
It's a bit complex so I'd suggest you spend some time going thru the documentation as it's not quite as simple as setting up a couple of masters and load balancing between them. It is a big step up from the previous master/slave replication that Solr used, so even if it's not a perfect fit it will be a lot closer to what you need.
https://cwiki.apache.org/confluence/display/solr/SolrCloud
https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud
You can just create a simple master - slave replication as described here:
https://cwiki.apache.org/confluence/display/solr/Index+Replication
But be sure you send your inserts, deletes, updates directly to the master, but selects can go through the load balancer.
The other alternative is to create a third server as a master, and 2 slaves, and the lode balancer can be in front of the two slaves.

Resources