How to quickly transfer data between servers using Sunspot / Websolr? - solr

Since I suspect my setup is rather conventional, I'd like to start by providing a little context. Our Solr setup involves three environments:
Production - Solr server hosted on Websolr.
Staging - Also a Solr server hosted on Websolr.
Development - Supported via the sunspot_solr gem which allows us to easily set up our own local Solr server for development.
For the most part, this is working well. We have a lot of records so doing a full reindex takes a few hours (despite eager loading and using background jobs to parallelize the work). But that's not too terrible since we don't need to completely reindex very often.
But there's another scenario which is starting to become very annoying... We very frequently need to populate our local machine (or staging environment) with production data (i.e. basically grab a SQL dump from production and pipe it into our local database). We do this all the time for bugfixes and whatnot.
At this point, because our data has changed, our local Solr index is out of date. So, if we want our search to work correctly, we also need to reindex our local Solr server and that takes a really long time.
So now the question: Rather than doing a full reindex, I would like to simply copy the production index down on to my machine (i.e. conceptually similar to a SQL dump but for a Solr server rather than a database). I've Googled around enough to know that this is possible but have not seen any solutions specific to Websolr / Sunspot. These are such common tools that I figured someone else must have figured this out already.
Thanks in advance for any help!

One of the better kept secrets of Solr (and websolr): You can use the Solr Replication API to copy the data between two indices.
If you're making a copy of the production index "prod54321" into the QA index "qa12345", then you'd initiate the replication with the fetchindex command on the QA index's replication handler. Here's a quick command to approximate that, using cURL.
curl -X POST https://index.websolr.com/solr/qa12345/replication \
-d command=fetchindex \
-d masterUrl=https://index.websolr.com/solr/prod54321/replication
(Note the references to the replication request handler on both URLs.)

Related

Need to migrate the whole cluster from one DC to another DC

I have a SolrCloud cluster consists of 5 hosts in one DC.
The collection configuration is 5 shards and 3 replicas and max 3 shards per host.
Solr version used is 5.3.1.
Because of some unforeseen maintenance activity, it needs to be moved to some other DC temporarily. In order to minimize the impact we need the indexed data to be available with the new setup. All the nodes has roughly 100GB of indexed data.
I already have tried copying the whole setup to the new DC and restarted after after updating the host information in the config files. It always complains some or other shards not available from hosts while querying data. [error code 503]
Note: The back up was taken from a running setup.
I also have tried creating the whole cluster again with the same configuration and copying only the data directory from the back up. It also results in shards not available from the hosts.
I wanted to understand if there is something wrong in the process I am following. One thing I am suspecting is , the back up should be taken after stoping a particular node.
Is there any simple and better way available? I am using Solr-5.3.1.
The right way to do it is using backup and restore feature. This feature was already available in the 5.3 version, check the appropiate doc and follow the steps. Should work just fine.

Solr - Multi Core vs Multiple Instance for Many Database Tables

I have performance concern and want a suggestion that which will be best, Multi Core or Multi Instance(with different port)? Lets have a look on My Case First:
Currently I am running solr with multiple core and its running OK. There is only one issue that sometime it goes "out of heap memory while processing facets fields", then I have to restart the solr. ( To minimize the no. of restarts, I starts the solr with high memory : java -Xms1000M -Xmx8000M -jar start.jar )
I have amazon ec2 instance with 8core-2.8GHtz /15GB Ram with optimized hard disk.
I have many database-tables(about 100) and have to create different schemas for each(leads to create different core).
Each table have millions of documents, with 7-9 indexed fields and 10-50 stored fields in each document.
My web portals should handle very high traffic (currently I m having 10 request/second, may increase to 50-100/second). I know 'solr' can handle that but it is to just inform you that I am concern about every-smallest performance issue also
Searching solr by PHP and CURL in to specific core, so there is no problem in searching in different solr instance also.
Question:
As per as I know Solr handles one request at a time. So I think if I create multiple instance of solr and starts those at different port, then my web portal can handle more request at a time. (if user search in different table).
So, what you will suggest me? Multi Core in Single Solr Instance? or Multiple Instances with Single/Dual Core in each?
Is there any problem in having multiple solr instances running at different ports?
NOTE: Here, I can/may/will combine less-searched-core(s)/small-core(s) in one instance AND heavy-traffic-core(s) in separate instance OR two-three-heavy-traffic-core in one-instance etc. Coz, creating different Instances for each table(~100 here) will take too much hardware resources.
As I didn't got any answer since more then week AND I had also tried many case with solr (and also read some articles), I want to share my experience as answer to my own question. This may/will help to future viewer. I tried on serverfault also with no success.
Solr can handle more request at a time.
I have tested it by running a long query [qTime=7203, approx. 7sec] and several small-queries-after-long-one [qTime=30], solr respond for small-queries first even they ran after long-one.
This point gives much reason in answer: Use single solr instance with multiple core. Just assign High memory to JVM.
Other Points:
1. Each solr instance will require RAM, so running multiple instances will require more resources, which will be expensive. And if you are using facets, sort fields then you need to allocate more RAM to each instance.
As you can see in my case I need to start the solr with high memory(8GB). You can see a case for Danish Web Archive, Which uses multiple instances and allocated 9GB RAM to each and having 256GM total RAM.
2. You can run multiple instances of solr on different PORT by java -Djetty.port=8984 -jar start.jar. Everything running ok BUT I got one problem.
While indexing it may give "not enough memory error" and then solr instance will be killed. So you again need to start second instance with high memory, which will leads to more RAM requirement.
3. Solr Resource Requirement and Performance Problem can be understand here. According to this 64bit environment and 12GB RAM is recommended for good performance. Solr Optimization are explained here.

Running a weekly update on a live Solr environment

I have a server which has a Solr Environment hosted on it. I want to run a weekly update of the data that our Solr database contains.
I have a couple solutions but I was wondering whether one is possible and if it is which one would be better:
My first solution is to have 2 Servers with a Solr environment on both and when one is updating you just switch the url using to connect to Solr and connect to the other one.
My other solution is the one I am not sure how to do. Is there a way to switch the datasource that a Solr environment looks at without restarting it or cutting out any current searches.
If anyone has any ideas it would be much appreciated.
Depending on the size of the data, you can probably just keep the Solr core running while doing the update. First issue a delete, then index the data and finally commit the changes. The new index state won't be seen before the commit is issued, which allows you to serve the old data while waiting for the indexing to complete.
Another option is to use the core admin to switch cores as you mentioned, similar to copying data into other cores (drop the mergeindex command).
If you're also talking about updating and upgrading the actual Solr version or application server while still serving content, having a second server that replicates the index from the master is an easy way to get more redundancy. That way you can keep serving queries from the second server while the first one is being maintained and then do it the other way around. Point your clients to an HTTP load balancer, and take the maintained server out of the list of servers serving requests while it's down. This will also make you resistant against single hardware failures, etc.
There's also the option of setting up SolrCloud, but that might require a bit more restructuring.

Solr master-master replication alternatives?

Currently we have 2 servers with a load-balancer before them. We want to be able to turn 1 machine off and later on, without the user noticing it.
Our application also uses solr and now i wanted to install & configure solr on both servers and the question is how do i configure a master-master replication?
After my initial research i found out that it's not possible :(
But what are my options here? I want both indices to stay in sync and when a document is commited on one server it should also go to the other.
Thanks for your help!
Not certain of your specific use case (why turn 1 server on and off?), there is no specific "master-master" replication. Solr does however support distributed indexing and querying via SolrCloud. From the documentation for SolrCloud:
Replication ensures redundancy for your data, and enables you to send
an update request to any node in the shard. If that node is a
replica, it will forward the request to the leader, which then
forwards it to all existing replicas, using versioning to make sure
every replica has the most up-to-date version. This architecture
enables you to be certain that your data can be recovered in the event
of a disaster, even if you are using Near Real Time searching.
It's a bit complex so I'd suggest you spend some time going thru the documentation as it's not quite as simple as setting up a couple of masters and load balancing between them. It is a big step up from the previous master/slave replication that Solr used, so even if it's not a perfect fit it will be a lot closer to what you need.
https://cwiki.apache.org/confluence/display/solr/SolrCloud
https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud
You can just create a simple master - slave replication as described here:
https://cwiki.apache.org/confluence/display/solr/Index+Replication
But be sure you send your inserts, deletes, updates directly to the master, but selects can go through the load balancer.
The other alternative is to create a third server as a master, and 2 slaves, and the lode balancer can be in front of the two slaves.

Solr Incremental backup on real-time system with heavy index

I implement search engine with solr that import minimal 2 million doc per day.
User must can search on imported doc ASAP (near real-time).
I using 2 dedicated Windows x64 with tomcat 6 (Solr shard mode). every server, index about 120 million doc and about 220 GB (total 500 GB).
I want to get backup incremental from solr index file during update or search.
after search it, find rsync tools for UNIX and DeltaCopy for windows (GUI rsync for windows). but get error (vanished) during update.
how to solve this problem.
Note1:File copy really slow, when file size very large. therefore i can't use this way.
Note2: Can i prevent corrupt index files during update, if windows crash or hardware reset or any other problem ?
You can take a hot backup (i.e. while writing to the index) using the ReplicationHandler to copy Solr's data directory elsewhere on the local system. Then do whatever you like with that directory. You can launch the backup whenever you want by going to a URL like this:
http://host:8080/solr/replication?command=backup&location=/home/jboss/backup
Obviously you could script that with wget+cron.
More details can be found here:
http://wiki.apache.org/solr/SolrReplication
The Lucene in Action book has a section on hot backups with Lucene, and it appears to me that the code in Solr's ReplicationHandler uses the same strategy as outlined there. One of that book's authors even elaborated on how it works in another StackOverflow answer.
Don't run a backup while updating the index. You will probably get a corrupt (therefore useless) backup.
Some ideas to work around it:
Batch up your updates, i.e. instead of adding/updating documents all the time, add/update every n minutes. This will let you run the backup in between those n minutes. Cons: document freshness is affected.
Use a second, passive Solr core: Set up two cores per shard, one active and one passive. All queries are issued against the active core. Use replication to keep the passive core up to date. Run the backup against the passive core. You'd have to disable replication while running the backup. Cons: complex, more moving parts, requires double the disk space to maintain the passive core.

Resources