Solr Incremental backup on real-time system with heavy index

Solr Incremental backup on real-time system with heavy index - solr

I implement search engine with solr that import minimal 2 million doc per day.
User must can search on imported doc ASAP (near real-time).
I using 2 dedicated Windows x64 with tomcat 6 (Solr shard mode). every server, index about 120 million doc and about 220 GB (total 500 GB).
I want to get backup incremental from solr index file during update or search.
after search it, find rsync tools for UNIX and DeltaCopy for windows (GUI rsync for windows). but get error (vanished) during update.
how to solve this problem.
Note1:File copy really slow, when file size very large. therefore i can't use this way.
Note2: Can i prevent corrupt index files during update, if windows crash or hardware reset or any other problem ?

You can take a hot backup (i.e. while writing to the index) using the ReplicationHandler to copy Solr's data directory elsewhere on the local system. Then do whatever you like with that directory. You can launch the backup whenever you want by going to a URL like this:
http://host:8080/solr/replication?command=backup&location=/home/jboss/backup
Obviously you could script that with wget+cron.
More details can be found here:
http://wiki.apache.org/solr/SolrReplication
The Lucene in Action book has a section on hot backups with Lucene, and it appears to me that the code in Solr's ReplicationHandler uses the same strategy as outlined there. One of that book's authors even elaborated on how it works in another StackOverflow answer.

Don't run a backup while updating the index. You will probably get a corrupt (therefore useless) backup.
Some ideas to work around it:
Batch up your updates, i.e. instead of adding/updating documents all the time, add/update every n minutes. This will let you run the backup in between those n minutes. Cons: document freshness is affected.
Use a second, passive Solr core: Set up two cores per shard, one active and one passive. All queries are issued against the active core. Use replication to keep the passive core up to date. Run the backup against the passive core. You'd have to disable replication while running the backup. Cons: complex, more moving parts, requires double the disk space to maintain the passive core.

Related

Need to migrate the whole cluster from one DC to another DC

I have a SolrCloud cluster consists of 5 hosts in one DC.
The collection configuration is 5 shards and 3 replicas and max 3 shards per host.
Solr version used is 5.3.1.
Because of some unforeseen maintenance activity, it needs to be moved to some other DC temporarily. In order to minimize the impact we need the indexed data to be available with the new setup. All the nodes has roughly 100GB of indexed data.
I already have tried copying the whole setup to the new DC and restarted after after updating the host information in the config files. It always complains some or other shards not available from hosts while querying data. [error code 503]
Note: The back up was taken from a running setup.
I also have tried creating the whole cluster again with the same configuration and copying only the data directory from the back up. It also results in shards not available from the hosts.
I wanted to understand if there is something wrong in the process I am following. One thing I am suspecting is , the back up should be taken after stoping a particular node.
Is there any simple and better way available? I am using Solr-5.3.1.

The right way to do it is using backup and restore feature. This feature was already available in the 5.3 version, check the appropiate doc and follow the steps. Should work just fine.

why does CouchDBs _dbs.couch keep growing when purging/compacting DBs?

The setup:
A CouchDB 2.0 running in Docker on a Raspberry PI 3
A node-application that uses pouchdb, also in Docker on the same PI 3
The scenario:
At any given moment, the CouchDB has at max 4 Databases with a total of about 60 documents
the node application purges (using pouchdbs destroy) and recreates these databases periodically (some of them every two seconds, others every 15 minutes)
The databases are always recreated with the newest entries
The reason for purging the databases, instead of deleting their documents is, that i'd otherwise have a huge amount of deleted documents, and my web-client can't handle syncing all these deleted documents
The problem:
The file var/lib/couchdb/_dbs.couch always keeps growing, it never shrinks. Last time i left it alone for three weeks, and it grew to 37 GB. Fauxten showed, that the CouchDB only contains these up to 60 Documents, but this file still keeps growing, until it fills all the space available
What i tried:
running everything on an x86 machine (osx)
running couchdb without docker (because of this info)
using couchdb 2.1
running compaction manually (which didn't do anything)
googling for about 3 days now
Whatever i do, i always get the same result: the _dbs.couch keeps growing. I also wasn't really able to find out, what that files purpose is. googling that specific filename only yields two pages of search-results, none of which are specific.
The only thing i can currently do, is manually delete this file from time to time, and restart the docker-container, which does delete all my databases, but that is not a problem as the node-application recreates them soon after.

The _dbs database is a meta-database. It records the locations of all the shards of your clustered databases, but since it's a couchdb database too (though not a sharded one) it also needs compacting from time to time.
try;
curl localhost:5986/_dbs/_compact -XPOST -Hcontent-type:application/json
You can enable the compaction daemon to do this for you, and we enable it by default in the recent 2.1.0 release.
add this to the end of your local.ini file and restart couchdb;
[compactions]
_default = [{db_fragmentation, "70%"}, {view_fragmentation, "60%"}]

How to quickly transfer data between servers using Sunspot / Websolr?

Since I suspect my setup is rather conventional, I'd like to start by providing a little context. Our Solr setup involves three environments:
Production - Solr server hosted on Websolr.
Staging - Also a Solr server hosted on Websolr.
Development - Supported via the sunspot_solr gem which allows us to easily set up our own local Solr server for development.
For the most part, this is working well. We have a lot of records so doing a full reindex takes a few hours (despite eager loading and using background jobs to parallelize the work). But that's not too terrible since we don't need to completely reindex very often.
But there's another scenario which is starting to become very annoying... We very frequently need to populate our local machine (or staging environment) with production data (i.e. basically grab a SQL dump from production and pipe it into our local database). We do this all the time for bugfixes and whatnot.
At this point, because our data has changed, our local Solr index is out of date. So, if we want our search to work correctly, we also need to reindex our local Solr server and that takes a really long time.
So now the question: Rather than doing a full reindex, I would like to simply copy the production index down on to my machine (i.e. conceptually similar to a SQL dump but for a Solr server rather than a database). I've Googled around enough to know that this is possible but have not seen any solutions specific to Websolr / Sunspot. These are such common tools that I figured someone else must have figured this out already.
Thanks in advance for any help!

One of the better kept secrets of Solr (and websolr): You can use the Solr Replication API to copy the data between two indices.
If you're making a copy of the production index "prod54321" into the QA index "qa12345", then you'd initiate the replication with the fetchindex command on the QA index's replication handler. Here's a quick command to approximate that, using cURL.
curl -X POST https://index.websolr.com/solr/qa12345/replication \
-d command=fetchindex \
-d masterUrl=https://index.websolr.com/solr/prod54321/replication
(Note the references to the replication request handler on both URLs.)

Running a weekly update on a live Solr environment

I have a server which has a Solr Environment hosted on it. I want to run a weekly update of the data that our Solr database contains.
I have a couple solutions but I was wondering whether one is possible and if it is which one would be better:
My first solution is to have 2 Servers with a Solr environment on both and when one is updating you just switch the url using to connect to Solr and connect to the other one.
My other solution is the one I am not sure how to do. Is there a way to switch the datasource that a Solr environment looks at without restarting it or cutting out any current searches.
If anyone has any ideas it would be much appreciated.

Depending on the size of the data, you can probably just keep the Solr core running while doing the update. First issue a delete, then index the data and finally commit the changes. The new index state won't be seen before the commit is issued, which allows you to serve the old data while waiting for the indexing to complete.
Another option is to use the core admin to switch cores as you mentioned, similar to copying data into other cores (drop the mergeindex command).
If you're also talking about updating and upgrading the actual Solr version or application server while still serving content, having a second server that replicates the index from the master is an easy way to get more redundancy. That way you can keep serving queries from the second server while the first one is being maintained and then do it the other way around. Point your clients to an HTTP load balancer, and take the maintained server out of the list of servers serving requests while it's down. This will also make you resistant against single hardware failures, etc.
There's also the option of setting up SolrCloud, but that might require a bit more restructuring.

Solr indexes are not visible

We're using a training server to create solr indexes and uploading them to another (solr) server via rsync.
Until now, everything has been fine. Now, our index size on one core has increased drastically and our solr instances are refusing to read those indexes on that core. Also, they are ignoring those indexes without any exceptions. (we sure are reloading the cores or restarting tomcat after rsyncs)
ie: in solr stats, numDocs is 0 or /select?q=*:* is not returning any results..
Just to answer the question, are those indexes corrupted, we have regenerated them a couple of times. But nothing has changed. When we try to use smaller indexes, they are being read fine.
our solrconfig.xml in this core is like this; https://gist.github.com/983ebb13c895c9cccbfb

Copying your index using rsync is a bad idea. Your Solr server may not have completed writing files to disc when you initiate the copy operation, and you could end up with corruption. The only safe way to do this is to shut down the master (source index), shut down the slave (destination index), remove the entire content of the slave's index directory, copy the master's index across, and then restart everything.
A better approach is what was suggested by Peer Allan above - use Solr's built-in replication support. See http://wiki.apache.org/solr/SolrReplication.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight