Can I manualy copy SOLR index files? - solr

I have a cluster of 3 sharded SOLR 4.1. There is a replicated cluster but the data is quite out of synced. I have stopped polling on those secondary nodes for a long time.
Now I want to start the replication again but I'm afraid it would take too long to replicate 400GB index data on each node.
If I manually copy over the index files from the master to the slave node, will it work?
Thanks

Yes, that should work just fine - as long as you don't write to the index while copying it (or copy it from a snapshot). In fact, that's what the replication does in the background (by replicating the segment files that needs replicating).
In older versions of Solr the replication was just shell scripts triggered to copy the index to other servers after an update happened.

Related

How to administrate storage of ClickHouse server in a Cluster when disks get full

I'm setting up a ClickHouse server in cluster, but one of the things that doesn't appear in the documentation is how to manage very large amount of data, it says that it can handle up to petabytes of data, but you can't store that much data in single server. You usually will have a few teras in each.
So my question is, how can I handle it to store in a node of the cluster, and then when it requires more space, add another, will it handle the distribution to the new server automatically or will I have to play with the weights in the shard distribution.
When you have more than 1 disk in one server, how can it use them all to store the data?
Is there a way to store very old data in the cloud and download it if needed? For example all data older than 2 years can be stored in Amazon S3 as it will be hardly requested and in case it is, it will take a longer time to retreive the data but wouldn't be a problem.
What solution would you find to this? Handling an ever exapanding database to avoid disk space issues in the future.
Thanks
I will assume that you use standard configuration for the ClickHouse cluster: several shards consisting of 2-3 replica nodes, and on each of these nodes a ReplicatedMergeTree table containing data for its respective shard. There are also Distributed tables created on one or more nodes that are configured to query the nodes of the cluster (relevant section in the docs).
When you add a new shard, old data is not moved to it automatically. Recommended approach is indeed to "play with the weights" as you have put it, i.e. increase the weight of the new node until the volume of data is even. But if you want to rebalance the data immediately, you can use the ALTER TABLE RESHARD command. Read the docs carefully and keep in mind various limitations of this command, e.g. it is not atomic.
When you have more than 1 disk in one server, how can it use them all to store the data?
Please read the section on configuring RAID in the administration tips.
Is there a way to store very old data in the cloud and download it if needed? For example all data older than 2 years can be stored in Amazon S3 as it will be hardly requested and in case it is, it will take a longer time to retreive the data but wouldn't be a problem.
MergeTree tables in ClickHouse are partitioned by month. You can use ALTER TABLE DETACH/ATTACH PARTITION commands to manipulate partitions. You can e.g. at the start of each month detach the partition for some older month and back it up to Amazon S3. Or you can setup a cluster of cheaper machines with ample disk space and manually move old partitions there. If your queries always include a filter on date, irrelevant partitions will be skipped automatically, else you can setup two Distributed tables: table_recent and table_all (with the cluster config including the nodes with old partitions).
Version 19.15 introduced multidisk strorage configuration. 20.1 introduces time-based data rearrangements.

How to handle solr replication when master goes down

I have solr setup, which is configured for Master and slave. The indexing is happening in master and slave is replicating the index at every 2 Min interval from master. So there is a delay of 2 Minutes in getting data from master to slave. Lets assume that my master was indexing at 10:42 some data but due to some hardware issue, master went down at 10:43. So now the data which was indexing at 10:42 was suppose to replicate on Slave by 10:44 (as we have set two minutes interval) Since now the master is not available, how to identify what the last indexed data in solr Master server. Is there way in solr log to track the index activity.
Thanks in Advance
Solr does log the indexing operations if you have the Solr log set to INFO. Any commit/add will show up in the log, so you can check the log for when the last addition was made. Depending on the setup, it might be hard to get the last log when then server is down, though.
You can reduce the time between replications to get more real time replication, or use SolrCloud instead (which should distribute the documents as they're being indexed).
There are also API endpoints (see which connections the Admin interface makes when browsing to the 'replication' status page) for getting the replication status, but those wouldn't help you if the server is gone.
In general - if the server isn't available, you'll have a hard time telling when it was last indexed to. You can work around a few of the issues by storing the indexing time outside of Solr from the indexing task, for example updating a value in memcache or MySQL every time you send something to be indexed from your application.

SolrCloud - Updates to schema or dataConfig

We have a SolrCloud managed by Zookeeper. One concern that we have is with updating the schema or dataConfig on the fly. All changes that we are planning to make is in the indexing server node on the SolrCloud. Once the changes to the schema or dataConfig are made, then we do a full dataimport.
The concern is that the replication of the new indexes on the slave nodes in the cloud would not happen immediately, but only after the replication interval. Also for the different slave nodes the replication will happen at different times, which might cause inconsistent results.
For e.g.
The index replication interval is 5 mins.
Slave node A started at 10:00 => next index replication would be at 10:05.
Slave node B started at 10:03 => next index replication would be at 10:08.
If we make changes to the schema in the indexing server and re-index the results at 10:04, then the results of this change would be available on node A at 10:05, but in node B only at 10:08. Requests made to the SolrCloud between 10:05 and 10:08 would have inconsistent results depending on which slave node the request gets redirected to.
Please let me know if there is any way to make the results more consistent.
#Wish, what you are stating is not the behavior of a SolrCloud.
In SolrCloud indexing are routed to shard leaders and leader sent the copies to all the replicas.
At any point of time, if the ZooKeeper identifies that any of the replica is not in sync with leader, it will brought down to recovering mode. In this mode it will not serve any requests including the query.
P.S: In solr cloud configs are maintained at ZooKeeper and not at the nodes level.
I guess you are little confusing Solr Cloud and Master Slave mode, please confirm which one setup are you in?

Solr Replication Starts periodically

Under what conditions does solr starts replicating from the start, we have noticed that in our master slave setup solr periodically start replicating the entire index from the beginning.
We have not made any changes to schema or config files, in-spite of that full replication get's triggered. How can this be avoided.
Regards,
Ayush

Backup strategy with master-slave solr 3.6 servers

We're using solr 3.6 replication with 2 servers - a master and a slave - and we're currently looking for the way to do clean backups.
As the wiki says so, we can use a HTTP command to create a snapshot of the master like this: http://myMasterHost/solr/replication?command=backup
But we still have some questions:
What is the benefit of the backup command on a classic shell script copying the index files?
The command only backups the indexes; is it possible to copy also the spellchecker folder? is it needed?
Can we create the snapshot while the application is running, so while there are potential index updates?
When we have to restore the servers from the backup, what do we have to do on the slave?
just copy the snapshot in its index folder, and removing the replication.properties file (or not)?
ask for a fetchindex through the HTTP command http://mySlave/solr/replication?command=fetchindex ?
just empty the slave index folder, in order to force a full replication from the master?
You can use the backup command provided by the ReplicationHandler. It's an asynchronous operation and it takes time if your index is big. This way you don't need to shutdown Solr. Then you'll find within the index directory a new directory named backup.yyyymmddHHMMSS with the backup date. You can also configure how many old backups you want to keep.
After that of course it's better if you move the backup to a safe location, probably to a different server.
I don't think it's possible to backup the spellchecker, not completely sure though.
Of course the command is meant to be run while the application is running. The only problem is that you will probably lose in the backup the documents that you committed after you started the backup itself.
You can also have a look at the lucene CheckIndex tool. Once you backed up the index you could check if the index is ok.
I wouldn't personally use the backups to restore the index on the slaves if you already have a good index on the master. The copy of the index would be automatic using the standard replication process (it's really a copy of the index segments), you don't need to copy them manually unless the backup contains better data than the master.

Resources