How does Solrcloud handle host failures? - solr

I am learning how to use Solrcloud's new features, and I can successfully set up an ensemble of Zookeepers, and a set of Solr instances for a sharded index. I wanted to investigate how failures affected my setup. Mostly, it worked as expected except for one case.
I used two machines, and started 3 Zookeepers on each (6 total). I started s Solr instance one machine (bosmac01), asking for 2 shards, and started a second instance that machine. I then started two more Solr instances on a second machine (qasolrmaster). The Solr admin showed the configuration I expected, and indexing/querying worked:
Shard1: qasolrmaster:8900 and bosmac01:8983
Shard2: qasolrmaster:8910 and bosmac01:8920
I wanted to test what would happen if one machine crashed, so I shutdown qasolrmaster. I expected that since there would be 3 Zookeepers still running, and since there would still be a Solr instance connected to each shard, that everything would still work. Instead, the two remaining Solr instances (on bosmac01) kept trying to reconnect to the missing Zookeepers. The Admin would not display the cloud image, and I could not add docs or query. Same thing happens if I just stop all the Zookeepers on qasolrmaster but leave the machine running. Re-starting one of the missing Zookeepers returned things to normal.
Why did the test fail? 3 Zookeepers plus a Solr for each shard should allow things to keep working, yes?

Zk requires a majority of its nodes stay up. If you put 3 on one machine and 3 on another, then kill 3, you do not have a majority.

Related

Sitecore SOLR Errors

I am using SOLR with sitecore, on production environment, I am getting a lot of errors in SOLR log, but sites are working fine, I have 32 solr cores, and I am using Solr version 4.10.3.0 with Sitecore 8.1 update 2, below is sample of these errors, any one can explain to me these errors :
Most of the errors are self-descriptive, like this one:
undefined field: "Reckless"
tells that the field in question is not defined in the solr schema. Try to analyze the queries you system is accepting and the system sending these in.
The less obvious one:
Overlapping onDeckSearchers=2
is warning about warming searchers, in this case 2 of them concurrently. This means, that there were commits to the Solr index in a quick succession, each of which triggered a warming searcher. The reason it is wasteful is that even though the first searcher has warmed up and is ready to serve queries, it will be thrown away as the new searcher warms up and is ready to serve.

How do I recreate a SolrCloud shard using a compositeId router and known hash range?

We have a SolrCloud setup with 20 shards, each with only 1 replica, served on 8 servers.
After a server went down we are left with 16 shards, which means that some of the compositeId hash ranges aren't hosted by any cores. Somehow the shards/cores didn't come back after the server came up again. I can see the server in /live_nodes.
But all is not bad: The data in the collection is volatile with a TTS of 30 minutes, and we have a failover in place that tries a new random compositeId whenever an "add" operation fails.
My question is: Is it possible to recreate the missing shards or do I have to delete and create the collection from scratch?
I know which hash ranges are are missing, but the CREATESHARD API call doesn't support shards with the 'compositeId' router. And I cannot use SPLITSHARD since it only divides the original shard's hash.
(We use Solr 5.4.0 and can't upgrade before 6.1 is released, ref. SOLR-8940)
So, I asked the same question at the solr-user mailing list. The answer I got suggests using zkNavigator and editing state.json, adding the missing shard(s) with the correct parameters. An ADDREPLICA API call is needed afterwards.
I haven't gotten around to trying it yet, but I will do if the problem reoccurs.

Solr nodes' replication is getting stuck

We have standalone solr servers which are master and slave. Also have a full indexer job nightly. Generally, when job executed successful everything is alright. But last days, we noticed that indexer node has different document number with searching node. So, expected productions are not available in our production system. That's why we had to restart nodes and start replication manually, then problem went away. We need to prevent to occur this problem again. What do you suggest us to check or where should i look at? Indeed i think that essential error about the issue is: "SEVERE: No files to download for index generation"
Regards

Solr Master/Slave still being followed

I have a solr cloud cluster with three different machines. Initially when there was only one machine I had enabled replication handler for master/slave in solrconfig.xml. But then I changed the config and commented out the replication handler part. Then I add the other two machines to the cluster and also created a cluster of zookeepers(one on each machine) and uploaded the new solrconfig.xml file.
But I still see the master/slave setup on the initial machine( that one that existed from beginning) and not on the other two machines. They all have the same config now, so why is the first machine still showing the master/slave part. Do I need to reload the zookeeper for that machine or something?
Any help will be appreciated.
thanks.

SolrCloud: Unable to Create Collection, Locking Issues

I have been trying to implement a SolrCloud, and everything works fine until I try to create a collection with 6 shards. My setup is as follows:
5 virtual servers, all running Ubuntu 14.04, hosted by a single company across different data centers
3 servers running ZooKeeper 3.4.6 for the ensemble
2 servers, each running Solr 5.1.0 server (Jetty)
The Solr instances each have a 2TB, ext4 secondary disk for the indexes, mounted at /solrData/Indexes. I set this value in solrconfig.xml via <dataDir>/solrData/Indexes</dataDir>, and uploaded it to the ZooKeeper ensemble. Note that these secondary disks are neither NAS nor NFS, which I know can cause problems. The solr user owns /solrData.
All the intra-server communication is via private IP, since all are hosted by the same company. I'm using iptables for firewall, and the ports are open and all the servers are communicating successfully. Config upload to ZooKeeper is successful, and I can see via the Solr admin interface that both nodes are available.
The trouble starts when I try to create a collection using the following command:
http://xxx.xxx.xxx.xxx:8983/solr/admin/collections?action=CREATE&name=coll1&maxShardsPerNode=6&router.name=implicit&shards=shard1,shard2,shard3,shard4,shard5,shard6&router.field=shard&async=4444
Via the Solr UI logging, I see that multiple index creation commands are issued simultaneously, like so:
6/25/2015, 7:55:45 AM WARN SolrCore [coll1_shard2_replica1] Solr index directory '/solrData/Indexes/index' doesn't exist. Creating new index...
6/25/2015, 7:55:45 AM WARN SolrCore [coll1_shard1_replica2] Solr index directory '/solrData/Indexes/index' doesn't exist. Creating new index...
Ultimately the task gets reported as complete, but in the log, I have locking errors:
Error creating core [coll1_shard2_replica1]: Lock obtain timed out: SimpleFSLock#/solrData/Indexes/index/write.lock
SolrIndexWriter was not closed prior to finalize(),​ indicates a bug -- POSSIBLE RESOURCE LEAK!!!
Error closing IndexWriter
If I look at the cloud graph, maybe a couple of the shards will have been created, others are closed or recovering, and if I restart Solr, none of the cores can fire up.
Now, I know what you're going to say: follow this SO post and change solrconfig.xml locking settings to this:
<unlockOnStartup>true</unlockOnStartup>
<lockType>simple</lockType>
I did that, and it had no impact whatsoever. Hence the question. I'm about to have to release a single Solr instance into production, which I hate to do. Does anybody know how to fix this?
Based on the log entry you supplied, it looks like Solr may be creating the data (index) directory for EACH shard in the same folder.
Solr index directory '/solrData/Indexes/index' doesn't exist. Creating new index...
This message was shown for two different collections and it references the same location. What I usually do, is change my Solr Home to a different directory, under which all collection "instance" stuff will be created. Then I manually edit the core.properties for each shard to specify the location of the index data.

Resources