I'm trying to tweak our system status check to see the state of the Solr nodes in our SolrCloud. I'm facing the following problems:
We send a query to each of the Solr nodes separately. If we get a response and the status of the response is 0, we assume the node is running. Unfortunately, we've seen cases in which the node is recovering or even down and select queries are still handled.
In hope to prevent this, we've added a check which sends a ping request to solr. If the status returned by this is request reads 'OK' we assume the node is up. Unfortunately even with this request, if the node is recovering or down, this check won't fail.
My question is: What is the correct way to check the status of a node in SolrCloud?
If you are using a SolrCloud, it's recommended to maintain an explicit zookeeper ensemble as well. Because zookeeper ensemble maintains the SolrCloud's current status of each node and each shard wise. This status is actually get reflected from the SolrCloud admin window.
Go to the Admin window. Click on "Cloud".
Then click on "Tree" to get a tree view of your SolrCloud architecture.
Click /clusterstate.json to view the SolrCloud status.
This (clusterstate.json) json file holds the SolrCloud status information. Now if you are running an explicit zookeeper ensemble, following are the steps to get SolrCloud status.
Go to the path "zookeeper/installation/directory/bin"
Execute ./zkCli.sh -server ZK_IP:ZK_PORT (E.g ./zkCli.sh -server localhost:2181)
Execute get /clusterstate.json
You'll find the SolrCloud status.
Note : ZK_IP - The HOST IP where zoopeeper is running.
ZK_PORT - Zookeeper's client port.
You actually don't want /clusterstate.json - as this only covers the case where collections are already present. From ZooKeeper you need /live_nodes
Because Zookeeper is the authority for what Solr Nodes are members of the Solr cloud cluster, it follows that you should go to it first, to discover what members are accessible. This is how all Solr cloud clients work, and probably is the best way to approach the problem.
/live_nodes contains a file for each live Solr node, regardless of what collections exist or where the replicas are located.
Once you have resolved /live_nodes... you can call clusterstatus on any Solr instance with the address and port from one of the live-nodes.
http://localhost:8983/solr/admin/collections?action=clusterstatus&wt=json
clusterstatus provides a detailed overview of Solr nodes, collections, replicas, etc. Everything you would want to know.
As a final note, it's very wise to set SOLR_HOST inside of solr.in.sh configuration (/etc/default/solr.in.sh) - by default 'localhost' is used to reference the solr node. Setting this value to the public address you want the Solr node identified by will prevent ZooKeeper from returning the address "localhost" to clients when attempting to reach a Solr Node.
Related
I am setting up Solr Cloud (3 Solr node ) with external Zookeeper and I need to expose zookeeper URL to end user , so that zookeeper can automatically determine available sole node and return it to end user.
Does anyone know if this is feasible in Solr ? This is required for High Availability support as if one of Solr node goes down then end user should be automatically redirected to another instance of Solr.
Thanks
Shashi
I followed the instructions listed in Getting started with the Retrieve and Rank service to create solr cluster, however I received the following message : WRRCSR42:The requesting service instance may not create any more free solar clusters(current limit:1)
My Questions: what this message mean? and what should I do to get the cluster id?
Thank you,
The error tells you that you've already created a Solr Cluster. IBM Watson R&R only provides one free cluster.
To retrieve the list of existing clusters, you can use the same endpoint as when you attempt to create the cluster, but issue a regular GET request instead of a POST request.
https://gateway.watsonplatform.net/retrieve-and-rank/api/v1/solr_clusters
The response lists your existing Solr clusters, and includes your solr_cluster_id
Or will I need to implement another solutions like say RabbitMQ for message queuing and some other service for load balancing along with it? Please point me into the right direction.
You are setting up solr in CloudSolr mode. You can use the solrj java client implementation for load balancing the indexes and search queries to solr. CloudSolrServer is a class in the Solrj client for connecting to the Solr cloud.
It connects to Zookeeper and keeps track of the state of each node in the cluster. With this knowledge, CloudSolrServer client knows which nodes are the leaders and sends requests to leaders only to save time. Without CloudSolrServer, requests are sent in a round-robin fashion to all the Solr nodes (Leaders and Replicas). So there is S/N chance of getting the current Shard leader where N is the total number of nodes in the cloud (i.e. sum of leaders and replicas = N) and S is the number of Shard-Leaders. There are (1-S/N) chances of hitting the non-leader node which is wasteful as the non-leader node would then have to pass the request to its leader. With CloudSolrServer, requests are sent only to Shard-Leaders which performs much better.
If a node crashes, ZooKeeper notifies CloudSolrServer about the same so that CloudSolrServer removes it from the eligible solr instances' list. If a new leader is elected, then also CloudSolrServer clients are notified.
In fact, Solr actually uses the CloudSolrServer internally to communicate with other nodes in the cluster.
You don't need any type of queuing mechanism while working with Solr.
I've got a setup with 3x zoo keeper's and 4x solrcloud node's.
This is all working, all nodes are seeing each other and I initially had a default collection.
From there, I used the collections API to create a new collection which successfully completed and all it's successfully sharded across 2 nodes, with the other 2 being used for replica's. I can also successfully save documents to that collection. Browsing the solr web GUI on any of the boxes all works, no speed issues.
However, anytime I try to use the collections API I get timeouts. Creating a new collection, reloading one of the existing collections, deleting a collection... all of them timeout.
Any thoughts on why would be much appreciated
Cheers
I have also faced similar issue:
Solr process 24214 running on port 8983
Failed to get system information from http://localhost:8983/solr/ due to: org.apache.solr.client.solrj.SolrServerException: clusterstatus the collection time out:180s
at org.apache.solr.util.SolrCLI.getJson(SolrCLI.java:537)
at org.apache.solr.util.SolrCLI.getJson(SolrCLI.java:471)
at org.apache.solr.util.SolrCLI$StatusTool.getCloudStatus(SolrCLI.java:721)
at org.apache.solr.util.SolrCLI$StatusTool.reportStatus(SolrCLI.java:704)
at org.apache.solr.util.SolrCLI$StatusTool.runTool(SolrCLI.java:662)
at org.apache.solr.util.SolrCLI.main(SolrCLI.java:215)
So to solve this issue I have followed given instruction and resolved it.:
Stop all Solr instances
Stop all Zookeeper instances
Start all Zookeeper instances
Start Solr instances one at a time.
Such timeout can occur when Solr is not able to obtain cluster state. If following call is results in timeout, then this is the case
http://solr-hostname:8983/solr/admin/collections?action=CLUSTERSTATUS&wt=json
This may be caused by incorrect entries present in /clusterstate.json
To fix this:
get clusterstate from ZooKeeper by calling
zkcli.sh -zkhost localhost:2181 -cmd get /clusterstate.json > clusterstate.json
edit extracted clusterstate.json file and remove sections with wrong IPs or not existing hosts
clear the clusterstate in ZooKeeper by calling
zkcli.sh -zkhost localhost:2181 -cmd clear /clusterstate.json
save corrected state in ZooKeeper by sending updated JSON file
zkcli.sh -zkhost localhost:2181 -cmd putfile /clusterstate.json ./clusterstate.json`
restart Solr instances
After that, if your clusterstate shows correct info, you should no longer have timeouts when accessing Collections API.
Note
Be careful when editing clusterstate JSON, limit your changes only to removing not existing hosts/replicas/shards.
I also had timeout issues with the collections API. To fix this problem, I added the server's IP address to the solr.xml file that you find in /var/solr/data/solr.xml. My setup consists of 3 Ubuntu servers that run ZooKeeper (3.4.6) and SolrCloud (5.2.1) on each server.
Ended up being Zoo Keeper config mismatch
I was reading this PDF: http://2010.lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf, and there is a section that talks about CloudSolrServer. In particular, this statement is made:
It keeps a list of both live and dead servers. When a request to a server fails, that server is added to the ‘dead’ list, and another ‘live’ server is queried instead.
The ‘dead’ server list is occasionally pinged, and if a server comes back, it is moved back into the ‘live’ list.
This works fine when a SOLR instance or the machine crashes, but for normal maintenance it would be undesirable because requests in progress would be lost. Typically with a normal load balancer, there's a way to shut off traffic to a box, and then normal shutdown can proceed at some interval after that.
Since it appears that CloudSolrServer is intended to replace a traditional load balancer in front of a SOLR cluster, I was wondering about graceful shutdown. What is the recommended way to shutdown a SOLR instance while not losing requests, (while using CloudSolrServer)?
If you want to gracefully shutdown an instance, you will need to first remove the corresponding node from ZooKeeper and then shut down the instance. You can remove the node from ZK by using "DELETEREPLICA" command:
/admin/collections?action=DELETEREPLICA&collection=collection&shard=shard&replica=replica
See more in Solr Collections API documentation
once the ephemeral node is removed from ZooKeeper, CloudSolrServer will stop sending requests to it.