How to make a request in SolrCloud? - solr

I have a Node.js app and I used to have a standalone Solr but then our company decided to use SolrCloud to provide failover.
In the standalone Solr I had only the one server with it and I had all my requests like: http://solr_server:8983/solr/mycore/select?indent=on&q=*:*&wt=json so all requests led to the same server all the time.
But now I have 3 different instances with 1 ZooKeeper and 1 Solr node on each of them and my requests look like this: http://solr_server_1:8983/solr/mycollection/select?q=*:*
And now the question: what if the solr_server_1 will go down? How can I still get my results? How can I handle requests in this case?

If you're doing this manually: You'll have to catch the exception when the connection fails, and then retry the next server in your list.
let servers = ['ip1:8983', 'ip2:8983', 'ip3:8983']
If you're using a library that supports Zookeeper (i.e. it connects to zookeeper to find out what the live nodes are), you give the client a list of zookeeper nodes and lets it figure out the rest. node-solr-smart-client is a client that supports Zookeeper as well.
options = {
zkConnectionString: 'ip1:2181,ip2:2181,ip3:2181',
// etc.
}
solrSmartClient.createClient('my_solr_collection', options, function (err, solrClient) {

Related

Zookeeper + Solr in TestContainers

I'm using org.testcontainers to perform integration testing with Solr.
[Using SolrJ in my unit tests]
When I start Solr in cloud mode, using an embedded ZooKeeper instance, I'm able to connect to the solr instance from my unit test, but unable to connect to ZooKeeper from my SolrClient.
I think this is because embedded ZooKeeper is bound to IP 127.0.0.1 and inaccessible.
If I start two separate containers [using a shared network], ZooKeeper and Solr, I can connect Solr to ZooKeeper, and I can connect to Zookeeper from my unit tests, BUT when Zookeeper returns the active SOLR node, it return the internal server IP which is not accessible from my unit test [in my SolrJ client].
I'm not sure where to go with this.
Maybe there is a network mode that will do address translation?
Thoughts?
UPDATE:
There is an official Testcontainers Module: https://www.testcontainers.org/modules/solr/
It seems that this problem can`t be solved that easy.
One way would be to use fixed ports with testcontainer. In this case the ports 9983 and 8983 will be mapped to the same ports on the host. This makes it possible to use the Solr Cloud Client. But this only works if you can ensure that tests will run sequentially, which can be a bit tricky, e.g. on Jenkins with Feature Branches.
A different solution would be to use another client. Since Solrj provides multiple Clients, you can choose which one you want to use. If you only want to search or update you can use the LBHttp2SolrClient which load balances between multiple nodes. If you want to use a specific client for the Integration Tests this example could work:
// Create the solr container.
SolrContainer container = new SolrContainer();
// Start the container. This step might take some time...
container.start();
// Do whatever you want with the client ...
SolrClient client = new Http2SolrClient.Builder("http://localhost:" + container.getSolrPort() + "/solr").build();
SolrPingResponse response = client.ping("dummy");
// Stop the container.
container.stop();
Here is a list of solr client in java: https://lucene.apache.org/solr/guide/8_3/using-solrj.html#types-of-solrclients
I ran into the exact same issue. I solved it using a proxy. In my docker_compose.yml I added:
squid:
image: sameersbn/squid:3.5.27-2
ports:
- "3128:3128"
volumes:
- ./squid.conf:/etc/squid/squid.conf
- ./cache:/var/spool/squid
restart: always
networks:
- solr
And in the configuration of the SolrClient I added:
[...]
HttpClient httpClient = HttpClientBuilder.create().setProxy(new HttpHost("localhost", 3128)).build()
CloudSolrClient c = new CloudSolrClient.Builder(getZookeeperList(), Optional.empty()).withHttpClient(httpClient).build();
[...]
protected List<String> getZookeeperList() {
List<String> zookeeperList = new ArrayList<String>();
for (Zookeepers z : Zookeepers.values()) {
zookeeperList.add(testcontainer.getServiceHost(z.getServicename(), z.getPort()) + ":"
+ testcontainer.getServicePort(z.getServicename(), z.getPort()));
}
return zookeeperList;
}
But I'd still be interested in the workaround, that Jeremy mentioned in this comment.

In Solr 5.x, does zookeeper perform the tasks of load balancing and message queuing as well?

Or will I need to implement another solutions like say RabbitMQ for message queuing and some other service for load balancing along with it? Please point me into the right direction.
You are setting up solr in CloudSolr mode. You can use the solrj java client implementation for load balancing the indexes and search queries to solr. CloudSolrServer is a class in the Solrj client for connecting to the Solr cloud.
It connects to Zookeeper and keeps track of the state of each node in the cluster. With this knowledge, CloudSolrServer client knows which nodes are the leaders and sends requests to leaders only to save time. Without CloudSolrServer, requests are sent in a round-robin fashion to all the Solr nodes (Leaders and Replicas). So there is S/N chance of getting the current Shard leader where N is the total number of nodes in the cloud (i.e. sum of leaders and replicas = N) and S is the number of Shard-Leaders. There are (1-S/N) chances of hitting the non-leader node which is wasteful as the non-leader node would then have to pass the request to its leader. With CloudSolrServer, requests are sent only to Shard-Leaders which performs much better.
If a node crashes, ZooKeeper notifies CloudSolrServer about the same so that CloudSolrServer removes it from the eligible solr instances' list. If a new leader is elected, then also CloudSolrServer clients are notified.
In fact, Solr actually uses the CloudSolrServer internally to communicate with other nodes in the cluster.
You don't need any type of queuing mechanism while working with Solr.

Distribute Solr Using Replication without Using SolrCloud

I want to use Solr replication without using SolrCloud.
I have three Solr servers, one is master and others are slave.
How to dispatch the search query on the Solr server which isn't busy?
What tools do and how to lead?
You can use any load balancer - Solr talks HTTP, which makes any existing load balancing technology available. HAProxy, varnish, nginx, etc. will all work as you expect, and you'll be able to use all the advanced features that the different packages offer. It'll also be independent of the client, meaning that you're not limited to the LBHttpSolrServer class from SolrJ or what your particular client offers. Certain LB solutions also offer high throughput caching (varnish) or dynamic real time fallover between live nodes.
Another option we've also used successfully is to replicate the core to each web node, allowing us to always query localhost for searching.
You have configured solr in master-slave mode. I think you can use LBHttpSolrServer from solrj api for querying the solr. You need to send the update requests to master node explicitly. The LBHttpSolrServer will provide you the load balancing among all the specified nodes. In the master-slave mode, slave are responsible for keeping themselves updated with the master.
Do NOT use this class for indexing in master/slave scenarios since documents must be sent to the correct master; no inter-node routing is done. In SolrCloud (leader/replica) scenarios, this class may be used for updates since updates will be forwarded to the appropriate leader.
I hope this will help.
apache camel can be used for general load balancer. like this:
public class LoadBalancer {
public static void main(String args[]) throws Exception {
CamelContext context = new DefaultCamelContext();
context.addRoutes(new RouteBuilder() {
public void configure() {
from("jetty://http://localhost:8080")
.loadBalance().roundRobin().to("http://172.28.39.138:8080","http://172.168.20.118:8080");
}
});
context.start();
Thread.sleep(100000);
context.stop();
}
}
There is some other materials maybe useful:
Basic Apache Camel LoadBalancer Failover Example
http://camel.apache.org/load-balancer.html
But is seems there are not straight way to solr-camel integration, because camel can be used to balance the requests upon he java "Beans" components
http://camel.apache.org/loadbalancing-mina-example.html
There is another useful example:
https://svn.apache.org/repos/asf/camel/trunk/camel-core/src/test/java/org/apache/camel/processor/CustomLoadBalanceTest.java
And you can use camel as a proxy between client and server
http://camel.apache.org/how-to-use-camel-as-a-http-proxy-between-a-client-and-server.html
There are some presentation to beginning with apache camel,its approach and architecture:
http://www.slideshare.net/ieugen222/eip-cu-apache-camel

Query solr cluster for state of nodes

I'm trying to tweak our system status check to see the state of the Solr nodes in our SolrCloud. I'm facing the following problems:
We send a query to each of the Solr nodes separately. If we get a response and the status of the response is 0, we assume the node is running. Unfortunately, we've seen cases in which the node is recovering or even down and select queries are still handled.
In hope to prevent this, we've added a check which sends a ping request to solr. If the status returned by this is request reads 'OK' we assume the node is up. Unfortunately even with this request, if the node is recovering or down, this check won't fail.
My question is: What is the correct way to check the status of a node in SolrCloud?
If you are using a SolrCloud, it's recommended to maintain an explicit zookeeper ensemble as well. Because zookeeper ensemble maintains the SolrCloud's current status of each node and each shard wise. This status is actually get reflected from the SolrCloud admin window.
Go to the Admin window. Click on "Cloud".
Then click on "Tree" to get a tree view of your SolrCloud architecture.
Click /clusterstate.json to view the SolrCloud status.
This (clusterstate.json) json file holds the SolrCloud status information. Now if you are running an explicit zookeeper ensemble, following are the steps to get SolrCloud status.
Go to the path "zookeeper/installation/directory/bin"
Execute ./zkCli.sh -server ZK_IP:ZK_PORT (E.g ./zkCli.sh -server localhost:2181)
Execute get /clusterstate.json
You'll find the SolrCloud status.
Note : ZK_IP - The HOST IP where zoopeeper is running.
ZK_PORT - Zookeeper's client port.
You actually don't want /clusterstate.json - as this only covers the case where collections are already present. From ZooKeeper you need /live_nodes
Because Zookeeper is the authority for what Solr Nodes are members of the Solr cloud cluster, it follows that you should go to it first, to discover what members are accessible. This is how all Solr cloud clients work, and probably is the best way to approach the problem.
/live_nodes contains a file for each live Solr node, regardless of what collections exist or where the replicas are located.
Once you have resolved /live_nodes... you can call clusterstatus on any Solr instance with the address and port from one of the live-nodes.
http://localhost:8983/solr/admin/collections?action=clusterstatus&wt=json
clusterstatus provides a detailed overview of Solr nodes, collections, replicas, etc. Everything you would want to know.
As a final note, it's very wise to set SOLR_HOST inside of solr.in.sh configuration (/etc/default/solr.in.sh) - by default 'localhost' is used to reference the solr node. Setting this value to the public address you want the Solr node identified by will prevent ZooKeeper from returning the address "localhost" to clients when attempting to reach a Solr Node.

CloudSolrServer, graceful shutdown

I was reading this PDF: http://2010.lucene-eurocon.org/slides/Solr-In-The-Cloud_Mark-Miller.pdf, and there is a section that talks about CloudSolrServer. In particular, this statement is made:
It keeps a list of both live and dead servers. When a request to a server fails, that server is added to the ‘dead’ list, and another ‘live’ server is queried instead.
The ‘dead’ server list is occasionally pinged, and if a server comes back, it is moved back into the ‘live’ list.
This works fine when a SOLR instance or the machine crashes, but for normal maintenance it would be undesirable because requests in progress would be lost. Typically with a normal load balancer, there's a way to shut off traffic to a box, and then normal shutdown can proceed at some interval after that.
Since it appears that CloudSolrServer is intended to replace a traditional load balancer in front of a SOLR cluster, I was wondering about graceful shutdown. What is the recommended way to shutdown a SOLR instance while not losing requests, (while using CloudSolrServer)?
If you want to gracefully shutdown an instance, you will need to first remove the corresponding node from ZooKeeper and then shut down the instance. You can remove the node from ZK by using "DELETEREPLICA" command:
/admin/collections?action=DELETEREPLICA&collection=collection&shard=shard&replica=replica
See more in Solr Collections API documentation
once the ephemeral node is removed from ZooKeeper, CloudSolrServer will stop sending requests to it.

Resources