Using solrj and LBHttpSolrClient to access a single solrcloud instance - solr

Is using the LBHttpSolrClient within solrj to access a single solrcloud instance is it less robust than using the default solrj and zookeeper behavior? Can it load balance over a single solrcloud instance correctly?
The solrcloud instance that I have available has a collection with about 9 million documents, spread over three shards with about 3 million documents per shard. There are three nodes (servers) in the solrcloud, with 3 shards, replicationFactor is 2, and maxShardsPerNode of 2. For this solrcloud instance, there are 3 zookeeper nodes also running on these three servers.
Note: The values listed in the following variable named solrUrls should be prefixed with "http://" instead of "http_url_". I am unable to post more than 2 URLs at this time so I must "encode" them. Sorry.
This is the basic code that I've been told to use:
String zkUrls = "solrd1:2181,solrd2:2181,solrd3:2181";
String solrUrls = {"http_url_solrd1:8983", "http_url_solrd2:8983", "http_url_solrd3:8983"};
LBHttpSolrClient.Builder lbclient =
new BHttpSolrClient.Builder().withBaseSolrUrls(solrUrls);
CloudSolrClient solr = new CloudSolrClient.Builder()
.withLBHttpSolrClientBuilder(lbclient)
.withZkHost(zkUrls)
.build();
cloudServer.setDefaultCollection(defaultCollection);
Is this LBHttpSolrClient client able to properly use the provided solrUrls since each node listed in that variable are just nodes within a single solrcloud? Does this load balance client automatically query all the other nodes to ensure the results are complete for the whole collection instead of just the shards that exist on that node?
If the use of the LBHttpSolrClient client is the correct way to access a single solrcloud instance (better than solrj and zookeeper), then is there a better way to let zookeeper provide the base solr urls? I have an impression that the LBHttpSolrClient client predates the whole solrcloud setup and was a way to load balance over multiple standalone instances of solr; if that's the case then would the use of the LBHttpSolrClient client be obsolete compared to solrj and zookeeper?
References:
Is there any loss of functionality if I use load balancer which does not communicate with zookeeper in solrcloud?
This link appears to have an appropriate title that may provide some insight in to the same questions that I'm asking, but it has no answers.
Loadbalancer and Solrcloud
This link discusses how solrj and zookeeper works together, but does not address my questions on if the LBHttpSolrClient client is less robust or if it will work correctly on a single instance of a small solrcloud.
SolrCloud load-balancing
Does not address if solrj and zookeeper is better suited than use of the LBHttpSolrClient client.

I think you are overcomplicating things, you can even totally skip the LBHttpSolrClient in your code, and Solrj will create the needed instance behind the scenes.
In short, CloudSolrClient uses LBHttpSolrClient to send request to right Solr instances. If you want to get the most out of your Solrcloud setup, use CloudSolrClient, if you use just a LBHttpSolrClient (without CloudSolrClient), then you will not know a Solr node has gone down for instance (until you get failed requests).

Related

Load balancing and indexing in SolrCloud

I have some questions regarding SolrCloud:
If I send a request directly to a solr node, which belons to a solr cluster, does it delegate the query to the zookeeper ensemble to handle it?
I want to have a single url to send requests to SolrCloud. Is there a better way of achieving this, than setting up an external load balancer, which balances directly between individual solr nodes? If 1 isn't true, this approach seems like a bad idea. On top I feel like it would somewhat defeat the purpose of zookeeper ensemble.
There is an option to break up a collection in shards. If I do so, how exactly does SolrCloud decide which document goes to which shard? Is there a need and/or an option to configure this process?
What happens if I send a collection of documents directly to one of the solr nodes? Would the data set somehow distribute itself across the shards evenly? If so, how does it happen?
Thanks a lot!
Zookeeper "just" keeps configuration data available for all nodes - i.e. the state of the cluster, etc. It does not get any queries "delegated" to it; it's just a way for Solr nodes and clients to know which collections are handled by which nodes in the cluster, and have that information be stored in resilient and available manner (i.e. dedicate the hard part out of managing a cluster to Zookeeper).
The best is to use a cloud aware Solr client - it will connect to any of the available Zookeeper nodes given in its configuration, retrieve the cluster state and connect directly to one the nodes that has the information it needs (i.e. the collection it needs to query). If you can't do that, you can either load balance with an external load balancer across all nodes in your cluster or let the client load balance if the client you use supports round robin, etc. - but having an external load balancer gives you other gains (such as being able to remove a node from load balancing for all clients at the same time, having dedicated http caching in front of th enodes, etc.) for a bit more administration.
It will use the unique id field to decide which node a given document should be routed to. You don't have to configure anything, but you can tell Solr to use a specific field or a specific prefix of a field, etc. as the route key. See Document Routing. for specific information. It allows you to make sure that all documents that belong to a specific client/application is placed on the same node (which is important for some calculations and possible operations).
It gets routed to the correct node. Whether that is evenly depends on your routing key, but by default, it'll be about as even as you can get it.

Solr Cloud: Distribution of Shards across nodes

I'm currently using Solr Cloud 6.1, the following behavior can also be observed until 7.0.
I'm trying to create a Solr collection with 5 shards and a replication factor of 2. I have 5 physical servers. Normally, this would distribute all 10 replicas evenly among the available servers.
But, when starting Solr Cloud with a -h (hostname) param to give every Solr instance an individual, but constant hostname, this doesn't work any more. The distribution then looks like this:
solr-0:
wikipedia_shard1_replica1 wikipedia_shard2_replica1 wikipedia_shard3_replica2 wikipedia_shard4_replica1 wikipedia_shard4_replica2
solr-1:
solr-2:
wikipedia_shard3_replica1 wikipedia_shard5_replica1 wikipedia_shard5_replica2
solr-3:
wikipedia_shard1_replica2
solr-4:
wikipedia_shard2_replica2
I tried using Rule-based Replica Placement, but the rules seem to be ignored.
I need to use hostnames, because Solr runs in a Kubernetes cluster, where IP adresses change frequently and Solr won't find it's cores after a container restart. I first suspected a newer Solr version to be the cause of this, but I narrowed it down to the hostname problem.
Is there any solution for this?
The solution was actually quite simple (but not really documented):
When creating a Service in OpenShift/Kubernetes, all matching Pods get backed by a load balancer. When all Solr instances get assigned an unique hostname, this hostnames would all resolve to one single IP address (that of the load balancer).
Solr somehow can't deal with that and fails to distribute its shards evenly.
The solution is to use headless services from Kubernetes. Headless services aren't backed by a load balancer and therefore every hostname resolves to an unique IP address.

Zookeeper and SolrCloud on AWS EC2 instances

I have used Solr for a while, but am new to SolrCloud. I am investigating whether it makes sense in my context to deploy SolrCloud or to have multiple Solr instances (with matching indexed content) sitting behind an ELB.
My deployment will be in AWS on EC2 instances. Our current troubleshooting strategy in AWS is to terminate misbehaving instances and allow them to be automatically recreated by an AutoScaling group (which configures new instances via scripts when they are created). In fact, we do not have access to log on to the instances once they are in production. Everything stored in Solr can be re-indexed, so there is not a concern for data loss.
When trying to understand the SolrCloud infrastructure, however, I had a few questions:
Is Zookeeper able to automatically add a new instance if I destroy one of them? Everything I have seen seems to have static IP addresses in the configurations, which would require the configs to be updated (and Zookeeper restarted) if an instance was terminated and replaced.
Is there a "master" Zookeeper instance that I should call, or can I call any of them? If I can call any of them, we would likely put an ELB in front of Zookeeper.
If we hit heavy usage and allow the AWS AutoScaling group to create additional servers that serve as SolrCloud shards, will SolrCloud gracefully add the instances and terminate them without problems? (This appears to be true, and the whole point of using SolrCloud.)
Is Zookeeper able to automatically add a new instance if I destroy one of them? Everything I have seen seems to have static IP addresses in the configurations, which would require the configs to be updated (and Zookeeper restarted) if an instance was terminated and replaced.
AN: In ZooKeeper, you will just have to mention about other ZooKeepers. This is to make the ZooKeepers aware of other running ZooKeepers. You don't need to change this config unless you plan to increase/decrease the number of ZooKeepers. Even if we have to do, we can do without disturbing the cluster by doing one at time. Also we keep hostname in config so that change in ip will have no impact on this.
Is there a "master" Zookeeper instance that I should call, or can I call any of them? If I can call any of them, we would likely put an ELB in front of Zookeeper.
AN: In ZooKeeper, we have a leader and followers. We don't need to bother about them as we don't communicate with ZooKeepers
If we hit heavy usage and allow the AWS AutoScaling group to create additional servers that serve as SolrCloud shards, will SolrCloud gracefully add the instances and terminate them without problems? (This appears to be true, and the whole point of using SolrCloud.)
AN: When you create a new SOLR node, you will have to start the node under the same cluster (Pass same ZooKeepers). Once you start with this, you will have to split a shard and move it to another node so as to balance the cluster. Not automated as of now.
SOLR Nodes are the one that you have to add in your ELB.
When you start a SOLR node, you will mention the list of ZooKeepers by which SOLR node will understand which cluster is that part of and other nodes serving the cluster

Load balance solr search

I am trying to implement search in datastax cassandra using solr. I have two nodes running both cassandra and solr. I am able to perform solr search using solrj. However I have hardcoded solr url of one of the node. I would like to know what configuration/code change I need to change so that solr nodes can be chosen directly.
At this stage, I am reading solrUrl from an external file and passing it as an argument to HttpSolrServer.
HttpSolrServer solrServer = new HttpSolrServer(solrUrl);
External file contains solrUrl
Solr.URL=http://192.168.100.12:8983/solr/
Also what improvements I can do to existing approach?
You can use the LBHttpSolrServer (remember: only use it for querying), which allows you to provide several servers that SolrJ will use to distribute its queries.
If you have Solr Cloud cluster, you can use the ZooKeeper-aware server in SolrJ to get your queries automagically distributed.
Third, you can set up a regular HTTP load balancer (such as haproxy, varnish, etc.) to distribute the requests for you and handle new servers coming online and servers disappearing.
You could also read a random line in the file instead of one specific server, or use a separator for the configuration line and split on that separator and pick a server on random. It won't allow you to dynamically adjust the weights depending on query times (which a HTTP Load Balancer could do), but it would probably work Good Enough.

SolrCloud load-balancing

i'm working on a .NET application that uses Solr as Search Engine. I had configured a SolrCloud installation with two server (one for Replica) and i didn't split the index in shards (number of shards = 1). I have read that SolrCloud (via Zookeeper) can do some load balancing, but i didn't understand how. If a call a specific address where an instance of solr is deployed, the query appears only on the logs of that specific server.
On the documentation of SolrCloud i've found that:
Explicitly specify the addresses of shards you want to query, giving alternatives (delimited by |) used for load balancing and fail-over:
http://www.ipaddress.com:8983/solr/collection1/select?shards=www.ipaddress.com:8983/solr|www.ipaddress.com:8900/solr,www.ipaddress.com:7574/solr|www.ipaddress.com:7500/solr
I'm wondering if i can use this notation to force load balancing also if a have an entire index (only one shard) and in that case how the load-balancer works.
UPDATE: I've tested this solution and it works. Adding the various shard addresses in the field "shards" separated by the character "|" forces Solr to call the internal load balancer (LBHttpSolrServer) that performs a simple round robin balancing.
Thanks for your help.
Regards,
Jacopo
I've tested this solution and it works. Adding the various shard addresses in the field "shards" separated by the character "|" forces Solr to call the internal load balancer (LBHttpSolrServer) that performs a simple round robin balancing.
Since you only have a single shard, the server that is receiving the request will respond with the result, it will not perform another request to the other replica when it has the data locally. The Java CloudSolrServer client connects to ZooKeeper and knows which servers are up or down and will perform load balancing appropriately across all active servers. I don't believe there are any ports .NET ports available for this specific client.

Resources