Solr Cloud: Distribution of Shards across nodes - solr

I'm currently using Solr Cloud 6.1, the following behavior can also be observed until 7.0.
I'm trying to create a Solr collection with 5 shards and a replication factor of 2. I have 5 physical servers. Normally, this would distribute all 10 replicas evenly among the available servers.
But, when starting Solr Cloud with a -h (hostname) param to give every Solr instance an individual, but constant hostname, this doesn't work any more. The distribution then looks like this:
solr-0:
wikipedia_shard1_replica1 wikipedia_shard2_replica1 wikipedia_shard3_replica2 wikipedia_shard4_replica1 wikipedia_shard4_replica2
solr-1:
solr-2:
wikipedia_shard3_replica1 wikipedia_shard5_replica1 wikipedia_shard5_replica2
solr-3:
wikipedia_shard1_replica2
solr-4:
wikipedia_shard2_replica2
I tried using Rule-based Replica Placement, but the rules seem to be ignored.
I need to use hostnames, because Solr runs in a Kubernetes cluster, where IP adresses change frequently and Solr won't find it's cores after a container restart. I first suspected a newer Solr version to be the cause of this, but I narrowed it down to the hostname problem.
Is there any solution for this?

The solution was actually quite simple (but not really documented):
When creating a Service in OpenShift/Kubernetes, all matching Pods get backed by a load balancer. When all Solr instances get assigned an unique hostname, this hostnames would all resolve to one single IP address (that of the load balancer).
Solr somehow can't deal with that and fails to distribute its shards evenly.
The solution is to use headless services from Kubernetes. Headless services aren't backed by a load balancer and therefore every hostname resolves to an unique IP address.

Related

Deploy SolrCloud to multiple servers

I am a little bit confused with solrCloud. But how can I deploy SolrCloud on multiple servers? Will it be multiple nodes one per separate server or maybe will it bee one solrCloud node and multiple shards one per server?
And how all of this will communicate with Zookeeper (as far as I understand Zookeeper has to be also deployed on the separate server, is this correct?)
I am a little bit confused with all of this? Can you help me? Or maybe give a link to a good tutorial?
The SolrCloud section of the reference manual should be able to help you out about the concepts of Solr Cloud.
You can run multiple nodes on a single server, or you can run one node on each server. That's really up to you - but all the nodes running in a single server will disappear when that server goes down. The use case for running multiple nodes on a single server is usually for experimenting or for very particular requirements to try to get certain speedups from the single threaded parts of Lucene, so unless you're doing low-level optimization, having one node per server is what you want.
The exception to that rule is for development and experimenting - running multiple nodes on a single machine is fine when the data doesn't matter.
All the nodes make up a single SolrCloud cluster - so you'd be running multiple nodes, not multiple clusters.
Zookeeper should (usually) be deployed on three to five servers - depending on what kind of resiliency you want for failovers. While Solr bundles a Zookeeper instance you can use if you don't want to set up Zookeeper yourself, that is not recommended for production. In a production environment you'd run Zookeeper as a separate process - but that may not mean that you'll be running it on separate servers. Depending on how much traffic and use you'll see for Zookeeper for your nodes, running them on the same server as your cloud nodes will work perfectly fine. The point is to avoid using the bundled version to have full control over Zookeeper and its configuration, and to be able to upgrade/manage the instances outside of Solr.
If the need arises later you can move Zookeeper to its own cluster of servers then (at least three).

Using solrj and LBHttpSolrClient to access a single solrcloud instance

Is using the LBHttpSolrClient within solrj to access a single solrcloud instance is it less robust than using the default solrj and zookeeper behavior? Can it load balance over a single solrcloud instance correctly?
The solrcloud instance that I have available has a collection with about 9 million documents, spread over three shards with about 3 million documents per shard. There are three nodes (servers) in the solrcloud, with 3 shards, replicationFactor is 2, and maxShardsPerNode of 2. For this solrcloud instance, there are 3 zookeeper nodes also running on these three servers.
Note: The values listed in the following variable named solrUrls should be prefixed with "http://" instead of "http_url_". I am unable to post more than 2 URLs at this time so I must "encode" them. Sorry.
This is the basic code that I've been told to use:
String zkUrls = "solrd1:2181,solrd2:2181,solrd3:2181";
String solrUrls = {"http_url_solrd1:8983", "http_url_solrd2:8983", "http_url_solrd3:8983"};
LBHttpSolrClient.Builder lbclient =
new BHttpSolrClient.Builder().withBaseSolrUrls(solrUrls);
CloudSolrClient solr = new CloudSolrClient.Builder()
.withLBHttpSolrClientBuilder(lbclient)
.withZkHost(zkUrls)
.build();
cloudServer.setDefaultCollection(defaultCollection);
Is this LBHttpSolrClient client able to properly use the provided solrUrls since each node listed in that variable are just nodes within a single solrcloud? Does this load balance client automatically query all the other nodes to ensure the results are complete for the whole collection instead of just the shards that exist on that node?
If the use of the LBHttpSolrClient client is the correct way to access a single solrcloud instance (better than solrj and zookeeper), then is there a better way to let zookeeper provide the base solr urls? I have an impression that the LBHttpSolrClient client predates the whole solrcloud setup and was a way to load balance over multiple standalone instances of solr; if that's the case then would the use of the LBHttpSolrClient client be obsolete compared to solrj and zookeeper?
References:
Is there any loss of functionality if I use load balancer which does not communicate with zookeeper in solrcloud?
This link appears to have an appropriate title that may provide some insight in to the same questions that I'm asking, but it has no answers.
Loadbalancer and Solrcloud
This link discusses how solrj and zookeeper works together, but does not address my questions on if the LBHttpSolrClient client is less robust or if it will work correctly on a single instance of a small solrcloud.
SolrCloud load-balancing
Does not address if solrj and zookeeper is better suited than use of the LBHttpSolrClient client.
I think you are overcomplicating things, you can even totally skip the LBHttpSolrClient in your code, and Solrj will create the needed instance behind the scenes.
In short, CloudSolrClient uses LBHttpSolrClient to send request to right Solr instances. If you want to get the most out of your Solrcloud setup, use CloudSolrClient, if you use just a LBHttpSolrClient (without CloudSolrClient), then you will not know a Solr node has gone down for instance (until you get failed requests).

Zookeeper and SolrCloud on AWS EC2 instances

I have used Solr for a while, but am new to SolrCloud. I am investigating whether it makes sense in my context to deploy SolrCloud or to have multiple Solr instances (with matching indexed content) sitting behind an ELB.
My deployment will be in AWS on EC2 instances. Our current troubleshooting strategy in AWS is to terminate misbehaving instances and allow them to be automatically recreated by an AutoScaling group (which configures new instances via scripts when they are created). In fact, we do not have access to log on to the instances once they are in production. Everything stored in Solr can be re-indexed, so there is not a concern for data loss.
When trying to understand the SolrCloud infrastructure, however, I had a few questions:
Is Zookeeper able to automatically add a new instance if I destroy one of them? Everything I have seen seems to have static IP addresses in the configurations, which would require the configs to be updated (and Zookeeper restarted) if an instance was terminated and replaced.
Is there a "master" Zookeeper instance that I should call, or can I call any of them? If I can call any of them, we would likely put an ELB in front of Zookeeper.
If we hit heavy usage and allow the AWS AutoScaling group to create additional servers that serve as SolrCloud shards, will SolrCloud gracefully add the instances and terminate them without problems? (This appears to be true, and the whole point of using SolrCloud.)
Is Zookeeper able to automatically add a new instance if I destroy one of them? Everything I have seen seems to have static IP addresses in the configurations, which would require the configs to be updated (and Zookeeper restarted) if an instance was terminated and replaced.
AN: In ZooKeeper, you will just have to mention about other ZooKeepers. This is to make the ZooKeepers aware of other running ZooKeepers. You don't need to change this config unless you plan to increase/decrease the number of ZooKeepers. Even if we have to do, we can do without disturbing the cluster by doing one at time. Also we keep hostname in config so that change in ip will have no impact on this.
Is there a "master" Zookeeper instance that I should call, or can I call any of them? If I can call any of them, we would likely put an ELB in front of Zookeeper.
AN: In ZooKeeper, we have a leader and followers. We don't need to bother about them as we don't communicate with ZooKeepers
If we hit heavy usage and allow the AWS AutoScaling group to create additional servers that serve as SolrCloud shards, will SolrCloud gracefully add the instances and terminate them without problems? (This appears to be true, and the whole point of using SolrCloud.)
AN: When you create a new SOLR node, you will have to start the node under the same cluster (Pass same ZooKeepers). Once you start with this, you will have to split a shard and move it to another node so as to balance the cluster. Not automated as of now.
SOLR Nodes are the one that you have to add in your ELB.
When you start a SOLR node, you will mention the list of ZooKeepers by which SOLR node will understand which cluster is that part of and other nodes serving the cluster

Solr - Multi Core vs Multiple Instance for Many Database Tables

I have performance concern and want a suggestion that which will be best, Multi Core or Multi Instance(with different port)? Lets have a look on My Case First:
Currently I am running solr with multiple core and its running OK. There is only one issue that sometime it goes "out of heap memory while processing facets fields", then I have to restart the solr. ( To minimize the no. of restarts, I starts the solr with high memory : java -Xms1000M -Xmx8000M -jar start.jar )
I have amazon ec2 instance with 8core-2.8GHtz /15GB Ram with optimized hard disk.
I have many database-tables(about 100) and have to create different schemas for each(leads to create different core).
Each table have millions of documents, with 7-9 indexed fields and 10-50 stored fields in each document.
My web portals should handle very high traffic (currently I m having 10 request/second, may increase to 50-100/second). I know 'solr' can handle that but it is to just inform you that I am concern about every-smallest performance issue also
Searching solr by PHP and CURL in to specific core, so there is no problem in searching in different solr instance also.
Question:
As per as I know Solr handles one request at a time. So I think if I create multiple instance of solr and starts those at different port, then my web portal can handle more request at a time. (if user search in different table).
So, what you will suggest me? Multi Core in Single Solr Instance? or Multiple Instances with Single/Dual Core in each?
Is there any problem in having multiple solr instances running at different ports?
NOTE: Here, I can/may/will combine less-searched-core(s)/small-core(s) in one instance AND heavy-traffic-core(s) in separate instance OR two-three-heavy-traffic-core in one-instance etc. Coz, creating different Instances for each table(~100 here) will take too much hardware resources.
As I didn't got any answer since more then week AND I had also tried many case with solr (and also read some articles), I want to share my experience as answer to my own question. This may/will help to future viewer. I tried on serverfault also with no success.
Solr can handle more request at a time.
I have tested it by running a long query [qTime=7203, approx. 7sec] and several small-queries-after-long-one [qTime=30], solr respond for small-queries first even they ran after long-one.
This point gives much reason in answer: Use single solr instance with multiple core. Just assign High memory to JVM.
Other Points:
1. Each solr instance will require RAM, so running multiple instances will require more resources, which will be expensive. And if you are using facets, sort fields then you need to allocate more RAM to each instance.
As you can see in my case I need to start the solr with high memory(8GB). You can see a case for Danish Web Archive, Which uses multiple instances and allocated 9GB RAM to each and having 256GM total RAM.
2. You can run multiple instances of solr on different PORT by java -Djetty.port=8984 -jar start.jar. Everything running ok BUT I got one problem.
While indexing it may give "not enough memory error" and then solr instance will be killed. So you again need to start second instance with high memory, which will leads to more RAM requirement.
3. Solr Resource Requirement and Performance Problem can be understand here. According to this 64bit environment and 12GB RAM is recommended for good performance. Solr Optimization are explained here.

SolrCloud load-balancing

i'm working on a .NET application that uses Solr as Search Engine. I had configured a SolrCloud installation with two server (one for Replica) and i didn't split the index in shards (number of shards = 1). I have read that SolrCloud (via Zookeeper) can do some load balancing, but i didn't understand how. If a call a specific address where an instance of solr is deployed, the query appears only on the logs of that specific server.
On the documentation of SolrCloud i've found that:
Explicitly specify the addresses of shards you want to query, giving alternatives (delimited by |) used for load balancing and fail-over:
http://www.ipaddress.com:8983/solr/collection1/select?shards=www.ipaddress.com:8983/solr|www.ipaddress.com:8900/solr,www.ipaddress.com:7574/solr|www.ipaddress.com:7500/solr
I'm wondering if i can use this notation to force load balancing also if a have an entire index (only one shard) and in that case how the load-balancer works.
UPDATE: I've tested this solution and it works. Adding the various shard addresses in the field "shards" separated by the character "|" forces Solr to call the internal load balancer (LBHttpSolrServer) that performs a simple round robin balancing.
Thanks for your help.
Regards,
Jacopo
I've tested this solution and it works. Adding the various shard addresses in the field "shards" separated by the character "|" forces Solr to call the internal load balancer (LBHttpSolrServer) that performs a simple round robin balancing.
Since you only have a single shard, the server that is receiving the request will respond with the result, it will not perform another request to the other replica when it has the data locally. The Java CloudSolrServer client connects to ZooKeeper and knows which servers are up or down and will perform load balancing appropriately across all active servers. I don't believe there are any ports .NET ports available for this specific client.

Resources