Do I need httpd to have single endpoint to Infinispan cluster - apache-camel

I have a RedHat DataGrid cluster with two nodes on different servers and I use it from Camel route. So, when I define endpoint to cache I set one of the node host (i.e.):
<to uri="infinispan://node1.some.com:11222" />
DataGrid Cluster works fine in terms of caches. they are replicated, distributed etc.
But if node1 is down then I have no connection to cache.
So question:
Do I need to have httpd with mod_cluster upfront as load balancer or there is a way to setup cache cluster level endpoint to do not care about what node is up and how many nodes are there?
BTW: I tried to find an answer, but did not get clear answer so far.
Thanks.

The Hot Rod protocol automatically receives server topology information (i.e. joiners / leavers) as they happen. The connection string specifies the initial hosts, i.e. those that the client will attempt to connect to initially. As long as one of those is up and running, the clients will be able to talk to the whole cluster. To specify multiple initial hosts separate them with semicolons: host1:port1;host2:port2;...

Related

Load balancing and indexing in SolrCloud

I have some questions regarding SolrCloud:
If I send a request directly to a solr node, which belons to a solr cluster, does it delegate the query to the zookeeper ensemble to handle it?
I want to have a single url to send requests to SolrCloud. Is there a better way of achieving this, than setting up an external load balancer, which balances directly between individual solr nodes? If 1 isn't true, this approach seems like a bad idea. On top I feel like it would somewhat defeat the purpose of zookeeper ensemble.
There is an option to break up a collection in shards. If I do so, how exactly does SolrCloud decide which document goes to which shard? Is there a need and/or an option to configure this process?
What happens if I send a collection of documents directly to one of the solr nodes? Would the data set somehow distribute itself across the shards evenly? If so, how does it happen?
Thanks a lot!
Zookeeper "just" keeps configuration data available for all nodes - i.e. the state of the cluster, etc. It does not get any queries "delegated" to it; it's just a way for Solr nodes and clients to know which collections are handled by which nodes in the cluster, and have that information be stored in resilient and available manner (i.e. dedicate the hard part out of managing a cluster to Zookeeper).
The best is to use a cloud aware Solr client - it will connect to any of the available Zookeeper nodes given in its configuration, retrieve the cluster state and connect directly to one the nodes that has the information it needs (i.e. the collection it needs to query). If you can't do that, you can either load balance with an external load balancer across all nodes in your cluster or let the client load balance if the client you use supports round robin, etc. - but having an external load balancer gives you other gains (such as being able to remove a node from load balancing for all clients at the same time, having dedicated http caching in front of th enodes, etc.) for a bit more administration.
It will use the unique id field to decide which node a given document should be routed to. You don't have to configure anything, but you can tell Solr to use a specific field or a specific prefix of a field, etc. as the route key. See Document Routing. for specific information. It allows you to make sure that all documents that belong to a specific client/application is placed on the same node (which is important for some calculations and possible operations).
It gets routed to the correct node. Whether that is evenly depends on your routing key, but by default, it'll be about as even as you can get it.

Does Solr cloud needs a load balancer e.g. HAPROXY in master failure

I have searched a lot but unfortunately have some simple confusion about solr cloud. Lets say, I have three systems where solrCloud in configured (1 master and 2 slave) and external Zookeeper on same three machines to make a quorum. Systems names are
master
slave1
slave2
Public-Front
The Public-Front is the system where, I have configured HAPROXY. It receives requests from WWW and the send to backend server depending on ACLs.
According to my understanding, If I request to Solr collection (i.e., master), it routes it to slaves and hence load balanced. There is no need to specify slaves here. Isn't ?
Now in Public-Front, should I configured each Solr as a separate slave to load balance or just to master system.
Now if I only configure master system as solr-server in HAPROXY then if solr-server (master) goes down then I think I cannot get service from Solr from HAPROXY (although slaves are till up but not configured in HAPROXY).
Where am I wrong and what is the best approach ?
There is no traditional master or slave in Solr Cloud - there is a set of replicas, one of which is defined as the leader. The leader selection is automagic - i.e. the first replica that says it wants to be the leader, receives that status. This is per collection state. In your example there is three replicas, one which is designed as the leader. If that replica disappears, one of the two remaining replicas becomes the new leader, and everything continues as normal. The role of the leader is to be the up-to-date version of the index and handle any updates - first to its own index, then route those updates to any replicas.
There is also several types of replicas, and not all of them are suited to be promoted to a leader - but in the default configuration they can be.
Here's the thing - since there isn't really a master, all three indexes contain the same data and they all are replicas of the same shard, the request won't have to be routed through the master. If you're using a dumb haproxy, you can safely spread the requests across all three nodes and they should be able to answer the query without contacting any other nodes (as long as they all contain all the shards of the collection).
However, if you're using SolrJ or another Zookeeper capable client (and using the Zookeeper compatible client), the client will keep in touch with Zookeeper instead, and read the state information for your cluster. That allows the client to know which servers are currently replicas for your collection, and contact any of those nodes that it can decide have the required information for your query. In your case the result will be the same, except that your client will know not to connect to any nodes that disappear and will automagically know about nodes that are added to the cluster.
The "one Solr node routing requests to a different node" is only relevant if the node you're contacting doesn't have any replicas for the collection you're querying - i.e. it'll have to contact a different node to fetch that content. In that case an inter cluster request will happen and the load on the cluster will be slightly higher than necessary. When the collection is replicated to all three nodes - or when you're using SolrJ, that inter cluster request should not happen.

Zookeeper and SolrCloud on AWS EC2 instances

I have used Solr for a while, but am new to SolrCloud. I am investigating whether it makes sense in my context to deploy SolrCloud or to have multiple Solr instances (with matching indexed content) sitting behind an ELB.
My deployment will be in AWS on EC2 instances. Our current troubleshooting strategy in AWS is to terminate misbehaving instances and allow them to be automatically recreated by an AutoScaling group (which configures new instances via scripts when they are created). In fact, we do not have access to log on to the instances once they are in production. Everything stored in Solr can be re-indexed, so there is not a concern for data loss.
When trying to understand the SolrCloud infrastructure, however, I had a few questions:
Is Zookeeper able to automatically add a new instance if I destroy one of them? Everything I have seen seems to have static IP addresses in the configurations, which would require the configs to be updated (and Zookeeper restarted) if an instance was terminated and replaced.
Is there a "master" Zookeeper instance that I should call, or can I call any of them? If I can call any of them, we would likely put an ELB in front of Zookeeper.
If we hit heavy usage and allow the AWS AutoScaling group to create additional servers that serve as SolrCloud shards, will SolrCloud gracefully add the instances and terminate them without problems? (This appears to be true, and the whole point of using SolrCloud.)
Is Zookeeper able to automatically add a new instance if I destroy one of them? Everything I have seen seems to have static IP addresses in the configurations, which would require the configs to be updated (and Zookeeper restarted) if an instance was terminated and replaced.
AN: In ZooKeeper, you will just have to mention about other ZooKeepers. This is to make the ZooKeepers aware of other running ZooKeepers. You don't need to change this config unless you plan to increase/decrease the number of ZooKeepers. Even if we have to do, we can do without disturbing the cluster by doing one at time. Also we keep hostname in config so that change in ip will have no impact on this.
Is there a "master" Zookeeper instance that I should call, or can I call any of them? If I can call any of them, we would likely put an ELB in front of Zookeeper.
AN: In ZooKeeper, we have a leader and followers. We don't need to bother about them as we don't communicate with ZooKeepers
If we hit heavy usage and allow the AWS AutoScaling group to create additional servers that serve as SolrCloud shards, will SolrCloud gracefully add the instances and terminate them without problems? (This appears to be true, and the whole point of using SolrCloud.)
AN: When you create a new SOLR node, you will have to start the node under the same cluster (Pass same ZooKeepers). Once you start with this, you will have to split a shard and move it to another node so as to balance the cluster. Not automated as of now.
SOLR Nodes are the one that you have to add in your ELB.
When you start a SOLR node, you will mention the list of ZooKeepers by which SOLR node will understand which cluster is that part of and other nodes serving the cluster

Distributed systems where to send requests?

How do different components in a distributed system know where to send messages to access certain services?
For example, lets say I have a service which handles authentication, and a service which handles searching. How does the component which handles searching know where to send an authentication request? Are subdomains more commonly used? If so, how does replication work in this scenario? Is there some registry of local IP addresses which handles all this routing?
The problem you are describing is called service lookup / service registry / resource lookup / .. and it depends. It depends on how large your system is and how dynamic it is.
If you only have few components, it might be feasible enough to store the necessary information in a config file, or pass it as parameter. Generally, many use DNS as a lookup system, but it’s not considered to be a good one, due to the caching and long latency.
I think most distributed systems use Zookeeper to store this information for them. This way, all the services only need to know the IP-addresses of the Zookeeper cluster. If you have replication, you just store multiple addresses inside Zookeeper, and depending on which system you are using, you’ll need to choose an address on your own, or the driver does it (in case you’re connecting to a replicated database for instance).
Another way to do this, is to use a message queue, like ZMQ which will forward the messages to the correct instances. ZMQ can deal with replications and load balancing as well.

SolrCloud load-balancing

i'm working on a .NET application that uses Solr as Search Engine. I had configured a SolrCloud installation with two server (one for Replica) and i didn't split the index in shards (number of shards = 1). I have read that SolrCloud (via Zookeeper) can do some load balancing, but i didn't understand how. If a call a specific address where an instance of solr is deployed, the query appears only on the logs of that specific server.
On the documentation of SolrCloud i've found that:
Explicitly specify the addresses of shards you want to query, giving alternatives (delimited by |) used for load balancing and fail-over:
http://www.ipaddress.com:8983/solr/collection1/select?shards=www.ipaddress.com:8983/solr|www.ipaddress.com:8900/solr,www.ipaddress.com:7574/solr|www.ipaddress.com:7500/solr
I'm wondering if i can use this notation to force load balancing also if a have an entire index (only one shard) and in that case how the load-balancer works.
UPDATE: I've tested this solution and it works. Adding the various shard addresses in the field "shards" separated by the character "|" forces Solr to call the internal load balancer (LBHttpSolrServer) that performs a simple round robin balancing.
Thanks for your help.
Regards,
Jacopo
I've tested this solution and it works. Adding the various shard addresses in the field "shards" separated by the character "|" forces Solr to call the internal load balancer (LBHttpSolrServer) that performs a simple round robin balancing.
Since you only have a single shard, the server that is receiving the request will respond with the result, it will not perform another request to the other replica when it has the data locally. The Java CloudSolrServer client connects to ZooKeeper and knows which servers are up or down and will perform load balancing appropriately across all active servers. I don't believe there are any ports .NET ports available for this specific client.

Resources