When does a distributed system need ZooKeeper - solr

Why do some distributed systems like Solr or Kafka need ZooKeeper, but some distributed systems like Cassandra don't?

ZooKeeper provides a strongly consistent store for critical system state. Many systems, e.g. Storm and Kafka rely on ZooKeeper to do service discovery and leader election. Because ZooKeeper's ZAB protocol falls on the CP side of the CAP theorem, it can guarantee that two clients will not see different views of the same system. So, for instance, Kafka will not mistakenly believe both node A and node C are the leader for the same partition.
These systems simply use ZooKeeper because it's a very well tested and proven technology for storing this type of critical metadata. ZooKeeper acts as a central point for coordination. Cassandra, however, has a more decentralized architecture and implements its own consensus algorithm (Paxos) rather than relying on an external CP store like ZooKeeper. Depending on how Cassandra uses its gossip and consensus protocols, it may simply make some concessions that systems like Kafka and Solr do not. This allows Cassandra to be devoid of dependencies on external systems like ZooKeeper which can generally tolerate less failures than can HA systems.

Systems that need Zookeeper relies on it for cluster coordination. Cassandra architecture is different because it's a peer-to-peer system. As consequence of that the coordination is "distribuited" among each node.

In Kafka Consumers of topics register themselves in ZooKeeper, in order to coordinate with each other and balance the consumption of data.
Consumers can also store their offsets in ZooKeeper by setting offsets.storage=zookeeper.
Solr embeds and uses Zookeeper as a repository for cluster configuration and coordination - think of it as a distributed filesystem that contains information about all of the Solr servers.
Apart from these zookeeper is used in many other systems like Hadoop Highavailabilty, HBase.

Related

How to use NATS Streaming Server with Apache flink?

I want to use NATs streaming server to streaming data and using Flink want to process on data.
how I can use apache flink to process real-time streaming data with NATS streaming server?
You'll need to either find or develop a Flink/NATS connector, or mirror the data into some other stream storage service that is already has Flink support. There is not a NATS connector among the connectors that are part of Flink, or Apache Bahir, or in the collection of Flink community packages. But if you search around, you will find some relevant projects on github, etc.
When evaluating a connector implementation, in addition to the usual considerations, consider these factors:
does it provide both consumer and producer interfaces?
does it do checkpointing?
what processing guarantees does it provide? (at least once, exactly once)
how good is the error handling?
performance: e.g., is it somehow batching writes?
how does it handle serialization?
does it expose any metrics?
If you decide to write your own connector, there are existing connectors for similar systems you can use as a reference, e.g., Nifi, Pulsar, etc. And you should be aware that the interfaces used by data sources are currently being refactored under the umbrella of FLIP-27.

What is the difference between p2p file system and distributed file system?

When I googled for a distributed storage tool for my app,
I found two type of technologies:
The first represent themselves as p2p file system (IPFS..) and the others as distributed files system (Ceph ..)
so what is the different between p2p systems and distributed system ?
what I believe (it can be wrong) is that p2p systems doesn't assume trust between nodes, in contrast distributed systems all nodes have to trust each others or at least trust a "master" node.
P2P is a Distributed System Architecture.
what I believe (it can be wrong) is that p2p systems doesn't assume
trust between nodes, in contrast distributed systems all nodes have to
trust each others or at least trust a "master" node.
It depends on your definition of trust. If the 'trust' means standalone computer node operation, then you are correct.
P2P involves a component called Peer. In P2P, each peer has the same power/capability with another peer in the network. One peer can work alone without another peer.
Another example of Distributed System Architecture is Client-Server Architecture.
The client has a limited capability compared to peer. The client must connect to Server to perform a specific task. The client has limited capability without a server.
Distributed files system (DFS) combine several nodes storage (can be a large number) in a way that end user this see as single storage space. There is middleware that manage with all disks space and take care of data. Now, this Distributed file system can relay on servers or can relay on simple workstations. If nodes are Workstation we are talking about P2P DF system and if there are servers then we just say distributed file systems. I have to say that even P2P file system could involve node that act as server for indexing files, maping locatione etc.P2P DFS is effected by churn nature of peers (join/leave behaviour), while server based don t have this problem.
The best approach is to analyze several P2P distributed file Systems like Freenet, CFS, Oceanstores (interesanting since it use untrusted servers that act as peers), Farsite etc. take a look here for more.
And some DFS like Cepth, Hadoop, Riak etc ... some of them you can find here
Hope this helped.

Should zookeeper be run on the worker machines or independent machines?

We have several kinds of software that use zookeeper like Solr, Storm, Kafka, Hbase etc.
There are 2 options to install zookeeper cluster (more than 1 nodes):
Embedded cluster: Install ZK on some of the same machines as the other software are installed OR
External cluster: Have a few not very powerful but dedicated zookeeper machines (in the same region, cloud and data-center though) to run zookeeper on.
Which is a better option for cluster stability? Note that in both the cases, we always have an odd number of machines in our zookeeper cluster and not just one machine.
It appears that the embedded option is easier to setup and is a better use of the machines but the external option seems more stable because a loss of single machine means the loss of just one component (Loss of a machine in embedded zookeeper means loss of zookeeper node as well as the worker node of Solr, Storm, Kafka whatever the case maybe).
What is the industry standard to run zookeepers in production for maximum stability?
Zookeeper is a critical component for a Kafka cluster but since the implementation of the new generation of clients the load on ZK has been greatly reduced and is now only used by the cluster itself. Even though the load is usually not very high, it can be sensitive to latency and therefore the best practice is to run a Zookeeper ensemble on dedicated machines and optimally even use dedicated disks for ZK transaction logs to avoid IO contention.
By using larger Zookeeper ensembles you gain resiliency but this also increase communication within the cluster and you could lose some performance. Since Zookeeper works with simple majority voting you need an odd number of nodes for it to make sense. A 3 node ensemble allow losing 1 node without impact, a 5 node ensemble allow losing 2 nodes and so on.
In practice, I´ve seen small, low workload clusters run very well with Zookeeper installed on the same machines as the Kafka nodes but if you aim for maximum stability and have increasing traffic, separate clusters would be recommended.
You should consider yourself discouraged from using internal ZooKeeper in production.
Its good to have external zookeeper, Best if Zookeeper ensemble(two or more)
If you have one zookeeper node and it might create problems when it goes down.
if you have cluster setup of zookeeper nodes and if one zookeeper node goes down the remaining majority nodes are running will continue to work.
More details
For SolrCloud, we strongly recommend that Zookeeper is external, and that you have at least three of them.
This does NOT mean that it cannot run on the same servers as Solr, but it DOES mean that you should NOT use the zookeeper server that Solr itself can start, embedded within itself.
Here's some information related to performance and SolrCloud that touches on zookeeper:
https://wiki.apache.org/solr/SolrPerformanceProblems#SolrCloud
Whether or not you need completely separate machines, or even separate disks for the zookeeper database when running on the same machine as Solr, is VERY dependent on the characteristics of your SolrCloud install. If your index is very small and your query load is low, it's possible that you can put zookeeper on the same machines and even the same disks.
For the other services you mentioned, I have no idea what the recommendation is.

Combining Solr 3x-style Master/Slave "Repeater" to feed remote 4x SolrCloud instances?

Solr 3x "Repeaters" and Multiple Data Centers:
Solr 3x let a node behave as both a slave and master, pull from one master, and then feed copies downstream to its own slaves. This was so common/useful it even had a name, a "Repeater".
This was useful if you wanted span multiple data centers. You could have the real master in data center A (DCA), and a "repeater" in data center B (DCB). That repeater would then grab content from DCA and feed all of the other nodes in DCB, saving on bandwidth.
Suppose you want to upgrade this setup to Solr 4x and SolrCloud. (Note that Solr 4x still supports Solr 3x-style legacy replication)
It's said that you should NOT have a single SolrCloud cluster span disparate data centers. So data center B should have it's own SolrCloud.
One idea is to have the DCA -> DCB link still use Solr 3x-style Master/Slave replication. And then the "repeater" in DCB, being also a SolrCloud node, would automatically be propagated to other nodes.
Main question:
Can a Solr node participate in both Solr 3x-style master/slave mode (as a slave) and also be part of a SolrCloud cluster? And if so, how is this configured?
Complications:
In the simple case, if it's just 1 shard with replicas, it's easy to see how that might work in terms of data. It's a little less clear if you have multiple shards in DCB, how do I tell each shard to only replicate its own share of data? Note that SolrCloud normally replicates via transactions, whereas 3x uses binary indices.
Another complexity is if you're doing replication. How do you tell just the master node for each shard to pull from the remote DCA node?
Alternatives:
On solution is to upgrade to 4x but continue using 3x-style replication in DCB, so just don't use SolrCloud.
I realize that another solution would be to have the data feed send it's updates to both data centers, or usE something like RabbitMQ. For the sake of this question, let's assume thats not an option (long story...)
Maybe there's some other way I haven't thought of?
Has anybody actually tried having SolrCloud span data centers? How horrible is it?
Somebody must have asked this question before!
But I've looked on Google and, although it finds tons of pages with the keywords, I haven't seen this specific "hybrid" mode fleshed out. I found one thread from 2013 but it didn't really talk about the configuration and complexity.
To answer your first question, a Solr slave in 3.X style cannot be a node in a Solr Cloud. The reason is the slave in a master/slave 3.X Solr config simply replicates, byte for byte, all the index files on the master. That's all it does. It can, in the repeater config, then also be a master for others to replicate from, or be a dedicated query slave or both. But that's it.
A node in a Solr Cloud config is a full participant in a distributed computing cluster where indexing is generally intended to be distributed across all nodes, and all nodes participate in queries. It's a very powerful feature which automatically handles failed nodes and significantly eases the work load of scaling up that was very manual in 3.X style.
However, part of what you pay for that is increased complexity (Zookeeper), requirements for lower latency inter-node communications (because all the nodes now talk to each other and to Zookeeper) and the loss of the simplicity of Master/Slave replication.
At 20M docs you are well within the constraints of a single node master index with an effectively unlimited number of slaves and therefor very high query capacity. I do this today with a production environment where each master has on the order of 60M docs in it with no significant problems.
The question is do you need NRT, multi-node indexing, automated failover, the ability to autoscale well past 100M docs? If so then Master/Slave it probably not going to work for you.
You could take a look at writing the same data to two different Solr Cloud clusters, one in each datacenter. You could do that directly, or use something like Apache Flume to do it for you - in either there are some issues with doing this and so the real question is are dealing with those issues worth it to get the added benefit of Solr Cloud?

Solr Master Slave Failover setup for High Availability

While using Solr (we are currently using 3.5), how do we setup the Masters for a Failover?
Lets say in my Setup I have Two Masters and Two Slaves. The Application commits all the writes to One Active Master, and both the slaves get the updates from this Active Master. There is another repeater which serves the same purpose of the Master.
Now my question is if the Master for some reason comes down, how can I make the Repeater as a Master without any Manual intervention. How can the slaves start getting the updates from the Repeater instead of the broken Master. Is there a recommended way to do this? Are there any other recommended Master/Slave setup's to ensure High availability of the Solr systems?
At this time, your best option is probably to investigate the SolrCloud functionality present in the current Solr 4.0 alpha, which at the time of this writing is due for its final release within a few months. The goal of SolrCloud is to handle data distribution and master election, using the ZooKeeper distributed database to maintain consensus within the cluster about which nodes are serving in while roles.
There are other more traditional ways to set up failover for Solr 3's replicated master-slave architecture, but I personally wouldn't want to make that investment with Solr 4.0 so near to release.
Edit: See Linux-HA, for one such traditional approach. Personally, I would create a purpose-built daemon that reconfigures your cores and load balancer, using ZooKeeper for presence detection and distributed locks.
If outsourcing is an option, you might consider a hosted service such as my own humble Websolr. We provide this kind of distribution and hot failover by default, so our customers don't have to worry as much about the mechanics of how it's implemented.
I agree with Nick. The way replication works in Solr 3.x is not always handy, especially for master fail-over. If you are going to consider Solr 4 you might want to have a look at elasticsearch too, which solves this kind of problems in a really brilliant way!
It uses push replication instead of the pull mechanism used by Solr. That means the document is literally reindexed on all nodes. It might sound strange but that allows to reduce the network load (due to segment merge for example). Furthermore, a node is elected as master and if it crashes one other node will automatically replace it becoming the new master.

Resources