statsd architecture for a distributed system - distributed

I am studying to use the graphite - statsd - collectd stack to monitor a a distributed system.
I have tested the components (graphite-web, carbon, whisper, statsd, collectd and grafana) in a local instance.
However I'm confused about how I should distributed these components in a distributed system:
- A monitor node with graphite-web (and grafana), carbon and whisper.
- In every worker node: statsd and collectd sending data to the carbon backend in the remote monitor node.
Is it right this scheme? What I should configure statsd and collectd to get an acceptable network ussage (tcp/udp, packets per second...)?

Assuming you have a relatively light workload, having a node that manages graphite-web, grafana, and carbon (which itself manages the whisper database) should be fine.
Then you should have a separate node for your statsd. Each of your machines/applications should have statsd client code that sends your metrics to this statsd node. This statsd node should then forward these metrics onto your carbon node.
For larger workloads that stress a single node, you'll need to either scale vertically (get more powerful node to host your carbons/statsd instances), or start clustering those services.
Carbon clusters tend to use some kind of relay that you send to that manages forwarding those metrics to the cluster (usually using consistent hashing). You could use a similar setup to consistently hash metrics to a cluster of statsd servers.

Related

Does Solr cloud needs a load balancer e.g. HAPROXY in master failure

I have searched a lot but unfortunately have some simple confusion about solr cloud. Lets say, I have three systems where solrCloud in configured (1 master and 2 slave) and external Zookeeper on same three machines to make a quorum. Systems names are
master
slave1
slave2
Public-Front
The Public-Front is the system where, I have configured HAPROXY. It receives requests from WWW and the send to backend server depending on ACLs.
According to my understanding, If I request to Solr collection (i.e., master), it routes it to slaves and hence load balanced. There is no need to specify slaves here. Isn't ?
Now in Public-Front, should I configured each Solr as a separate slave to load balance or just to master system.
Now if I only configure master system as solr-server in HAPROXY then if solr-server (master) goes down then I think I cannot get service from Solr from HAPROXY (although slaves are till up but not configured in HAPROXY).
Where am I wrong and what is the best approach ?
There is no traditional master or slave in Solr Cloud - there is a set of replicas, one of which is defined as the leader. The leader selection is automagic - i.e. the first replica that says it wants to be the leader, receives that status. This is per collection state. In your example there is three replicas, one which is designed as the leader. If that replica disappears, one of the two remaining replicas becomes the new leader, and everything continues as normal. The role of the leader is to be the up-to-date version of the index and handle any updates - first to its own index, then route those updates to any replicas.
There is also several types of replicas, and not all of them are suited to be promoted to a leader - but in the default configuration they can be.
Here's the thing - since there isn't really a master, all three indexes contain the same data and they all are replicas of the same shard, the request won't have to be routed through the master. If you're using a dumb haproxy, you can safely spread the requests across all three nodes and they should be able to answer the query without contacting any other nodes (as long as they all contain all the shards of the collection).
However, if you're using SolrJ or another Zookeeper capable client (and using the Zookeeper compatible client), the client will keep in touch with Zookeeper instead, and read the state information for your cluster. That allows the client to know which servers are currently replicas for your collection, and contact any of those nodes that it can decide have the required information for your query. In your case the result will be the same, except that your client will know not to connect to any nodes that disappear and will automagically know about nodes that are added to the cluster.
The "one Solr node routing requests to a different node" is only relevant if the node you're contacting doesn't have any replicas for the collection you're querying - i.e. it'll have to contact a different node to fetch that content. In that case an inter cluster request will happen and the load on the cluster will be slightly higher than necessary. When the collection is replicated to all three nodes - or when you're using SolrJ, that inter cluster request should not happen.

AWS Elasticache Redis failover

I am using Redis on ElastiCache for a Node application and today the node went down which means our app stopped working. It took 20 minutes for a new node to be provisioned.
From reading the documentation it seems I can set up a cluster which automatically promotes a slave to primary in case of a failure. The big gotcha seems to be you have to set your client to write to the primary node and read from the slave nodes.
This means in the case of a failure, you have to reconfigure your app to point to the newly created 'read' nodes. It also takes a few minutes for a slave to be promoted to primary.
Is there no way to set this up so if the primary fails, a slave will automatically take over for read/write operations?
I'm not storing much data in redis and low read/write operations, but it is required to run the app (live video sessions!).
If I can't have a seamless failover in redis, is there something I can use which provides this functionality? I'm hoping I don't have to move to a traditional DBMS as everything works perfectly but I need to be able to handle failure well.
Thanks
Multi AZ's should automatically switch over with minimal downtime. Once you have created one of these instances, you will get an endpoint for the cluster. Amazon will point that DNS entry to the proper failover node, and handle the promotion of a slave, if the master instances dies.

Do I need httpd to have single endpoint to Infinispan cluster

I have a RedHat DataGrid cluster with two nodes on different servers and I use it from Camel route. So, when I define endpoint to cache I set one of the node host (i.e.):
<to uri="infinispan://node1.some.com:11222" />
DataGrid Cluster works fine in terms of caches. they are replicated, distributed etc.
But if node1 is down then I have no connection to cache.
So question:
Do I need to have httpd with mod_cluster upfront as load balancer or there is a way to setup cache cluster level endpoint to do not care about what node is up and how many nodes are there?
BTW: I tried to find an answer, but did not get clear answer so far.
Thanks.
The Hot Rod protocol automatically receives server topology information (i.e. joiners / leavers) as they happen. The connection string specifies the initial hosts, i.e. those that the client will attempt to connect to initially. As long as one of those is up and running, the clients will be able to talk to the whole cluster. To specify multiple initial hosts separate them with semicolons: host1:port1;host2:port2;...

Should zookeeper be run on the worker machines or independent machines?

We have several kinds of software that use zookeeper like Solr, Storm, Kafka, Hbase etc.
There are 2 options to install zookeeper cluster (more than 1 nodes):
Embedded cluster: Install ZK on some of the same machines as the other software are installed OR
External cluster: Have a few not very powerful but dedicated zookeeper machines (in the same region, cloud and data-center though) to run zookeeper on.
Which is a better option for cluster stability? Note that in both the cases, we always have an odd number of machines in our zookeeper cluster and not just one machine.
It appears that the embedded option is easier to setup and is a better use of the machines but the external option seems more stable because a loss of single machine means the loss of just one component (Loss of a machine in embedded zookeeper means loss of zookeeper node as well as the worker node of Solr, Storm, Kafka whatever the case maybe).
What is the industry standard to run zookeepers in production for maximum stability?
Zookeeper is a critical component for a Kafka cluster but since the implementation of the new generation of clients the load on ZK has been greatly reduced and is now only used by the cluster itself. Even though the load is usually not very high, it can be sensitive to latency and therefore the best practice is to run a Zookeeper ensemble on dedicated machines and optimally even use dedicated disks for ZK transaction logs to avoid IO contention.
By using larger Zookeeper ensembles you gain resiliency but this also increase communication within the cluster and you could lose some performance. Since Zookeeper works with simple majority voting you need an odd number of nodes for it to make sense. A 3 node ensemble allow losing 1 node without impact, a 5 node ensemble allow losing 2 nodes and so on.
In practice, I´ve seen small, low workload clusters run very well with Zookeeper installed on the same machines as the Kafka nodes but if you aim for maximum stability and have increasing traffic, separate clusters would be recommended.
You should consider yourself discouraged from using internal ZooKeeper in production.
Its good to have external zookeeper, Best if Zookeeper ensemble(two or more)
If you have one zookeeper node and it might create problems when it goes down.
if you have cluster setup of zookeeper nodes and if one zookeeper node goes down the remaining majority nodes are running will continue to work.
More details
For SolrCloud, we strongly recommend that Zookeeper is external, and that you have at least three of them.
This does NOT mean that it cannot run on the same servers as Solr, but it DOES mean that you should NOT use the zookeeper server that Solr itself can start, embedded within itself.
Here's some information related to performance and SolrCloud that touches on zookeeper:
https://wiki.apache.org/solr/SolrPerformanceProblems#SolrCloud
Whether or not you need completely separate machines, or even separate disks for the zookeeper database when running on the same machine as Solr, is VERY dependent on the characteristics of your SolrCloud install. If your index is very small and your query load is low, it's possible that you can put zookeeper on the same machines and even the same disks.
For the other services you mentioned, I have no idea what the recommendation is.

How to setup Solr Cloud with two search servers?

Hi I'm developing rails project with sunspot solr and configuring Solr Cloud.
My environment: rails 3.2.1, ruby 2.1.2, sunspot 2.1.0, Solr 4.1.6.
Why SolrCloud: I need more stable system - oftentimes search server goes on maintenance and web application stop working on production. So, I think about how to make 2 identical search servers instead of one, to make system more stable: if one server will be down, other will continue working.
I cannot find any good turtorial with simple, easy to understand and described in details turtorial...
I'm trying to set up SolrCloud on two servers, but I do not fully understand how it is working inside:
synchronize data between two servers (is it automatic action?)
balances search requests between two servers
when one server suddenly stop working other should become a master (is it automatic action?)
is there SolrCloud features other than listed?
Read more about SolrCloud here..! https://wiki.apache.org/solr/SolrCloud
Couple of inputs from my experience.
If your application just reads data from SOLR and does not write to SOLR(in real time but you index using an ETL or so) then you can just go for Master Slave hierarchy.
Define one Master :- Point all writes to here. If this master is down you will no longer be able to index the data
Create 2(or more) Slaves :- This is an feature from SOLR and it will take care of synchronizing data from the master based on the interval we specify(Say every 20 seconds)
Create a load balancer based out of slaves and point your application to read data from load balancer.
Pros:
With above setup, you don't have high availability for Master(Data writes) but you will have high availability for data until the last slave goes down.
Cons:
Assume one slave went down and you bought it back after an hour, this slave will be behind the other slaves by one hour. So its manual task to check for data consistency among other slaves before adding back to ELB.
How about SolrCloud?
No Master here, so you can achieve high availability for Writes too
No need to worry about data inconsistency as I described above, SolrCloud architecture will take care of that.
What Suits Best for you.
Define a external Zookeeper with 3 nodes Quorom
Define at least 2 SOLR severs.
Split your Current index to 2 shards (by default each shard will reside one each in 2 solr nodes defined in step #2
Define replica as 2 (This will create replica for shards in each nodes)
Define an LB to point to above solr nodes.
Point your Solr input as well as application to point to this LB.
By above setup, you can sustain fail over for either nodes.
Let me know if you need more info on this.
Regards,
Aneesh N
-Let us learn together.

Resources