I set up heapster+influxdb+grafana combination for my Minikube kubernetes cluster.In heapster metrics api documentation they mention about filesystem metrics along with cpu ,memory, network related apis. I can get CPU , memory related metrics by using hepster api. But I am not able to access filesystem metrics using api. Any help guys?
Related
I need to track the memory usage of my linux operating system through web application. In the application I want to display the memory utilization and the avilable free memory. Is there an any API that I can make use of that gives me the status. How this can be implemented?
You could create a Springboot Application and expose the Memory Usage data as a Rest Api.
There are though better solutions like Prometheus and Grafana. You could take a look at
MONITORING LINUX HOST METRICS WITH THE NODE EXPORTER which exposes an api as http://localhost:9100/metrics. And then you could consume it either through your React webapp or throug a Prometheus deployment. Grafana has really cool out of the box Dashboards that are worth taking in consideration.
We are using fargate to deploy our ecs containers. It gives us CPU and Memory utilization dashbaords. We need to capture heap usages pattern/gc/thread usages info for ecs tasks. How could we collect this infor and use cloud watch to set alarnms and monitoring dahsboard on it.
Thanks
We use micrometer to collect our application metrics and Jvm metrics.
Flink Web UI has a brilliant backpressure section. But I can not see any metrics, given by Prometheus reporter, which could be used to detect backpressure in the same way for a Grafana dashboard.
Is there some way to get the same metrics outside of the Flink Web UI? Using the metrics described here https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html. Or even having a prometheus scraper for scraping the web api?
The back pressure monitoring that appears in the Flink dashboard isn't using the metrics system, so those values aren't available via a MetricsReporter. But you can access this info via the REST api at
/jobs/:jobid/vertices/:vertexid/backpressure
While this back pressure detection mechanism is useful, it does have its limitations. It works by calling Thread.getStackTrace(), which is expensive, and some operators (such as AsyncFunction) do critical activities in threads that aren't being sampled.
Another way to investigate back pressure is to set this configuration option in flink-conf.yaml
taskmanager.network.detailed-metrics: true
and then you can look at the metrics measuring inbound/outbound network queue lengths.
Flink documents suggests that Ceph can be used as a persistent storage for states. https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/checkpointing.html
Considering that Ceph is a transactional database, wouldn't it have adverse effect on Flink's performance?
Ceph describes itself as a "unified, distributed storage system" and provides a network file system API. As such, it such should be seamlessly working with Flink's state backends that persist checkpoints to a remote file system.
I'm not aware of people using Ceph (HDFS and S3 are more commonly used) and have no information about the performance. However, note that Flink is able to write checkpoints asynchronously, such that the performance of the storage system does not affect the processing speed of a Flink application. It might however, constrain the interval in which checkpoints are taken.
Update:
(Feb. 2018) I noticed that multiple users reported on Flink's user mailing list that they are using Ceph with Flink.
Update 2:
Flink is working fine with S3 protocol and both (Presto & Hadoop) Flink's S3 FileSystem plugins are working fine with it.
Except for Amazon MapReduce, what other options do I have to process a large amount of data?
Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below:
https://www.hadooponazure.com/
The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.
Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.
There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look:
http://mortardata.com/#!/how_it_works
DataStax Brisk is good.
Full-on distributions
Apache Hadoop
Cloudera’s Distribution including Apache Hadoop (that’s the official name)
IBM Distribution of Apache Hadoop
DataStax Brisk
Amazon Elastic MapReduce
HDFS alternatives
Mapr
Appistry CloudIQ Storage Hadoop Edition
IBM Global Parallel File System (GPFS)
CloudStore
Hadoop MapReduce alternatives
Pervasive DataRush
Cascading
Hive (an Apache subproject, included in Cloudera’s distribution)
Pig (a Yahoo-developed language, included in Cloudera’s distribution)
Refer : http://gigaom.com/cloud/as-big-data-takes-off-the-hadoop-wars-begin/
If want to process large amount of data in real-time ( twitter feed, click stream from website) etc using cluster of machines then check out "storm" which was opensource'd from twitter recently
Standard Apache Hadoop is good for processing in batch with petabytes of data where latency is not a problem.
Brisk from DataStax as mentioned above is quite unique in that you can use MapReduce Parallel processing on live data.
There are other efforts like Hadoop Online which allows to process using pipeline.
Google BigQuery obviously another option where you have csv (delimited records) and you can slice and dice without any setting up. It's extremely simple to use ,but is a premium service where you have to pay by no. of bytes processed ( first 100GB / month is free though).
If you want to stay in the cloud, you can also spin up EC2 instances to create a permanent Hadoop cluster. Cloudera has plenty of resources about setting up such a cluster here.
However, this option is less cost effective than Amazon Elastic Mapreduce, unless you have lots of jobs to run through the day, keeping your cluster fairly busy.
The other option is to build your own cluster. One of the nice features of Hadoop is that you can cobble heterogenous hardware into a cluster with decent computing power. The kind that can live in a rack in your server room. Considering that older hardware that's laying around is already paid for, the only costs to getting such a cluster going is new drives, and perhaps enough memory sticks to maximize the capacity of those boxes. Then cost effectiveness of such an approach is much better than Amazon. The only caveat would be whether you have the bandwidth necessary for pulling down all the data into the cluster's HDFS on a regular basis.
Google App Engine does MapReduce as well (at least the map part for now). http://code.google.com/p/appengine-mapreduce/