Mesos as a giant Linux box - distributed

Is it possible to see all the mesos resources as a giant Linux box without custom code for the framework?
I am wondering, if I want to run a program using 2500tb of ram, can mesos abstract the master slave architecture away? Do I have to use custom code?

You have to write custom code. Mesos offers resources per agent (slave) basis and it is up to you how to coordinate binaries of your app running on different machines.

Q1: Mesos is a resource manager. Yes, it's a giant pool of resources. Although at a given time it will offer you only a subspace of all resources. Assuming that there are other users that might need some resources (don't worry there's a way how to utilize almost whole cluster).
Q2: Mesos is designed for a commodity hardware (many nodes, not a single giant HPC computer). A framework running on Mesos will be given a list resources (and slaves - worker nodes) and Mesos will execute a task within bound of given resources. This way you can start an MPI job or run a task on top of Apache Spark which will handle the communication between nodes for you (but not Mesos itself).
Q3: You haven't specified what kind of task you'd like to compute. Spark comes with quite a few examples. You can run any of those without writing own code.
(Image credits: Malte Schwarzkopf, Google talk EuroSys 2013 in Prague)

Related

flink jobmanger or taskmanger instances

I had few questions in flink stream processing framework. Please let me know the your comments on these questions.
Let say If I build the cluster with n nodes, out of which I had m nodes as job mangers (for HA) then, remaining nodes (n-m) are the ask mangers?
In each node, We had n cores then how we can control/to use the specific number of cores to task-manger/job-manger?
If we add the new node as task-manger then, does the job manger automatically assign the task to the newly added task-manger?
Does flink has concept of partitions and data skew?
If flink connects to pulsar and need to read the data from portioned topic. So, what is the parallelism here? (parallelism is equal to no. of partitions or it's completely depends the flink task-manager's no.of task slots)
Does flink has any inbuilt optimization on job graph? (Example. My job graph has so many filter, map , flatmap.. etc). Please can you suggest any docs/materials for flink job optimizations?
do we have any option like, one dedicated core can be used for prometheus metrics scraping?
Yes
Configuring the number of slots per TM: https://nightlies.apache.org/flink/flink-docs-stable/docs/concepts/flink-architecture/#task-slots-and-resources although each operator runs in its own thread and you have no control on which core they run, so you don't really have a fine-grained control of how cores are used. Configuring resource groups also allows you to distribute operators across slots: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/overview/#task-chaining-and-resource-groups
Not for currently running jobs, you'd need to re-scale them. New jobs will use it though.
Yes. https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/sources/
It will depend on the Fink source parallelism.
It automatically optimizes the graph as it sees fit. You have some control rescaling and chaining/splitting operators: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/overview/ (towards the end). As a rule of thumb, I would start deploying a full job per slot and then, once properly understood where are the bottlenecks, try to optimize the graph. Most of the time is not worth it due to increased serialization and shuffling of data.
You can export Prometheus metrics, but not have a core dedicated to it: https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/metric_reporters/#prometheus

How to control Flink jobs to be distributed/load-balanced properly amongst task-managers in a cluster?

How to control Flink's jobs to be distributed/load-balanced(Evenly or another way where we can set the threshold limit for Free-Slots/Physical MEM/CPU Cores/JVM Heap Size etc..) properly amongst task-managers in a cluster?
For example, I have 3 task-managers in a cluster where one task-manager is heavily loaded even though there are many Free Slots and other resources are available in other task-managers in a cluster.
So if a particular task-manager is heavily loaded then it may cause many problems e.g. Memory issues, heap issues, high back-pressure, Kafka lagging(May slow down the source and sink operation), etc which could lead a container to restart many times.
Note: I may have not mentioned all the possible issues here due to this limitation but in general in distributed systems we should not have such limitations.
It sounds like cluster.evenly-spread-out-slots is the option you're looking for. See the docs. With this option set to true, Flink will try to always use slots from the least used TM when there aren’t any other preferences. In other words, sources will be placed in the least used TM, and then the rest of the topology will follow (consumers will try to be co-located with their producers, to keep communication local).
This option is only going to be helpful if you have a static set of TMs (e.g., a standalone cluster, rather than a cluster which is dynamically starting and stopping TMs as needed).
For what it's worth, in many ways per-job (or application mode) clusters are easier to manage than session clusters.

Flink: What does it mean to embed flink on other programs?

What does it mean to embed flink on other programs?
In the link here - https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/api_concepts.html#basic-api-concepts in second paragraph it says flink can be embedded in other programs.
I would like to know more about this. Like how to achieve it. A sample program would be very helpful.
Using the above is it possible to achieve the following?
Can we run flink programs as individual Actors?
Can we route data between two flink programs?
Reason: I am asking the above two questions because my requirement is as below
I have some set of Flink Jobs/programs based on the config file I want only certain Flink Jobs/programs to process the input data and this keeps changing based on the config file. So there is a need for Flink jobs./programs(or the code in those jobs) to be always available and they need to pass data and communicate.
Kindly share your insights.
Running Flink embedded in other programs refers to Flink's local execution mode. The local execution mode runs a Flink program in your JVM. This entails that the job won't be executed distributedly.
What is currently not possible out of the box is to let Flink jobs control other Flink jobs. However, it is possible to build a Flink application which takes as input job descriptions and executes them. RBEA is an example of such a Flink application. The conceptual difference is that you don't have multiple Flink jobs but a single one which processes programs as input records.
Alternatively, you could take a look at Stateful functions which is a virtual actor framework built on top of Apache Flink. The idea is to provide a framework for building distributed stateful applications with strong consistency guarantees. With stateful functions, you would also build a single Flink application which processes events which could represent a form of computation.

How to stop mesos from offering resources to a framework?

I have a use case where I have 20-30 frameworks runnings on mesos cluster that has over 200 nodes. A lot of the times mesos is offering resources to frameworks that do not want any offers at all. While doing that, it is offering little resources to frameworks that actually need them.
I know there's a function requestResources that a framework can call to ask for resources. However, I couldn't find a function that a framework can use to tell mesos to stop sending it any offers. Is there a way of doing that? Because my frameworks keep getting offers every 100 milliseconds, which is way fast!
When you declineOffer, you can set an optional Filter with refuse_seconds longer than the default 5s. This will mean that after you decline an offer from a node, that Mesos will not offer these resources back to your framework for refuse_seconds.
Alternatively, if your framework temporarily doesn't want any offers from any nodes, it can call driver.stop(true), and the scheduler will unregister from Mesos but its tasks will keep running for FrameworkInfo.failover_timeout. Once the framework has work to do, it can start/run the driver again to start getting offers again.
(FYI, requestResources doesn't actually do anything yet.)

Running C simulations on EC2

I've been running parallel, independent simulations on a SGE cluster and would like to transition to using EC2. I've been looking at the documentation for StarCluster but, due to my inexperience, I'm still missing a few things.
1) My code is written in C and uses GSL -- do I need to install GSL on the virtual machines and compile there, or can I precompile the code? Are there any tutorials that cover this exact usage of EC2?
2) I need to run maybe 10,000 CPU hours of code, but I could easily set this up as many short instances or fewer, longer jobs. Given these requirements, is EC2 really the best choice? If so, is StarCluster the best interface for my needs?
Thanks very much.
You can create an AMI (basically an image of your virtual machine) with all your dependencies installed. Then all you need to do is configure job specific parameters at launch.
You can run as many instances as you want on ec2. You may be able to take advantage of spot instances to save money (all you need to be able to tolerate is the instance getting shutdown if the price exceeds your bid.)

Resources