Kubernetes cpu requests/limits in heterogeneous cluster - request

Kubernetes allows to specify the cpu limit and/or request for a POD.
Limits and requests for CPU resources are measured in cpu units. One cpu, in Kubernetes, is equivalent to:
1 AWS vCPU
1 GCP Core
1 Azure vCore
1 IBM vCPU
1 Hyperthread on a bare-metal Intel processor with Hyperthreading
Unfortunately, when using an heterogeneous cluster (for instance with different processors), the cpu limit/request depends on the node on which the POD is assigned; especially for real time applications.
If we assume that:
we can compute a fined tuned cpu limit for the POD for each kind of hardware of the cluster
we want to let the Kubernetes scheduler choose a matching node in the whole cluster
Is there a way to launch the POD so that the cpu limit/request depends on the node chosen by the Kubernetes scheduler (or a Node label)?
The obtained behavior should be (or equivalent to):
Before assigning the POD, the scheduler chooses the node by checking different cpu requests depending on the Node (or a Node Label)
At runtime, Kublet checks a specific cpu limit depending on the Node (or a Node Label)

No, you can't have different requests per node type. What you can do is create a pod manifest with a node affinity for a specific kind of node, and requests which makes sense for that node type. For a second kind of node, you will need a second pod manifest which makes sense for that node type. These pod manifests will differ only in their affinity spec and resource requests - so it would be handy to parameterize them. You could do this with Helm, or write a simple script to do it.
This approach would let you launch a pod within a subset of your nodes with resource requests which make sense on those nodes, but there's no way to globally adjust its requests/limits based on where it ends up.

Before assigning the POD, the scheduler chooses the node by checking different cpu requests depending on the Node (or a Node Label)
Not with default scheduler, the closest option is using node-affinity like Marcin suggested, so you can pick the node based on a node label. Like below:
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/e2e-az-name
operator: In
values:
- e2e-az1
- e2e-az2
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
containers:
- name: podname
image: k8s.gcr.io/pause:2.0
In this case, you would tag the Nodes with labels to identify their type or purpose, e.g: db, cache, web and so on. Then you set the affinity to match the node types you expect.
requiredDuringSchedulingIgnoredDuringExecution means the pod won't be scheduled in the node if the conditions are not meet.
preferredDuringSchedulingIgnoredDuringExecution means the scheduler will try to find a node that also matches that condition, but will schedule the pod anywhere possible if no nodes fit the condition specified.
Your other alternative, is writting your custom scheduler.
apiVersion: v1
kind: Pod
metadata:
name: annotation-default-scheduler
labels:
name: multischeduler-example
spec:
schedulerName: default-scheduler
containers:
- name: pod-with-default-annotation-container
image: k8s.gcr.io/pause:2.0
Kubernetes ships with a default scheduler that is described here. If the default scheduler does not suit your needs you can implement your own scheduler. This way you can write a complex scheduling logic to decide where each POD should go, only recommended for something that are not possible using the default scheduler
Keep in mind, one of the most important components in Kubernetes is the scheduler, the default scheduler is battle tested and really flexible to handle most of the applications. Writing your own scheduler lose the features provided by the default one, like load balancing, policies, filtering. To know more about the features provided by default scheduler, check the docs here.
If you are willing to take the risks and want to write a custom scheduler, take a look in the docs in here.
At runtime, Kublet checks a specific cpu limit depending on the Node (or a Node Label)
Before receiving the request to allocate a pod, the scheduler checks for resource availability in the node then assign the pod to a node.
Each node have it's own kubelet which check for pods that should initialize in that node and the only thing the kubelet does is start these pods, it does not decide which node a pod it should go.
Kubelet also check for resources before initializing a POD, In case the kubelet can't initialize the pod it will just fail and the scheduler will take an action to schedule pods elsewhere.

Related

K8s which access mode should be used in a persistentvolumeclaim for a database deployment?

I want to store the data from a PostgreSQL database in a persistentvolumeclaim.
(on a managed Kubernetes cluster on Microsoft Azure)
And I am not sure which access mode to choose.
Looking at the available access modes:
ReadWriteOnce
ReadOnlyMany
ReadWriteMany
ReadWriteOncePod
I would say, I should choose either ReadWriteOnce or ReadWriteMany.
Thinking about the fact that I might want to migrate the database pod to another node pool at some point, I would intuitively choose ReadWriteMany.
Is there any disadvantage if I choose ReadWriteMany instead of ReadWriteOnce?
You are correct with the migration, where the access mode should be set to ReadWriteMany.
Generally, if you have access mode ReadWriteOnce and a multinode cluster on microsoft azure, where multiple pods need to access the database, then the kubernetes will enforce all the pods to be scheduled on the node that mounts the volume first. Your node can be overloaded with pods. Now, if you have a DaemonSet where one pod is scheduled on each node, this could pose a problem. In this scenario, you are best with tagging the PVC and PV with access mode ReadWriteMany.
Therefore
if you want multiple pods to be scheduled on multiple nodes and have access to DB, for write and read permissions, use access mode ReadWriteMany
if you logically need to have pods/db on one node and know for sure, that you will keep the logic on the one node, use access mode ReadWriteOnce
You should choose ReadWriteOnce.
I'm a little more familiar with AWS, so I'll use it as a motivating example. In AWS, the easiest kind of persistent volume to get is backed by an Amazon Elastic Block Storage (EBS) volume. This can be attached to only one node at a time, which is the ReadWriteOnce semantics; but, if nothing is currently using the volume, it can be detached and reattached to another node, and the cluster knows how to do this.
Meanwhile, in the case of a PostgreSQL database storage (and most other database storage), only one process can be using the physical storage at a time, on one node or several. In the best case a second copy of the database pointing at the same storage will fail to start up; in the worst case you'll corrupt the data.
So:
It never makes sense to have the volume attached to more than one pod at a time
So it never makes sense to have the volume attached to more than one node at a time
And ReadWriteOnce volumes are very easy to come by, but ReadWriteMany may not be available by default
This logic probably applies to most use cases, particularly in a cloud environment, where you'll also have your cloud provider's native storage system available (AWS S3 buckets, for example). Sharing files between processes is fraught with peril, especially across multiple nodes. I'd almost always pick ReadWriteOnce absent a really specific need to use something else.

What is the interval of kube-controller-manager control loop?

I see this in Kubernetes doc,
In Kubernetes, controllers are control loops that watch the state of
your cluster, then make or request changes where needed. Each
controller tries to move the current cluster state closer to the
desired state.
Also this,
The Deployment controller and Job controller are examples of
controllers that come as part of Kubernetes itself ("built-in"
controllers).
But, I couldn't find how does the control loop work. Does it check the current state of the cluster every few seconds? If yes, what is the default value?
I also found something interesting here,
What is the deployment controller sync period for kube-controller-manager?
I would like to start explaining that the kube-controller-manager is a collection of individual control processes tied together to reduce complexity.
Being said that, the control process responsible for monitoring the node's health and a few other parameters is the Node Controller, and it does that by reading the Heartbeats sent by the Kubelet agent in the nodes.
According to the Kubernete's documentation:
For nodes there are two forms of heartbeats:
updates to the .status of a Node
Lease objects within the kube-node-lease namespace. Each Node has an associated Lease object.
Compared to updates to .status of a Node, a Lease is a lightweight
resource. Using Leases for heartbeats reduces the performance impact
of these updates for large clusters.
The kubelet is responsible for creating and updating the .status of
Nodes, and for updating their related Leases.
The kubelet updates the node's .status either when there is change in status or if there has been no update for a configured interval.
The default interval for .status updates to Nodes is 5 minutes,
which is much longer than the 40 second default timeout for
unreachable nodes.
The kubelet creates and then updates its Lease object every 10 seconds (the default update interval). Lease updates occur independently from updates to the Node's .status. If the Lease update
fails, the kubelet retries, using exponential backoff that starts at 200 milliseconds and capped at 7 seconds.
As for the Kubernetes Objects running in the nodes:
Kubernetes objects are persistent entities in the Kubernetes system.
Kubernetes uses these entities to represent the state of your cluster.
Specifically, they can describe:
What containerized applications are running (and on which nodes)
The resources available to those applications
The policies around how those applications behave, such as restart policies, upgrades, and fault-tolerance
A Kubernetes object is a "record of intent"--once you create the
object, the Kubernetes system will constantly work to ensure that
object exists. By creating an object, you're effectively telling the
Kubernetes system what you want your cluster's workload to look like;
this is your cluster's desired state.
Depending on the Kubernetes Object, the controller mechanism is responsible for maintaining its desired state. The Deployment Object for example, uses the Replica Set underneath to maintain the desired described state of the Pods; while the Statefulset Object uses its own Controller for the same purpose.
To see a complete list of Kubernetes Objects managed by your cluster, you can run the command: kubectl api-resources

Kubernetes volumes: when is Statefulset necessary?

I'm approaching k8s volumes and best practices and I've noticed that when reading documentation it seems that you always need to use StatefulSet resource if you want to implement persistency in your cluster:
"StatefulSet is the workload API object used to manage stateful
applications."
I've implemented some tutorials, some of them use StatefulSet, some others don't.
In fact, let's say I want to persist some data, I can have my stateless Pods (even MySql server pods!) in which I use a PersistentVolumeClaim which persists the state. If I stop and rerun the cluster, I can resume the state from the Volume with no need of StatefulSet.
I attach here an example of Github repo in which there is a stateful app with MySql and no StatefulSet at all:
https://github.com/shri-kanth/kuberenetes-demo-manifests
So do I really need to use a StatefulSet resource for databases in k8s? Or are there some specific cases it could be a necessary practice?
PVCs are not the only reason to use Statefulsets over Deployments.
As the Kubernetes manual states:
StatefulSets are valuable for applications that require one or more of the following:
Stable, unique network identifiers.
Stable, persistent storage.
Ordered, graceful deployment and scaling.
Ordered, automated rolling updates.
You can read more about database considerations when deployed on Kubernetes here To run or not to run a database on Kubernetes
StatefulSet is not the same as PV+PVC.
A StatefulSet manages Pods that are based on an identical container spec. Unlike a Deployment, a StatefulSet maintains a sticky identity for each of their Pods. These pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling.
In other words it manages the deployment and scaling of a set of Pods , and provides guarantees about the ordering and uniqueness of these Pods.
So do I really need to use a StatefulSet resource for databases in k8s?
It depends on what you would like to achieve.
StatefulSet gives you:
Possibility to have a Stable Network ID (so your pods will be always named as $(statefulset name)-$(ordinal) )
Possibility to have a Stable Storage, so when a Pod is (re)scheduled onto a node, its volumeMounts mount the PersistentVolumes associated with its PersistentVolume Claims.
...MySql and no StatefulSet...
As you can see, if your goal is just to run single RDBMS Pod (for example Mysql) that stores all its data (DB itself) on PV+PVC, then the StatefulSet is definitely an overkill.
However, if you need to run Redis cluster (distributed DB) :-D it'll be close to impossible to do that without a StatefulSet (to the best of my knowledge and based on numerous threads about the same on StackOverflow).
I hope that this info helps you.

Task distribution in Apache Flink

Consider a Flink cluster with some nodes where each node has a multi-core processor. If we configure the number of the slots based on the number of cores and equal share of memory, how does Apache Flink distribute the tasks between the nodes and the free slots? Are they fairly treated?
Is there any way to make/configure Flink to treat the slots equally when we configure the task slots based on the number of the cores available on a node
For instance, assume that we partition the data equally and run the same task over the partitions. Flink uses all the slots from some nodes and at the same time some nodes are totally free. The node which has less number of CPU cores involved outputs the result much faster than the node with more number of CPU cores involved in the process. Apart from that, this ratio of speedup is not proportional to the number of used cores in each node. In other words, if in one node one core is occupied and in another node two cores are occupied, in fairly treating each core as a slot, each slot should output the result over the same task in almost equal amount of time irrespective of which node they belong to. But, this is not the case here.
With this assumption, I would say that the nodes are not treated equally. This in turn produces a result time wise that is not proportional to the number of the nodes available. We can not say that increasing the number of the slots necessarily decreases the time cost.
I would appreciate any comment from the Apache Flink Community!!
Flink's default strategy as of version >= 1.5 considers every slot to be resource-wise the same. With this assumption, it should not matter wrt resources where you place the tasks since all slots should be the same. Given this, the main objective for placing tasks is to colocate them with their inputs in order to minimize network I/O.
If we are now in a standalone setup where we have a fixed number of TaskManagers running, Flink will pick slots in an arbitrary fashion (no guarantee given) for the sources and then colocate their consumers in the same slots if possible.
When running Flink on Yarn or Mesos where Flink can start new TaskManagers, Flink will first use up all slots of an existing TaskManager before it requests a new one. In this case, you will see that all sources will end up on as few TaskManagers as possible.
Since CPUs are not isolated wrt slots (they are a shared resource), the above-mentioned assumption does not hold true in all cases. Hence, in some cases where you have a fixed set of TaskManagers it is actually beneficial to spread the tasks out as much as possible to make use of the shared CPU resources.
In order to support this kind of scheduling strategy, the Flink community added the task spread out strategy via FLINK-12122. In order to use a scheduling strategy which is more similar to the pre FLIP-6 behaviour where Flink tries to spread out the workload across all available TaskExecutors, one needs to set cluster.evenly-spread-out-slots: true in the flink-conf.yaml
Very old thread, but there is a newer thread that answers this question for current versions.
with Flink 1.5 we added resource elasticity. This means that Flink is now able to allocate new containers on a cluster management framework like Yarn or Mesos. Due to these changes (which also apply to the standalone mode), Flink no longer reasons about a fixed set of TaskManagers because if needed it will start new containers (does not work in standalone mode). Therefore, it is hard for the system to make any decisions about spreading slots belonging to a single job out across multiple TMs. It gets even harder when you consider that some jobs like yours might benefit from such a strategy whereas others would benefit from co-locating its slots. It gets even more complicated if you want to do scheduling wrt to multiple jobs which the system does not have full knowledge about because they are submitted sequentially. Therefore, Flink currently assumes that slots requests can be fulfilled by any TaskManager.

statsd architecture for a distributed system

I am studying to use the graphite - statsd - collectd stack to monitor a a distributed system.
I have tested the components (graphite-web, carbon, whisper, statsd, collectd and grafana) in a local instance.
However I'm confused about how I should distributed these components in a distributed system:
- A monitor node with graphite-web (and grafana), carbon and whisper.
- In every worker node: statsd and collectd sending data to the carbon backend in the remote monitor node.
Is it right this scheme? What I should configure statsd and collectd to get an acceptable network ussage (tcp/udp, packets per second...)?
Assuming you have a relatively light workload, having a node that manages graphite-web, grafana, and carbon (which itself manages the whisper database) should be fine.
Then you should have a separate node for your statsd. Each of your machines/applications should have statsd client code that sends your metrics to this statsd node. This statsd node should then forward these metrics onto your carbon node.
For larger workloads that stress a single node, you'll need to either scale vertically (get more powerful node to host your carbons/statsd instances), or start clustering those services.
Carbon clusters tend to use some kind of relay that you send to that manages forwarding those metrics to the cluster (usually using consistent hashing). You could use a similar setup to consistently hash metrics to a cluster of statsd servers.

Resources