Kubernetes volumes: when is Statefulset necessary?

Kubernetes volumes: when is Statefulset necessary? - database

I'm approaching k8s volumes and best practices and I've noticed that when reading documentation it seems that you always need to use StatefulSet resource if you want to implement persistency in your cluster:
"StatefulSet is the workload API object used to manage stateful
applications."
I've implemented some tutorials, some of them use StatefulSet, some others don't.
In fact, let's say I want to persist some data, I can have my stateless Pods (even MySql server pods!) in which I use a PersistentVolumeClaim which persists the state. If I stop and rerun the cluster, I can resume the state from the Volume with no need of StatefulSet.
I attach here an example of Github repo in which there is a stateful app with MySql and no StatefulSet at all:
https://github.com/shri-kanth/kuberenetes-demo-manifests
So do I really need to use a StatefulSet resource for databases in k8s? Or are there some specific cases it could be a necessary practice?

PVCs are not the only reason to use Statefulsets over Deployments.
As the Kubernetes manual states:
StatefulSets are valuable for applications that require one or more of the following:
Stable, unique network identifiers.
Stable, persistent storage.
Ordered, graceful deployment and scaling.
Ordered, automated rolling updates.
You can read more about database considerations when deployed on Kubernetes here To run or not to run a database on Kubernetes

StatefulSet is not the same as PV+PVC.
A StatefulSet manages Pods that are based on an identical container spec. Unlike a Deployment, a StatefulSet maintains a sticky identity for each of their Pods. These pods are created from the same spec, but are not interchangeable: each has a persistent identifier that it maintains across any rescheduling.
In other words it manages the deployment and scaling of a set of Pods , and provides guarantees about the ordering and uniqueness of these Pods.
So do I really need to use a StatefulSet resource for databases in k8s?
It depends on what you would like to achieve.
StatefulSet gives you:
Possibility to have a Stable Network ID (so your pods will be always named as $(statefulset name)-$(ordinal) )
Possibility to have a Stable Storage, so when a Pod is (re)scheduled onto a node, its volumeMounts mount the PersistentVolumes associated with its PersistentVolume Claims.
...MySql and no StatefulSet...
As you can see, if your goal is just to run single RDBMS Pod (for example Mysql) that stores all its data (DB itself) on PV+PVC, then the StatefulSet is definitely an overkill.
However, if you need to run Redis cluster (distributed DB) :-D it'll be close to impossible to do that without a StatefulSet (to the best of my knowledge and based on numerous threads about the same on StackOverflow).
I hope that this info helps you.

Related

K8s which access mode should be used in a persistentvolumeclaim for a database deployment?

I want to store the data from a PostgreSQL database in a persistentvolumeclaim.
(on a managed Kubernetes cluster on Microsoft Azure)
And I am not sure which access mode to choose.
Looking at the available access modes:
ReadWriteOnce
ReadOnlyMany
ReadWriteMany
ReadWriteOncePod
I would say, I should choose either ReadWriteOnce or ReadWriteMany.
Thinking about the fact that I might want to migrate the database pod to another node pool at some point, I would intuitively choose ReadWriteMany.
Is there any disadvantage if I choose ReadWriteMany instead of ReadWriteOnce?

You are correct with the migration, where the access mode should be set to ReadWriteMany.
Generally, if you have access mode ReadWriteOnce and a multinode cluster on microsoft azure, where multiple pods need to access the database, then the kubernetes will enforce all the pods to be scheduled on the node that mounts the volume first. Your node can be overloaded with pods. Now, if you have a DaemonSet where one pod is scheduled on each node, this could pose a problem. In this scenario, you are best with tagging the PVC and PV with access mode ReadWriteMany.
Therefore
if you want multiple pods to be scheduled on multiple nodes and have access to DB, for write and read permissions, use access mode ReadWriteMany
if you logically need to have pods/db on one node and know for sure, that you will keep the logic on the one node, use access mode ReadWriteOnce

You should choose ReadWriteOnce.
I'm a little more familiar with AWS, so I'll use it as a motivating example. In AWS, the easiest kind of persistent volume to get is backed by an Amazon Elastic Block Storage (EBS) volume. This can be attached to only one node at a time, which is the ReadWriteOnce semantics; but, if nothing is currently using the volume, it can be detached and reattached to another node, and the cluster knows how to do this.
Meanwhile, in the case of a PostgreSQL database storage (and most other database storage), only one process can be using the physical storage at a time, on one node or several. In the best case a second copy of the database pointing at the same storage will fail to start up; in the worst case you'll corrupt the data.
So:
It never makes sense to have the volume attached to more than one pod at a time
So it never makes sense to have the volume attached to more than one node at a time
And ReadWriteOnce volumes are very easy to come by, but ReadWriteMany may not be available by default
This logic probably applies to most use cases, particularly in a cloud environment, where you'll also have your cloud provider's native storage system available (AWS S3 buckets, for example). Sharing files between processes is fraught with peril, especially across multiple nodes. I'd almost always pick ReadWriteOnce absent a really specific need to use something else.

How to auto scale up/down Flink Stateful Functions on K8s

My Current Flink Application
based on Flink Stateful Function 3.1.1, it reads message from Kafka, process the message and then sink to Kafka Egress
Application has been deployed on K8s following guide and is running well: Stateful Functions Deployment
Based on the standard deployment, I have turned on kubernetes HA
My Objectives
I want to auto scale up/down the stateful functions.
I also want to know how to create more standby job managers
My Observations about the HA
I tried to set kubernetes.jobmanager.replicas in the flink-config ConfigMap:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: flink-config
labels:
app: shadow-fn
data:
flink-conf.yaml: |+
kubernetes.jobmanager.replicas: 7
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
I see no standby job managers in K8s.
Then I directly adjust the replicas of deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: statefun-master
spec:
replicas: 7
Standby job managers show up. I check the pod log, the leader election is done successfully. However, when I access UI in the web browser, it says:
{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}
What's wrong with my approach?
My Questions about the scaling
Reactive Mode is exactly what I need. I tried but failed, job manager has error message:
Exception in thread "main" org.apache.flink.configuration.IllegalConfigurationException: Reactive mode is configured for an unsupported cluster type. At the moment, reactive mode is only supported by standalone application clusters (bin/standalone-job.sh).
It seems that stateful function auto scaling shouldn't be done in this way.
What's the correct way to do the auto scaling, then?
Potential Approach(Probably incorrect)
After some research, my current direction is:
Job Manger has nothing to do with auto scaling. It is related to HA on K8s. I just need to make sure Job Manager has correct failover behaviors
My stateful functions are Flink remote services, i.e., they are regular k8s services. they can be deployed in form of KNative service to achieve auto scaling. Replicas of services goes up only when http requests come from Flink's worker
The most important part, Flink's worker(or Task Manager) I have no idea how to do the auto scaling yet. Maybe I should use KNative to deploy the Flink worker?
If it doesn't work with KNative, maybe I should totally change the flink runtime deployment. E.g., to try the original reactive demo. But I'm afraid the Stateuful functions are not intended to work like that.
At the last
I have read the Flink documentation and Github samples over and over but cannot find any more information to do this. Any hint/instructions/guideline are appreciated!

Since Reactive Mode is a new, experimental feature, not all features supported by the default scheduler are also available with Reactive Mode (and its adaptive scheduler). The Flink community is working on addressing these limitations.
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/elastic_scaling/

Is it advisable to use cloud redis with noeviction policy to act as persistent database?

I am thinking to use cloud memory store redis database with policy set to noeviction, sort of persistent database to serve the client. Wondering what could be the downside of this?
Of course we will keep instance memory on higher side to make sure incoming keys can accommodate. Are there any chances keys can lost while sort of infra restructuring or failover or patching happen at cloud provider end?
Thanks in advance

There are still chances that keys will be lost in case of unplanned restarts. Failovers only work during instance crashes or scheduled maintenance, and will not work on manual restarts. GCP also has two Redis tier capabilities. Only the Standard tier supports failovers.
Both offers maximum instance size of 300GB and maximum network bandwidth of 12Gbps. The advantage of having Standard tier is that it provides redundancy and availability using replication, cross-zone replication and automatic failover.
noeviction is only a policy that makes sure that all keys are not evicted and not replaced regardless of how old they are. It only returns an error when the Redis instance reaches maxmemory. It still doesn't cover other persistence features like point-in-time snapshot and AOF persistence, which unfortunately Memorystore doesn't support yet.
Since Memorystore does not cover your entire use case, my suggestion is to use Redis open source instead. You can quickly provision and deploy a Redis VM instance from the GCP Markeplace.
You can check out the full features in the documentation.

RDBMS scale on kubernetes

One simple question. I would like to containerize a postgresql RDBMS and use a volume for persistent storage.
Say it is time to scale because the traffic is rising. Since postgresql does not support multi master, how this is going to work? Because one request might go to x pod and another to y pod.

Just like when you are running Postgres anywhere else, you scale it mostly by either manual sharding (i.e. breaking data into discrete silos) or by throwing more horsepower (read: money) at it. Kubernetes doesn't particularly help or hinder either approach, Postgres is still just Postgres no matter how you run it.
Shoutout to Citus though, which does allow some level of load-sharing, even if it's not a true distributed system.

Will using a Cloud PaaS automatically solve scalability issues?

I'm currently looking for a Cloud PaaS that will allow me to scale an application to handle anything between 1 user and 10 Million+ users ... I've never worked on anything this big and the big question that I can't seem to get a clear answer for is that if you develop, let's say a standard application with a relational database and soap-webservices, will this application scale automatically when deployed on a Paas solution or do you still need to build the application with fall-over, redundancy and all those things in mind?
Let's say I deploy a Spring Hibernate application to Amazon EC2 and I create single instance of Ubuntu Server with Tomcat installed, will this application just scale indefinitely or do I need more Ubuntu instances? If more than one Ubuntu instance is needed, does Amazon take care of running the application over both instances or is this the developer's responsibility? What about database storage, can I install a database on EC2 that will scale as the database grow or do I need to use one of their APIs instead if I want it to scale indefinitely?
CloudFoundry allows you to build locally and just deploy straight to their PaaS, but since it's in beta, there's a limit on the amount of resources you can use and databases are limited to 128MB if I remember correctly, so this a no-go for now. Some have suggested installing CloudFoundry on Amazon EC2, how does it scale and how is the database layer handled then?
GAE (Google App Engine), will this allow me to just deploy an app and not have to worry about how it scales and implements redundancy? There appears to be some limitations one what you can and can't run on GAE and their price increase recently upset quite a large number of developers, is it really that expensive compared to other providers?
So basically, will it scale and what needs to be done to make it scale?

That's a lot of questions for one post. Anyway:
Amazon EC2 does not scale automatically with load. EC2 is basically just a virtual machine. You can achieve scaling of EC2 instances with Auto Scaling and Elastic Load Balancing.
SQL databases scale poorly. That's why people started using NoSQL databases in the first place. It's best to see which database your cloud provider offers as a managed service: Datastore on GAE and DynamoDB on Amazon.
Installing your own database on EC2 instances is very impractical as EC2 has ephemeral storage (it looses all data on "disk" when it reboots).
GAE Datastore is actually a one big database for all applications running on it. So it's pretty scalable - your million of users should not be a problem for it.
http://highscalability.com/blog/2011/1/11/google-megastore-3-billion-writes-and-20-billion-read-transa.html
Yes App Engine scales automatically, both frontend instances and database. There is nothing special you need to do to make it scale, just use their API.
There are limitations what you can do with AppEngine:
A. No local storage (filesystem) - you need to use Datastore or Blobstore.
B. Comet is only supported via their proprietary Channels API
C. Datastore is a NoSQL database: no JOINs, limited queries, limited transactions.
Cost of GAE is not bad. We do 1M requests a day for about 5 dollars a day. The biggest saving comes from the fact that you do not need a system admin on GAE ( but you do need one for EC2). Compared to the cost of manpower GAE is incredibly cheap.
Some hints to save money (an speed up) GAE:
A. Use get instead of query in Datastore (requires carefully crafting natiral keys).
B. Use memcache to cache data you got form datastore. This can be done automatically with objectify and it's #Cached annotation.
C. Denormalize data. Meaning you write data redundantly in various places in order to get to it in as few operations as possible.
D. If you have a lot of REST requests from devices, where you do not use cookies, then switch off session support ( or roll your own as we did). Sessions use datastore under the hood and for every request it does get and put.
E. Read about adjusting app settings. Try different settings (depending how tolerant your app is to request delay and your traffic patterns/spikes). We were able to cut down frontend instances by 70%.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight