Scaling down old replicasets while deploying requires data migration - database

What about ReplicaSet_B and ReplicaSet_A update the same db? I hoped the pods in ReplicaSet_A were stopped with taking a snapshot. But there is not any explanation like this in https://kubernetes.io/docs/concepts/workloads/controllers/deployment/. I think, It is assumed that the containers are running online applications in the pods. What if they are batch applications? I mean the old pods belonging to old replicas will update the dbs in old manner. This will require also a data migration issue.

Yes. ReplicaSets (managed by Deployments) make two assumptions: 1. your workload is stateless, and 2. all pods are identical clones (other than their IP addresses). Now, StatefulSets address some aspects, for example, you can assign pods a certain identity (for example: leader or follower), but really only work for specific workloads. Also, the Jobs abstractions in Kubernetes won't really help you a lot concerning stateful workloads. What you likely are looking at is a custom controller or operator. We're collecting good practices and tooling via stateful.kubernetes.sh, maybe there's something there that can be of help?

Related

Can you scale a StatefulSet horizontally running a relational database in Kubernetes?

Why I'd want to have multiple replicas of my DB?
Redundancy: I have > 1 replicas of my app code. Why? In case one node fails, another can fill its place when run behind a load balancer.
Load: A load balancer can distribute traffic to multiple instances of the app.
A/B testing. I can have one node serve one version of the app, and another serve a different one.
Maintenance. I can bring down one instance for maintenance, and keep the other one up with 0 down-time.
So, I assume I'd want to do the same with the backing db if possible too.
I realize that many nosql dbs are better configured for multiple instances, but I am interested in relational dbs.
I've played with operators like this and this but have found problems with the docs, have not been able to get them up and running and found the community a bit lacking. Relying on this kind of thing in production makes me nervous. The Mysql operator has a note even, saying it's not for production use.
I see that native k8s statefulsets have scaling but these docs aren't specific to dbs at all. I assume the complication is that dbs need to write persistently to disk via a volume and that data has to be synced and routed somehow if you have more than one instance.
So, is this something that's non-trivial to do myself? Or, am I better off having a dev environment that uses a one-replica db image in the cluster in order to save on billing, and a prod environment that uses a fully managed db, something like this that takes care of the scaling/HA for me? Then I'd use kustomize to manage the yaml variances.
Edit:
I actually found a postgres operator that worked great. Followed the docs one time through and it all worked, and it's from postgres docs.
I have created this community wiki answer to summarize the topic and to make pertinent information more visible.
As Turing85 well mentioned in the comment:
Do NOT share a pvc to multiple db instances. Even if you use the right backing volume (it must be an object-based storage in order to be read-write many), with enough scaling, performance will take a hit (after all, everything goes to one file system, this will stress the FS). The proper way would be to configure clustering. All major relational databases (mssql, mysql, postgres, oracle, ...) do support clustering. To be on the secure side, however, I would recommend to buy a scalable database "as a service" unless you know exactly what you are doing.
The good solution might be to use a single replica StatefulSet for development, to avoid billing and use a fully managed cloud based sql solution in prod. Unless you have the knowledge or a suffiiciently professional operator to deploy a clustered dbms.
Another solution may be to use a different operator as Aaron did:
I actually found a postgres operator that worked great. Followed the docs one time through and it all worked, and it's from postgres: https://www.kubegres.io/doc/getting-started.html
See also this similar question.

Enabling MongoDB transactions without replica sets or with least possible configuration

[Some background information - possibly skippable]
To begin with, I have barely any understanding of database management
and just shallow experience with mongoose and node in the backend
realm(a couple of udemy courses). Udemy courses made me believe that
mongodb was still a viable choice for a database with relational
properties and off I went working on a backend for a forum-like
website. After learning about transactions, I attempted to implement
them in my backend, since it seemed perfectly necessary to implement a
rollback feature when executing an array of queries. However it turns
out that transactions are only possible on replica sets - which also
appeared to require a minimum of 2. 3 databases for a startup MVP was
obviously considered an overkill.
[The question]
Is it possible to implement transactions with only 1 database? If so how?
If the above is not possible, how would one launch a mongodb database with minimum configuration while implementing transactions, with the fact that the database is for a startup MVP in consideration.
(For anyone who has experience in implementing a production-level mongo database in a similar scenario) If not using transactions was considered, how to send queries editing/creating multiple documents to a mongoDB database safely without transactions, while not spraying every bit of code with try-catches consisting of queries to rollback every point of failure(which I considered as way too much overhead)
I have a tight deadline, and have already done a substantial bit of groundwork and a couple of routes using mongoose, which means ditching mongodb for a relational database will be a difficult option at the moment.
I think I've googled everything related to the subject at matter and even tried blogs / articles in the second page of a google search(which many including myself consider as the dark web /s).
Yet I do think I may have missed what I was looking for and an answer consisting of just links is also welcome.
Thank you for reading!
You need a replica set[*] to use transactions, but you can create a single-node replica set for testing purposes.
The full procedure is described in documentation, for a single-node RS you follow that as written but only configure a single member.
Briefly, you need to pass --replSet argument to mongod and then connect to it through mongo shell and run rs.initiate().
Note that transactions aren't a magic solution to all problems. There are scenarios where they are appropriate and scenarios where MongoDB provides other functionality that would be a better fit.
[*] or a sharded cluster with MongoDB 4.2+, but this involves more setup work.

Kubernetes and Cloud Databases

Could someone explain the benefits/issues with hosting a database in Kubernetes via a persistent volume claim combined with a storage volume over using an actual cloud database resource?
It's essentially a trade-off: convenience vs control. Take a concrete example: let's say you pay Amazon money to use Athena, which is really just a nicely packaged version of Facebook Presto which AWS kindly operates for you in exchange for $$$. You could run Presto on EKS yourself, but why would you.
Now, let's say you want to or need to use Apache Drill or Apache Impala. Amazon doesn't offer it. Nor does any of the other big public cloud providers at time of writing, as far as I know.
Another thought: what if you want to migrate off of AWS? Your data has gravity as well.
Could someone explain the benefits/issues with hosting a database in Kubernetes ... over using an actual cloud database resource?
As previous excellent answer noted:
It's essentially a trade-off: convenience vs control
In addition to previous example (Athena), take a look at RDS as well and see what you would need to handle yourself (why would you, as said already):
Automatic backups
Multizone deployments
Snapshots
Engine upgrades
Read replicas
and other bells and whistles that come with managed service opposed to self-hosted/managed one.
But there is more to it than just convenience/control that this post I trying to shed light onto:
Kubernetes is adding another layer of abstraction there (pods, services...), and depending on way of handling storage (persistent volumes) you can have two additional considerations:
Access speed (depending on your use case this can be negligent or show stopper).
Storage that you have at hand might not be optimized for relational database type of I/O (or restrict you to schedule pods efficiently). The very same reasons you are not advised to run db on NFS for example.
There are several recent conference talks on kubernetes pointing out that database is big no-no for kubernetes (although this is highly opinionated, we do run average load mysql and postgresql databases in k8s), and large load/fast I/O is somewhat challenge to get right on k8s as opposed to somebody already fine tuned everything for you in managed cloud solution.
In conclusion:
It is all about convenience, controls and capabilities.

Database for a java application in cluster

I'd like to play around with kubernetes, I'm able to start a simple app, but now I'd like to design something more complex. Nevertheless I can't figure out, how to handle the database access in such architecture.
Let's say I have 100 pod replicas of some simple chat application. They all need to access the same database (or more like data set) and perform CRUD operations upon them. How to design it to keep the data consistent and eliminate the risk of deadlocks?
If possible, I'd like to use SQL-like database, so I can comfortably use hibernate and other tools I'm familiar with.
Is this even possible or do I have to use totally a different approach? What is the name of the technology or architecture I'm searching for?
1) You can use a connection pool to reduce this number and make the connection settings more aggressive/elastic;
2) Split your microservices in such way the access to the persistence is a microservice exposing your CRUD service to your persistence(mysql/rdms/nosql/etc). In that way you most likely don't need hundreds of replicas of your pods.
3) Deadlocks / locking strategies - as Andrew mentioned in the comments, it's more related to your software development architecture rather than K8s itself. There are plenty of ways to deal with that with pros/cons.

How to setup deployments in Azure so that they use different databases depending on the environment?

You can easily swap two deployments between staging and production environment in the Azure Management Portal by swapping their VIP. When working on a staging version of the services we want to use a staging database as well so we don't risk clobbering actual customer data. However, after swapping staging and production services the now-production (and formerly staging) deployment should obviously work on the production database.
So essentially the database to use would depend on whether the instance runs in the Staging or Production environment. Is there a good way of achieving that? Relying on the VIP and hard-coding the database switching based on that is probably not the best idea, I guess.
My recommendation would be to stop using the "staging slot" of a service for the function you used a traditional "staging environment" for. When I'm speaking to folks about Windows Azure, I strongly recommend they use the staging slots only to smoke test a new deployment before it goes live. If they want a more protracted sort of testing, the kind many of us are used to having on-premises, then use a separate service and possibly even a separate subscription (the later is great if you want cost transparency).
All this said, your only real options are to have a second service configuration that is specific for production that you update to before you execute the VIP swap, or you write some code that allows the service to detect which slot it's in and pull the appropriate of two configuration settings.
However, as I outlined in the first paragraph, I think there's a better way to do things. :)
In a recent release of Azure Websites, the story here has changed. You may now specify that any app setting or connection string is a "slot setting", pinning it to the particular slot. To solve your issue, you would simply set the connection string(s) in each slot and take care to check 'Slot Setting'.
I'm less clear if this is an advisable approach now. Database schema migration and rollback aren't baked in, and I'm unsure how to handle that correctly. Also only app settings and connection strings work this way, so, for example, system.net.mail settings cannot be pinned to a slot. For that, you'd need to change code to get mail server info, etc. from app settings or else use some other approach.
Re: "When working on a staging version of the services we want to use a staging database as well so we don't risk clobbering actual customer data." There is not a built-in way to do this.
If you wish to test without risk to production data, consider doing this testing in another Azure account - one that doesn't even have access to the production database. Then, when you think the system is tested and ready to go live, only then bring it up into the staging slot next to your production instance for a final smoke test.
I can imagine scenarios where you'd also want to a run through a few scenarios on the staging instance before doing a VIP Swap, but don't want to pollute production data. For this, many companies use special accounts - data associated with these accounts is known (or marked somehow) to be not from real customers so can be skipped in reporting and billing and such.
Re: "Relying on the VIP and hard-coding the database switching based on that is probably not the best idea, I guess." If by hard-coding, you mean reading it from a config file, that is probably not a bad idea, if you use an approach as mentioned above. I have heard of some folks going with a "figure out if we are in a staging slot and do something different in the code" approach, but I rather recommend what I described above.

Resources