Kubernetes and Cloud Databases - database

Could someone explain the benefits/issues with hosting a database in Kubernetes via a persistent volume claim combined with a storage volume over using an actual cloud database resource?

It's essentially a trade-off: convenience vs control. Take a concrete example: let's say you pay Amazon money to use Athena, which is really just a nicely packaged version of Facebook Presto which AWS kindly operates for you in exchange for $$$. You could run Presto on EKS yourself, but why would you.
Now, let's say you want to or need to use Apache Drill or Apache Impala. Amazon doesn't offer it. Nor does any of the other big public cloud providers at time of writing, as far as I know.
Another thought: what if you want to migrate off of AWS? Your data has gravity as well.

Could someone explain the benefits/issues with hosting a database in Kubernetes ... over using an actual cloud database resource?
As previous excellent answer noted:
It's essentially a trade-off: convenience vs control
In addition to previous example (Athena), take a look at RDS as well and see what you would need to handle yourself (why would you, as said already):
Automatic backups
Multizone deployments
Snapshots
Engine upgrades
Read replicas
and other bells and whistles that come with managed service opposed to self-hosted/managed one.
But there is more to it than just convenience/control that this post I trying to shed light onto:
Kubernetes is adding another layer of abstraction there (pods, services...), and depending on way of handling storage (persistent volumes) you can have two additional considerations:
Access speed (depending on your use case this can be negligent or show stopper).
Storage that you have at hand might not be optimized for relational database type of I/O (or restrict you to schedule pods efficiently). The very same reasons you are not advised to run db on NFS for example.
There are several recent conference talks on kubernetes pointing out that database is big no-no for kubernetes (although this is highly opinionated, we do run average load mysql and postgresql databases in k8s), and large load/fast I/O is somewhat challenge to get right on k8s as opposed to somebody already fine tuned everything for you in managed cloud solution.
In conclusion:
It is all about convenience, controls and capabilities.

Related

Can you scale a StatefulSet horizontally running a relational database in Kubernetes?

Why I'd want to have multiple replicas of my DB?
Redundancy: I have > 1 replicas of my app code. Why? In case one node fails, another can fill its place when run behind a load balancer.
Load: A load balancer can distribute traffic to multiple instances of the app.
A/B testing. I can have one node serve one version of the app, and another serve a different one.
Maintenance. I can bring down one instance for maintenance, and keep the other one up with 0 down-time.
So, I assume I'd want to do the same with the backing db if possible too.
I realize that many nosql dbs are better configured for multiple instances, but I am interested in relational dbs.
I've played with operators like this and this but have found problems with the docs, have not been able to get them up and running and found the community a bit lacking. Relying on this kind of thing in production makes me nervous. The Mysql operator has a note even, saying it's not for production use.
I see that native k8s statefulsets have scaling but these docs aren't specific to dbs at all. I assume the complication is that dbs need to write persistently to disk via a volume and that data has to be synced and routed somehow if you have more than one instance.
So, is this something that's non-trivial to do myself? Or, am I better off having a dev environment that uses a one-replica db image in the cluster in order to save on billing, and a prod environment that uses a fully managed db, something like this that takes care of the scaling/HA for me? Then I'd use kustomize to manage the yaml variances.
Edit:
I actually found a postgres operator that worked great. Followed the docs one time through and it all worked, and it's from postgres docs.
I have created this community wiki answer to summarize the topic and to make pertinent information more visible.
As Turing85 well mentioned in the comment:
Do NOT share a pvc to multiple db instances. Even if you use the right backing volume (it must be an object-based storage in order to be read-write many), with enough scaling, performance will take a hit (after all, everything goes to one file system, this will stress the FS). The proper way would be to configure clustering. All major relational databases (mssql, mysql, postgres, oracle, ...) do support clustering. To be on the secure side, however, I would recommend to buy a scalable database "as a service" unless you know exactly what you are doing.
The good solution might be to use a single replica StatefulSet for development, to avoid billing and use a fully managed cloud based sql solution in prod. Unless you have the knowledge or a suffiiciently professional operator to deploy a clustered dbms.
Another solution may be to use a different operator as Aaron did:
I actually found a postgres operator that worked great. Followed the docs one time through and it all worked, and it's from postgres: https://www.kubegres.io/doc/getting-started.html
See also this similar question.

Is it possible to run Postgres in Google App Engine Flexible?

Is it possible to run postgres (essentially, a non-HTTP service) in a custom Google App Engine Flexible container? Or will I be forced to use Google's Cloud SQL solution?
TL;DR: You could do that, but don’t. It’s better to externalize the persistent data storage.
Yes, it is possible to run a PostgreSQL database as a microservice (named simply a 'service' in Google Cloud Platform) in a custom Google App Engine Flexible container. However, that raises another important question, namely why would you like to run an SQL database inside a container. This is a risky solution, unless you are perfectly sure about what you are doing and how to manage that.
Typical container orchestration is based on stateless services which means that they are not intended to store persistent data. This kind of containers do have some form of storage sometimes, like NoSQL databases for cache or user session information. This data is not persistent, it can be lost during restarts or destruction of instances in an agile containerized application environment. PostgreSQL databases are rather used as stateful services and do not suit the aforementioned model. Putting such database into a container, one can run into problems like data corruption or direct concurrency when accessing some shared data directory. Also, in Google App Engine Flexible it’s not possible to add a shared persistent disk, the volumes are attached to instances and destroyed together with them. Much safer solution is keeping the SQL database in an external, durable storage, as Cloud SQL that you have mentioned. There are numerous blog posts and articles that elaborate this issue with the stateless/stateful services, like this one.
It should be mentioned that if you are to use the container in a local environment or for test/development (and you are not looking for a durable state of the database), putting a PostgreSQL inside a container should be perfectly ok. Also, if you design a special way of splitting your data across instances this could work fine, as the guys did with their MySQL servers in this article. So once again, the idea of putting a PostgreSQL database in a container should be carefully thought-out, especially that there are so many options of a safe externalization of such a service.
And just as a side note, you are not forced to use Cloud SQL. The database can be hosted on Compute Engine, another cloud provider, on premises, or can be managed by a third-party vendor. In case of hosting it in Compute Engine the application is able to communicate with the database inside the same project using the internal IP of the Compute Engine instance. Using Cloud Launcher you can quickly deploy PostgreSQL and other popular databases to Compute Engine. Check these Google docs for more information about using third-party databases.

GoogleApps Datastore Cons and Pros

I've been reading more about Google AppEngine and learned python in the past couple of weeks, including working with MongoDB. What I need the most is a scalable database solution. Before discovering Google AppEngine, the only three DB solutions I find useful are DynamoDB, MongoDb and BigCouch.
I find out how that I really like python language, and for one coming from ASP.NET development, I've decided to switch and develop my app using python. My first choice was to develop my application using python + bottle + mongoDB. The problem is that DynamoDB is very expensive, and the lack of easy to use backup/restore options made me pass Amazon's offering.
Google AppEngine datastore is much more affordable. However, I still can't find information regarding some specific question on Google's website
Here are some of the questions I need answer to:
Does Google Datastore support backup/restore within the administration console?
If I want to backup/restore 50TB of data, how much time it takes to backup/restore the data? Where it is stored? what are the costs?
How much time it takes to backup 1TB of data for example?
Does DataStore support caching in the database layer
Any cons that I should be aware of?
Those some of the question that I need to get answers to. MongoDB is an excellent product and developing web app using Mongo + Python + bottle is fun fun fun. However, I prefer a full DB hosted solution like one offered by Google. But before I do that, I need to be sure that I'm not missing anything.
Here are some of the questions I need answer to:
Does Google Datastore support backup/restore within the administration
console?
No. Yes. You can back up and restore data from within the Administration Console by enabling datastore_admin for an application (Thanks to Idan Shechter for pointing this out!) More info can be found here: https://developers.google.com/appengine/docs/adminconsole/datastoreadmin
You can also download the data through the command line. See: https://developers.google.com/appengine/docs/python/tools/uploadingdata
If I want to backup/restore 50TB of data, how much time it takes to backup/restore the data?
It depends on where you back the data up to. Backing up to the Blobstore or Google Cloud storage will probably take much less time than backing up to your local machine. Transferring 50TBs to your local machine will take a long time and depend on many factors including network speed.
Where it is stored?
If you use the Datastore Administration, you can backup to the Blobstore or to Google Cloud Storage. If you use the command line tools, it will be stored where you choose to download the data to.
what are the costs?
The Blobstore costs $0.13/GB/Month and gives you 5GB free. Google Cloud Storage is $0.12 per GB/Month up to the first TB. You can see more pricing info for Cloud Storage here:
https://developers.google.com/storage/docs/pricingandterms
Bandwidth costs are $0.12 per GB (The first GB is free). More details on pricing can be seen here:
https://cloud.google.com/pricing/
How much time it takes to backup 1TB of data for example?
Again, it depends on where you back up to and your transfer speeds.
Does DataStore support caching in the database layer Any cons that I should be aware of?
No, it does not support database layer caching.

Back up AppEngine database (Google cloud storage?)

I have an AppEngine application that currently has about 15GB of data, and it seems to me that it is impractical to use the current AppEngine bulk loader tools to back up datasets of this size. Therefore, I am starting to investigate other ways of backing up, and would be interested in hearing about practical solutions that people may have used for backing up their AppEngine Data.
As an aside, I am starting to think that the Google Cloud Storage might be a good choice. I am curious to know if anyone has experience using the Google Cloud Storage as a backup for their AppEngine data, and what their experience has been, and if there are any pointers or things that I should be aware of before going down this path.
No matter which solution I end up with, I would like a backup solution to meet the following requirements:
1) Reasonably fast to backup, and reasonably fast to restore (ie. if a serious error/data deletion/malicious attack hits my website, I don't want to have to bring it down for multiple days while restoring the database - by fast I mean hours, as opposed to days).
2) A separate location and account from my AppEngine data - ie. I don't want someone with admin access to my AppEngine data to necessarily have write/delete access to the backup data location - for example if my AppEngine account is compromised by a hacker, or if a disgruntled employee were to decide to delete all my data, I would like to have backups that are separate from the AppEngine administrator accounts.
To summarize, given that getting the data out of the cloud seems slow/painful, what I would like is a cloud-based backup solution that emulates the role that tape backups would have served in the past - if I were to have a backup tape, nobody else could modify the contents of that tape - but since I can't get a tape, can I store a secure copy of my data somewhere, that only I have access to?
Kind Regards
Alexander
There are a few options here, though none are (currently) quite what you're looking for.
With the latest release of version 1.5.5 of the SDK, we now support interfacing with Google Storage directly - you can see how, here. With this you can write data to Google Storage, but to the best of my knowledge there's no way to write a file that the app will then be unable to delete.
To actually gather the data, you could use the App Engine mapreduce API. It has built in support for writing to the App Engine blobstore; writing to Google Storage would require you to implement your own output writer, currently.
Another option, as WoLpH suggests, is to use the Datastore Admin tool to back up data to another app. With a little extra effort you could modify the remote_api stub to prohibit deletes to the target (backup) app.
One thing you should definitely do regardless is to enable two-factor authentication for your Google account; this makes it a lot harder for anyone to get control of your account, even if they discover your password.
The bulkloader is probably one of the fastest way to backup/restore your data.
The problem with the AppEngine is that you have to do everything through views. So you have the restrictions that views have... the result is that a fast backup/restore still has to use the same API's as the rest of your app. So the bulkloader (possibly with a few modifications) is definately your best option here.
Perhaps though... (haven't tried it yet), you can use the new Datastore Admin to copy the data to another app. One which only you control. That way you can copy it back from the other app when needed.

When should one use the following: Amazon EC2, Google App Engine, Microsoft Azure and Salesforce.com?

I am asking this in very general sense. Both from cloud provider and cloud consumer's perspective. Also the question is not for any specific kind of application (in fact the intention is to know which type of applications/domains can fit into which of the cloud slab -SaaS PaaS IaaS).
My understanding so far is:
IaaS: Raw Hardware (Processors, Networks, Storage).
PaaS: OS, System Softwares, Development Framework, Virtual Machines.
SaaS: Software Applications.
It would be great if Stackoverflower's can share their understanding and experiences of cloud computing concept.
EDIT: Ok, I will put it in more specific way -
Amazon EC2: You don't have control over hardware layer. But you can take your choice of OS image, Dev Framework (.NET, J2EE, LAMP) and Application and put it on EC2 hardware. Can you deploy an applications built with Google App Engine or Azure on EC2?
Google App Engine: You don't have control over hardware and OS and you get a specific Dev Framework to build your application. Can you take any existing Java or Python application and port it to GAE? Or vice versa, can applications that were built on GAE be taken out of GAE and ported to any Application Server like Websphere or Weblogic?
Azure: You don't have control over hardware and OS and you get a specific Dev Framework to build your application. Can you take any existing .NET application and port it to Azure? Or vice versa, can applications that were built on Azure be taken out of Azure and ported to any Application Server like Biztalk?
Good question! As you point out, the different offerings fit into different categories:
EC2 is Infrastructure as a Service; you get VM instances, and do with them as you wish. Rackspace Cloud Servers are more or less the same.
Azure, App Engine, and Salesforce are all Platform as a Service; they offer different levels of integration, though: Azure pretty much lets you run arbitrary background services, while App Engine is oriented around short lived request handler tasks (though it also supports a task queue and scheduled tasks). I'm not terribly familiar with Salesforce's offering, but my understanding is that it's similar to App Engine in some respects, though more specialized for its particular niche.
Cloud offerings that fall under Software as a Service are everything from infrastructure pieces like Amazon's Simple Storage Service and SimpleDB through to complete applications like Fog Creek's hosted FogBugz and, of course, StackExchange.
A good general rule is that the higher level the offering, the less work you'll have to do, but the more specific it is. If you want a bug tracker, using FogBugz is obviously going to be the least work; building one on top of App Engine or Azure is more work, but provides for more versatility, while building one on top of raw VMs like EC2 is even more work (quite a lot more, in fact), but provides for even more versatility. My general advice is to pick the highest level platform that still meets your requirements, and build from there.
This is an excellent question. Full disclosure as I am partial to Azure but have experience with the others.
Where I think Azure stands out from the others is the quick transition from on prem to the cloud. For example -
SQL Azure - change connection string, upload DB, go!
Queues work a lot like MSMQ.
Blobs are pretty much blobs any way you shake them but they scale like crazy.
The table storage component is good because it provides incredible scalability for name/value pairs - but takes some getting used to.
Service Bus is my favorite of the services because it allows for a variety of communications paradigms. Two SB endpoints first try to connect to each other, if they cannot, then they route through the cloud - makes for very secure and scalable processing when firewalls tend to get in the way.
Access control list - paired typically with the service bus to make sure the right people access the right things - think SAML in the cloud.
I hope that helps!
My cloud experience is currently limited to Salesforce.com
For standard business operations and automation it provides a significant number of features that allow us to get apps up and running very quickly. We are particularly benefitting from the following:
Security (Administrators can control access to objects and fields)
Workflow & Approvals
Automatic UI generation
Built in reporting and dashboards
Entire system (including our custom changes) is accessible via web services
Ability to make the data in the system available through public sites (e.g. eCommerce)
Large library of third party apps to solve standard problems
The platform does NOT solve every problem.
I would not use the platform to model a nuclear power station or build the next twitter.
The major points of cloud computing is to save on costs by paying for usage and enable immediate deployment of computing resources.
The costs are not purely x amount of cents per instance per hour. The costs include maintenance, development, administration, etc. The huge benefit of cloud, in my mind is to liberate the customers from having to manage anything that is not within the realm of their core business competency. If I am an insurance business, I want my developers to concentrate on my insurance problems that help solve needs of my claims, rates, etc. I would rather avoid dealing with problems of email servers, file servers, document repositories, and administrating OS patches, service packs, etc.
Thus, in my opinion, the biggest benefits are derived from the SaaS and PaaS cloud offerings. One should go to IaaS only when PaaS or SaaS have serious restrictions to specific needs (i.e. I need to install a set of proprietary COM components and Azure does not support them).
SaaS is good for commodity type of applications that are not the core line of business for the client, but are more of a utility. These are your typical Messaging systems, Portals, Document Repositories, Email systems, CRMs, ERP's, Accounting, etc. etc. etc. Why reinvent the wheel by writing your own when you can customize a well supported third party product.
PaaS is great for core line of business software that supports companies' main business offering. Abstracts clients from having to deal with OS management and lets clients concentrate on the business system development - something that noone else can do for the client.
One can also take advantage of the benefits of PaaS (let's say, Google App Engine) and extend it, at times and if necessary, by pulling out some virtual machines from IaaS providers (e.g. Amazon) to do some number crunching then just send back the output to Google App Engine.
This way, you get the best of both worlds -- you can rapidly develop scalable apps in GAE, then you can always augment it by running any program you want from Amazon virtual machines.
This keeps changing, now Windows Azure also supports VM, so it is also an IaaS provider now.
Now how about Free Amazon EC2 for a year to do a better comparision. Check this out.
http://www.buzzingup.com/2010/10/amazon-announces-free-cloud-services-for-new-developers/

Resources