Back up AppEngine database (Google cloud storage?) - database

I have an AppEngine application that currently has about 15GB of data, and it seems to me that it is impractical to use the current AppEngine bulk loader tools to back up datasets of this size. Therefore, I am starting to investigate other ways of backing up, and would be interested in hearing about practical solutions that people may have used for backing up their AppEngine Data.
As an aside, I am starting to think that the Google Cloud Storage might be a good choice. I am curious to know if anyone has experience using the Google Cloud Storage as a backup for their AppEngine data, and what their experience has been, and if there are any pointers or things that I should be aware of before going down this path.
No matter which solution I end up with, I would like a backup solution to meet the following requirements:
1) Reasonably fast to backup, and reasonably fast to restore (ie. if a serious error/data deletion/malicious attack hits my website, I don't want to have to bring it down for multiple days while restoring the database - by fast I mean hours, as opposed to days).
2) A separate location and account from my AppEngine data - ie. I don't want someone with admin access to my AppEngine data to necessarily have write/delete access to the backup data location - for example if my AppEngine account is compromised by a hacker, or if a disgruntled employee were to decide to delete all my data, I would like to have backups that are separate from the AppEngine administrator accounts.
To summarize, given that getting the data out of the cloud seems slow/painful, what I would like is a cloud-based backup solution that emulates the role that tape backups would have served in the past - if I were to have a backup tape, nobody else could modify the contents of that tape - but since I can't get a tape, can I store a secure copy of my data somewhere, that only I have access to?
Kind Regards
Alexander

There are a few options here, though none are (currently) quite what you're looking for.
With the latest release of version 1.5.5 of the SDK, we now support interfacing with Google Storage directly - you can see how, here. With this you can write data to Google Storage, but to the best of my knowledge there's no way to write a file that the app will then be unable to delete.
To actually gather the data, you could use the App Engine mapreduce API. It has built in support for writing to the App Engine blobstore; writing to Google Storage would require you to implement your own output writer, currently.
Another option, as WoLpH suggests, is to use the Datastore Admin tool to back up data to another app. With a little extra effort you could modify the remote_api stub to prohibit deletes to the target (backup) app.
One thing you should definitely do regardless is to enable two-factor authentication for your Google account; this makes it a lot harder for anyone to get control of your account, even if they discover your password.

The bulkloader is probably one of the fastest way to backup/restore your data.
The problem with the AppEngine is that you have to do everything through views. So you have the restrictions that views have... the result is that a fast backup/restore still has to use the same API's as the rest of your app. So the bulkloader (possibly with a few modifications) is definately your best option here.
Perhaps though... (haven't tried it yet), you can use the new Datastore Admin to copy the data to another app. One which only you control. That way you can copy it back from the other app when needed.

Related

Kubernetes and Cloud Databases

Could someone explain the benefits/issues with hosting a database in Kubernetes via a persistent volume claim combined with a storage volume over using an actual cloud database resource?
It's essentially a trade-off: convenience vs control. Take a concrete example: let's say you pay Amazon money to use Athena, which is really just a nicely packaged version of Facebook Presto which AWS kindly operates for you in exchange for $$$. You could run Presto on EKS yourself, but why would you.
Now, let's say you want to or need to use Apache Drill or Apache Impala. Amazon doesn't offer it. Nor does any of the other big public cloud providers at time of writing, as far as I know.
Another thought: what if you want to migrate off of AWS? Your data has gravity as well.
Could someone explain the benefits/issues with hosting a database in Kubernetes ... over using an actual cloud database resource?
As previous excellent answer noted:
It's essentially a trade-off: convenience vs control
In addition to previous example (Athena), take a look at RDS as well and see what you would need to handle yourself (why would you, as said already):
Automatic backups
Multizone deployments
Snapshots
Engine upgrades
Read replicas
and other bells and whistles that come with managed service opposed to self-hosted/managed one.
But there is more to it than just convenience/control that this post I trying to shed light onto:
Kubernetes is adding another layer of abstraction there (pods, services...), and depending on way of handling storage (persistent volumes) you can have two additional considerations:
Access speed (depending on your use case this can be negligent or show stopper).
Storage that you have at hand might not be optimized for relational database type of I/O (or restrict you to schedule pods efficiently). The very same reasons you are not advised to run db on NFS for example.
There are several recent conference talks on kubernetes pointing out that database is big no-no for kubernetes (although this is highly opinionated, we do run average load mysql and postgresql databases in k8s), and large load/fast I/O is somewhat challenge to get right on k8s as opposed to somebody already fine tuned everything for you in managed cloud solution.
In conclusion:
It is all about convenience, controls and capabilities.

Create App Engine project via API

I would need to automate the creation of new App Engine projects. Is this possible? I see there is a Google Cloud SQL Admin API which can create new Cloud SQL instances, but what about App Engine? Is there anything similar?
Update:
We have developed an application that runs on GAE and uses Cloud SQL and plenty of API integration with most of Google Apps. We foresee dozens, if not hundreds, of customers in a near future. All of them will be using their own Google domain and Google Apps.
While we could actually just deploy the application in our App Engine and modify the Cloud SQL tables to include the id of the customer who owns the record, we thought it would be better if we deploy an app instance and Cloud SQL for every one of them (on our own account). The main reasons coming to mind are that we can track how much every customer spends in terms of billing, and speed up the database since Cloud SQL is just a MySQL instance.
Steps for the creation would require editing a properties file in the packaged .war file, adding the certificate used to log in as a service account, and probably something that I am missing at this moment :-P
This question is somehow related Create an App Engine Project ID command line
As far as I know this is not possible (and is unlikely to be possible anytime soon).
Update:
I can see why splitting into separate projects for billing purposes would be really nice (multi-tenancy is great, but getting one bill per customer from Google sounds easier), but unfortunately I don't think that it's going to be your best option.
For AppEngine, you may want to look into the multi-tenancy features (or in Python) and how to get stats for billing.
Keep in mind however, CloudSQL is not simply a MySQL instance. It happens to speak MySQL but is not the same as running MySQL on Compute Engine for example. I would recommend that you run some benchmarks to be sure that the "adding the customer ID to the table" idea you had won't work.
Lastly, a possibly relevant read: http://signalvnoise.com/posts/1509-mr-moore-gets-to-punt-on-sharding
I guess the conclusion is that there’s no use in preempting the technological progress of tomorrow. Machines will get faster and cheaper all the time, but you’ll still only have the same limited programming resources that you had yesterday.
If you can spend them on adding stuff that users care about instead of prematurely optimizing for the future, you stand a better chance of being in business when that tomorrow finally rolls around.

GoogleApps Datastore Cons and Pros

I've been reading more about Google AppEngine and learned python in the past couple of weeks, including working with MongoDB. What I need the most is a scalable database solution. Before discovering Google AppEngine, the only three DB solutions I find useful are DynamoDB, MongoDb and BigCouch.
I find out how that I really like python language, and for one coming from ASP.NET development, I've decided to switch and develop my app using python. My first choice was to develop my application using python + bottle + mongoDB. The problem is that DynamoDB is very expensive, and the lack of easy to use backup/restore options made me pass Amazon's offering.
Google AppEngine datastore is much more affordable. However, I still can't find information regarding some specific question on Google's website
Here are some of the questions I need answer to:
Does Google Datastore support backup/restore within the administration console?
If I want to backup/restore 50TB of data, how much time it takes to backup/restore the data? Where it is stored? what are the costs?
How much time it takes to backup 1TB of data for example?
Does DataStore support caching in the database layer
Any cons that I should be aware of?
Those some of the question that I need to get answers to. MongoDB is an excellent product and developing web app using Mongo + Python + bottle is fun fun fun. However, I prefer a full DB hosted solution like one offered by Google. But before I do that, I need to be sure that I'm not missing anything.
Here are some of the questions I need answer to:
Does Google Datastore support backup/restore within the administration
console?
No. Yes. You can back up and restore data from within the Administration Console by enabling datastore_admin for an application (Thanks to Idan Shechter for pointing this out!) More info can be found here: https://developers.google.com/appengine/docs/adminconsole/datastoreadmin
You can also download the data through the command line. See: https://developers.google.com/appengine/docs/python/tools/uploadingdata
If I want to backup/restore 50TB of data, how much time it takes to backup/restore the data?
It depends on where you back the data up to. Backing up to the Blobstore or Google Cloud storage will probably take much less time than backing up to your local machine. Transferring 50TBs to your local machine will take a long time and depend on many factors including network speed.
Where it is stored?
If you use the Datastore Administration, you can backup to the Blobstore or to Google Cloud Storage. If you use the command line tools, it will be stored where you choose to download the data to.
what are the costs?
The Blobstore costs $0.13/GB/Month and gives you 5GB free. Google Cloud Storage is $0.12 per GB/Month up to the first TB. You can see more pricing info for Cloud Storage here:
https://developers.google.com/storage/docs/pricingandterms
Bandwidth costs are $0.12 per GB (The first GB is free). More details on pricing can be seen here:
https://cloud.google.com/pricing/
How much time it takes to backup 1TB of data for example?
Again, it depends on where you back up to and your transfer speeds.
Does DataStore support caching in the database layer Any cons that I should be aware of?
No, it does not support database layer caching.

Is copying entities from one app to another to backup your data recommended?

I'm referring tothe datastore admin tool explained here: http://code.google.com/appengine/docs/adminconsole/datastoreadmin.html
One way of backing up locally is by using bulkloader.py, but I'm liking this solution better as your data stays in Google's cloud and can be easily transferred from one app to the other using a button in the admin console on an entity by entity basis. thinking of having two apps, one that i can manually back up to every week, and another that actually serves users. The backup app might incur some storage costs but overall costs would be minimal as no front/backend instances would be used except as needed for the backups...
Backing up one GAE app's data to another app is certainly not recommended by me. To me, there are a handful of reasons to backup:
To safeguard against a catastrophic outage on Google's part.
To safeguard against a programming error on your part that results in significant data loss.
To safeguard against your Google account being revoked.
Backing up to another GAE app does relatively little on each of these.
If you use the High Replication datastore, you're already distributing your data all across Google's cloud, so doing that again just seems redundant -- but not in the high availability sense of the word. The only way Google is going to lose your data is through some catastrophic disaster, in which case both of your apps may be in peril.
If you backup your data locally, you can store historic snapshots. Whereas if you're simply backing up to another app, you aren't storing historic data, so you have little protection against programmer error unless you catch it in between the time the error happens and when you're going to do your next backup.
In the event that Google, for whatever reason, kills your account, you lose both apps and all your data.
Ultimately, by backing up to another GAE, you still have all your eggs in one basket. You've just partitioned your basket. If your data is important enough to be backed up outside of your app, it's important enough to backup locally or to another provider entirely. That's my opinion anyway.
We backup our data periodically to another application to use as our development environment, but as others pointed out that's not really protecting your data against a major appengine catastrophe (as unlikely as this is...).
The best solution I've found for archiving data for disaster recovery is to pull it down using the python scripts google provides either onto EC2 or local disk.

How difficult is it to migrate away from Google App Engine?

I am thinking of making an (initially) small Web Application, which would eventually have a potential to grow. All things considered Google App Engine seems like a very attractive option. Say, user base and complexity grows and for one or other reason I needed to leave GAE behind. How difficult would it be to migrate away?
1) Does GAE provide a way to export the database? What format would it be? Would it be difficult to put it under MySQL (or similar)?
2) In which areas (ex. database access, others?) would I have to use GAE API? I.e. which parts of implementation would have to be abstracted away / interfaced?
Edit: 3) Alternatively, is it even worth to abstract away GAE API?
For question #1: I don't know if GAE specifically supports exports of a database but you can always roll your own, worst case scenario. If you are in a position where you need to, you'll probably have the resources to do it, too.
For question #2: You can and should always encapsulate those kinds of outside dependencies anyway. It doesn't matter whether or not they provide interfaces. Coupling to those interfaces should be kept to an absolute minimum.
For question #3: This question is not really super-clear so I cannot answer it.
I'm speaking strictly from a java webapp point of view...
Google App Engine for python has a backup/restore utility:
http://code.google.com/appengine/articles/gae_backup_and_restore.html
There is a huge interested in porting this to the java flavor.
You can use the higher level standard database apis (JDO/JPS) to allow you to move your app away from google's database services. I suggest purchasing the data nucleus tools in order to smooth the transition from big tables to something like mysql or oracle.
The packaged services GAE provides are enumerated at
http://code.google.com/appengine/docs/java/javadoc/
The stock JRE should handle porting of the urlfetch, mail, and memcache api packages.
You'll have to find a substitute technology for the users, blobstore, xmpp, and taskqueue packages.

Resources