We have an application that we're deploying on GAE. I've been tasked with coming up with options for replicating the data that we're storing the the GAE data store to a system running in Amazon's cloud.
Ideally we could do this without having to transfer the entire data store on every sync. The replication does not need to be in anything close to real time, so something like a once or twice a day sync would work just fine.
Can anyone with some experience with GAE help me out here with what the options might be? So far I've come up with:
Use the Google provided bulkloader.py to export the data to CSV and somehow transfer the CSV to Amazon and process there
Create a Java app that runs on GAE, reads the data from the data store and sends the data to another Java app running on Amazon.
Do those options work? What would be the gotchas with those? What other options are there?
You could use a logic similar to what App Engine HRD migration or backup tool are doing:
Mark modified entities with a child entity marker
Run a MapperPipeline using App Engine mapreduce library iterating on those entity using a Datastore Input Reader
In your map function fetch the parent entity and serialize it to Google Storage using a File Output Writer and remove the marker
Ping the remote host to import those entity from the Google Storage url
As an alternative to 3 and 4, you could make multiple urlfetch(POST) to send each serialized entity to the remote host directly, but it is more fragile as an single failure could compromise the integrity of your data import.
You could look at the datastore admin source code for inspiration.
Related
My application is currently on app engine server. My application writes the records(for logging and reporting) continuously.
Scenario: Views count in the website. When we open the website it hits the server to add the record with time and type of view. Showing these counts in the users dashboard.
Seems these requests are huge now. For now 40/sec. Google App Engine writes are going heavy and cost is increasing like anything.
Is there any way to reduce this or any other db to log the views?
Google App Engine's Datastore is NOT suitable for such a requirement where you have to continuously write to datastore and read less often.
You need to offload this task to a third party service (either you write one or use existing one)
Better option for user tracking and analytics is Google Analytics (Although you wont be directly able to show the hit counters on website using analytics).
If you want to show your user page hit count use a page hit counter: https://www.google.com/search?q=hit+counter
In this case you should avoid Datastore.
For this kind of analytics it's best to do the following:
Dump data to GAE log (yes, this sounds counter-intuitive, but it's actually advice from google engineers). GAE log is persistent and is guaranteed to not loose data you write to it.
Periodically parse the log for your data and then export it to BigQuery.
BigQuery has a quite powerful query language so it's capable of doing complex analytics reports.
Luckily this was already done before: see the Mache framework. Also see related video.
Note: there is now a new BigQuery feature called streaming inserts, which could potentially replace the cumbersome middle step (files on Cloud Storage) used in Mache.
So I have this 2 applications connected with a REST API (json messages). One written in Django and the other in Php. I have an exact database replica on both sides (using mysql).
My question is, how can i keep this 2 applications databases synchronized?
In other words, when i press "submit" on one of them, i want that data to be saved on the current app database, and on the remote database for the other app using rest.
Is there a django app that does that? i read about django-synchro but didn't see anything REST related.
And i would like to keep things asynchronous, in other words the user must be able to keep using the app while this process is running on the background and keeping data consistent.
I had a look at celery and redis and it seems like a cron job will do what i need
New in GAE development and have some question regarding extracting data.
I have an app that collects data from end users and data is stored in high availability datastore, and there is a need to send subset of data the app collected to business partners on a regular basis.
Here are my questions,
1. How can I backup data in the datastore on a regular basis, say daily incremental backup and weekly full backup?
2. what are the best practices to generate daily data dump files that can be downloaded or send to my partners in a secured approach. I expect few hundred MB data files everyday and eventually will be in few GB range.
3. Can my business partners be authenticated though basic HTTP auth, or have to use OAuth?
Google is in effect backing up your data by storing it in multiple data centres.
You can however use the bulk loader if desired and back it up manually:
Uploading and Downloading Data
You can authenticate users however you choose, it's totally up to you. The "users" service is integrated into app engine directly however so if everybody has or could have google accounts that's even easier for you to use.
The users service
Due to the size of your files unless you want to piece them together from the datastore you'll have to use something else, as the datastore has a 1MB limit per model. It's perfectly possible to do that however.
But you should probably look at The Google Cloud Storage API instead as there are no file size limits.
I am looking into using Google App Engine for a project and would like make sure I have a way to export all my data if I ever decide to leave GAE (or GAE shuts down).
Everything I search about exporting data from GAE points to https://developers.google.com/appengine/docs/python/tools/uploadingdata. However, that page contains this note:
Note: This document applies to apps that use the master/slave
datastore. If your app uses the High Replication datastore, it is
possible to copy data from the app, but Google does not currently
support this use case. If you attempt to copy from a High Replication
datastore, you'll see a high_replication_warning error in the Admin
Console, and the downloaded data might not include recently saved
entities.
The problem is that recently the master/slave datastore was recently deprecated in favor of the High Replication datastore. I understand that the master/slave datastore is still supported for a little while, but I don't feel comfortable using something that has officially been deprecated and is on its way out. So that leaves me with the High Replication datastore and the only way it seems to export the data is the method above that is not officially supported (and thus does not provide me with a guarantee that I can get my data out).
Is there any other (officially supported) way of exporting data from the High Replication datastore? I don't feel comfortable using Google App Engine if it means my data could be locked in there forever.
It took me quite a long time to setup the download of data from GAE as the documentation is not as clear as it should be.
If you extracting data from a Unix server, you maybe could reuse the script below.
Also, if you do not provide the "config_file" parameter, it will extract all your data for this kind but in a proprietary format which can only be used for restoring data afterwards.
#!/bin/sh
#------------------------------------------------------------------
#-- Param 1 : Namespace
#-- Param 2 : Kind (table id)
#-- Param 3 : Directory in which the csv file should be stored
#-- Param 4 : output file name
#------------------------------------------------------------------
appcfg.py download_data --secure --email=$BACKUP_USERID -- config_file=configClientExtract.yml --filename=$3/$4.csv --kind=$2 --url=$BACKUP_WEBSITE/remote_api --namespace=$1 --passin <<-EOF $BACKUP_PASSWORD EOF
Currently app engine datastore supports another option also. Data backup provision can be used to copy selected data into blob store or google cloud storage. This function is available under datastore admin area in app engine console. If required, the backed up data can then be downloaded from the blob viewer or cloud storage. For doing the backup for high replication datastore, it is recommended that datastore writes are disabled before taking the backup.
You need to configure a builtin called remote_api. This article has all the information and guide you need to be able to download all your data today and in the future.
I and my friend are working on a GWT-Google App Engine project, using Tortoise SVN and Google Code to synchronize the code.
We also synchronize the local_db.bin file in appengine-generated folder. But we cant get it work. After synchronize the db file, our local datastore is not updated as we expected.
That is a pain. Im worrying about our future, when our database get bigger and more complicated #A#.
Anyone please give me an advice. What should i do to synchronize our local datastore?
I have to suggestions:
1) Use remote api : https://developers.google.com/appengine/articles/remote_api to share a GAE hosted db locally.
2) Maybe you can use Gdrive to sync folders.
This is a really bad idea. Even if you weren't having trouble making both ends read from the same datastore file, the local datastore is in a binary format, and thus you won't both be able to work on the app at the same time, or you'll get merge conflicts you will be unable to resolve.
Instead, both for collaboration purposes and for testing and deployment, you should provide a set of test data you can easily load into the datastore. Store the test data in version control, and load it in using bulkloader or your own code.