New in GAE development and have some question regarding extracting data.
I have an app that collects data from end users and data is stored in high availability datastore, and there is a need to send subset of data the app collected to business partners on a regular basis.
Here are my questions,
1. How can I backup data in the datastore on a regular basis, say daily incremental backup and weekly full backup?
2. what are the best practices to generate daily data dump files that can be downloaded or send to my partners in a secured approach. I expect few hundred MB data files everyday and eventually will be in few GB range.
3. Can my business partners be authenticated though basic HTTP auth, or have to use OAuth?
Google is in effect backing up your data by storing it in multiple data centres.
You can however use the bulk loader if desired and back it up manually:
Uploading and Downloading Data
You can authenticate users however you choose, it's totally up to you. The "users" service is integrated into app engine directly however so if everybody has or could have google accounts that's even easier for you to use.
The users service
Due to the size of your files unless you want to piece them together from the datastore you'll have to use something else, as the datastore has a 1MB limit per model. It's perfectly possible to do that however.
But you should probably look at The Google Cloud Storage API instead as there are no file size limits.
Related
Im a little confused about this because the docs say I can use stackdriver for "Request logs and application logs for App Engine applications" so does that mean like web requests? Does that mean like millions of web requests?
Stackdriver's pricing is per resource so does that mean I can log all of my web servers web request logs (which would be HUGE) for no extra cost (meaning I would not be charged by the volume of storage the logs use)?
Does stackdriver use GCP cloud storage as a backend and do I have to pay for the storage? It just looks like I can get hundreds of gigabytes of log aggregation for virtually no money just want to make sure Im understanding this.
I bring up ELK because elastic just partnered with google so it must not do everything elasticsearch does (for almost no money) otherwise it would be a competitor?
Things definitely seem to be moving quickly at Google's cloud division and documentation does seem to suffer a bit.
Having said that, the document you linked to also details the limitations -
The request and application logs for your app are collected by a Cloud
Logging agent and are kept for a maximum of 90 days, up to a maximum
size of 1GB. If you want to store your logs for a longer period or
store a larger size than 1GB, you can export your logs to Cloud
Storage. You can also export your logs to BigQuery and Pub/Sub for
further processing.
It should work out of the box for small to medium sized projects. The built in log viewer is also pretty basic.
From your description, it sounds like you may have specific needs, so you should not assume this will be free. You should factor in costs for Cloud Storage for the logs you want to retain and BigQuery depending on your needs to crunch the logs.
I would like to know a few things about how google handles the data withing google app engine:
Where is the data located - is it possible to find out?
How long will the data be stored? And what happens the data gets deleted?
Can you backup lost data or broken data?
What kind of security is added to protect the data?
You can find most of the answers here. More on ACID
Where? Google data centers across the globe
For how long? Forever is too much to say, so let's say for as long as you need the data to be there.
If the data gets deleted?
If you delete it yourself it'll be propagated within a sensible timeframe infrastructurally speaking
If there are accidental errors on Google's side, the data is replicated and backed.
To protect against accidental errors on your side you can use DataStore Admin Console to back up your data
Data is secured on Google's side. If you need to transfer sensitive information, App Engine enabled SSL for custom domains recently. If you don't want to go through the hassle of buying and registering certificates you can always use SSL through custom App Engine domain for free. More here
My application is currently on app engine server. My application writes the records(for logging and reporting) continuously.
Scenario: Views count in the website. When we open the website it hits the server to add the record with time and type of view. Showing these counts in the users dashboard.
Seems these requests are huge now. For now 40/sec. Google App Engine writes are going heavy and cost is increasing like anything.
Is there any way to reduce this or any other db to log the views?
Google App Engine's Datastore is NOT suitable for such a requirement where you have to continuously write to datastore and read less often.
You need to offload this task to a third party service (either you write one or use existing one)
Better option for user tracking and analytics is Google Analytics (Although you wont be directly able to show the hit counters on website using analytics).
If you want to show your user page hit count use a page hit counter: https://www.google.com/search?q=hit+counter
In this case you should avoid Datastore.
For this kind of analytics it's best to do the following:
Dump data to GAE log (yes, this sounds counter-intuitive, but it's actually advice from google engineers). GAE log is persistent and is guaranteed to not loose data you write to it.
Periodically parse the log for your data and then export it to BigQuery.
BigQuery has a quite powerful query language so it's capable of doing complex analytics reports.
Luckily this was already done before: see the Mache framework. Also see related video.
Note: there is now a new BigQuery feature called streaming inserts, which could potentially replace the cumbersome middle step (files on Cloud Storage) used in Mache.
My gae application collects large amounts of numerical data. Instead of having users download it, is it possible to create a google docs spreadsheet and save the outgoing bandwidth?
The idea is to create a google docs spreadsheet with the data which the user can then access and if he downloads the data to his computer, it would not count as bandwidth used by my application.
To call external APIs over Http you, or the library that you use, would need to make URLFetch calls, which count towards outgoing quota.
So you would only save on outgoing bandwidth cost if all users downloaded the same data, e.g. no per-user generated data. Even then the limits dor Google Docs apply: spreadsheet of max 20Mb in size and max 400k cells.
Also, in case of one shared spreadsheet you would not be able to control access to it. Everyone with the Url would be able to download it.
We have an application that we're deploying on GAE. I've been tasked with coming up with options for replicating the data that we're storing the the GAE data store to a system running in Amazon's cloud.
Ideally we could do this without having to transfer the entire data store on every sync. The replication does not need to be in anything close to real time, so something like a once or twice a day sync would work just fine.
Can anyone with some experience with GAE help me out here with what the options might be? So far I've come up with:
Use the Google provided bulkloader.py to export the data to CSV and somehow transfer the CSV to Amazon and process there
Create a Java app that runs on GAE, reads the data from the data store and sends the data to another Java app running on Amazon.
Do those options work? What would be the gotchas with those? What other options are there?
You could use a logic similar to what App Engine HRD migration or backup tool are doing:
Mark modified entities with a child entity marker
Run a MapperPipeline using App Engine mapreduce library iterating on those entity using a Datastore Input Reader
In your map function fetch the parent entity and serialize it to Google Storage using a File Output Writer and remove the marker
Ping the remote host to import those entity from the Google Storage url
As an alternative to 3 and 4, you could make multiple urlfetch(POST) to send each serialized entity to the remote host directly, but it is more fragile as an single failure could compromise the integrity of your data import.
You could look at the datastore admin source code for inspiration.