Automatically retain latest datastore backup - google-app-engine

I'm looking for the best strategy to collect specific datastore *.backup_info files stored in Cloud Storage and copy them as the "latest" backup_info files per kind, so I have a fix location for each kind, where the most recent backup_info file is found, e.g.
gs://MY-PROJECT.appspot.com/latest/Comment.backup_info
Basically, I have a Google App Engine app (Python standard) with data in Cloud Datastore. I can run a cron-job to perform backups automatically and regularly as described in the docs Scheduled Backups and I can also write a bit of Python code to execute backup tasks which is triggered manually as described in this SO answer. I plan to write a small Python cron-job that would perform the task to find the most recent backup_info file of a given kind and copy/rename it to the desired location.
Either way, the original backup location will be crowded with lots of files and folders during a day, especially if there is more than one backup for a certain kind. For example in gs://MY-PROJECT.appspot.com/ I will find:
VeryLoooooongRandomLookingString.backup_info
OtherStringForSecondBackup.backup_info
OtherStringForThirdBackup.backup_info
The string seems to be a unique identifier for every backup execution. I assume, it contains a list of *.backup_info files, one for each kind in the backup.
VeryLoooooongRandomLookingString.Comment.backup_info
OtherStringForSecondBackup.Comment.backup_info
OtherStringForThirdBackup.Comment.backup_info
For every kind in the backup, e.g. "Comment". It seems it contains a list of actual backup data for this kind and this backup.
datastore_backup_CUSTOM_PREFIX_2017_09_20_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_1_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_2_Comment/
Data folder for each backup and kind. Here for kind "Comment", backed up three times on 9/20.
My questions are related to Datastore and/or Storage:
Is it possible to explicitly specify a custom UID as a query parameter (or in HTTP header) when calling /_ah/datastore_admin/backup.create?
If not, is it possible to send a message with the UID to a hook or something, after the backup has been completed?
If (1) and (2) is not possible: Which approach would be the best in Storage to find the latest *.backup_info file for a given kind? It seems that listbucket() doesn't allow filtering, and I don't think that iterating through hundreds or thousands of files looking for certain name patterns would be efficient.

I have found two solutions for the problem, one is in GA and one is in Beta.
The answers in short:
The GA Datastore Export & Import service allows custom and predictable paths to the backup
and its API for long-running operations allows to get the output URL of a backup job (e.g. for paths with timestamps).
A Cloud Function triggered by Cloud Storage events would allow to handle just specific [KIND].backup_info files as soon as they are added to a bucket, instead of paging through thousands of files in the bucket each time.
Datastore Export & Import
This new service has an API to run export jobs (manually or scheduled). The job allows to specify the path and produces predictable full paths, so existing backup files could be overwritten if only the latest backup is needed at any time, e.g.:
gs://[YOUR_BUCKET]/[PATH]/[NAMESPACE]/[KIND]/[NAMESPACE]_[KIND].export_metadata
For cron-jobs, the App Engine handler URL is /cloud-datastore-export (instead of the old /_ah/datastore_admin/backup.create). Also the format of the export is different from the old export. It can be imported to BigQuery, too, just like the old [KIND].backup_info files.
Cloud Function
Deploy a Cloud Function (JavaScript / Node.js) that is triggered by any change in the backup bucket and if that file exists (file.resourceState === 'not_exists'), is new (file.metageneration === '1') and in fact is one of the [KIND].backup_info files we want, it will be copied to a different bucket ("latest_backups" or so). Custom metadata on the copy can be used to compare timeCreated in later executions of the function (so we don't accidentally overwrite more recent backup file with older file). Copying or moving actual backup payload will break the references inside the [KINDNAME].backup_info files though.
Background Cloud Function with a Cloud Storage trigger
How to copy files in Cloud Functions (Node.JS)

Related

Snowpipe: Importing historical data that may be modified without causing re-imports

To start with, I'm not sure if this is possible with the existing features of Snowpipe.
I have a S3 bucket with years of data, and occasionally some of those files get updated (the contents change, but the file name stays the same). I was hoping to use Snowpipe to import these files into Snowflake, as the "we won't reimport files that have been modified" aspect is appealing to me.
However, I discovered that ALTER PIPE ... REFRESH can only be used to import files staged no earlier than seven days ago, and the only other recommendation Snowflake's documentation has for importing historical data is to use COPY INTO .... However, if I use that, then if those old files get modified, they get imported via Snowflake since the metadata preventing COPY INTO ... from re-importing the S3 files and the metadata for Snowpipe are different, so I can end up with that file imported twice.
Is there any approach, short of "modify all those files in S3 so they have a recent modified-at timestamp", that would let me use Snowpipe with this?
If you're not opposed to a scripting solution for this, one solution would be to write a script to pull the set of in scope object names from AWS S3 and feed them to the Snowpipes REST API. The code you'd use for this is very similar to what is required if you're using an AWS Lambda to call the Snowpipe REST API when triggered via an S3 event notification. You can either use the AWS SDK to get the set of objects from S3, or just use Snowflake's LIST STAGE statement to pull them.
I've used this approach multiple times to backfill historical data from an AWS S3 location where we've enabled Snowpipe ingestion after data had already been written there. Even in the scenario where you don't have to worry about a file being updated in place, this can still be an advantage over just falling back to a direct COPY INTO because you don't have to worry if there's any overlap between when the PIPE was first enabled and the set of files you push to the Snowpipe REST API since the PIPE load history will take care of that for you..

Google Drive, auto delete files older than 7 days in specific folder

I'm using Google drive to store my daily made backup from my linux machine.
But i need a script that auto delete's the files in a specific folder after 7 days.
So that there are 7 backups in the folder.
The file that get's backup is called world-$(date +%d-%m-%Y).tar.gz
it replaces the %d %m and %Y with the day month and year it created the backup.
so let's say it created one today it would be called world-14-09-2018.tar.gz
it get's stored inside a folder called backups
Is there any way to have it auto delete the files so not store the inside the trash but delete's them completely after 7 days.
I'm not really familair with those kind of scripts. So if anyone could help me that would be really awesome.
You'll want to use the REST API for Google Drive. There are official client libraries for several languages, listed here
Authenticate your account via OAuth2. Depending on the client library you use, there are different tools to do this. I'm most familiar with the Python SDK, and I use Google's oauth2client. The run_flow() command is a simple way to get an OAuth2 refresh token that you can then use to authenticate API calls. Here's the full documentation for authenticating to Google Drive via OAuth2.
Once you're authenticated, you can call the files list endpoint. By default, this will list all files in your my drive. You can limit the search to just those files so you don't have to iterate through all your files each time using a search query. If you have more backups than can fit on a single page (it doesn't seem like it, esp. with the max pageSize of 1000), you will have to paginate your calls.
You can then filter your results by either the filename (as you indicated) or by the createdTime parameter in files.list in your code. Make sure to include createdTime in your fields, by setting the fields parameter to a comma-separated list of parameters, ie "files(id,createdTime,name,mimeType)" or, simply, "*" to get every field. Get a list of all files older than 7 days, then call files.delete. You can then run this script on a cron job every night, however you want to deploy it.
Alternatively, you could use the unofficial Drive command line tool, which will take care of a lot of this for you.

Recover image files accidentally deleted from Google Cloud bucket

Is it possible to recover lost files in Google Cloud?
Only deleted this evening by a quirk of fate, I ran a task that should not have been run.
All the data is image data, so before I write some code to stub the lost images with a template image I'd like to know if it is at all possible to recover them?
Google Cloud Storage provides Object Versioning which lets you do exactly what you're asking for - be able to recover previous versions of objects (i.e. files) including deleted ones similar to a version control system.
However, you need to turn on Object versioning for a GCS bucket before you can go over the different versions of objects within the bucket. Without that it is not possible to recover any files deleted from your GCS bucket.
How Object Versioning works
Cloud Storage allows you to enable Object Versioning at the bucket
level. Once enabled, a history of modifications (overwrite / delete)
of objects is kept for all objects in the bucket. You can list
archived versions of an object, restore an object to an older state,
or permanently delete a version, as needed.
All objects have generation numbers that allow you to perform safe
read-modify-write updates and conditional operations on them. Note
that there is no guarantee of ordering between generations.
When an object is overwritten or deleted in a bucket which has
versioning enabled, a copy of the object is automatically saved with
generation properties that identify it. You can turn versioning on or
off for a bucket at any time. Turning versioning off leaves existing
object versions in place, and simply causes the bucket to stop
accumulating new object versions. In this case, if you upload to an
existing object, the current version is overwritten instead of
creating a new version.

Monitoring for changes in folder without continuously running

This question has been asked around several time. Many programs like Dropbox make use of some form of file system api interaction to instantaneously keep track of changes that take place within a monitored folder.
As far as my understanding goes, however, this requires some daemon to be online at all times to wait for callbacks from the file system api. However, I can shut Dropbox down, update files and folders, and when I launch it again it still gets to know what the changes that I did to my folder were. How is this possible? Does it exhaustively search the whole tree in search for updates?
Short answer is YES.
Let's use Google Drive as an example, since its local database is not encrypted, and it's easy to see what's going on.
Basically it keeps a snapshot of the Google Drive folder.
You can browse the snapshot.db (typically under %USER%\AppData\Local\Google\Drive\user_default) using DB browser for SQLite.
Here's a sample from my computer:
You see that it tracks (among other stuff):
Last write time (looks like Unix time).
checksum.
Size - in bytes.
Whenever Google Drive starts up, it queries all the files and folders that are under your "Google Drive" folder (you can see that using Procmon)
Note that changes can also sync down from the server
There's also Change Journals, but I don't think that Dropbox or GDrive use it:
To avoid these disadvantages, the NTFS file system maintains an update sequence number (USN) change journal. When any change is made to a file or directory in a volume, the USN change journal for that volume is updated with a description of the change and the name of the file or directory.

Where should Map put temporary files when running under Hadoop

I am running Hadoop 0.20.1 under SLES 10 (SUSE).
My Map task takes a file and generates a few more, I then generate my results from these files. I would like to know where I should place these files, so that performance is good and there are no collisions. If Hadoop can delete the directory automatically - that would be nice.
Right now, I am using the temp folder and task id, to create a unique folder, and then working within subfolders of that folder.
reduceTaskId = job.get("mapred.task.id");
reduceTempDir = job.get("mapred.temp.dir");
String myTemporaryFoldername = reduceTempDir+File.separator+reduceTaskId+ File.separator;
File diseaseParent = new File(myTemporaryFoldername+File.separator +REDUCE_WORK_FOLDER);
The problem with this approach is that I am not sure it is optimal, also I have to delete each new folder or I start to run out of space.
Thanks
akintayo
(edit)
I found that the best place to keep files that you don't want beyond the life of map would be job.get("job.local.dir") which provides a path that will be deleted when the map tasks finishes. I am not sure if the delete is done on a per key basis or for each tasktracker.
The problem with that approach is that the sort and shuffle is going to move your data away from where that data was localized.
I do not know much about your data but the distributed cache might work well for you
${mapred.local.dir}/taskTracker/archive/ : The distributed cache. This directory holds the localized distributed cache. Thus localized distributed cache is shared among all the tasks and jobs
http://www.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/
"It is common for a MapReduce program to require one or more files to be read by each map or reduce task before execution. For example, you may have a lookup table that needs to be parsed before processing a set of records. To address this scenario, Hadoop’s MapReduce implementation includes a distributed file cache that will manage copying your file(s) out to the task execution nodes.
The DistributedCache was introduced in Hadoop 0.7.0; see HADOOP-288 for more detail on its origins. There is a great deal of existing documentation for the DistributedCache: see the Hadoop FAQ, the MapReduce Tutorial, the Hadoop Javadoc, and the Hadoop Streaming Tutorial. Once you’ve read the existing documentation and understand how to use the DistributedCache, come on back."

Resources