Recover image files accidentally deleted from Google Cloud bucket - google-app-engine

Is it possible to recover lost files in Google Cloud?
Only deleted this evening by a quirk of fate, I ran a task that should not have been run.
All the data is image data, so before I write some code to stub the lost images with a template image I'd like to know if it is at all possible to recover them?

Google Cloud Storage provides Object Versioning which lets you do exactly what you're asking for - be able to recover previous versions of objects (i.e. files) including deleted ones similar to a version control system.
However, you need to turn on Object versioning for a GCS bucket before you can go over the different versions of objects within the bucket. Without that it is not possible to recover any files deleted from your GCS bucket.
How Object Versioning works
Cloud Storage allows you to enable Object Versioning at the bucket
level. Once enabled, a history of modifications (overwrite / delete)
of objects is kept for all objects in the bucket. You can list
archived versions of an object, restore an object to an older state,
or permanently delete a version, as needed.
All objects have generation numbers that allow you to perform safe
read-modify-write updates and conditional operations on them. Note
that there is no guarantee of ordering between generations.
When an object is overwritten or deleted in a bucket which has
versioning enabled, a copy of the object is automatically saved with
generation properties that identify it. You can turn versioning on or
off for a bucket at any time. Turning versioning off leaves existing
object versions in place, and simply causes the bucket to stop
accumulating new object versions. In this case, if you upload to an
existing object, the current version is overwritten instead of
creating a new version.

Related

Consistency for file listing on Google Cloud Storage

I am going to develop an application that creates files on Google Cloud Storage and read them by other processes.
File creation may be delayed due to some reasons (such as the file is big) and may exist incomplete (write is ongoing) files on Cloud Storage.
I have to consider to prevent reading incomplete files. But according to this page, the Bucket listing is strongly consistent. The newly created files could be listed immediately after the file is created.
From the document above, my guess is the newly created files will not be listed until the creation will be completed, the incomplete files will not be listed.
Is my guess true? If not, how should I do to prevent reading incomplete files?
Your guess is true, the write in the bucket is atomic (when you upload the file, the content is cached before being pushed to your bucket.) You can see this in the documentation
Read-after-write (i.e. atomic operation, no transient state)
Thus, you don't need to worry about incomplete files.
I downloaded a large file from the internet (512 MB) and then uploaded it to GCS.
I tested by listing the bucket objects (during the upload) using the command
gsutil ls gs://bucket_name
The new object was not listed until the uploading process was successful.
Therefore your guess is true.

Removing External Stage Without Deleting S3 Files

I am loading csv files from Amazon S3 to Snowflake by first loading into a Snowflake External Stage pointing to Amazon S3 and following with a COPY command. From what I understand, the purge feature clears or leaves the stage intact once the movement is finished. I am using the same stage for subsequent calls of the same nature and having purge disabled would create duplicates and continue to stack in the same stage. The remove call seems to clear the stage, but also my S3 files.
Is there a way that I can purge the stage while leaving the s3 files intact?
The answer to your initial question "Is there a way that I can purge the stage while leaving the s3 files intact?" is no. An external stage is a reference to a file location (and the files in that location) so purging a stage (i.e. deleting the files in the referenced location; this is what 'purging' means) but keeping the files in that location is not logically possible.
As mentioned in the comments, if you want to keep a copy of the files in S3 then when you copy them to the Stage location just copy them to another S3 location at the same time.
I don't entirely understand when you say "I am using the same stage for subsequent calls of the same nature". I assume you are not trying to load the same files again so if this is a different set of files why don't you just use a different stage referencing a different S3 location?
As also mentioned in the comments, even if you keep loading data from the same stage (without purging) you won't create duplicates as Snowflake recognises files it has also processed and wont reload them.

Snowpipe: Importing historical data that may be modified without causing re-imports

To start with, I'm not sure if this is possible with the existing features of Snowpipe.
I have a S3 bucket with years of data, and occasionally some of those files get updated (the contents change, but the file name stays the same). I was hoping to use Snowpipe to import these files into Snowflake, as the "we won't reimport files that have been modified" aspect is appealing to me.
However, I discovered that ALTER PIPE ... REFRESH can only be used to import files staged no earlier than seven days ago, and the only other recommendation Snowflake's documentation has for importing historical data is to use COPY INTO .... However, if I use that, then if those old files get modified, they get imported via Snowflake since the metadata preventing COPY INTO ... from re-importing the S3 files and the metadata for Snowpipe are different, so I can end up with that file imported twice.
Is there any approach, short of "modify all those files in S3 so they have a recent modified-at timestamp", that would let me use Snowpipe with this?
If you're not opposed to a scripting solution for this, one solution would be to write a script to pull the set of in scope object names from AWS S3 and feed them to the Snowpipes REST API. The code you'd use for this is very similar to what is required if you're using an AWS Lambda to call the Snowpipe REST API when triggered via an S3 event notification. You can either use the AWS SDK to get the set of objects from S3, or just use Snowflake's LIST STAGE statement to pull them.
I've used this approach multiple times to backfill historical data from an AWS S3 location where we've enabled Snowpipe ingestion after data had already been written there. Even in the scenario where you don't have to worry about a file being updated in place, this can still be an advantage over just falling back to a direct COPY INTO because you don't have to worry if there's any overlap between when the PIPE was first enabled and the set of files you push to the Snowpipe REST API since the PIPE load history will take care of that for you..

Automatically retain latest datastore backup

I'm looking for the best strategy to collect specific datastore *.backup_info files stored in Cloud Storage and copy them as the "latest" backup_info files per kind, so I have a fix location for each kind, where the most recent backup_info file is found, e.g.
gs://MY-PROJECT.appspot.com/latest/Comment.backup_info
Basically, I have a Google App Engine app (Python standard) with data in Cloud Datastore. I can run a cron-job to perform backups automatically and regularly as described in the docs Scheduled Backups and I can also write a bit of Python code to execute backup tasks which is triggered manually as described in this SO answer. I plan to write a small Python cron-job that would perform the task to find the most recent backup_info file of a given kind and copy/rename it to the desired location.
Either way, the original backup location will be crowded with lots of files and folders during a day, especially if there is more than one backup for a certain kind. For example in gs://MY-PROJECT.appspot.com/ I will find:
VeryLoooooongRandomLookingString.backup_info
OtherStringForSecondBackup.backup_info
OtherStringForThirdBackup.backup_info
The string seems to be a unique identifier for every backup execution. I assume, it contains a list of *.backup_info files, one for each kind in the backup.
VeryLoooooongRandomLookingString.Comment.backup_info
OtherStringForSecondBackup.Comment.backup_info
OtherStringForThirdBackup.Comment.backup_info
For every kind in the backup, e.g. "Comment". It seems it contains a list of actual backup data for this kind and this backup.
datastore_backup_CUSTOM_PREFIX_2017_09_20_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_1_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_2_Comment/
Data folder for each backup and kind. Here for kind "Comment", backed up three times on 9/20.
My questions are related to Datastore and/or Storage:
Is it possible to explicitly specify a custom UID as a query parameter (or in HTTP header) when calling /_ah/datastore_admin/backup.create?
If not, is it possible to send a message with the UID to a hook or something, after the backup has been completed?
If (1) and (2) is not possible: Which approach would be the best in Storage to find the latest *.backup_info file for a given kind? It seems that listbucket() doesn't allow filtering, and I don't think that iterating through hundreds or thousands of files looking for certain name patterns would be efficient.
I have found two solutions for the problem, one is in GA and one is in Beta.
The answers in short:
The GA Datastore Export & Import service allows custom and predictable paths to the backup
and its API for long-running operations allows to get the output URL of a backup job (e.g. for paths with timestamps).
A Cloud Function triggered by Cloud Storage events would allow to handle just specific [KIND].backup_info files as soon as they are added to a bucket, instead of paging through thousands of files in the bucket each time.
Datastore Export & Import
This new service has an API to run export jobs (manually or scheduled). The job allows to specify the path and produces predictable full paths, so existing backup files could be overwritten if only the latest backup is needed at any time, e.g.:
gs://[YOUR_BUCKET]/[PATH]/[NAMESPACE]/[KIND]/[NAMESPACE]_[KIND].export_metadata
For cron-jobs, the App Engine handler URL is /cloud-datastore-export (instead of the old /_ah/datastore_admin/backup.create). Also the format of the export is different from the old export. It can be imported to BigQuery, too, just like the old [KIND].backup_info files.
Cloud Function
Deploy a Cloud Function (JavaScript / Node.js) that is triggered by any change in the backup bucket and if that file exists (file.resourceState === 'not_exists'), is new (file.metageneration === '1') and in fact is one of the [KIND].backup_info files we want, it will be copied to a different bucket ("latest_backups" or so). Custom metadata on the copy can be used to compare timeCreated in later executions of the function (so we don't accidentally overwrite more recent backup file with older file). Copying or moving actual backup payload will break the references inside the [KINDNAME].backup_info files though.
Background Cloud Function with a Cloud Storage trigger
How to copy files in Cloud Functions (Node.JS)

Delta changes in GAE

I'm wondering, when I press "deploy" in the google app engine launcher, how does it sync my changes to the actual instance.... maybe it would be better to ask specific questions :)
1) Does it only upload the delta changes (as opposed to the entire file) for changed files?
2) Does it only upload new files and changed files (i.e. does not copy pre-existing) unchanged files?
3) Does it delete remote files that do not exist locally?
4) Does all of this happen instantaneously for the end user once the app has finished deploying? (i.e. let's say I accidentally uploaded an insecure file that sits on example.com/passwords.txt - if #3 is true, then once I remove it from the local directory and re-deploy it should be gone- but can I be sure it is really gone and not cached on some edge somewhere?)
If you use only the launcher or the appcfg util as opposed to manage your code by means of git, AppEngine will keep only one 'state' of that particular version of your app and will not store any past state. So,
1) Yes, it uploads only deltas, not full files.
2) Yes, only new, modified or deleted files.
3) Yes, it deletes them if you delete locally and deploy. As Ibrahim Arief suggested, it is a good idea to use appcfg so you can prove it to yourself.
4) Here there are some caveats. With your new deploy, your old instances are sent a kill signal, and until it actually gets executed, there is a time span (seconds to minutes) during wich new requests could hit your previous version.
It is also very important the point Port Pleco has made. You have to be careful with caching on static files. If you have a file with Expires or Cache-Control headers, and it is actually served, then it could be cached on various places so the existence of old copies of it, is completely out of your control.
Happy coding!
I'm not a google employee so I don't have guaranteed accurate answers, but I can speak a little about your questions from my experience:
1) From what I've seen, it does upload all files each time
2) See 1, I'm fairly sure everything is uploaded
3) I'm not entirely sure whether it "deletes" the files, but I'm 99% sure that they're inaccessible if they don't exist in your current version. If you want to ensure that a file is inaccessible, then you can deploy your project with a new version number, and switch your app version to the new version in your admin panel. That will force google to use all your most recent files in that new version.
4) From what I've seen, changes that are rendered/executed, like html hardcoded text or controller changes or similar, appear instantly. Static files might be cached, as normal with web development, which means that you might have old versions of files saved on user's machines. You can use a query string on the end of the file name with the version to force an update on that.
For example, if I had a javascript file that I knew I would want to redeploy regularly, I would reference it like this:
<script type="text/javascript src="../javascript/file.js?version=1.2" />
Then just increment the version number manually when I needed to force deployment of the javascript to my users.

Resources