Consistency for file listing on Google Cloud Storage - file

I am going to develop an application that creates files on Google Cloud Storage and read them by other processes.
File creation may be delayed due to some reasons (such as the file is big) and may exist incomplete (write is ongoing) files on Cloud Storage.
I have to consider to prevent reading incomplete files. But according to this page, the Bucket listing is strongly consistent. The newly created files could be listed immediately after the file is created.
From the document above, my guess is the newly created files will not be listed until the creation will be completed, the incomplete files will not be listed.
Is my guess true? If not, how should I do to prevent reading incomplete files?

Your guess is true, the write in the bucket is atomic (when you upload the file, the content is cached before being pushed to your bucket.) You can see this in the documentation
Read-after-write (i.e. atomic operation, no transient state)
Thus, you don't need to worry about incomplete files.

I downloaded a large file from the internet (512 MB) and then uploaded it to GCS.
I tested by listing the bucket objects (during the upload) using the command
gsutil ls gs://bucket_name
The new object was not listed until the uploading process was successful.
Therefore your guess is true.

Related

Removing External Stage Without Deleting S3 Files

I am loading csv files from Amazon S3 to Snowflake by first loading into a Snowflake External Stage pointing to Amazon S3 and following with a COPY command. From what I understand, the purge feature clears or leaves the stage intact once the movement is finished. I am using the same stage for subsequent calls of the same nature and having purge disabled would create duplicates and continue to stack in the same stage. The remove call seems to clear the stage, but also my S3 files.
Is there a way that I can purge the stage while leaving the s3 files intact?
The answer to your initial question "Is there a way that I can purge the stage while leaving the s3 files intact?" is no. An external stage is a reference to a file location (and the files in that location) so purging a stage (i.e. deleting the files in the referenced location; this is what 'purging' means) but keeping the files in that location is not logically possible.
As mentioned in the comments, if you want to keep a copy of the files in S3 then when you copy them to the Stage location just copy them to another S3 location at the same time.
I don't entirely understand when you say "I am using the same stage for subsequent calls of the same nature". I assume you are not trying to load the same files again so if this is a different set of files why don't you just use a different stage referencing a different S3 location?
As also mentioned in the comments, even if you keep loading data from the same stage (without purging) you won't create duplicates as Snowflake recognises files it has also processed and wont reload them.

Google cloud storage create zip file with all files from a bucket folder with app engine

I want to create a zip file with all files present inside a bucket folder and write this zip file back to google cloud storage.
I want to do this with app engine standard environment but i didn't find a good example for doing this.
If the size of the writable temporary file normally needed during the zip file creation could fit in the available memory of your instance class you may be able to use the StringIO facility and avoid writing to the filesystem. See for an example How to zip or tar a static folder without writing anything to the filesystem in python?
It may also be possible to directly write the zip file to GCS, basically using the GAE app as a pipeline, which might circumvent the available instance memory limitation mentioned above, but you'd have to try it out, I don't have an actual example. The tricks to watch for would be the picking the right file handlers arguments and maybe buffering options. An example of directly accessing a GCS file (only you'd want to write to it instead of reading from it) would be How to open gzip file on gae cloud?

Automatically retain latest datastore backup

I'm looking for the best strategy to collect specific datastore *.backup_info files stored in Cloud Storage and copy them as the "latest" backup_info files per kind, so I have a fix location for each kind, where the most recent backup_info file is found, e.g.
gs://MY-PROJECT.appspot.com/latest/Comment.backup_info
Basically, I have a Google App Engine app (Python standard) with data in Cloud Datastore. I can run a cron-job to perform backups automatically and regularly as described in the docs Scheduled Backups and I can also write a bit of Python code to execute backup tasks which is triggered manually as described in this SO answer. I plan to write a small Python cron-job that would perform the task to find the most recent backup_info file of a given kind and copy/rename it to the desired location.
Either way, the original backup location will be crowded with lots of files and folders during a day, especially if there is more than one backup for a certain kind. For example in gs://MY-PROJECT.appspot.com/ I will find:
VeryLoooooongRandomLookingString.backup_info
OtherStringForSecondBackup.backup_info
OtherStringForThirdBackup.backup_info
The string seems to be a unique identifier for every backup execution. I assume, it contains a list of *.backup_info files, one for each kind in the backup.
VeryLoooooongRandomLookingString.Comment.backup_info
OtherStringForSecondBackup.Comment.backup_info
OtherStringForThirdBackup.Comment.backup_info
For every kind in the backup, e.g. "Comment". It seems it contains a list of actual backup data for this kind and this backup.
datastore_backup_CUSTOM_PREFIX_2017_09_20_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_1_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_2_Comment/
Data folder for each backup and kind. Here for kind "Comment", backed up three times on 9/20.
My questions are related to Datastore and/or Storage:
Is it possible to explicitly specify a custom UID as a query parameter (or in HTTP header) when calling /_ah/datastore_admin/backup.create?
If not, is it possible to send a message with the UID to a hook or something, after the backup has been completed?
If (1) and (2) is not possible: Which approach would be the best in Storage to find the latest *.backup_info file for a given kind? It seems that listbucket() doesn't allow filtering, and I don't think that iterating through hundreds or thousands of files looking for certain name patterns would be efficient.
I have found two solutions for the problem, one is in GA and one is in Beta.
The answers in short:
The GA Datastore Export & Import service allows custom and predictable paths to the backup
and its API for long-running operations allows to get the output URL of a backup job (e.g. for paths with timestamps).
A Cloud Function triggered by Cloud Storage events would allow to handle just specific [KIND].backup_info files as soon as they are added to a bucket, instead of paging through thousands of files in the bucket each time.
Datastore Export & Import
This new service has an API to run export jobs (manually or scheduled). The job allows to specify the path and produces predictable full paths, so existing backup files could be overwritten if only the latest backup is needed at any time, e.g.:
gs://[YOUR_BUCKET]/[PATH]/[NAMESPACE]/[KIND]/[NAMESPACE]_[KIND].export_metadata
For cron-jobs, the App Engine handler URL is /cloud-datastore-export (instead of the old /_ah/datastore_admin/backup.create). Also the format of the export is different from the old export. It can be imported to BigQuery, too, just like the old [KIND].backup_info files.
Cloud Function
Deploy a Cloud Function (JavaScript / Node.js) that is triggered by any change in the backup bucket and if that file exists (file.resourceState === 'not_exists'), is new (file.metageneration === '1') and in fact is one of the [KIND].backup_info files we want, it will be copied to a different bucket ("latest_backups" or so). Custom metadata on the copy can be used to compare timeCreated in later executions of the function (so we don't accidentally overwrite more recent backup file with older file). Copying or moving actual backup payload will break the references inside the [KINDNAME].backup_info files though.
Background Cloud Function with a Cloud Storage trigger
How to copy files in Cloud Functions (Node.JS)

Recover image files accidentally deleted from Google Cloud bucket

Is it possible to recover lost files in Google Cloud?
Only deleted this evening by a quirk of fate, I ran a task that should not have been run.
All the data is image data, so before I write some code to stub the lost images with a template image I'd like to know if it is at all possible to recover them?
Google Cloud Storage provides Object Versioning which lets you do exactly what you're asking for - be able to recover previous versions of objects (i.e. files) including deleted ones similar to a version control system.
However, you need to turn on Object versioning for a GCS bucket before you can go over the different versions of objects within the bucket. Without that it is not possible to recover any files deleted from your GCS bucket.
How Object Versioning works
Cloud Storage allows you to enable Object Versioning at the bucket
level. Once enabled, a history of modifications (overwrite / delete)
of objects is kept for all objects in the bucket. You can list
archived versions of an object, restore an object to an older state,
or permanently delete a version, as needed.
All objects have generation numbers that allow you to perform safe
read-modify-write updates and conditional operations on them. Note
that there is no guarantee of ordering between generations.
When an object is overwritten or deleted in a bucket which has
versioning enabled, a copy of the object is automatically saved with
generation properties that identify it. You can turn versioning on or
off for a bucket at any time. Turning versioning off leaves existing
object versions in place, and simply causes the bucket to stop
accumulating new object versions. In this case, if you upload to an
existing object, the current version is overwritten instead of
creating a new version.

Processing a large (>32mb) xml file over appengine

I'm trying to process large (~50mb) sized xml files to store in the datastore. I've tried using backends, sockets (to pull the file via urlfetch), and even straight up uploading the file within my source code, but again keep running into limits (i.e. the 32 mb limit).
So, I'm really confused (and a little angry/frustrated). Does appengine really have no real way to process a large file? There does seem to be one potential work around, which would involve remote_apis, amazon (or google compute I guess) and a security/setup nightmare...
Http ranges was another thing I considered, but it'll be painful to somehow connect the different splitted parts together (unless I can manage to split the file at exact points)
This seems crazy so I thought I'd ask stackover flow... am I missing something?
update
Tried using range requests and it looks like the server I'm trying to stream from doesn't use it. So right now I'm thinking either downloading the file, hosting it on another server, then use appengine to access that via range http requests on backends AND then automate the entire process so I can run it as a cron job :/ (the craziness of having to do all this work for something so simple... sigh)
What about storing it in the cloud storage and reading it incrementally, as you can access it line by line (in Python anyway) so it wont' consume all resources.
https://developers.google.com/appengine/docs/python/googlecloudstorageclient/
https://developers.google.com/storage/
The GCS client library lets your application read files from and write
files to buckets in Google Cloud Storage (GCS). This library supports
reading and writing large amounts of data to GCS, with internal error
handling and retries, so you don't have to write your own code to do
this. Moreover, it provides read buffering with prefetch so your app
can be more efficient.
The GCS client library provides the following functionality:
An open method that returns a file-like buffer on which you can invoke
standard Python file operations for reading and writing. A listbucket
method for listing the contents of a GCS bucket. A stat method for
obtaining metadata about a specific file. A delete method for deleting
files from GCS.
I've processed some very large CSV files in exactly this way - read as much as I need to, process, then read some more.
def read_file(self, filename):
self.response.write('Truncated file content:\n')
gcs_file = gcs.open(filename)
self.response.write(gcs_file.readline())
gcs_file.seek(-1024, os.SEEK_END)
self.response.write(gcs_file.read())
gcs_file.close()
Incremental reading with standard python!

Resources