Snowpipe: Importing historical data that may be modified without causing re-imports

Snowpipe: Importing historical data that may be modified without causing re-imports - snowflake-cloud-data-platform

To start with, I'm not sure if this is possible with the existing features of Snowpipe.
I have a S3 bucket with years of data, and occasionally some of those files get updated (the contents change, but the file name stays the same). I was hoping to use Snowpipe to import these files into Snowflake, as the "we won't reimport files that have been modified" aspect is appealing to me.
However, I discovered that ALTER PIPE ... REFRESH can only be used to import files staged no earlier than seven days ago, and the only other recommendation Snowflake's documentation has for importing historical data is to use COPY INTO .... However, if I use that, then if those old files get modified, they get imported via Snowflake since the metadata preventing COPY INTO ... from re-importing the S3 files and the metadata for Snowpipe are different, so I can end up with that file imported twice.
Is there any approach, short of "modify all those files in S3 so they have a recent modified-at timestamp", that would let me use Snowpipe with this?

If you're not opposed to a scripting solution for this, one solution would be to write a script to pull the set of in scope object names from AWS S3 and feed them to the Snowpipes REST API. The code you'd use for this is very similar to what is required if you're using an AWS Lambda to call the Snowpipe REST API when triggered via an S3 event notification. You can either use the AWS SDK to get the set of objects from S3, or just use Snowflake's LIST STAGE statement to pull them.
I've used this approach multiple times to backfill historical data from an AWS S3 location where we've enabled Snowpipe ingestion after data had already been written there. Even in the scenario where you don't have to worry about a file being updated in place, this can still be an advantage over just falling back to a direct COPY INTO because you don't have to worry if there's any overlap between when the PIPE was first enabled and the set of files you push to the Snowpipe REST API since the PIPE load history will take care of that for you..

Related

What would be the best way to store JSON files for use within a React application?

I have a python script that fetches data twice a day from a server of mine. The script returns around 40 JSON files containing various data. The files aren't particularly big and the combined size of all the files is around 250KB.
Alongside my script I am developing a dashboard in React that renders the data from each file into a table, allowing me a visual representation of the data.
I have been looking at what would be the best way to store these files, something that allows me to upload and fetch them twice a day.
Someone mentioned to me about using MongoDB to store the files, but after some research I feel like Mongo is better at storing the contents of the file rather than the file itself. I tried to develop a solution but I couldn't figure out how it could be done when each object is stored as a document with no clear way (to me) which document came from which file.
Other options I have considered are:
Storing the files on the server that is hosting my React project and rendering them locally as I am doing now during development
Storing the files using a provider such as AWS/Firebase
Storing them in a different database (I see SQL now support the storing of JSON files)
Are there any other solutions that you think would work best for this scenario? If so, why?

Hello,
Check about use of FTP server.
We have clients that send us data every 10 min via FTP that is inside XML files, then I have NodeJS back-end which read these files.
You can use it for your scenario with JSON files.

Removing External Stage Without Deleting S3 Files

I am loading csv files from Amazon S3 to Snowflake by first loading into a Snowflake External Stage pointing to Amazon S3 and following with a COPY command. From what I understand, the purge feature clears or leaves the stage intact once the movement is finished. I am using the same stage for subsequent calls of the same nature and having purge disabled would create duplicates and continue to stack in the same stage. The remove call seems to clear the stage, but also my S3 files.
Is there a way that I can purge the stage while leaving the s3 files intact?

The answer to your initial question "Is there a way that I can purge the stage while leaving the s3 files intact?" is no. An external stage is a reference to a file location (and the files in that location) so purging a stage (i.e. deleting the files in the referenced location; this is what 'purging' means) but keeping the files in that location is not logically possible.
As mentioned in the comments, if you want to keep a copy of the files in S3 then when you copy them to the Stage location just copy them to another S3 location at the same time.
I don't entirely understand when you say "I am using the same stage for subsequent calls of the same nature". I assume you are not trying to load the same files again so if this is a different set of files why don't you just use a different stage referencing a different S3 location?
As also mentioned in the comments, even if you keep loading data from the same stage (without purging) you won't create duplicates as Snowflake recognises files it has also processed and wont reload them.

Snowpipe Infastructure & s3 subfolders

I am trying to set up scalable snowpipe infrastructure. I have one AWS lambda function pulling data and putting the raw json files into their corresponding folders below.
Ideally I'd like to set up snowpipe to read in the data from each folder into it's own Snowflake table.
Ex)
The leads json file living in the leads folder is now piped into a
leads_json table within snowflake.
The opportunities json file living in the opportunities folder is now piped into a opportunitie_json table within snowflake.
How do I go about setting up the pipelines and stages to reduce the number of pipelines and stages needed?
Will I need one pipeline and stage per sub folder in the bucket?
I'm going to make use out of the AUTO_INGEST=true feature using SQS notifications.

You will need 1 PIPE for each TABLE that you are loading via Snowpipe. You could have a single STAGE pointing to the top folder of your S3 bucket, if you wish, or you could create 1 per table at a lower level folder. I hope that is answering your question.

Automatically retain latest datastore backup

I'm looking for the best strategy to collect specific datastore *.backup_info files stored in Cloud Storage and copy them as the "latest" backup_info files per kind, so I have a fix location for each kind, where the most recent backup_info file is found, e.g.
gs://MY-PROJECT.appspot.com/latest/Comment.backup_info
Basically, I have a Google App Engine app (Python standard) with data in Cloud Datastore. I can run a cron-job to perform backups automatically and regularly as described in the docs Scheduled Backups and I can also write a bit of Python code to execute backup tasks which is triggered manually as described in this SO answer. I plan to write a small Python cron-job that would perform the task to find the most recent backup_info file of a given kind and copy/rename it to the desired location.
Either way, the original backup location will be crowded with lots of files and folders during a day, especially if there is more than one backup for a certain kind. For example in gs://MY-PROJECT.appspot.com/ I will find:
VeryLoooooongRandomLookingString.backup_info
OtherStringForSecondBackup.backup_info
OtherStringForThirdBackup.backup_info
The string seems to be a unique identifier for every backup execution. I assume, it contains a list of *.backup_info files, one for each kind in the backup.
VeryLoooooongRandomLookingString.Comment.backup_info
OtherStringForSecondBackup.Comment.backup_info
OtherStringForThirdBackup.Comment.backup_info
For every kind in the backup, e.g. "Comment". It seems it contains a list of actual backup data for this kind and this backup.
datastore_backup_CUSTOM_PREFIX_2017_09_20_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_1_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_2_Comment/
Data folder for each backup and kind. Here for kind "Comment", backed up three times on 9/20.
My questions are related to Datastore and/or Storage:
Is it possible to explicitly specify a custom UID as a query parameter (or in HTTP header) when calling /_ah/datastore_admin/backup.create?
If not, is it possible to send a message with the UID to a hook or something, after the backup has been completed?
If (1) and (2) is not possible: Which approach would be the best in Storage to find the latest *.backup_info file for a given kind? It seems that listbucket() doesn't allow filtering, and I don't think that iterating through hundreds or thousands of files looking for certain name patterns would be efficient.

I have found two solutions for the problem, one is in GA and one is in Beta.
The answers in short:
The GA Datastore Export & Import service allows custom and predictable paths to the backup
and its API for long-running operations allows to get the output URL of a backup job (e.g. for paths with timestamps).
A Cloud Function triggered by Cloud Storage events would allow to handle just specific [KIND].backup_info files as soon as they are added to a bucket, instead of paging through thousands of files in the bucket each time.
Datastore Export & Import
This new service has an API to run export jobs (manually or scheduled). The job allows to specify the path and produces predictable full paths, so existing backup files could be overwritten if only the latest backup is needed at any time, e.g.:
gs://[YOUR_BUCKET]/[PATH]/[NAMESPACE]/[KIND]/[NAMESPACE]_[KIND].export_metadata
For cron-jobs, the App Engine handler URL is /cloud-datastore-export (instead of the old /_ah/datastore_admin/backup.create). Also the format of the export is different from the old export. It can be imported to BigQuery, too, just like the old [KIND].backup_info files.
Cloud Function
Deploy a Cloud Function (JavaScript / Node.js) that is triggered by any change in the backup bucket and if that file exists (file.resourceState === 'not_exists'), is new (file.metageneration === '1') and in fact is one of the [KIND].backup_info files we want, it will be copied to a different bucket ("latest_backups" or so). Custom metadata on the copy can be used to compare timeCreated in later executions of the function (so we don't accidentally overwrite more recent backup file with older file). Copying or moving actual backup payload will break the references inside the [KINDNAME].backup_info files though.
Background Cloud Function with a Cloud Storage trigger
How to copy files in Cloud Functions (Node.JS)

Recover image files accidentally deleted from Google Cloud bucket

Is it possible to recover lost files in Google Cloud?
Only deleted this evening by a quirk of fate, I ran a task that should not have been run.
All the data is image data, so before I write some code to stub the lost images with a template image I'd like to know if it is at all possible to recover them?

Google Cloud Storage provides Object Versioning which lets you do exactly what you're asking for - be able to recover previous versions of objects (i.e. files) including deleted ones similar to a version control system.
However, you need to turn on Object versioning for a GCS bucket before you can go over the different versions of objects within the bucket. Without that it is not possible to recover any files deleted from your GCS bucket.
How Object Versioning works
Cloud Storage allows you to enable Object Versioning at the bucket
level. Once enabled, a history of modifications (overwrite / delete)
of objects is kept for all objects in the bucket. You can list
archived versions of an object, restore an object to an older state,
or permanently delete a version, as needed.
All objects have generation numbers that allow you to perform safe
read-modify-write updates and conditional operations on them. Note
that there is no guarantee of ordering between generations.
When an object is overwritten or deleted in a bucket which has
versioning enabled, a copy of the object is automatically saved with
generation properties that identify it. You can turn versioning on or
off for a bucket at any time. Turning versioning off leaves existing
object versions in place, and simply causes the bucket to stop
accumulating new object versions. In this case, if you upload to an
existing object, the current version is overwritten instead of
creating a new version.