I have setup snowpipe to continuously load data in tables from an S3 bucket. This has been running about a month now (i.e. > 14 days). There is data in the bucket from before snowpipe was setup and we need to load those files into snowflake also. Snowpipe apparently only maintains copy history data for 14 days. What would be a good way to identify the files that have not yet been ingested into tables and bulk import them?
Did you try below view
SNOWFLAKE.ACCOUNT_USAGE.COPY_HISTORY . It stores last one year load history data from both copy into command as well as the snowpipe load history
Get the list of files loaded using snowpipe then you can plan for all remaining files load .
Please check the usage note on latency as well
https://docs.snowflake.com/en/sql-reference/account-usage/copy_history.html
Related
I have loaded data in internal/names stage in snowflake and now want to copy data from internal stage to snowflake table continuously. I read that continuous load via snowpipe is supported only for external stage and not for internal stage. So what is the way to load data from internal stage to snowflake table continuously?
Thanks in advance.
Amrita
I think you can use Snowpipe but you'll need to call the REST API for Snowpipe to load the data.
Have you considered simply running a COPY statement immediately after you run the PUT command to get the file into the table?
Try using Tasks and Schedule it for every 5 mins.
// Task DDL
CREATE OR REPLACE TASK MASTER_TASK
WAREHOUSE = LOAD_WH
SCHEDULE = '5 MINUTE'
AS
COPY INTO STAGE_TABLE
FROM #STAGE;
You can always change the frequency based on your requirement. Refer to this Documentation
For External Stage, please refer this link
I am relatively new to AWS Glue, but after creating my crawler and running it successfully, I can see that a new table has been created but I can't see any columns in that table. It is absolutely blank.
I am using a .csv file from a S3 bucket as my data source.
Is your file UTF8 encoded... Glue has a problem if it’s not.
Does your file have at least 2 records
Does the file have more than one column.
There are various factors that impact the crawler from identifying a csv file
Please refer to this documentation that talks about the built in classifier and what it needs to crawl a csv file properly
https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html
I am trying to set up scalable snowpipe infrastructure. I have one AWS lambda function pulling data and putting the raw json files into their corresponding folders below.
Ideally I'd like to set up snowpipe to read in the data from each folder into it's own Snowflake table.
Ex)
The leads json file living in the leads folder is now piped into a
leads_json table within snowflake.
The opportunities json file living in the opportunities folder is now piped into a opportunitie_json table within snowflake.
How do I go about setting up the pipelines and stages to reduce the number of pipelines and stages needed?
Will I need one pipeline and stage per sub folder in the bucket?
I'm going to make use out of the AUTO_INGEST=true feature using SQS notifications.
You will need 1 PIPE for each TABLE that you are loading via Snowpipe. You could have a single STAGE pointing to the top folder of your S3 bucket, if you wish, or you could create 1 per table at a lower level folder. I hope that is answering your question.
To start with, I'm not sure if this is possible with the existing features of Snowpipe.
I have a S3 bucket with years of data, and occasionally some of those files get updated (the contents change, but the file name stays the same). I was hoping to use Snowpipe to import these files into Snowflake, as the "we won't reimport files that have been modified" aspect is appealing to me.
However, I discovered that ALTER PIPE ... REFRESH can only be used to import files staged no earlier than seven days ago, and the only other recommendation Snowflake's documentation has for importing historical data is to use COPY INTO .... However, if I use that, then if those old files get modified, they get imported via Snowflake since the metadata preventing COPY INTO ... from re-importing the S3 files and the metadata for Snowpipe are different, so I can end up with that file imported twice.
Is there any approach, short of "modify all those files in S3 so they have a recent modified-at timestamp", that would let me use Snowpipe with this?
If you're not opposed to a scripting solution for this, one solution would be to write a script to pull the set of in scope object names from AWS S3 and feed them to the Snowpipes REST API. The code you'd use for this is very similar to what is required if you're using an AWS Lambda to call the Snowpipe REST API when triggered via an S3 event notification. You can either use the AWS SDK to get the set of objects from S3, or just use Snowflake's LIST STAGE statement to pull them.
I've used this approach multiple times to backfill historical data from an AWS S3 location where we've enabled Snowpipe ingestion after data had already been written there. Even in the scenario where you don't have to worry about a file being updated in place, this can still be an advantage over just falling back to a direct COPY INTO because you don't have to worry if there's any overlap between when the PIPE was first enabled and the set of files you push to the Snowpipe REST API since the PIPE load history will take care of that for you..
I'm looking for the best strategy to collect specific datastore *.backup_info files stored in Cloud Storage and copy them as the "latest" backup_info files per kind, so I have a fix location for each kind, where the most recent backup_info file is found, e.g.
gs://MY-PROJECT.appspot.com/latest/Comment.backup_info
Basically, I have a Google App Engine app (Python standard) with data in Cloud Datastore. I can run a cron-job to perform backups automatically and regularly as described in the docs Scheduled Backups and I can also write a bit of Python code to execute backup tasks which is triggered manually as described in this SO answer. I plan to write a small Python cron-job that would perform the task to find the most recent backup_info file of a given kind and copy/rename it to the desired location.
Either way, the original backup location will be crowded with lots of files and folders during a day, especially if there is more than one backup for a certain kind. For example in gs://MY-PROJECT.appspot.com/ I will find:
VeryLoooooongRandomLookingString.backup_info
OtherStringForSecondBackup.backup_info
OtherStringForThirdBackup.backup_info
The string seems to be a unique identifier for every backup execution. I assume, it contains a list of *.backup_info files, one for each kind in the backup.
VeryLoooooongRandomLookingString.Comment.backup_info
OtherStringForSecondBackup.Comment.backup_info
OtherStringForThirdBackup.Comment.backup_info
For every kind in the backup, e.g. "Comment". It seems it contains a list of actual backup data for this kind and this backup.
datastore_backup_CUSTOM_PREFIX_2017_09_20_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_1_Comment/
datastore_backup_CUSTOM_PREFIX_2017_09_20_2_Comment/
Data folder for each backup and kind. Here for kind "Comment", backed up three times on 9/20.
My questions are related to Datastore and/or Storage:
Is it possible to explicitly specify a custom UID as a query parameter (or in HTTP header) when calling /_ah/datastore_admin/backup.create?
If not, is it possible to send a message with the UID to a hook or something, after the backup has been completed?
If (1) and (2) is not possible: Which approach would be the best in Storage to find the latest *.backup_info file for a given kind? It seems that listbucket() doesn't allow filtering, and I don't think that iterating through hundreds or thousands of files looking for certain name patterns would be efficient.
I have found two solutions for the problem, one is in GA and one is in Beta.
The answers in short:
The GA Datastore Export & Import service allows custom and predictable paths to the backup
and its API for long-running operations allows to get the output URL of a backup job (e.g. for paths with timestamps).
A Cloud Function triggered by Cloud Storage events would allow to handle just specific [KIND].backup_info files as soon as they are added to a bucket, instead of paging through thousands of files in the bucket each time.
Datastore Export & Import
This new service has an API to run export jobs (manually or scheduled). The job allows to specify the path and produces predictable full paths, so existing backup files could be overwritten if only the latest backup is needed at any time, e.g.:
gs://[YOUR_BUCKET]/[PATH]/[NAMESPACE]/[KIND]/[NAMESPACE]_[KIND].export_metadata
For cron-jobs, the App Engine handler URL is /cloud-datastore-export (instead of the old /_ah/datastore_admin/backup.create). Also the format of the export is different from the old export. It can be imported to BigQuery, too, just like the old [KIND].backup_info files.
Cloud Function
Deploy a Cloud Function (JavaScript / Node.js) that is triggered by any change in the backup bucket and if that file exists (file.resourceState === 'not_exists'), is new (file.metageneration === '1') and in fact is one of the [KIND].backup_info files we want, it will be copied to a different bucket ("latest_backups" or so). Custom metadata on the copy can be used to compare timeCreated in later executions of the function (so we don't accidentally overwrite more recent backup file with older file). Copying or moving actual backup payload will break the references inside the [KINDNAME].backup_info files though.
Background Cloud Function with a Cloud Storage trigger
How to copy files in Cloud Functions (Node.JS)