Snowflake Ingest Pipeline - Automatically Remove Ingest Files From Source? - snowflake-cloud-data-platform

We are wanting to ingest events into snowflake from S3 bucket. I know this is possible from this documentation: https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-s3.html
But after the files have been ingested we'd like to either delete the files that have been ingested or remove it form the ingest bucket.
1: Events loaded into ingest bucket via Firehose (or direct)
2: Snowflake automatically ingests events from this bucket
3: Snowflake either A) moves the files to a new bucket or B) sends a message (SNS?) to a process (lambda) to move the processed file out of the ingest bucket.
Is this possible with Snowflake? I'd like to use the automatic pipeline feature but it looks like the only way I can get this behavior would be to write the pipeline ourselves.

Related

SnowFlake Azure Event Grid Integration - Does it doe a file scan?

I am trying to figure out how the SnowPipes execute when you setup Snowflake to automatically import data using Azure event grid notification, like this document describes - https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-azure.html
So say I have a Azure Data Lake Gen2 container attached to Snowflake as an external stage, and this container has three folders (FolderA, FolderB, and FolderC), and I have a SnowPipe setup for each folder. Then I add a file to FolderA. So Snowflake gets a message from Azure Event Grid saying that the file has been added (and the Event Grid message has the full file name). Does Snowflake know to just run the SnowPipe setup for FolderA? Or will it run all three of the SnowPipes? And when the SnowPipe runs, does it scan for files? Or does the SnowPipe just import the specific file named in the EventGrid message?
Setting up event grid and snowpipe and overall handshake process in Azure/Snowflake combination is a bit tricky and have never tried with multiple folders and snowpipe but I prefer to give folder and file pattern to make sure even if snowpipe is triggered it only picks the files which is targeted for copy command which snowpipe has wrapped.
In AWS, all snowpipe with auto-ingest true flag generates the same ARN key and SNS (equivalent to even grid) also takes the file pattern on each folder and calls the same ARN. So I assume it runs but does not copy anything.
But I will surely try and simulate how it works.

Snowflake snowpipe files are not auto ingesting after adding file into AWS s3

I have created snowpipe for one of snowflake table. Source file will be landed in AWS S3 bucket at periodical time, So followed below steps to create snowpipe:
Created external stage
Queried the files using "PUT" command (Able to see the list of available files in result panel)
Created snowpipe
Configured SQS notification on top of S3 bucket
Added one sample file and its noy loaded automatically
Altered snowpipe using following command:
alter pipe snowpipe_content refresh;
The file got added into snowflake target table after some time.
Can someone please help me to figure out what I missed on snowpipe setup
Follow the below setps to trouble shoot your snowpipe:
Step: I : Check the status of your snowpipe:
SELECT SYSTEM$PIPE_STATUS('pipe_name');
Make sure your pipe status is RUNNING
Step: II: Check copy history for the table associated with snowpipe:
select
*
from
table(information_schema.copy_history(table_name=>'table_name', start_time=> dateadd(hours, -1, current_timestamp())));
Ensure the file is not loaded from the list / errored.
Step III: Validate your snowpipe load
select
*
from
table(validate_pipe_load(
pipe_name=>'pipe_name',
start_time=>dateadd(hour, -1, current_timestamp()))
);
If above steps looks good, Might be issue with your SQS notification set up:
Follow the snowflake article by referring below link:
Snowflake KB

Mapping S3 bucket object's audio metadata to DynamoDB

Let's say I have imported a large number of audio files in S3. I would need to map my audio files metadata (including artist, track name, duration, release date, ...) to a DynamoDB table in order to query them using a GraphQL API in a react app. However, I can't yet figure out how to extract these metadata to be mapped in DynamoDB.
In the DynamoDB developer guide, it is mentioned (p.914) that the S3 object identifier can be stored in the DynamoDB item.
It is also mentioned that S3 object metadata support can provide a link back to the parent item in DynamoDB (by storing the primary key value of the table item as the S3 metadata).
However, the process is not really detailed; the closest approach I found is from J. Beswick who uses a lambda function to load a large amount of data from a JSON file stored in an S3 bucket.
(https://www.youtube.com/watch?v=f0sE_dNrimU&feature=emb_logo).
S3 object metadata is something different from audio metadata.
Think this way: everything that you put on S3 is a object. This object has a key (name) and some metadata attached to it by default by S3 and another metadata that you can attach to it. All of these things are explained here.
Audio files metadata are a different thing. They are inside the file (let's suppose that it is a mp3 file). To access this data you need to read the file using a api that knows the file format and how to extract the data.
When you upload the file to s3 it does not extract any kind of data and attach it to your object metadata (artist, track number, etc from mp3 files). You need to do it by yourself.
A suggested solution would be: for every file that you upload to s3, the upload triggers a lambda function that knows how to extract the audio metadata from the file. It will then extract this metadata and save it on DynamoDB together with the name of the object in s3. After that you can query your table with the search that you planned for and after found the record, point to the correct object in s3.
In that suggestion you can also run it for all objects already existent in the s3 bucket to avoid requiring new upload.

Snowpipe not working after upload same file twice

Just playing around with Snowpipe. I had it working. I would drop a file onto S3 and Snowpipe loaded the data into a Snowflake table.
However when I copied the same file twice into the S3 bucket, Snowpipe didnt pick it up or any subsequent files where were not duplicate.
To illustrate:
Uploaded file1.txt into the S3 bucket - success
Uploaded file2.txt into the S3 bucket - success
Uploaded file3.txt into the S3 bucket - success
Re-Uploaded file1.txt into the S3 bucket - no result - table was not updated
Uploaded file4.txt into the S3 bucket - no result - table was not updated
How do I go about troubleshooting this? or fixing this issue.
Thanks
A few clarifications:
Yes, Snowpipe will not load a file again. If there is an error in
the file and you need to modify it, you will need to rename it (e.g.
file1v2.txt)
The behavior you noticed regarding the next file not being loaded is
unexpected & requires troubleshooting. Is there any issue with the
next file (since it is showing up as pending file count of 1)? Are
you able to access it otherwise from outside Snowflake? Can you run
COPY on it to load it to say another table?
Snowpipe behaves similarly on Azure & AWS except for queue ownership
(Azure blob store will not deliver to a queue in another
subscription).
Multiple pipes share the same queue on AWS and we use the
bucket/prefix to demultiplex to different pipes.
Dinesh Kulkarni
(PM, Snowflake)

Using S3 as a sink (StreamingFileSink)

We are using Flink to process input events and aggregate and write o/p of our streaming job to S3 using StreamingFileSink but whenever we try to restore the job from a savepoint, the restoration fails with missing part files error. As per my understanding, s3 deletes those part(intermittent) files and can no longer be found on s3. Is there a workaround for this, so that we can use s3 as a sink?

Resources