Snowpipe not working after upload same file twice - snowflake-cloud-data-platform

Just playing around with Snowpipe. I had it working. I would drop a file onto S3 and Snowpipe loaded the data into a Snowflake table.
However when I copied the same file twice into the S3 bucket, Snowpipe didnt pick it up or any subsequent files where were not duplicate.
To illustrate:
Uploaded file1.txt into the S3 bucket - success
Uploaded file2.txt into the S3 bucket - success
Uploaded file3.txt into the S3 bucket - success
Re-Uploaded file1.txt into the S3 bucket - no result - table was not updated
Uploaded file4.txt into the S3 bucket - no result - table was not updated
How do I go about troubleshooting this? or fixing this issue.
Thanks

A few clarifications:
Yes, Snowpipe will not load a file again. If there is an error in
the file and you need to modify it, you will need to rename it (e.g.
file1v2.txt)
The behavior you noticed regarding the next file not being loaded is
unexpected & requires troubleshooting. Is there any issue with the
next file (since it is showing up as pending file count of 1)? Are
you able to access it otherwise from outside Snowflake? Can you run
COPY on it to load it to say another table?
Snowpipe behaves similarly on Azure & AWS except for queue ownership
(Azure blob store will not deliver to a queue in another
subscription).
Multiple pipes share the same queue on AWS and we use the
bucket/prefix to demultiplex to different pipes.
Dinesh Kulkarni
(PM, Snowflake)

Related

SnowFlake Azure Event Grid Integration - Does it doe a file scan?

I am trying to figure out how the SnowPipes execute when you setup Snowflake to automatically import data using Azure event grid notification, like this document describes - https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-azure.html
So say I have a Azure Data Lake Gen2 container attached to Snowflake as an external stage, and this container has three folders (FolderA, FolderB, and FolderC), and I have a SnowPipe setup for each folder. Then I add a file to FolderA. So Snowflake gets a message from Azure Event Grid saying that the file has been added (and the Event Grid message has the full file name). Does Snowflake know to just run the SnowPipe setup for FolderA? Or will it run all three of the SnowPipes? And when the SnowPipe runs, does it scan for files? Or does the SnowPipe just import the specific file named in the EventGrid message?
Setting up event grid and snowpipe and overall handshake process in Azure/Snowflake combination is a bit tricky and have never tried with multiple folders and snowpipe but I prefer to give folder and file pattern to make sure even if snowpipe is triggered it only picks the files which is targeted for copy command which snowpipe has wrapped.
In AWS, all snowpipe with auto-ingest true flag generates the same ARN key and SNS (equivalent to even grid) also takes the file pattern on each folder and calls the same ARN. So I assume it runs but does not copy anything.
But I will surely try and simulate how it works.

Snowflake Ingest Pipeline - Automatically Remove Ingest Files From Source?

We are wanting to ingest events into snowflake from S3 bucket. I know this is possible from this documentation: https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-s3.html
But after the files have been ingested we'd like to either delete the files that have been ingested or remove it form the ingest bucket.
1: Events loaded into ingest bucket via Firehose (or direct)
2: Snowflake automatically ingests events from this bucket
3: Snowflake either A) moves the files to a new bucket or B) sends a message (SNS?) to a process (lambda) to move the processed file out of the ingest bucket.
Is this possible with Snowflake? I'd like to use the automatic pipeline feature but it looks like the only way I can get this behavior would be to write the pipeline ourselves.

Snowflake snowpipe files are not auto ingesting after adding file into AWS s3

I have created snowpipe for one of snowflake table. Source file will be landed in AWS S3 bucket at periodical time, So followed below steps to create snowpipe:
Created external stage
Queried the files using "PUT" command (Able to see the list of available files in result panel)
Created snowpipe
Configured SQS notification on top of S3 bucket
Added one sample file and its noy loaded automatically
Altered snowpipe using following command:
alter pipe snowpipe_content refresh;
The file got added into snowflake target table after some time.
Can someone please help me to figure out what I missed on snowpipe setup
Follow the below setps to trouble shoot your snowpipe:
Step: I : Check the status of your snowpipe:
SELECT SYSTEM$PIPE_STATUS('pipe_name');
Make sure your pipe status is RUNNING
Step: II: Check copy history for the table associated with snowpipe:
select
*
from
table(information_schema.copy_history(table_name=>'table_name', start_time=> dateadd(hours, -1, current_timestamp())));
Ensure the file is not loaded from the list / errored.
Step III: Validate your snowpipe load
select
*
from
table(validate_pipe_load(
pipe_name=>'pipe_name',
start_time=>dateadd(hour, -1, current_timestamp()))
);
If above steps looks good, Might be issue with your SQS notification set up:
Follow the snowflake article by referring below link:
Snowflake KB

Using S3 as a sink (StreamingFileSink)

We are using Flink to process input events and aggregate and write o/p of our streaming job to S3 using StreamingFileSink but whenever we try to restore the job from a savepoint, the restoration fails with missing part files error. As per my understanding, s3 deletes those part(intermittent) files and can no longer be found on s3. Is there a workaround for this, so that we can use s3 as a sink?

Uploading to Google Cloud Storage using Blobstore: Blobstore doesn't retain file name upon upload

I'm trying to upload to GCS using the Blobstore. I have set the GCS bucket name while generating the upload url, and the file gets uploaded successfully.
In the upload handler, blobInfo.getFilename() returns the right file name. But the file actually got saved in the GCS bucket in some different file name. Each time, the file name is some random hash like this one:
L2FwcGhvc3RpbmdfcHJvZC9ibG9icy9BRW5CMlVvbi1XNFEyWEJkNGlKZHNZRlJvTC0wZGlXVS13WTF2c0g0LXdzcEVkaUNEbEEyc3daS3Vham1MVlZzNXlCSk05ZnpKc1RudDJpajF1TmxwdWhTd2VySVFLdUw3US56ZXFHTEZSLVoxT3lablBI
Is this how it will work? Is this an anomaly?
I store the file name to the datastore based on the data returned from blobInfo.getFilename(), which is the correct value of file name. But I'm unable to access the file using the GcsFilename since the file is stored in GCS with that random hash as file name.
Any pointers would be greatly helpful.
Thanks!
PS: The blobstore page says that BlobInfo is currently not available for GCS objects. But BlobInfo.getFilename returns the right value for me. Is that something wrong from my end?
It's how it works, see https://cloud.google.com/appengine/docs/python/blobstore/fileinfoclas ...:
FileInfo metadata is not persisted to datastore [...] You must save
the gs_object_name yourself in your upload handler or this data will
be lost
I personally recommend that new applications use https://cloud.google.com/appengine/docs/python/googlecloudstorageclient/ directly, rather than the blobstore emulation on top of it.
The latter is currently provided essentially only for (limited, partial) backwards compatibility: it's not really all that suitable for new applications.

Resources