I am trying to set up scalable snowpipe infrastructure. I have one AWS lambda function pulling data and putting the raw json files into their corresponding folders below.
Ideally I'd like to set up snowpipe to read in the data from each folder into it's own Snowflake table.
Ex)
The leads json file living in the leads folder is now piped into a
leads_json table within snowflake.
The opportunities json file living in the opportunities folder is now piped into a opportunitie_json table within snowflake.
How do I go about setting up the pipelines and stages to reduce the number of pipelines and stages needed?
Will I need one pipeline and stage per sub folder in the bucket?
I'm going to make use out of the AUTO_INGEST=true feature using SQS notifications.
You will need 1 PIPE for each TABLE that you are loading via Snowpipe. You could have a single STAGE pointing to the top folder of your S3 bucket, if you wish, or you could create 1 per table at a lower level folder. I hope that is answering your question.
Related
I'm trying to use Sagemaker ProcessingJob to process a huge s3 bucket on multiple instances.
The s3 bucket is structured to have multiple input files of each job in the same folder, e.g.
job1/
a.jpg
b.json
c.proto
job2/
a.jpg
b.json
c.proto
...
Where a.jpg,b.json,c.proto are required together for processing.
How can i force Sagemaker to shard jobs according to the folder structure, instead of individual files?
I tried looking for appropriate sharding strategy, but found only ShardByS3Key or FullyReplicated.
I am relatively new to AWS Glue, but after creating my crawler and running it successfully, I can see that a new table has been created but I can't see any columns in that table. It is absolutely blank.
I am using a .csv file from a S3 bucket as my data source.
Is your file UTF8 encoded... Glue has a problem if it’s not.
Does your file have at least 2 records
Does the file have more than one column.
There are various factors that impact the crawler from identifying a csv file
Please refer to this documentation that talks about the built in classifier and what it needs to crawl a csv file properly
https://docs.aws.amazon.com/glue/latest/dg/add-classifier.html
I have setup snowpipe to continuously load data in tables from an S3 bucket. This has been running about a month now (i.e. > 14 days). There is data in the bucket from before snowpipe was setup and we need to load those files into snowflake also. Snowpipe apparently only maintains copy history data for 14 days. What would be a good way to identify the files that have not yet been ingested into tables and bulk import them?
Did you try below view
SNOWFLAKE.ACCOUNT_USAGE.COPY_HISTORY . It stores last one year load history data from both copy into command as well as the snowpipe load history
Get the list of files loaded using snowpipe then you can plan for all remaining files load .
Please check the usage note on latency as well
https://docs.snowflake.com/en/sql-reference/account-usage/copy_history.html
I am new to Camel and need some guidance. I need to read some files from an S3 bucket. The structure is like so.
S3 Bucket
```
Incoming
+xls
-file1.xls
-file2.xls
-file3.xls
+doc
-file1.doc
-file2.doc
-file3.doc
Processed
+xls
...
+doc
...
When a particular excel file is dropped into the incoming/xls folder (say file1.xls), I need to pick up all the files, do some processing and drop them into a processed folder with the same directory structure.
What components do I need to use for this? I tried reading the documentation but its a little difficult to figure out what components I need. I understand that I will use the camel-aws-s3 plugin but there are not many examples of it out there.
On the https://camel.apache.org/components/latest/aws-s3-component.html there some examples about writing and reading from a S3 Bucket.
Next to reading and writing to S3, you might need some custom processor that uses Apache POI to transform the xsl files
To start with, I'm not sure if this is possible with the existing features of Snowpipe.
I have a S3 bucket with years of data, and occasionally some of those files get updated (the contents change, but the file name stays the same). I was hoping to use Snowpipe to import these files into Snowflake, as the "we won't reimport files that have been modified" aspect is appealing to me.
However, I discovered that ALTER PIPE ... REFRESH can only be used to import files staged no earlier than seven days ago, and the only other recommendation Snowflake's documentation has for importing historical data is to use COPY INTO .... However, if I use that, then if those old files get modified, they get imported via Snowflake since the metadata preventing COPY INTO ... from re-importing the S3 files and the metadata for Snowpipe are different, so I can end up with that file imported twice.
Is there any approach, short of "modify all those files in S3 so they have a recent modified-at timestamp", that would let me use Snowpipe with this?
If you're not opposed to a scripting solution for this, one solution would be to write a script to pull the set of in scope object names from AWS S3 and feed them to the Snowpipes REST API. The code you'd use for this is very similar to what is required if you're using an AWS Lambda to call the Snowpipe REST API when triggered via an S3 event notification. You can either use the AWS SDK to get the set of objects from S3, or just use Snowflake's LIST STAGE statement to pull them.
I've used this approach multiple times to backfill historical data from an AWS S3 location where we've enabled Snowpipe ingestion after data had already been written there. Even in the scenario where you don't have to worry about a file being updated in place, this can still be an advantage over just falling back to a direct COPY INTO because you don't have to worry if there's any overlap between when the PIPE was first enabled and the set of files you push to the Snowpipe REST API since the PIPE load history will take care of that for you..