Snowpipe Continuous Ingest From S3 Best Practices - snowflake-cloud-data-platform

I'm expecting to stream 10,000 (small, ~ 10KB) files per day into Snowflake via S3, distributed evenly throughout the day. I plan on using the S3 event notification as outlined in the Snowpipe documentation to automate. I also want to persist these files on S3 independent of Snowflake. I have two choices on how to ingest from S3:
s3://data-lake/2020-06-02/objects
/2020-06-03/objects
.
.
/2020-06-24/objects
or
s3://snowpipe specific bucket/objects
From a best practices / billing perspective, should I ingest directly from my data lake - meaning my 'CREATE or replace STORAGE INTEGRATION' and 'CREATE or replace STAGE' statements references top level 's3://data-lake' above? Or, should I create a dedicated S3 bucket for the Snowpipe ingestion, and expire the objects in that bucket after a day or two?
Does Snowpipe have to do more work (and hence bill me more) to ingest if I give it a top level folder that has thousands and thousand and thousands of objects in it, than if I give it a small tight, controlled, dedicated folder with only a few objects in it? Does the S3 notification service tell Snowpipe what is new when the notification goes out, or does Snowpipe have to do a LIST and compare it to the list of objects already ingested?
Documentation at https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-s3.html doesn't offer up any specific guidance in this case.

The INTEGRATION receives a message from AWS whenever a new file is added. If that file matches the fileformat, file path, etc. of your STAGE, then the COPY INTO statement from your pipe is run on that file.
There is minimal overhead for the integration to receive extra messages that do not match your STAGE filters, and no overhead that I know of for other files in that source.
So I am fairly certain that this will work fine either way as long as your STAGE is set up correctly.
We have been using a similar setup with ~5000 permanent files per day into a single Azure storage account with files divided into different directories that correspond to different Snowflake STAGEs for the last 6 months with no noticeable extra lag on the copying.

Related

When should we use SNOWPIPE?

We have some files between 500KB - 20 MB size in Sharepoint portal. We would like to convert those files to CSV and then stage them to Snowflake. There is no real need for real time ingestion. I am thinking of two options. Which option will be better?
Load the file(CSV) into the cloud providers object storage. Create an external stage. Then have a python program scheduled every hour to ingest the data from stage to snowflake table
Use SNOWPIPE
I am more inclined to #1 primarily because I will have a control on the warehouse. Also, it will allow me to bunch up the files and then load to snowflake.
If you don't need to load your source data in real time option 1 makes more sense, but you need to manage and maintain it.
Option 2 is set up once and it will load the files automatically, but will be more costly because you don't have control over warehouse usage.
I have a similar situation and using option 1 like load.

Is it possible to use Snowflake user and named stages for long term data file storage?

Instead of creating our own S3 bucket, I'm wondering if I can just leverage user and named stages as ways to store data files (that may not be loaded into tables). Or are files in these stages automatically purged by Snowflake at times?
Internal Stages are not managed by Users, so anything you upload into Internal Stages if you want to query has to be copied into the tables eventually.
If you are using PUT command, you can list the files and then use copy into tables command.
https://docs.snowflake.com/en/sql-reference/sql/put.html
Using PUT operation into Internal Stages will incur Storage Charges as applicable.

Q: AWS DynamoDB to S3 [pipeline]

The goal that I wish to achieve is to generate a file of the table, so that afterwards that can be checked for data (monthly calculations). What I have done so far is to create a Backup using the PipeLine option from DynamoDB to an S3 bucket, but:
It is taking too long, the pipeline has been running for more than 24h since the table I am exporting is 7 GB in DynamoDB size (which is compressed and it will take even more time to finish with the backup);
I will need to do that monthly, which means that I will only need the data between first and last day of the month, while the PIPELINE can create a backup I could not find an option to make it so that only the changes in the table from specific timelines is exported;
The files that the Pipeline export are around 10 MB each and that means hundreds of files, instead of a couple (for example 100 MB files or 1 GB files).
In this case I am interested if there is a different way which I can make a full backup of current information and afterwards do a month to month on the changes that where performed (something like a monthly incremental) and not to have millions of 10 MB files.
Any comments, clarifications, code samples, corrections are appreciated.
Thanks for your time.
You have, basically, two options:
Implement your own logic by DynamoDB Steams and process your data by
your own logic
Use combination on AWS Glue for ETL processing and,
possible, AWS Athena for query your data from S3. Be careful and use
Apache Parquet format for better query performance and cache your
results somewhere else

Continuously updated database shared between multiple AWS EC2 instances

For a small personal project, I've been scraping some data every 5 minutes and saving it in a SQL database. So far I've been using a tiny EC2 AWS instance in combination with a 100GB EBS storage. This has been working great for the scraping, but is becoming unusable for analysing the resulting data, as the EC2 instance doesn't have enough memory.
The data analysis only happens irregularly, so it would feel a waste to pay 24/7 to have a bigger EC2 instance, so I'm looking for something more flexible. From reading around I've learned:
You can't connect EBS to two EC2 instances at the same time, so spinning up a second temporary big instance whenever analysis needed isn't an option.
AWS EFS seems a solution, but is quite a lot more expensive and considering my limited knowledge, I'm not a 100% sure this is the ideal solution.
The serverless options like Amazon Athena look great, but this is based on S3 which is a no-go for data that needs continuous updating (?).
I assume this is quite a common usecase for AWS, so I'm hoping to try to get some pointers in the right direction. Are there options I'm overlooking that fit my problem? Is EFS the right way to go?
Thanks!
Answers by previous users are great. Let's break them down in options. It sounds to me that your initial stack is a Custom SQL Database you installed in EC2.
Option 1 - RDS Read Replicas
Move your DB to RDS, this would give you a lot of goodies, but the main one we are looking for is Read Replicas if your reading/s grows you can create additional read replicas and put them behind a load balancer. This setup is the lowest hanging fruit without too many code changes.
Option 2 - EFS to Share Data between EC2 Instances
Using EFS is not straightforward, to no fault of EFS. Some databases save unique IDs to the filesystem, meaning you can't share the hard drive. EFS is a service and will add some lag to every read/write operation. Depending on how your installed Database distribution it might not even be possible.
Option 3 - Athena and S3
Having the workers save to S3 instead of SQL is also doable, but it means rewriting your web scraping tool. You can call S3 -> PutObject on the same key multiple times, and it will overwrite the previous object. Then you would need to rewrite your analytics tool to query S3. This option is excellent, and it's likely the cheapest in 'operation cost,' but it means that you have to be acquainted with S3, and more importantly, Athena. You would also need to figure out how you will save new data and the best file format for your application. You can start with regular JSON or CSV blobs and then later move to Apache Parquet for lower cost. (For more info on how that statement means savings see here: https://aws.amazon.com/athena/pricing/)
Option 4 - RedShift
RedShift is for BigData, I would wait until querying regular SQL is a problem (multiple seconds per query), and then I would start looking into it. Sure it would allow you query very for cheap, but you would probably have to set up a Pipeline that listens to SQL (or is triggered by it) and then updates RedShift. Reason is because RedShift scales depending on your querying needs, and you can spin up multiple machines easily to make querying faster.
As far as I can see S3 and Athena is good option for this. I am not sure about your concern NOT to use S3, but once you can save scraped data in S3 and you can analyse them with Athena (Pay Per Query model).
Alternatively, you can use RedShift to save data and analyse which has on demand service similar to ec2 on demand pricing model.
Also, you may use Kenisis Firehose which can be used to analyse data real time as and when you ingest them.
Your scraping workers should store data in Amazon S3. That way, worker instances can be scaled (and even turned off) without having to worry about data storage. Keep process data (eg what has been scraped, where to scrape next) in a database such as DynamoDB.
When you need to query the data saved to Amazon S3, Amazon Athena is ideal if it is stored in a readable format (CSV, ORC, etc).
However, if you need to read unstructured data, your application can access the files directly S3 by either downloading and using them, or reading them as streams. For this type of processing, you could launch a large EC2 instance with plenty of resources, then turn it off when not being used. Better yet, launch it as a Spot instance to save money. (It means your system will need to cope with potentially being stopped mid-way.)

Database for large data files and streaming

I have a "database choice" and arhitecture question.
Use-case:
Clients will upload large .json files (or other format like .tsv, it is irrelevant) where each line is a data about their customers (e.g name, address etc.)
We need to stream this data later on to process it and store results which will also be some large file where each line is data about each customer (approximately same as uploaded file).
My requirements:
Streaming should be as fast it could (e.g > 1000 rps) and we could have multiple process running in parallel (for multiple clients)
Database should be scalable and fault tolerant. Because there could easily be uploaded many GB of data it should be easy for me to implement automatically adding new commodity instances (using AWS) if storage gets low.
Database should have kind of replication because we don't want to lose data.
No index is required since we are just streaming data.
What would you suggest for database for this problem? We tried to upload it to Amazon S3 and let them take care of scaling etc. but there is a problem of slow read/streaming.
Thanks,
Ivan
Initially uploading the files to S3 is fine, but then pick them up and push each line to Kinesis (or MSK or even Kafka on EC2s if you prefer); from there, you can hook up the stream processing framework of your choice (Flink, Spark Streaming, Samza, Kafka Streams, Kinesis KCL) to do transformations and enrichment, and finally you’ll want to pipe the results into a storage stack that will allow streaming appends. A few obvious candidates:
HBase
Druid
Keyspaces for Cassandra
Hudi (or maybe LakeFS?) on top of S3
Which one you choose is kind of up to your needs downstream in terms of query flexibility, latency, integration options/standards, etc.

Resources