Database for large data files and streaming - database

I have a "database choice" and arhitecture question.
Use-case:
Clients will upload large .json files (or other format like .tsv, it is irrelevant) where each line is a data about their customers (e.g name, address etc.)
We need to stream this data later on to process it and store results which will also be some large file where each line is data about each customer (approximately same as uploaded file).
My requirements:
Streaming should be as fast it could (e.g > 1000 rps) and we could have multiple process running in parallel (for multiple clients)
Database should be scalable and fault tolerant. Because there could easily be uploaded many GB of data it should be easy for me to implement automatically adding new commodity instances (using AWS) if storage gets low.
Database should have kind of replication because we don't want to lose data.
No index is required since we are just streaming data.
What would you suggest for database for this problem? We tried to upload it to Amazon S3 and let them take care of scaling etc. but there is a problem of slow read/streaming.
Thanks,
Ivan

Initially uploading the files to S3 is fine, but then pick them up and push each line to Kinesis (or MSK or even Kafka on EC2s if you prefer); from there, you can hook up the stream processing framework of your choice (Flink, Spark Streaming, Samza, Kafka Streams, Kinesis KCL) to do transformations and enrichment, and finally you’ll want to pipe the results into a storage stack that will allow streaming appends. A few obvious candidates:
HBase
Druid
Keyspaces for Cassandra
Hudi (or maybe LakeFS?) on top of S3
Which one you choose is kind of up to your needs downstream in terms of query flexibility, latency, integration options/standards, etc.

Related

Snowpipe Continuous Ingest From S3 Best Practices

I'm expecting to stream 10,000 (small, ~ 10KB) files per day into Snowflake via S3, distributed evenly throughout the day. I plan on using the S3 event notification as outlined in the Snowpipe documentation to automate. I also want to persist these files on S3 independent of Snowflake. I have two choices on how to ingest from S3:
s3://data-lake/2020-06-02/objects
/2020-06-03/objects
.
.
/2020-06-24/objects
or
s3://snowpipe specific bucket/objects
From a best practices / billing perspective, should I ingest directly from my data lake - meaning my 'CREATE or replace STORAGE INTEGRATION' and 'CREATE or replace STAGE' statements references top level 's3://data-lake' above? Or, should I create a dedicated S3 bucket for the Snowpipe ingestion, and expire the objects in that bucket after a day or two?
Does Snowpipe have to do more work (and hence bill me more) to ingest if I give it a top level folder that has thousands and thousand and thousands of objects in it, than if I give it a small tight, controlled, dedicated folder with only a few objects in it? Does the S3 notification service tell Snowpipe what is new when the notification goes out, or does Snowpipe have to do a LIST and compare it to the list of objects already ingested?
Documentation at https://docs.snowflake.com/en/user-guide/data-load-snowpipe-auto-s3.html doesn't offer up any specific guidance in this case.
The INTEGRATION receives a message from AWS whenever a new file is added. If that file matches the fileformat, file path, etc. of your STAGE, then the COPY INTO statement from your pipe is run on that file.
There is minimal overhead for the integration to receive extra messages that do not match your STAGE filters, and no overhead that I know of for other files in that source.
So I am fairly certain that this will work fine either way as long as your STAGE is set up correctly.
We have been using a similar setup with ~5000 permanent files per day into a single Azure storage account with files divided into different directories that correspond to different Snowflake STAGEs for the last 6 months with no noticeable extra lag on the copying.

Continuously updated database shared between multiple AWS EC2 instances

For a small personal project, I've been scraping some data every 5 minutes and saving it in a SQL database. So far I've been using a tiny EC2 AWS instance in combination with a 100GB EBS storage. This has been working great for the scraping, but is becoming unusable for analysing the resulting data, as the EC2 instance doesn't have enough memory.
The data analysis only happens irregularly, so it would feel a waste to pay 24/7 to have a bigger EC2 instance, so I'm looking for something more flexible. From reading around I've learned:
You can't connect EBS to two EC2 instances at the same time, so spinning up a second temporary big instance whenever analysis needed isn't an option.
AWS EFS seems a solution, but is quite a lot more expensive and considering my limited knowledge, I'm not a 100% sure this is the ideal solution.
The serverless options like Amazon Athena look great, but this is based on S3 which is a no-go for data that needs continuous updating (?).
I assume this is quite a common usecase for AWS, so I'm hoping to try to get some pointers in the right direction. Are there options I'm overlooking that fit my problem? Is EFS the right way to go?
Thanks!
Answers by previous users are great. Let's break them down in options. It sounds to me that your initial stack is a Custom SQL Database you installed in EC2.
Option 1 - RDS Read Replicas
Move your DB to RDS, this would give you a lot of goodies, but the main one we are looking for is Read Replicas if your reading/s grows you can create additional read replicas and put them behind a load balancer. This setup is the lowest hanging fruit without too many code changes.
Option 2 - EFS to Share Data between EC2 Instances
Using EFS is not straightforward, to no fault of EFS. Some databases save unique IDs to the filesystem, meaning you can't share the hard drive. EFS is a service and will add some lag to every read/write operation. Depending on how your installed Database distribution it might not even be possible.
Option 3 - Athena and S3
Having the workers save to S3 instead of SQL is also doable, but it means rewriting your web scraping tool. You can call S3 -> PutObject on the same key multiple times, and it will overwrite the previous object. Then you would need to rewrite your analytics tool to query S3. This option is excellent, and it's likely the cheapest in 'operation cost,' but it means that you have to be acquainted with S3, and more importantly, Athena. You would also need to figure out how you will save new data and the best file format for your application. You can start with regular JSON or CSV blobs and then later move to Apache Parquet for lower cost. (For more info on how that statement means savings see here: https://aws.amazon.com/athena/pricing/)
Option 4 - RedShift
RedShift is for BigData, I would wait until querying regular SQL is a problem (multiple seconds per query), and then I would start looking into it. Sure it would allow you query very for cheap, but you would probably have to set up a Pipeline that listens to SQL (or is triggered by it) and then updates RedShift. Reason is because RedShift scales depending on your querying needs, and you can spin up multiple machines easily to make querying faster.
As far as I can see S3 and Athena is good option for this. I am not sure about your concern NOT to use S3, but once you can save scraped data in S3 and you can analyse them with Athena (Pay Per Query model).
Alternatively, you can use RedShift to save data and analyse which has on demand service similar to ec2 on demand pricing model.
Also, you may use Kenisis Firehose which can be used to analyse data real time as and when you ingest them.
Your scraping workers should store data in Amazon S3. That way, worker instances can be scaled (and even turned off) without having to worry about data storage. Keep process data (eg what has been scraped, where to scrape next) in a database such as DynamoDB.
When you need to query the data saved to Amazon S3, Amazon Athena is ideal if it is stored in a readable format (CSV, ORC, etc).
However, if you need to read unstructured data, your application can access the files directly S3 by either downloading and using them, or reading them as streams. For this type of processing, you could launch a large EC2 instance with plenty of resources, then turn it off when not being used. Better yet, launch it as a Spot instance to save money. (It means your system will need to cope with potentially being stopped mid-way.)

Metrics collection and analysis architecture

We are working on HomeKit-enabled IoT devices. HomeKit is designed for consumer use and does not have the ability to collect metrics (power, temperature, etc.), so we need to implement it separately.
Let's say we have 10 000 devices. They send one collection of metrics every 5 seconds. So each second we need to receive 10000/5=2000 collections. The end-user needs to see graphs of each metric in the specified period of time (1 week, month, year, etc.). So each day the system will receive 172,8 millions of records. Here come a lot of questions.
First of all, there's no need to store all data, as the user needs only graphs of the specified period, so it needs some aggregation. What database solution fits it? I believe no RDMS will handle such amount of data. Then, how to get average data of metrics to present it to the end-user?
AWS has shared time-series data processing architecture:
Very simplified I think of it this way:
Devices push data directly to DynamoDB using HTTP API
Metrics are stored in one table per 24 hours
At the end of the day some procedure runs on Elastic Map Reduce and
produces ready JSON files with data required to show graphs per time
period.
Old tables are stored in RedShift for further applications.
Has anyone already done something similar before? Maybe there is simpler architecture?
This requires bigdata infrastructure like
1) Hadoop cluster
2) Spark
3) HDFS
4) HBase
You can use Spark to read the data as stream. The steamed data can be store in HDFS file system that allows you to store large file across the Hadoop cluster. You can use map reduce algorithm to get the required data set from HDFS and store in HBase which is the Hadoop database. HDFS is distributed, scalable and big data store to store the records. Finally, you can use the query tools to query the hbase.
IOT data --> Spark --> HDFS --> Map/Reduce --> HBase -- > Query Hbase.
The reason I am suggesting this architecture is for
scalability. The input data can grow based on the number of IOT devices. In the above architecture, infrastructure is distributed and the nodes in the cluster can grow without limit.
This is proven architecture in big data analytics application.

Store large IoT data at high frequency to the cloud

I am building an IoT device that will be producing 200Kb of data per second, and I need to save this data to storage. I currently have about 500 devices, I am trying to figure out what is the best way to store the data? And the best database for this purpose? In the past I have stored data to GCP's BigQuery and done processing by using compute engine instance groups, but the size of the data was much smaller.
This is my best answer based upon the limited information in your question.
The first step is to document / describe what type of data that you are processing. Is it structured data (SQL) or unstructured (NoSQL)? What type of queries do you need to make? How long do you need to store the data and what is the expected total data size. This will determine the choice of the backend performing the query processing and analytics.
Next you need to look at the rate of data being transmitted. At 200 Kbits (or is it 200 KBytes) times 500 devices this is 100 Mbits (or 800 MBits) per second. How valuable is the data and how tolerant is your design for data loss? What is the data transfer rate for each device (cellular, wireless, etc.) and connection reliability?.
To push the data into the cloud I would use Pub/Sub. Then process the data to merge, combine, compress, purge, etc and push to Google Cloud Storage or to BigQuery (but other options may be better such as Cloud SQL or Cloud Datastore / BigTable). The answer for the intermediate processor depends on the previous questions but you will need some horsepower to process that rate of data stream. Options might be Google Cloud Dataproc running Spark or Google Cloud Dataflow.
There is a lot to consider for this type of design. My answer has created a bunch of questions, hopefully this will help you architect a suitable solution.
You could also look at IoT Core as a possible way to handle the load balancing piece (it auto-scales). There would be some up front overhead registering all your devices, but it also then handles secure connection as well (TLS stack + JWT encryption for security on devices using IoT Core).
With 500 devices and 200KB/s, that sound well within the capabilities of the system to handle. Pub/Sub is the limiter, and it handles between 1-2M messages per second so it should be fine.

Storage in Apache Flink

After processing those millions of events/data, where is the best place to storage the information to say that worth to save millions of events? I saw a pull request closed by this commit mentioning Parquet formats, but, the default is the HDFS? My concern is after saving (where?) if it is easy (fast!) to retrieved that data?
Apache Flink is not coupled with specific storage engines or formats. The best place to store the results computed by Flink depends on your use case.
Are you running a batch or streaming job?
What do you want to do with the result?
Do you need batch (full scan), point, or continuously streaming access to the data?
What format does the data have? flat structured (relational), nested, blob, ...
Depending on the answer to these questions, you can choose from various storage backends such as
- Apache HDFS for batch access (with different storage format such as Parquet, ORC, custom binary)
- Apache Kafka if you want to access the data as a stream
- a key-value store such as Apache HBase and Apache Cassandra for point access to data
- a database such as MongoDB, MySQL, ...
Flink provides OutputFormats for most of these systems (some through a wrapper for Hadoop OutputFormats). The "best" system depends on your use case.

Resources