Partition data for AWS Athena results in a lot of small files in S3 - database

I have a large dataset (>40G) which I want to store in S3 and then use Athena for query.
As suggested by this blog post, I could store my data in the following hierarchical directory structure to enable usingMSCK REPAIR to automatically add partitions while creating table from my dataset.
s3://yourBucket/pathToTable/<PARTITION_COLUMN_NAME>=<VALUE>/<PARTITION_COLUMN_NAME>=<VALUE>/
However, this requires me to split my dataset into many smaller data files and each will be stored under a nested folder depending on the partition keys.
Although using partition could reduce amount of data to be scanned by Athena and therefore speed up a query, would managing large amount of small files cause performance issue for S3? Is there a tradeoff here I need to consider?

Yes, you may experience an important decrease of efficiency with small files and lots of partitions.
Here there is a good explanation and suggestion on file sizes and number of partitions, which should be larger than 128 MB to compensate the overhead.
Also, I performed some experiments in a very small dataset (1 GB), partitioning my data by minute, hour and day. The scanned data decreases when you make the partitions smaller, but the time spent on the query will increase a lot (40 times slower in some experiments).

I will try to get into it without veering too much into the realm of opinion.
For the use cases which I have used Athena, 40 GB is actually a very small dataset by the standards of what the underlying technology (Presto) is designed to handle. According to the Presto web page, Facebook uses the underlying technology to query their 300 PB data warehouse. I routinely use it on datasets between 500 GB and 1 TB in size.
Considering the underlying S3 technology, S3 was used to host Dropbox and Netflix, so I doubt most enterprises could come anywhere near taxing the storage infrastructure. Where you may have heard about performance issues and S3 relates to websites storing multiple, small, pieces of static content on many files scattered across S3. In this case, a delay in retrieving one of these small pieces of content might affect user experience on the larger site.
Related Reading:
Presto

Related

Scale database that receives streaming data with small resources

My use case is the following: I run about 60 websockets from 7 data sources in parallel that record stock tickers (so time-series data). Currently, I'm writing the data into a mongodb that is hosted on a Google Cloud VM such that every data source has its own collection and all collections are hosted inside the same database.
However, the database has grown to 0.6 GB and ~ 10 million rows after only five days of data. I'm pretty new to such questions, but I have a feeling that this is not a viable long-term solution. I will never need all of the data at once, but I need all of the data in order to query by date / currency. However, as I understood those queries might become impossible once the dataset is bigger than my RAM, is that true?
Moreover, this is a research project, but unfortunately I'm currently not able to use a university cluster, therefore I'm hosting the data on a private VM. However, this is subject to a budget constraint, and highly performant machines quickly become very expensive. That's why I'm questioning my design choice. Currently, I'm thinking of either switching to another kind of database, but fear that I'm running into the same issues again, or exporting the database once per week / month / whatever to CSV and wiping out. This would be quite a hastle though and I'm also scared of losing data.
So my question is, how can I design this database such that I can subset the data per one of the keys (either datetime or ticker_id) even when the database grows larger than my machine's RAM? Diskspace is not an issue.
On top of what Alex Blex already commented about storage and performance.
Query response time,in 5 days you have close to 10M rows, will worsen as data set grows. You can look at sharding to break the table down to reasonable chunks and still have acees to all data for query purpose.

Immutable database to huge write volume

I'm building a application that need to be created using an immutable database, I know about Datomic, but's not recommended to huge data volume (my application will have thousands, or more, writes per second).
I already did some search about it and I could't find any similar database that do not have this "issue".
My application will use event sourcing and microservices pattern.
Any suggestions about what database should I use?
Greg Young's Event Store appears to fit your criteria.
Stores your data as a series of immutable events over time.
Claims to be benchmarked at 15,000 writes per second and 50,000 reads per second.
Amazon's DynamoDB can scale to meet very high TPS demands. It can certainly handle 10 to 100 of thousands of writes per second sustained if your schema is designed properly but it is not cheap.
Your question is a bit vague about whether you need to be able to sustain tens of thousands of writes per second, or you need to be able to burst to tens of thousands of writes. It's also not clear how you intend to read the data.
How big is a typical event/record?
Could you batch the writes?
Could you partition your writes?
Have you looked into something like Amazon's Kinesis Firehose? With small events you could have a relatively cheap ingestion pipeline and then perhaps use S3 for long term storage. It would certainly be cheaper than DynamoDB.
Azure offers similar services as well but I'm not as familiar with their offerings.

Database and large Timeseries - Downsampling - OpenTSDB InfluxDB Google DataFlow

I have a project where we sample "large" amount of data on per-second basis. Some operation are performed as filtering and so on and it needs then to be accessed as second, minute, hour or day interval.
We currently do this process with an SQL based system and a software that update different tables (daily average, hourly averages, etc...).
We are currently looking if other solution could fit our needs and I went across several solutions, as open tsdb, google cloud dataflow and influxdb.
All seem to address timeseries needs, but it gets difficult to get information about the internals. opentsdb do offer downsampling but it is not clearly specified how.
The need is since we can query vast amount of data, for instance a year, if the DB downsample at the query and is not pre-computed, it may take a very long time.
As well, downsampling needs to be "updated" when ever "delayed" datapoint are added.
On top of that, upon data arrival we perform some processing (outliner filter, calibration) and those operation should not be written on the disk, several solution can be used like a Ram based DB but perhaps some more elegant solution that would work together with the previous specification exists.
I believe this application is not something "extravagant" and that it must exist some tools to perform this, I'm thinking of stock tickers, monitoring and so forth.
Perhaps you may have some good suggestions into which technologies / DB I should look on.
Thanks.
You can accomplish such use cases pretty easily with Google Cloud Dataflow. Data preprocessing and optimizing queries is one of major scenarios for Cloud Dataflow.
We don't provide a "downsample" primitive built-in, but you can write such data transformation easily. If you are simply looking at dropping unnecessary data, you can just use a ParDo. For really simple cases, Filter.byPredicate primitive can be even simpler.
Alternatively, if you are looking at merging many data points into one, a common pattern is to window your PCollection to subdivide it according to the timestamps. Then, you can use a Combine to merge elements per window.
Additional processing that you mention can easily be tacked along to the same data processing pipeline.
In terms of comparison, Cloud Dataflow is not really comparable to databases. Databases are primarily storage solutions with processing capabilities. Cloud Dataflow is primarily a data processing solution, which connects to other products for its storage needs. You should expect your Cloud Dataflow-based solution to be much more scalable and flexible, but that also comes with higher overall cost.
Dataflow is for inline processing as the data comes in. If you are only interested in summary and calculations, dataflow is your best bet.
If you want to later take that data and access it via time (time-series) for things such as graphs, then InfluxDB is a good solution though it has a limitation on how much data it can contain.
If you're ok with 2-25 second delay on large data sets, then you can just use BigQuery along with Dataflow. Dataflow will receive, summarize, and process your numbers. Then you submit the result into BigQuery. HINT, divide your tables by DAYS to reduce costs and make re-calculations much easier.
We process 187 GB of data each night. That equals 478,439,634 individual data points (each with about 15 metrics and an average of 43,000 rows per device) for about 11,512 devices.
Secrets to BigQuery:
LIMIT your column selection. Don't ever do a select * if you can help it.
;)

Do Amazon S3 GET Object and PUT Object commands slow down at high object counts?

I am considering using S3 for back-end persistent storage.
However, depending on architecture choices, I predict some buckets may need to store billions of small objects.
How will GET Object and PUT Object perform under these conditions, assuming I am using UUIDs as keys? Can I expect O(1), O(logN), or O(n) performance?
Will I need to rethink my architecture and subdivide bigger buckets in some way to maintain performance? I need object lookups (GET) in particular to be as fast as possible.
Though it is probably meant for S3 customers with truly outrageous request volume, Amazon does have some tips for getting the most out of S3, based on the internal architecture of S3:
Performing PUTs against a particular bucket in alphanumerically increasing order by key name can reduce the total response time of each individual call. Performing GETs in any sorted order can have a similar effect. The smaller the objects, the more significantly this will likely impact overall throughput.
When executing many requests from a single client, use multi-threading to enable concurrent request execution.
Consider prefacing keys with a hash utilizing a small set of characters. Decimal hashes work nicely.
Consider utilizing multiple buckets that start with different alphanumeric characters. This will ensure a degree of partitioning from the start. The higher your volume of concurrent PUT and GET requests, the more impact this will likely have.
If you'll be making GET requests against Amazon S3 from within Amazon EC2 instances, you can minimize network latency on these calls by performing the PUT for these objects from within Amazon EC2 as well.
Source: http://aws.amazon.com/articles/1904
Here's a great article from AWS that goes into depth about the hash prefix strategy and explains when it is and isn't necessary:
http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html
Bottom line: Your plan to put billions of objects in a single bucket using UUIDs for the keys should be fine. If you have outrageous request volume, you might split it into multiple buckets with different leading characters for even better partitioning.
If you are going to be spending a lot of money with AWS, consider getting in touch with Amazon and talking through the approach with them.
S3 is like an external disk. So like read/write GET or PUT will depend on file object size, regardless of the number of other files in the disk.
From the FAQ page:
Since Amazon S3 is highly scalable and you only pay for what you use,
developers can start small and grow their application as they wish,
with no compromise on performance or reliability. It is designed to be
highly flexible: Store any type and amount of data that you want; read
the same piece of data a million times or only for emergency disaster
recovery; build a simple FTP application, or a sophisticated web
application such as the Amazon.com retail web site. Amazon S3 frees
developers to focus on innovation, not figuring out how to store their
data.
If you want to know what is the time complexity of file lookup in S3 file system, it is difficult to say, since I don't know how it does that. But at least is better than O(n). O(1) if uses hash or O(logn) if trees. But either is very scalable.
Bottomline is don't worry about that.

Need a storage solution that is scalable, distributed and can read data extremely fast and works with .NET

I currently have a data solution in RDBMS. The load on the server will grow by 10x, and I do not believe it will scale.
I believe what I need is a data store that can provide fault tolerant, scalable and that can retrieve data extremely fast.
The Stats
Records: 200 million
Total Data Size (not including indexes): 381 GB
New records per day: 200,000
Queries per Sec: 5,000
Query Result: 1 - 2000 records
Requirements
Very fast reads
Scalable
Fault tolerant
Able to execute complex queries (conditions across many columns)
Range Queries
Distributed
Partition – Is this required for 381 GB of data?
Able to Reload from file
In-Memory (not sure)
Not Required
ACID - Transactions
The primary purpose of the data store is retrieve data very fast. The queries that will access this data will have conditions across many different columns (30 columns and probably many more). I hope this is enough info.
I have read about many different types of data stores that include NoSQL, In-Memory, Distributed Hashed, Key-Value, Information Retrieval Library, Document Store, Structured Storage, Distributed Database, Tabular and others. And then there are over 2 dozen products that implement these database types. This is a lot of stuff to digest and figure out which would provide the best solution.
It would be preferred that the solution run on Windows and is compatible with Microsoft .NET.
Base on the information above, does any one have any suggestions and why?
Thanks
So, what is your problem? I do not really see anything even nontrivial here.
Fast and scaling: Grab a database (sorry, complex queries, columns = database) and get some NICE SAN - a HP EVA is great. I have seen it, in a database, deliver 800mb of random IO reads per seconds..... using 190 SAS discs. Fast enough for you? Sorry, but THIS is scalability.
400gb database size are not remarakble by any means.
Grab a decent server. Supermicro has one with space for 24 discs in 2 rack units height.
Grab a higher end SAS raid controller - Adaptec.
Plug in ReadSSD drives in a RAID 10 configuration. YOu will be surprised - you will saturate the IO bus faster than you can see "ouch". Scalability is there with 24 discs space. And an IO bus that can handle 1.2 Gigabyte per second.
Finally, get a pro to tune your database server(s). That simple. SQL Server is a lot more complicated to properly use than "ok, I just know how a select should look" (without really knmowing).

Resources