Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am new to AWS and need to decide what to choose between AWS dynamo db or AWS S3.
I have a use case in which I need to fetch multiple items from the data source and update the items and put back to the data source. I have searched and found that we can't perform multiple get in S3.
Any Suggestions it will be helpful !!
AWS Dynamodb and S3 serves different purposes.
Dynamodb - Is good for storing structured or semi-structured data. Its has limits in storage size(Record should be less than 400KB's) but has very high access speeds (Single digit millisecond)
S3 - Is good for storing files. Files could be read over http with its REST API. It allows to store very large files(Up to 5 TB's) with a reasonable access speeds.
For some requirements both services can be used together.
e.g.
If you need store user profiles with profile image, you can upload the image to S3 and store the link in user profiles table in dynamodb as an attribute.
I have used both Dynamodb and S3.
It depends on your application and type of data, if you are going to use that for real time application i would suggest Dynamodb. Latency is good on dynamodb compare to s3 and you can update data based on your key. If you are going to update images or some kind of files you can use s3 and you can save some money with s3.
While I understand, that you want to do CRUD operations (create, read , update and delete) - it is crucial to understand the following factors to decide whether S3 or dynamoDB fits your use-case.
[1]Data Structure -> Are you storing objects as whole like documents or tuples of data ?
[2]Evolution of Data -> How frequent your data will get updated?
[3]Concurrency -> How many concurrent reads or writes at a time? How many clients will be reading and writing the data store?
[4]Scalability -> does your use-case involves billions of objects which needs to be retrieved in a sub-second response time ??
AWS S3 is a scalable storage service which not only helps to store the data in a secure & organization way but also helps to manage its lifecycle though other storage classes like IA and glaciers in a cost efficient way.
AWS DynamoDB is a NOSQL storage service, which gives you a high concurrency reads/writes (which you can provision) as per your need and you need to pay only for those reads/writes. With primary & secondary (local or global) indexes, highly distributed cluster pattern can be obtained for sub-second response queries.
In some ways, you can also use both S3 and DynamoDB together by fronting Lambda services distributing your compute requirements across these storage services. Hope it helps!
I actually did a similar kind of experiment and I would say that DynamoDB is the best choice because it has very fast read and write speeds than S3.
If you plan on updating items regularly then it's best to use DynamoDB. DynamoDB will be able to update and delete records faster than S3. S3 has many facilities (backups) with your files in case your bucket gets corrupted or deleted. However, it takes time for updates and deletes to be visible in S3. So, if the S3 bucket is regularly updated, then you risk customers seeing information that was not meant to be seen. Only use S3 if you want an easy way to upload files and download them with minimal need of deleting and updating, or don't care if users see old data. Summary: writes and reads are instant in S3. Updates and deletes take time to propagate across facilities.
Dynamodb is very fast and you can expect predictable performance. If your requirement is performance oriented then go for dynamodb. Also you can query/scan the data based on requirement.
If your require more storage space and file read through REST then go for S3. S3 is cheaper than dynamodb. Also you can set life cycle policy for files which you not accessing frequently.
For use cases like the one you described you go with DynamoDB.
However there is an exception to that.
The maximum item size in DynamoDB is 400 KB. So in case of items bigger than 400KB then it is recommended to store part of the information on S3 and add an attribute, to reference the s3 item, to your dynamodb entry.
For concurrent updates you can use a conditional update or apply a counter.
I will recommend you DynamoDb instead of S3, Dynamodb works on key value pair like as hashmap if you are searching the data from dynamodb using correct key(partition key or index) then searching will be extremely fast otherwise dynamo will scan the complete table and you may also face the provisioned throughput exception if your data in table in high or provisioned read capacity is low.
1 read capacity means 4KB data fetching per seconds, while scanning of data high provisioned rate is need when your data in table is quite high.
Please use dynamo in such way that data should be completely indexed means you have to search data using partition and range key so that scanning can be avoided.
Each read and write throghput cost several dollars .
Amazon S3 is an object storage capable of storing very large objects. S3 is typically used for storing files like images,logs etc. DynamoDB is a NoSQL database that can be used as a key value (schema less record) store. For simple data storage, S3 is the cheapest service.
DynamoDB has the better performance, low cost and higher scalability and availability.
Dynamodb is meant for Metadata, It is extremely fast when you are searching based on key as internally it is using hashing in order to find the item in collection(Table). But its read and write operations are dependent on provisioning throughput. 1 read throughput means that you can read 4Kb of data per sec and 1 write throughput mean that you can write 1Kb of data per sec. More the throughput capacity more will be cost in aws.
Hence I will suggest to use Dynamodb in case when you are querying the data based on key not at all recommended for scanning the data without key(you will be getting throughput exception).
S3 is object based storage like disk not a database, data will be there in buckets you can search but read after update is always very slower in S3.
Related
For a small personal project, I've been scraping some data every 5 minutes and saving it in a SQL database. So far I've been using a tiny EC2 AWS instance in combination with a 100GB EBS storage. This has been working great for the scraping, but is becoming unusable for analysing the resulting data, as the EC2 instance doesn't have enough memory.
The data analysis only happens irregularly, so it would feel a waste to pay 24/7 to have a bigger EC2 instance, so I'm looking for something more flexible. From reading around I've learned:
You can't connect EBS to two EC2 instances at the same time, so spinning up a second temporary big instance whenever analysis needed isn't an option.
AWS EFS seems a solution, but is quite a lot more expensive and considering my limited knowledge, I'm not a 100% sure this is the ideal solution.
The serverless options like Amazon Athena look great, but this is based on S3 which is a no-go for data that needs continuous updating (?).
I assume this is quite a common usecase for AWS, so I'm hoping to try to get some pointers in the right direction. Are there options I'm overlooking that fit my problem? Is EFS the right way to go?
Thanks!
Answers by previous users are great. Let's break them down in options. It sounds to me that your initial stack is a Custom SQL Database you installed in EC2.
Option 1 - RDS Read Replicas
Move your DB to RDS, this would give you a lot of goodies, but the main one we are looking for is Read Replicas if your reading/s grows you can create additional read replicas and put them behind a load balancer. This setup is the lowest hanging fruit without too many code changes.
Option 2 - EFS to Share Data between EC2 Instances
Using EFS is not straightforward, to no fault of EFS. Some databases save unique IDs to the filesystem, meaning you can't share the hard drive. EFS is a service and will add some lag to every read/write operation. Depending on how your installed Database distribution it might not even be possible.
Option 3 - Athena and S3
Having the workers save to S3 instead of SQL is also doable, but it means rewriting your web scraping tool. You can call S3 -> PutObject on the same key multiple times, and it will overwrite the previous object. Then you would need to rewrite your analytics tool to query S3. This option is excellent, and it's likely the cheapest in 'operation cost,' but it means that you have to be acquainted with S3, and more importantly, Athena. You would also need to figure out how you will save new data and the best file format for your application. You can start with regular JSON or CSV blobs and then later move to Apache Parquet for lower cost. (For more info on how that statement means savings see here: https://aws.amazon.com/athena/pricing/)
Option 4 - RedShift
RedShift is for BigData, I would wait until querying regular SQL is a problem (multiple seconds per query), and then I would start looking into it. Sure it would allow you query very for cheap, but you would probably have to set up a Pipeline that listens to SQL (or is triggered by it) and then updates RedShift. Reason is because RedShift scales depending on your querying needs, and you can spin up multiple machines easily to make querying faster.
As far as I can see S3 and Athena is good option for this. I am not sure about your concern NOT to use S3, but once you can save scraped data in S3 and you can analyse them with Athena (Pay Per Query model).
Alternatively, you can use RedShift to save data and analyse which has on demand service similar to ec2 on demand pricing model.
Also, you may use Kenisis Firehose which can be used to analyse data real time as and when you ingest them.
Your scraping workers should store data in Amazon S3. That way, worker instances can be scaled (and even turned off) without having to worry about data storage. Keep process data (eg what has been scraped, where to scrape next) in a database such as DynamoDB.
When you need to query the data saved to Amazon S3, Amazon Athena is ideal if it is stored in a readable format (CSV, ORC, etc).
However, if you need to read unstructured data, your application can access the files directly S3 by either downloading and using them, or reading them as streams. For this type of processing, you could launch a large EC2 instance with plenty of resources, then turn it off when not being used. Better yet, launch it as a Spot instance to save money. (It means your system will need to cope with potentially being stopped mid-way.)
Trying to use Google Cloud Datastore to store streaming data from IOT devices. Currently getting data from 10,000 devices at rate of 2 rows (entities) in one minute per device. Data entities would never be updated but purged at regular interval. Backend code is in PHP.
Do I need to partition my data to get better performance as I do at present in MySQL table. Currently using table partitions based on key.
If Yes, Should I used use namespaces as one NAMESPACE for one device OR I should create one KIND for one device such as "device_data_1", "device_data_2"
Thanks
No, you do not need partitioning, the datastore performance is not impacted by the number or entities being written or read (as long as they're not in the same entity group, which has an overall write rate of 1/sec).
See also these somehow related answers: Does Google Datastore have a provisioned capacity system like DynamoDB?
I am investigating what might be the best infrastructure for storing log files from many clients.
Google App engine offers a nice solution that doesn't make the process a IT nightmare: Load balancing, sharding, server, user authentication - all in once place with almost zero configuration.
However, I wonder if the Datastore model is the right for storing logs. Each log entry should be saved as a single document, where each clients uploads its document on a daily basis and can consists of 100K of log entries each day.
Plus, there are some limitation and questions that can break the requirements:
60 seconds timeout on bulk transaction - How many log entries per second will I be able to insert? If 100K won't fit into the 60 seconds frame - this will affect the design and the work that needs to be put into the server.
5 inserts per entity per seconds - Is a transaction considered a single insert?
Post analysis - text search, searching for similar log entries cross clients. How flexible and efficient is Datastore with these queries?
Real time data fetch - getting all the recent log entries.
The other option is to deploy an elasticsearch cluster on goole compute and write the server on our own which fetches data from ES.
Thanks!
Bad idea to use datastore and even worse if you use entity groups with parent/child as a comment mentions when comparing performance.
Those numbers do not apply but datastore is not at all designed for what you want.
bigquery is what you want. its designed for this specially if you later want to analyze the logs in a sql-like fashion. Any more detail requires that you ask a specific question as it seems you havent read much about either service.
I do not agree, Data Store is a totally fully managed no sql document store database, you can store the logs you want in this type of storage and you can query directly in datastore, the benefits of using this instead of BigQuery is the schemaless part, in BigQuery you have to define the schema before inserting the logs, this is not necessary if you use DataStore, think of DataStore as a MongoDB log analysis use case in Google Cloud.
I am considering using S3 for back-end persistent storage.
However, depending on architecture choices, I predict some buckets may need to store billions of small objects.
How will GET Object and PUT Object perform under these conditions, assuming I am using UUIDs as keys? Can I expect O(1), O(logN), or O(n) performance?
Will I need to rethink my architecture and subdivide bigger buckets in some way to maintain performance? I need object lookups (GET) in particular to be as fast as possible.
Though it is probably meant for S3 customers with truly outrageous request volume, Amazon does have some tips for getting the most out of S3, based on the internal architecture of S3:
Performing PUTs against a particular bucket in alphanumerically increasing order by key name can reduce the total response time of each individual call. Performing GETs in any sorted order can have a similar effect. The smaller the objects, the more significantly this will likely impact overall throughput.
When executing many requests from a single client, use multi-threading to enable concurrent request execution.
Consider prefacing keys with a hash utilizing a small set of characters. Decimal hashes work nicely.
Consider utilizing multiple buckets that start with different alphanumeric characters. This will ensure a degree of partitioning from the start. The higher your volume of concurrent PUT and GET requests, the more impact this will likely have.
If you'll be making GET requests against Amazon S3 from within Amazon EC2 instances, you can minimize network latency on these calls by performing the PUT for these objects from within Amazon EC2 as well.
Source: http://aws.amazon.com/articles/1904
Here's a great article from AWS that goes into depth about the hash prefix strategy and explains when it is and isn't necessary:
http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html
Bottom line: Your plan to put billions of objects in a single bucket using UUIDs for the keys should be fine. If you have outrageous request volume, you might split it into multiple buckets with different leading characters for even better partitioning.
If you are going to be spending a lot of money with AWS, consider getting in touch with Amazon and talking through the approach with them.
S3 is like an external disk. So like read/write GET or PUT will depend on file object size, regardless of the number of other files in the disk.
From the FAQ page:
Since Amazon S3 is highly scalable and you only pay for what you use,
developers can start small and grow their application as they wish,
with no compromise on performance or reliability. It is designed to be
highly flexible: Store any type and amount of data that you want; read
the same piece of data a million times or only for emergency disaster
recovery; build a simple FTP application, or a sophisticated web
application such as the Amazon.com retail web site. Amazon S3 frees
developers to focus on innovation, not figuring out how to store their
data.
If you want to know what is the time complexity of file lookup in S3 file system, it is difficult to say, since I don't know how it does that. But at least is better than O(n). O(1) if uses hash or O(logn) if trees. But either is very scalable.
Bottomline is don't worry about that.
I am working on a financial database that I need to develop caching for. I have a MySQL database with a lot of raw, realtime data. This data is then provided over a HTTP API using Flask (Python).
Before the raw data is returned it is manipulated by my python code. This manipulation can involve a lot of data, therefore a caching system is in order.
The cached data never changes. For example, if someone queries for data for a time range of 2000-01-01 till now, the data will get manipulated, returned and stored in the cache as being the specifically manipulated data from 2000-01-01 till now. If the same manipulated data is queried again later, the cache will retrieve the values from 2000-01-01 till the last time it was queried, elimination the need for manipulation for that entire period. Then, it will manipulate the new data from that point till now, and add that to the cache too.
The data size shouldn't be enormous (under 5GB I would say at max).
I need to be able to retrieve from the cache using date ranges.
Which DB should I be looking it? MongoDB? Redis? CouchDB?
Thanks!
Using BigData solution for such a small data set seems like a waste and might still not yell the required latency.
It seems like what you need is not one of the BigData solution like MongoDB or CouchDB but a distributed Caching (or In Memory Data Grid).
One of the leading solution which (which I'm one of its contributors) seems like a perfect match for you needs is XAP Elastic Caching.
For more details see: http://www.gigaspaces.com/datagrid
And you can find a post describing exactly this case on how you can use DataGrid to scale MySQL: "Scaling MySQL" - http://www.gigaspaces.com/mysql