google cloud datastore partitioning strategy - google-app-engine

Trying to use Google Cloud Datastore to store streaming data from IOT devices. Currently getting data from 10,000 devices at rate of 2 rows (entities) in one minute per device. Data entities would never be updated but purged at regular interval. Backend code is in PHP.
Do I need to partition my data to get better performance as I do at present in MySQL table. Currently using table partitions based on key.
If Yes, Should I used use namespaces as one NAMESPACE for one device OR I should create one KIND for one device such as "device_data_1", "device_data_2"
Thanks

No, you do not need partitioning, the datastore performance is not impacted by the number or entities being written or read (as long as they're not in the same entity group, which has an overall write rate of 1/sec).
See also these somehow related answers: Does Google Datastore have a provisioned capacity system like DynamoDB?

Related

Metrics collection and analysis architecture

We are working on HomeKit-enabled IoT devices. HomeKit is designed for consumer use and does not have the ability to collect metrics (power, temperature, etc.), so we need to implement it separately.
Let's say we have 10 000 devices. They send one collection of metrics every 5 seconds. So each second we need to receive 10000/5=2000 collections. The end-user needs to see graphs of each metric in the specified period of time (1 week, month, year, etc.). So each day the system will receive 172,8 millions of records. Here come a lot of questions.
First of all, there's no need to store all data, as the user needs only graphs of the specified period, so it needs some aggregation. What database solution fits it? I believe no RDMS will handle such amount of data. Then, how to get average data of metrics to present it to the end-user?
AWS has shared time-series data processing architecture:
Very simplified I think of it this way:
Devices push data directly to DynamoDB using HTTP API
Metrics are stored in one table per 24 hours
At the end of the day some procedure runs on Elastic Map Reduce and
produces ready JSON files with data required to show graphs per time
period.
Old tables are stored in RedShift for further applications.
Has anyone already done something similar before? Maybe there is simpler architecture?
This requires bigdata infrastructure like
1) Hadoop cluster
2) Spark
3) HDFS
4) HBase
You can use Spark to read the data as stream. The steamed data can be store in HDFS file system that allows you to store large file across the Hadoop cluster. You can use map reduce algorithm to get the required data set from HDFS and store in HBase which is the Hadoop database. HDFS is distributed, scalable and big data store to store the records. Finally, you can use the query tools to query the hbase.
IOT data --> Spark --> HDFS --> Map/Reduce --> HBase -- > Query Hbase.
The reason I am suggesting this architecture is for
scalability. The input data can grow based on the number of IOT devices. In the above architecture, infrastructure is distributed and the nodes in the cluster can grow without limit.
This is proven architecture in big data analytics application.

Store large IoT data at high frequency to the cloud

I am building an IoT device that will be producing 200Kb of data per second, and I need to save this data to storage. I currently have about 500 devices, I am trying to figure out what is the best way to store the data? And the best database for this purpose? In the past I have stored data to GCP's BigQuery and done processing by using compute engine instance groups, but the size of the data was much smaller.
This is my best answer based upon the limited information in your question.
The first step is to document / describe what type of data that you are processing. Is it structured data (SQL) or unstructured (NoSQL)? What type of queries do you need to make? How long do you need to store the data and what is the expected total data size. This will determine the choice of the backend performing the query processing and analytics.
Next you need to look at the rate of data being transmitted. At 200 Kbits (or is it 200 KBytes) times 500 devices this is 100 Mbits (or 800 MBits) per second. How valuable is the data and how tolerant is your design for data loss? What is the data transfer rate for each device (cellular, wireless, etc.) and connection reliability?.
To push the data into the cloud I would use Pub/Sub. Then process the data to merge, combine, compress, purge, etc and push to Google Cloud Storage or to BigQuery (but other options may be better such as Cloud SQL or Cloud Datastore / BigTable). The answer for the intermediate processor depends on the previous questions but you will need some horsepower to process that rate of data stream. Options might be Google Cloud Dataproc running Spark or Google Cloud Dataflow.
There is a lot to consider for this type of design. My answer has created a bunch of questions, hopefully this will help you architect a suitable solution.
You could also look at IoT Core as a possible way to handle the load balancing piece (it auto-scales). There would be some up front overhead registering all your devices, but it also then handles secure connection as well (TLS stack + JWT encryption for security on devices using IoT Core).
With 500 devices and 200KB/s, that sound well within the capabilities of the system to handle. Pub/Sub is the limiter, and it handles between 1-2M messages per second so it should be fine.

AWS dynamodb over AWS S3 [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am new to AWS and need to decide what to choose between AWS dynamo db or AWS S3.
I have a use case in which I need to fetch multiple items from the data source and update the items and put back to the data source. I have searched and found that we can't perform multiple get in S3.
Any Suggestions it will be helpful !!
AWS Dynamodb and S3 serves different purposes.
Dynamodb - Is good for storing structured or semi-structured data. Its has limits in storage size(Record should be less than 400KB's) but has very high access speeds (Single digit millisecond)
S3 - Is good for storing files. Files could be read over http with its REST API. It allows to store very large files(Up to 5 TB's) with a reasonable access speeds.
For some requirements both services can be used together.
e.g.
If you need store user profiles with profile image, you can upload the image to S3 and store the link in user profiles table in dynamodb as an attribute.
I have used both Dynamodb and S3.
It depends on your application and type of data, if you are going to use that for real time application i would suggest Dynamodb. Latency is good on dynamodb compare to s3 and you can update data based on your key. If you are going to update images or some kind of files you can use s3 and you can save some money with s3.
While I understand, that you want to do CRUD operations (create, read , update and delete) - it is crucial to understand the following factors to decide whether S3 or dynamoDB fits your use-case.
[1]Data Structure -> Are you storing objects as whole like documents or tuples of data ?
[2]Evolution of Data -> How frequent your data will get updated?
[3]Concurrency -> How many concurrent reads or writes at a time? How many clients will be reading and writing the data store?
[4]Scalability -> does your use-case involves billions of objects which needs to be retrieved in a sub-second response time ??
AWS S3 is a scalable storage service which not only helps to store the data in a secure & organization way but also helps to manage its lifecycle though other storage classes like IA and glaciers in a cost efficient way.
AWS DynamoDB is a NOSQL storage service, which gives you a high concurrency reads/writes (which you can provision) as per your need and you need to pay only for those reads/writes. With primary & secondary (local or global) indexes, highly distributed cluster pattern can be obtained for sub-second response queries.
In some ways, you can also use both S3 and DynamoDB together by fronting Lambda services distributing your compute requirements across these storage services. Hope it helps!
I actually did a similar kind of experiment and I would say that DynamoDB is the best choice because it has very fast read and write speeds than S3.
If you plan on updating items regularly then it's best to use DynamoDB. DynamoDB will be able to update and delete records faster than S3. S3 has many facilities (backups) with your files in case your bucket gets corrupted or deleted. However, it takes time for updates and deletes to be visible in S3. So, if the S3 bucket is regularly updated, then you risk customers seeing information that was not meant to be seen. Only use S3 if you want an easy way to upload files and download them with minimal need of deleting and updating, or don't care if users see old data. Summary: writes and reads are instant in S3. Updates and deletes take time to propagate across facilities.
Dynamodb is very fast and you can expect predictable performance. If your requirement is performance oriented then go for dynamodb. Also you can query/scan the data based on requirement.
If your require more storage space and file read through REST then go for S3. S3 is cheaper than dynamodb. Also you can set life cycle policy for files which you not accessing frequently.
For use cases like the one you described you go with DynamoDB.
However there is an exception to that.
The maximum item size in DynamoDB is 400 KB. So in case of items bigger than 400KB then it is recommended to store part of the information on S3 and add an attribute, to reference the s3 item, to your dynamodb entry.
For concurrent updates you can use a conditional update or apply a counter.
I will recommend you DynamoDb instead of S3, Dynamodb works on key value pair like as hashmap if you are searching the data from dynamodb using correct key(partition key or index) then searching will be extremely fast otherwise dynamo will scan the complete table and you may also face the provisioned throughput exception if your data in table in high or provisioned read capacity is low.
1 read capacity means 4KB data fetching per seconds, while scanning of data high provisioned rate is need when your data in table is quite high.
Please use dynamo in such way that data should be completely indexed means you have to search data using partition and range key so that scanning can be avoided.
Each read and write throghput cost several dollars .
Amazon S3 is an object storage capable of storing very large objects. S3 is typically used for storing files like images,logs etc. DynamoDB is a NoSQL database that can be used as a key value (schema less record) store. For simple data storage, S3 is the cheapest service.
DynamoDB has the better performance, low cost and higher scalability and availability.
Dynamodb is meant for Metadata, It is extremely fast when you are searching based on key as internally it is using hashing in order to find the item in collection(Table). But its read and write operations are dependent on provisioning throughput. 1 read throughput means that you can read 4Kb of data per sec and 1 write throughput mean that you can write 1Kb of data per sec. More the throughput capacity more will be cost in aws.
Hence I will suggest to use Dynamodb in case when you are querying the data based on key not at all recommended for scanning the data without key(you will be getting throughput exception).
S3 is object based storage like disk not a database, data will be there in buckets you can search but read after update is always very slower in S3.

Is google Datastore recommended for storing logs?

I am investigating what might be the best infrastructure for storing log files from many clients.
Google App engine offers a nice solution that doesn't make the process a IT nightmare: Load balancing, sharding, server, user authentication - all in once place with almost zero configuration.
However, I wonder if the Datastore model is the right for storing logs. Each log entry should be saved as a single document, where each clients uploads its document on a daily basis and can consists of 100K of log entries each day.
Plus, there are some limitation and questions that can break the requirements:
60 seconds timeout on bulk transaction - How many log entries per second will I be able to insert? If 100K won't fit into the 60 seconds frame - this will affect the design and the work that needs to be put into the server.
5 inserts per entity per seconds - Is a transaction considered a single insert?
Post analysis - text search, searching for similar log entries cross clients. How flexible and efficient is Datastore with these queries?
Real time data fetch - getting all the recent log entries.
The other option is to deploy an elasticsearch cluster on goole compute and write the server on our own which fetches data from ES.
Thanks!
Bad idea to use datastore and even worse if you use entity groups with parent/child as a comment mentions when comparing performance.
Those numbers do not apply but datastore is not at all designed for what you want.
bigquery is what you want. its designed for this specially if you later want to analyze the logs in a sql-like fashion. Any more detail requires that you ask a specific question as it seems you havent read much about either service.
I do not agree, Data Store is a totally fully managed no sql document store database, you can store the logs you want in this type of storage and you can query directly in datastore, the benefits of using this instead of BigQuery is the schemaless part, in BigQuery you have to define the schema before inserting the logs, this is not necessary if you use DataStore, think of DataStore as a MongoDB log analysis use case in Google Cloud.

Is GAE optimized for database-heavy applications?

I'm writing a very limited-purpose web application that stores about 10-20k user-submitted articles (typically 500-700 words). At any time, any user should be able to perform searches on tags and keywords, edit any part of any article (metadata, text, or tags), or download a copy of the entire database that is recent up-to-the-hour. (It can be from a cache as long as it is updated hourly.) Activity tends to happen in a few unpredictable spikes over a day (wherein many users download the entire database simultaneously requiring 100% availability and fast downloads) and itermittent weeks of low activity. This usage pattern is set in stone.
Is GAE a wise choice for this application? It appeals to me for its low cost (hopefully free), elasticity of scale, and professional management of most of the stack. I like the idea of an app engine as an alternative to a host. However, the excessive limitations and quotas on all manner of datastore usage concern me, as does the trade-off between strong and eventual consistency imposed by the datastore's distributed architecture.
Is there a way to fit this application into GAE? Should I use the ndb API instead of the plain datastore API? Or are the requirements so data-intensive that GAE is more expensive than hosts like Webfaction?
As long as you don't require full text search on the articles (which is currently still marked as experimental and limited to ~1000 queries per day), your usage scenario sounds like it would fit just fine in App Engine.
stores about 10-20k user-submitted articles (typically 500-700 words)
Maximum entity size in App Engine is 1 MB, so as long as the total size of the article is lower than that, it should not be a problem. Also, the cost for reading data in is not tied to the size of the entity but to the number of entities being read.
At any time, any user should be able to perform searches on tags and keywords.
Again, as long as the search on the tags and keywords are not full text searches, App Engine's datastore queries could handle these kind of searches efficiently. If you want to search on both tags and keywords at the same time, you would need to build a composite index for both fields. This could increase your write cost.
download a copy of the entire database that is recent up-to-the-hour.
You could use cron/scheduled task to schedule a hourly dump to the blobstore. The cron could be targeted to a backend instance if your dump takes more than 60 seconds to be finished. Do remember that with each dump, you would need to read all entities in the database, and this means 10-20k read ops per hour. You could add a timestamp field to your entity, and have your dump servlet query for anything newer than the last dump instead to save up read ops.
Activity tends to happen in a few unpredictable spikes over a day (wherein many users download the entire database simultaneously requiring 100% availability and fast downloads) and itermittent weeks of low activity.
This is where GAE shines, you could have very efficient instance usages with GAE in this case.
I don't think your application is particularly "database-heavy".
500-700 words is only a few KB of data.
I think GAE is a good fit.
You could store each article as a textproperty on an entity, with tags in a listproperty. For searching text you could use the search service https://developers.google.com/appengine/docs/python/search/ (which currently has quota limits).
Not 100% sure about downloading all the data, but I think you could store all the data in the blobstore (possibly as pdf?) and then allow users to download that blob.
I would choose NDB over regular datastore, mostly for the built-in async functionality and caching.
Regarding staying below quota, it depends on how many people are accessing the site and how much data they download/upload.

Resources