Is google Datastore recommended for storing logs? - google-app-engine

I am investigating what might be the best infrastructure for storing log files from many clients.
Google App engine offers a nice solution that doesn't make the process a IT nightmare: Load balancing, sharding, server, user authentication - all in once place with almost zero configuration.
However, I wonder if the Datastore model is the right for storing logs. Each log entry should be saved as a single document, where each clients uploads its document on a daily basis and can consists of 100K of log entries each day.
Plus, there are some limitation and questions that can break the requirements:
60 seconds timeout on bulk transaction - How many log entries per second will I be able to insert? If 100K won't fit into the 60 seconds frame - this will affect the design and the work that needs to be put into the server.
5 inserts per entity per seconds - Is a transaction considered a single insert?
Post analysis - text search, searching for similar log entries cross clients. How flexible and efficient is Datastore with these queries?
Real time data fetch - getting all the recent log entries.
The other option is to deploy an elasticsearch cluster on goole compute and write the server on our own which fetches data from ES.
Thanks!

Bad idea to use datastore and even worse if you use entity groups with parent/child as a comment mentions when comparing performance.
Those numbers do not apply but datastore is not at all designed for what you want.
bigquery is what you want. its designed for this specially if you later want to analyze the logs in a sql-like fashion. Any more detail requires that you ask a specific question as it seems you havent read much about either service.

I do not agree, Data Store is a totally fully managed no sql document store database, you can store the logs you want in this type of storage and you can query directly in datastore, the benefits of using this instead of BigQuery is the schemaless part, in BigQuery you have to define the schema before inserting the logs, this is not necessary if you use DataStore, think of DataStore as a MongoDB log analysis use case in Google Cloud.

Related

google cloud datastore partitioning strategy

Trying to use Google Cloud Datastore to store streaming data from IOT devices. Currently getting data from 10,000 devices at rate of 2 rows (entities) in one minute per device. Data entities would never be updated but purged at regular interval. Backend code is in PHP.
Do I need to partition my data to get better performance as I do at present in MySQL table. Currently using table partitions based on key.
If Yes, Should I used use namespaces as one NAMESPACE for one device OR I should create one KIND for one device such as "device_data_1", "device_data_2"
Thanks
No, you do not need partitioning, the datastore performance is not impacted by the number or entities being written or read (as long as they're not in the same entity group, which has an overall write rate of 1/sec).
See also these somehow related answers: Does Google Datastore have a provisioned capacity system like DynamoDB?

database to be used for log storing

I am starting up with a log monitoring tool which captures audit logs,firewall logs and many other logs.I Have an issue in choosing the right kind of database for this project as number of logs generated per second is at least 500 which has to be stored.
Let's assume that this should be able to support 1BN+ log entries per month. The two factors that will likely come into play most are ability to write quickly and also the ability display reports quickly.
A common stack used for log storing is the ELK stack composed of Elastic search, Logstash, and Kibana.
Elastic search is used to store the documents and can execute queries.
Logstash monitors and parses the logs.
Kibana is used to create reports based off of the data in elastic search.
There are other options out there. Splunk is a paid solution that does all of the above. Graylog is another solution that is similar to splunk.

AppEngine & BigQuery - Where would you put stat/monitoring data?

I have an AppEngine application that process files from Cloud Storage and inserts them in BigQuery.
Because now and also in the future I would like to know the sanity/performance of the application... I would like to store stats data in either Cloud Datastore or in a Cloud SQL instance.
I have two questions I would like to ask:
Cloud Datastore vs Cloud SQL - what would you use and why? What downsides have you experienced so far?
Would you use a task or direct call to insert data and, also, why? - Would you add a task and then have some consumers insert to data or would you do a direct insert [ regardless of the solution choosen above ]. What downsides have you experienced so far?
Thank you.
Cloud SQL is better if you want to perform JOINs or SUMs later, Cloud Datastore will scale more if you have a lot of data to store. Also, in the Datastore, if you want to update a stats entity transactionally, you will need to shard or you will be limited to 5 updates per second.
If the data to insert is small (one row to insert in BQ or one entity in the datastore) then you can do it by a direct call, but you must accept that the call may fail. If you want to retry in case of failure, or if the data to insert is big and it will take time, it is better to run it asynchronously in a task. Note that with tasks,y you must be cautious because they can be run more than once.

Is GAE optimized for database-heavy applications?

I'm writing a very limited-purpose web application that stores about 10-20k user-submitted articles (typically 500-700 words). At any time, any user should be able to perform searches on tags and keywords, edit any part of any article (metadata, text, or tags), or download a copy of the entire database that is recent up-to-the-hour. (It can be from a cache as long as it is updated hourly.) Activity tends to happen in a few unpredictable spikes over a day (wherein many users download the entire database simultaneously requiring 100% availability and fast downloads) and itermittent weeks of low activity. This usage pattern is set in stone.
Is GAE a wise choice for this application? It appeals to me for its low cost (hopefully free), elasticity of scale, and professional management of most of the stack. I like the idea of an app engine as an alternative to a host. However, the excessive limitations and quotas on all manner of datastore usage concern me, as does the trade-off between strong and eventual consistency imposed by the datastore's distributed architecture.
Is there a way to fit this application into GAE? Should I use the ndb API instead of the plain datastore API? Or are the requirements so data-intensive that GAE is more expensive than hosts like Webfaction?
As long as you don't require full text search on the articles (which is currently still marked as experimental and limited to ~1000 queries per day), your usage scenario sounds like it would fit just fine in App Engine.
stores about 10-20k user-submitted articles (typically 500-700 words)
Maximum entity size in App Engine is 1 MB, so as long as the total size of the article is lower than that, it should not be a problem. Also, the cost for reading data in is not tied to the size of the entity but to the number of entities being read.
At any time, any user should be able to perform searches on tags and keywords.
Again, as long as the search on the tags and keywords are not full text searches, App Engine's datastore queries could handle these kind of searches efficiently. If you want to search on both tags and keywords at the same time, you would need to build a composite index for both fields. This could increase your write cost.
download a copy of the entire database that is recent up-to-the-hour.
You could use cron/scheduled task to schedule a hourly dump to the blobstore. The cron could be targeted to a backend instance if your dump takes more than 60 seconds to be finished. Do remember that with each dump, you would need to read all entities in the database, and this means 10-20k read ops per hour. You could add a timestamp field to your entity, and have your dump servlet query for anything newer than the last dump instead to save up read ops.
Activity tends to happen in a few unpredictable spikes over a day (wherein many users download the entire database simultaneously requiring 100% availability and fast downloads) and itermittent weeks of low activity.
This is where GAE shines, you could have very efficient instance usages with GAE in this case.
I don't think your application is particularly "database-heavy".
500-700 words is only a few KB of data.
I think GAE is a good fit.
You could store each article as a textproperty on an entity, with tags in a listproperty. For searching text you could use the search service https://developers.google.com/appengine/docs/python/search/ (which currently has quota limits).
Not 100% sure about downloading all the data, but I think you could store all the data in the blobstore (possibly as pdf?) and then allow users to download that blob.
I would choose NDB over regular datastore, mostly for the built-in async functionality and caching.
Regarding staying below quota, it depends on how many people are accessing the site and how much data they download/upload.

Preventing duplicates with MapReduce to BigQuery pipeline

I was reading the answer by Michael to this post here, which suggests using a pipeline to move data from datastore to cloud storage to big query.
Google App Engine: Using Big Query on datastore?
I want to use this technique to append data to a bigquery table. That means I have to have some way of knowing if the entities have been processed, so they don't get repeatedly submitted to bigquery during mapreduce runs. I don't want to rebuild my table each time.
The way I see it, I have two options. I can put a flag on the entities and update it when each entity is processed and filter it out on subsequent runs - or - I can save each entity to a new table and delete it from the source table. The second way seems superior but I wanted to ask for options or see if there's any gotchas
Assuming you have some stream of activity represented as entities, you can use query cursors to start up one query where a prior one left off. Query cursors are perfect for the type of incremental situation that you've described, because they avoid the overhead for marking entities as having been processed.
I'd have to poke around a bit to see if App Engine MapReduce supports cursors (I suspect that it doesn't, yet).

Resources