Google App Engine Database : disk usage - google-app-engine

I have a Google App engine Application storing 135 MBytes into its datastore, however when I check my quotas It tells me that I'm using 76% of my Free 1gb of stored data.
Is it because of the index ? How can it use so much diskspace?
Thanks

It could be due to indexes. Every property (with exception of some types) has "single property" indexes unless you explicitly disable indexing of that property. Since the indexes store the property name and the value, the impact on storage space can be quite significant. If you would like statistics on your index usage, star issue 2740.
If you are using a lot of tasks, your stored task bytes also counts against your storage usage. Also note that blobstore usage counts against your storage quota.

Related

Caching/storage strategies for implementing a word dictionary in App Engine

I am looking at using a spell checker for my GAE app and we have an algorithm already for spell checking, but I'm trying to figure out how to best store and load dictionary files for best performance.
I am considering the following strategies:
Place the dictionary data in a text file(s) in local app engine storage and load/read them using standard IO methods (open(),read(),etc)
Place the dictionary data in GCS and load/read using GCS IO methods
Place the dictionary data in an ndb.model() and load/cache information
One cache I don't quite understand is the context cache -- is this cache that is attached to a given instance? I.e. if I have a resident instance that is spun up, can I go ahead and load the dictionary data into the instance's RAM and thus accessing data should be extremely fast (microsecond vs millisecond seek/get times)? The dictionary data will probably be a sharded list of some sort that we'll optimize for performance. Are there other data storage methods/structures I'm not considering here that may be more appropriate? Thanks.
Cache (or its full name memcache) isn't exactly RAM but similar. When used with NDB it acts like a buffer. When you do writes it writes to the Memcache first then to the DB. Though this may sound slower its not, as writes to the DB take a while before they are accessible. When it reads it checks memcache, if it exists then it uses that info otherwise it pulls from the DB, stores it in Memcache then gives you the data. Just like RAM though its volatile, thus you cannot guaranty information is always acceptable, its limited (depending on what type of instance you have) and can be flush with no warning or reason. You can read more here:
https://developers.google.com/appengine/docs/python/memcache/
https://developers.google.com/appengine/articles/scaling/memcache
Ultimately Memcahe will be the fastest and most accessible as it it shared amongst all your instances, so if one instance pulls some data from the datastore then all of them can access it quickly. Even if its not in memcache it is still the fastest of all the options, as the others ones will fill up your memory and may cause errors and performance issues.

Reduce the size of Google app engine Datastore Stored Data

I'm using Google app engine and now the size of the "Datastore Stored Data" is near to exceed the free quota limit. so i want to reduce the size of the data in the Datastore by removing some entity elements.
I have tried deleting some entity elements that cost about 100MB (abt 10% from 1GB limit) , but it still shows the earlier usage and it still near to exceed the free quota limit.
Please advice me, how to reduce the data store size.
Thanks in advance.
Nalaka
To reduce the size in your case:
1) NDB can compress properties, so you can create an object for the non indexed properties and compress it: https://developers.google.com/appengine/docs/python/ndb/properties?hl=nl
2) I do not know your models. But an option is to distribute your models and create a webservice to fetch entities from the other appids.
3) If it is only one model, you keep the indexed properties in your primary appid and fetch the data from the secondary appid.
But of course, everything has a price. Performance, url fetches, CPU ... So it is easy to run from one bottleneck in another.

Is GAE optimized for database-heavy applications?

I'm writing a very limited-purpose web application that stores about 10-20k user-submitted articles (typically 500-700 words). At any time, any user should be able to perform searches on tags and keywords, edit any part of any article (metadata, text, or tags), or download a copy of the entire database that is recent up-to-the-hour. (It can be from a cache as long as it is updated hourly.) Activity tends to happen in a few unpredictable spikes over a day (wherein many users download the entire database simultaneously requiring 100% availability and fast downloads) and itermittent weeks of low activity. This usage pattern is set in stone.
Is GAE a wise choice for this application? It appeals to me for its low cost (hopefully free), elasticity of scale, and professional management of most of the stack. I like the idea of an app engine as an alternative to a host. However, the excessive limitations and quotas on all manner of datastore usage concern me, as does the trade-off between strong and eventual consistency imposed by the datastore's distributed architecture.
Is there a way to fit this application into GAE? Should I use the ndb API instead of the plain datastore API? Or are the requirements so data-intensive that GAE is more expensive than hosts like Webfaction?
As long as you don't require full text search on the articles (which is currently still marked as experimental and limited to ~1000 queries per day), your usage scenario sounds like it would fit just fine in App Engine.
stores about 10-20k user-submitted articles (typically 500-700 words)
Maximum entity size in App Engine is 1 MB, so as long as the total size of the article is lower than that, it should not be a problem. Also, the cost for reading data in is not tied to the size of the entity but to the number of entities being read.
At any time, any user should be able to perform searches on tags and keywords.
Again, as long as the search on the tags and keywords are not full text searches, App Engine's datastore queries could handle these kind of searches efficiently. If you want to search on both tags and keywords at the same time, you would need to build a composite index for both fields. This could increase your write cost.
download a copy of the entire database that is recent up-to-the-hour.
You could use cron/scheduled task to schedule a hourly dump to the blobstore. The cron could be targeted to a backend instance if your dump takes more than 60 seconds to be finished. Do remember that with each dump, you would need to read all entities in the database, and this means 10-20k read ops per hour. You could add a timestamp field to your entity, and have your dump servlet query for anything newer than the last dump instead to save up read ops.
Activity tends to happen in a few unpredictable spikes over a day (wherein many users download the entire database simultaneously requiring 100% availability and fast downloads) and itermittent weeks of low activity.
This is where GAE shines, you could have very efficient instance usages with GAE in this case.
I don't think your application is particularly "database-heavy".
500-700 words is only a few KB of data.
I think GAE is a good fit.
You could store each article as a textproperty on an entity, with tags in a listproperty. For searching text you could use the search service https://developers.google.com/appengine/docs/python/search/ (which currently has quota limits).
Not 100% sure about downloading all the data, but I think you could store all the data in the blobstore (possibly as pdf?) and then allow users to download that blob.
I would choose NDB over regular datastore, mostly for the built-in async functionality and caching.
Regarding staying below quota, it depends on how many people are accessing the site and how much data they download/upload.

GAE: About the Usage of High Replication Data

I'm using Google App Engine and High Replication Datastore.
I checked the Dashboard of one of my GAE app today, I found that High Replication Data became 52%, 0.26 of 0.50 GBytes in the Billing Status.
I don't use so much data for the app, so I also checked Datastore Statistics and Total number of entities is about 60,000 and Size of all entities is only 42 MBytes, which is far from 0.26 GBytes.
What is the difference between the Usage in the Dashboard and in the Datastore Statistics? And how can I reduce the former Usage?
Thank you.
Because the datastore creates automatic indexes for your entities. In addition if you have custom indexes, they will also need storage.
You can reduce this by removing unused indexes and by not indexing properties, which are not needed for queries (setting indexed=false).
In general however, you need to get used to the idea that the storage for your entities is not the same as total storage needed for the datastore ;)

What does the "Blobstore Stored Data" in the quota details of Google Appengine refer to exactly?

I am trying to understand what the "Blobstore Stored Data" refers to. My app has about 4 GB of uploaded images into the blobstore (not datastore). However when I look at my quota details in appengine, I notice that the quota being used up is the "Total Stored Data". I was expecting to see the "Blobstore Stored Data" being used up instead (which in my case is still at 0%). Why is that the case?
Blobstore Stored Data represents what you think it does: the amount of data stored in the blobstore. There are two types of quota in App Engine, though: billing quotas and limits. "Total Stored Data" is a billing quota; "Blobstore Stored Data" is a limit. Limits tend to be set very high, mostly to prevent runaway apps and abuse; if you run out of one of them, we'll generally extend them for you. Storage in blobstore is counted towards both quotas.
It's likely that you're seeing 0% on "Blobstore Stored Data" because the limit is set high enough that you're not even using 1% of it. What is the actual value of that limit, as opposed to the percentage value?

Resources