Appengine NDB: Putting 880 rows, exceeding datastore write ops quota. Why? - google-app-engine

I have an application which imports 880 rows into an NDB datastore, using put_async(). Whenever I run this import it exceeds the daily quota of 50,000 write ops to the datastore.
I'm trying to understand why this operation is so expensive and what can be done to stay under quota.
There are 13 columns like so:
stringbool = ['true', 'false']
class BeerMenu(ndb.Model):
name = ndb.StringProperty()
brewery = ndb.StringProperty()
origin = ndb.StringProperty()
abv = ndb.FloatProperty()
size = ndb.FloatProperty()
meas = ndb.StringProperty()
price = ndb.FloatProperty()
active = ndb.StringProperty(default="false", choices=stringbool)
url = ndb.StringProperty()
bartender = ndb.StringProperty()
lineno = ndb.IntegerProperty()
purdate = ndb.DateProperty()
costper = ndb.FloatProperty()
I've trimmed the indexing back to one:
- kind: BeerMenu
properties:
- name: brewery
- name: name
According to the SDK datastore viewer, each row is 29 write ops, so that would generate 25520 writes! I'm assuming that the indexes consume the rest of the write ops, but I don't know exactly how many because AppEngine just says I've exceeded the quota.
What are the best strategies for reducing the number of write ops?

All properties except text and blob properties are indexed by default. So if you deindex the string properties, all the float, int, and date properties are still indexed. You should add indexed=False to the other properties to decrease writes.
Indexes listed in index.yaml are additional indexes to the property indexes. index.yaml indexes are for things like ordered and relational queries (i.e., a query with date > date_property will generate an entry in index.yaml).

Here, check out the Costs for Datastore Calls part to give you more idea:
Paid Apps: Budgeting, Billing, and Buying Resources
Hope this helps too.

Related

ndb Models are not saved in memcache when using MapReduce

I've created two MapReduce Pipelines for uploading CSVs files to create Categories and Products in bulk. Each product is gets tied to a Category through a KeyProperty. The Category and Product models are built on ndb.Model, so based on the documentation, I would think they'd be automatically cached in Memcache when retrieved from the Datastore.
I've run these scripts on the server to upload 30 categories and, afterward, 3000 products. All the data appears in the Datastore as expected.
However, it doesn't seem like the Product upload is using Memcache to get the Categories. When I check the Memcache viewer in the portal, it says something along the lines of the hit count being around 180 and the miss count around 60. If I was uploading 3000 products and retrieving the category each time, shouldn't I have around 3000 hits + misses from fetching the category (ie, Category.get_by_id(category_id))? And likely 3000 more misses from attempting to retrieve the existing product before creating a new one (algorithm handles both entity creation and updates).
Here's the relevant product mapping function, which takes in a line from the CSV file in order to create or update the product:
def product_bulk_import_map(data):
"""Product Bulk Import map function."""
result = {"status" : "CREATED"}
product_data = data
try:
# parse input parameter tuple
byteoffset, line_data = data
# parse base product data
product_data = [x for x in csv.reader([line_data])][0]
(p_id, c_id, p_type, p_description) = product_data
# process category
category = Category.get_by_id(c_id)
if category is None:
raise Exception(product_import_error_messages["category"] % c_id)
# store in datastore
product = Product.get_by_id(p_id)
if product is not None:
result["status"] = "UPDATED"
product.category = category.key
product.product_type = p_type
product.description = p_description
else:
product = Product(
id = p_id,
category = category.key,
product_type = p_type,
description = p_description
)
product.put()
result["entity"] = product.to_dict()
except Exception as e:
# catch any exceptions, and note failure in output
result["status"] = "FAILED"
result["entity"] = str(e)
# return results
yield (str(product_data), result)
MapReduce intentionally disables memcache for NDB.
See mapreduce/util.py ln 373, _set_ndb_cache_policy() (as of 2015-05-01):
def _set_ndb_cache_policy():
"""Tell NDB to never cache anything in memcache or in-process.
This ensures that entities fetched from Datastore input_readers via NDB
will not bloat up the request memory size and Datastore Puts will avoid
doing calls to memcache. Without this you get soft memory limit exits,
which hurts overall throughput.
"""
ndb_ctx = ndb.get_context()
ndb_ctx.set_cache_policy(lambda key: False)
ndb_ctx.set_memcache_policy(lambda key: False)
You can force get_by_id() and put() to use memcache, eg:
product = Product.get_by_id(p_id, use_memcache=True)
...
product.put(use_memcache=True)
Alternatively, you can modify the NDB context if you are batching puts together with mapreduce.operation. However I don't know enough to say whether this has other undesired effects:
ndb_ctx = ndb.get_context()
ndb_ctx.set_memcache_policy(lambda key: True)
...
yield operation.db.Put(product)
As for the docstring about "soft memory limit exits", I don't understand why that would occur if only memcache was enabled (ie. no in-context cache).
It actually seems like you want memcache to be enabled for puts, otherwise your app ends up reading stale data from NDB's memcache after your mapper has modified the data underneath.
As Slawek Rewaj already mentioned this is caused by the in-context cache. When retrieving an entity NDB tries the in-context cache first, then memcache, and finally it retrieves the entity from datastore if it wasn't found neither in the in-context cache nor memcache. The in-context cache is just a Python dictionary and its lifetime and visibility is limited to the current request, but MapReduce does multiple calls to product_bulk_import_map() within a single request.
You can find more information about the in-context cache here: https://cloud.google.com/appengine/docs/python/ndb/cache#incontext

Google App Engine ndb performance on repeated property

Do I pay a penalty on query performance if I choose to query repeated property? For example:
class User(ndb.Model):
user_name = ndb.StringProperty()
login_providers = ndb.KeyProperty(repeated=true)
fbkey = ndb.Key("ProviderId", 1, "ProviderName", "FB")
for entry in User.query(User.login_providers == fbkey):
# Do something with entry.key
vs
class User(ndb.Model)
user_name = ndb.StringProperty()
class UserProvider(ndb.Model):
user_key = ndb.KeyProperty(kind=User)
login_provider = ndb.KeyProperty()
for entry in UserProvider.query(
UserProvider.user_key == auserkey,
UserProvider.login_provider == fbkey
):
# Do something with entry.user_key
Based on the documentation from GAE, it seems that Datastore take care of indexing and the first less verbose option would be using the index. However, I failed to find any documentation to confirm this.
Edit
The sole purpose of UserProvider in the second example is to create a one-to-many relationship between a user and it's login_provider. I wanted to understand if it worth the trouble of creating a second entity instead of querying on repeated property. Also, assume that all I need is the key from the User.
No. But you'll raise your write costs because each entry needs to be indexed, and write costs are based on the number of indexes updated.

reduce google datastore read operation fee

I have a Kind XXX_account with 1000 entities. The Kind file size is 3 mb. Whenever I send a request, the Query need to be called to find a certain entity in the Kind. Therefore, I think the google fee is almost 4 usd in just a 20 hours.
Is there anyway to reduce the datastore read operations? I plan to store 1000 entities in txt file so that I do need to read datastore everytime.
Datastore Read Operations 5.01 Million Ops 4.96 $0.70/ Million Ops $3.48
My model.py
class MyUser(DatastoreUser):
pass
class XXXAccount(db.Model):
user = db.ReferenceProperty(MyUser,
collection_name='xxx_accounts')
id = db.StringProperty(required=True)
created = db.DateTimeProperty(auto_now_add=True)
updated = db.DateTimeProperty(auto_now=True)
name = db.StringProperty(required=True)
username = db.StringProperty(required=True)
profile_url = db.StringProperty(required=True)
aaa = db.StringProperty(required=True)
bbb = db.StringProperty(required=True)
view.py
#login_required
def updateprofile(request):
number_form = NumberForm()
if request.method =="POST" and number_form.validate(request.form):
acc_num_str = number_form['nb']
acc_num = int(acc_num_str)
current_user = request.user
xxx_account = current_user.xxx_accounts[acc_num] #Query
DO SOME THING WHICH DOES NOT RELATED TO READ AND WRITE DATASTORE OPERATION
return......
UPDATE:
Code was posted
OMG, 0.32 USD for just 1000 requests.
You should post your model definition and code where you do querying entities.
Common recommendations:
If you want to find certain entity(ies), there is only one right way to do it - use entity key (id number or key_name string) to get it. Datastore automatically assigns some id to entity when it saves it or you can manually set some nice key_name when you're creating entity.
To get entity's id or key_name use Model.key().id() or Model.key().name() in DB or Model.key.id() in NDB.
Then you can get entity by id or key_name with Model.get_by_id() or Model.get_by_key_name() methods if you're using old DB API or Key.get() method if you're using new NDB API. You can pass id or key_name to URL - http://example.com/getentity/[id].
Also, use Memcache to cache entities. Caching can extremely decrease using of Datastore. By the way, NDB automatically uses cache.
p.s. Sorry, I cannot post more than 2 links.

Should the size of entities be as small as possible when I count them by "count()" method?

I'm wondering if I should have a kind only for counting entities.
For example
There is a model like the following.
class Message(db.Model):
title = db.StringProperty()
message = db.StringProperty()
created_on = db.DateTimeProperty()
created_by = db.ReferenceProperty(User)
category = db.StringProperty()
And there are 100000000 entities made of this model.
I want to count entities which category equals 'book'.
In this case, should I create the following mode for counting them?
class Category(db.Model):
category = db.StringProperty()
look_message = db.ReferenceProperty(Message)
Does this small model make it faster to count?
And does it erase smaller memory?
I'm thinking to count them like the following by the way
q = db.Query(Message).filter('category =', 'book')
count = q.count(10000)
Counting 100000000 entities is a very expensive operation on a NoSQL database as the App Engine datastore. You'll probably want to count as you update, or run a map-reduce operation to count after the fact.
App Engine also offers a simple way to query how many entities of each type you have:
https://developers.google.com/appengine/docs/python/datastore/stats
For example, to count all Messages:
from google.appengine.ext.db import stats
kind_stats = stats.KindStat().all().filter("kind_name =", "Message").get()
count = kind_stats.count
Note that stats are updated asynchronously, so they'll lag the actual count.
I think that you have to create another entity like this.
This entity will just count the number of messages by category.
Just change your category to this:
class Category(db.model):
category = db.StringProperty()
totalOfMessages = db.IntegerProperty(default=0)
In the message class you change to reference the category class, just change the category property to:
category = db.ReferenceProperty(Category)
When you create a new Message object, you have to update the counter, increment when you create a new message or decrement if you delete.
The best way to work with counters on GAE is using Sharding Counters
Count is implemented as an index scan that discards all data except the number of records seen . It never looks up the entity, so the size of the entity does not matter.
That being said, counting like this does not scale and is quite costly in a system without a fixed schema. It would likely be better to use another method like a Sharded Counter, MapReduce or Materialized View/Fork Join. If you really want it to scale, this talk is pretty informative: http://www.google.com/events/io/2010/sessions/high-throughput-data-pipelines-appengine.html

How many Datastore reads consume each Fetch, Count and Query operations?

I'm reading on Google App Engine groups many users (Fig1, Fig2, Fig3) that can't figure out where the high number of Datastore reads in their billing reports come from.
As you might know, Datastore reads are capped to 50K operations/day, above this budget you have to pay.
50K operations sounds like a lot of resources, but unluckily, it seems that each operation (Query, Entity fetch, Count..), hides several Datastore reads.
Is it possible to know via API or some other approach, how many Datastore reads are hidden behind the common RPC.get , RPC.runquery calls?
Appstats seems useless in this case because it gives just the RPC details and not the hidden reads cost.
Having a simple Model like this:
class Example(db.Model):
foo = db.StringProperty()
bars= db.ListProperty(str)
and 1000 entities in the datastore, I'm interested in the cost of these kind of operations:
items_count = Example.all(keys_only = True).filter('bars=','spam').count()
items_count = Example.all().count(10000)
items = Example.all().fetch(10000)
items = Example.all().filter('bars=','spam').filter('bars=','fu').fetch(10000)
items = Example.all().fetch(10000, offset=500)
items = Example.all().filter('foo>=', filtr).filter('foo<', filtr+ u'\ufffd')
See http://code.google.com/appengine/docs/billing.html#Billable_Resource_Unit_Cost .
A query costs you 1 read plus 1 read for each entity returned. "Returned" includes entities skipped by offset or count.
So that is 1001 reads for each of these:
Example.all(keys_only = True).filter('bars=','spam').count()
Example.all().count(1000)
Example.all().fetch(1000)
Example.all().fetch(1000, offset=500)
For these, the number of reads charged is 1 plus the number of entities that match the filters:
Example.all().filter('bars=','spam').filter('bars=','fu').fetch()
Example.all().filter('foo>=', filtr).filter('foo<', filtr+ u'\ufffd').fetch()
Instead of using count you should consider storing the count in the datastore, sharded if you need to update the count more than once a second. http://code.google.com/appengine/articles/sharding_counters.html
Whenever possible you should use cursors instead of an offset.
Just to make sure:
I'm almost sure:
Example.all().count(10000)
This one uses small datastore operations (no need to fetch the entities, only keys), so this would count as 1 read + 10,000 (max) small operations.

Resources