ndb Models are not saved in memcache when using MapReduce

ndb Models are not saved in memcache when using MapReduce - google-app-engine

I've created two MapReduce Pipelines for uploading CSVs files to create Categories and Products in bulk. Each product is gets tied to a Category through a KeyProperty. The Category and Product models are built on ndb.Model, so based on the documentation, I would think they'd be automatically cached in Memcache when retrieved from the Datastore.
I've run these scripts on the server to upload 30 categories and, afterward, 3000 products. All the data appears in the Datastore as expected.
However, it doesn't seem like the Product upload is using Memcache to get the Categories. When I check the Memcache viewer in the portal, it says something along the lines of the hit count being around 180 and the miss count around 60. If I was uploading 3000 products and retrieving the category each time, shouldn't I have around 3000 hits + misses from fetching the category (ie, Category.get_by_id(category_id))? And likely 3000 more misses from attempting to retrieve the existing product before creating a new one (algorithm handles both entity creation and updates).
Here's the relevant product mapping function, which takes in a line from the CSV file in order to create or update the product:
def product_bulk_import_map(data):
"""Product Bulk Import map function."""
result = {"status" : "CREATED"}
product_data = data
try:
# parse input parameter tuple
byteoffset, line_data = data
# parse base product data
product_data = [x for x in csv.reader([line_data])][0]
(p_id, c_id, p_type, p_description) = product_data
# process category
category = Category.get_by_id(c_id)
if category is None:
raise Exception(product_import_error_messages["category"] % c_id)
# store in datastore
product = Product.get_by_id(p_id)
if product is not None:
result["status"] = "UPDATED"
product.category = category.key
product.product_type = p_type
product.description = p_description
else:
product = Product(
id = p_id,
category = category.key,
product_type = p_type,
description = p_description
)
product.put()
result["entity"] = product.to_dict()
except Exception as e:
# catch any exceptions, and note failure in output
result["status"] = "FAILED"
result["entity"] = str(e)
# return results
yield (str(product_data), result)

MapReduce intentionally disables memcache for NDB.
See mapreduce/util.py ln 373, _set_ndb_cache_policy() (as of 2015-05-01):
def _set_ndb_cache_policy():
"""Tell NDB to never cache anything in memcache or in-process.
This ensures that entities fetched from Datastore input_readers via NDB
will not bloat up the request memory size and Datastore Puts will avoid
doing calls to memcache. Without this you get soft memory limit exits,
which hurts overall throughput.
"""
ndb_ctx = ndb.get_context()
ndb_ctx.set_cache_policy(lambda key: False)
ndb_ctx.set_memcache_policy(lambda key: False)
You can force get_by_id() and put() to use memcache, eg:
product = Product.get_by_id(p_id, use_memcache=True)
...
product.put(use_memcache=True)
Alternatively, you can modify the NDB context if you are batching puts together with mapreduce.operation. However I don't know enough to say whether this has other undesired effects:
ndb_ctx = ndb.get_context()
ndb_ctx.set_memcache_policy(lambda key: True)
...
yield operation.db.Put(product)
As for the docstring about "soft memory limit exits", I don't understand why that would occur if only memcache was enabled (ie. no in-context cache).
It actually seems like you want memcache to be enabled for puts, otherwise your app ends up reading stale data from NDB's memcache after your mapper has modified the data underneath.

As Slawek Rewaj already mentioned this is caused by the in-context cache. When retrieving an entity NDB tries the in-context cache first, then memcache, and finally it retrieves the entity from datastore if it wasn't found neither in the in-context cache nor memcache. The in-context cache is just a Python dictionary and its lifetime and visibility is limited to the current request, but MapReduce does multiple calls to product_bulk_import_map() within a single request.
You can find more information about the in-context cache here: https://cloud.google.com/appengine/docs/python/ndb/cache#incontext

Related

Datastore in Firestore mode - a distributed counter than can scale it's shards up based on traffic

In Datastore in Firestore mode the recommended way to deal with storing a high write counter (such as profile views on a website) is to use sharded/distributed counters.
The problem I have is that with distributed counters you need to pick how many shards you want to have. This is addressed here as well. For example some profiles may get a lot more views per second than others (one profile may be a famous person while another is a regular person), and therefore need more shards.
Is there a way to write a distributed counter that can scale it's shards up if the page is getting a lot of views per second?
I was thinking of detecting a datastore contention error and then adding more shards if that happens.
I noticed there is a new extension for Cloud Firestore that seems to do what I am asking for. However, I am not using Cloud Firestore, I am using Datastore in Firestore mode - similar under the hood but still different.

The original Datastore distributed counters example:
NUM_SHARDS = 20
class SimpleCounterShard(ndb.Model):
"""Shards for the counter"""
count = ndb.IntegerProperty(default=0)
def get_count():
"""Retrieve the value for a given sharded counter.
Returns:
Integer; the cumulative count of all sharded counters.
"""
total = 0
for counter in SimpleCounterShard.query():
total += counter.count
return total
#ndb.transactional
def increment():
"""Increment the value for a given sharded counter."""
shard_string_index = str(random.randint(0, NUM_SHARDS - 1))
counter = SimpleCounterShard.get_by_id(shard_string_index)
if counter is None:
counter = SimpleCounterShard(id=shard_string_index)
counter.count += 1
counter.put()
Used a fixed number of shards, but the Firestore example uses a separate entity for keeping track of the number of shards. So, you can update the code above with something like:
class RootCounter(ndb.Model):
count = ndb.IntegerProperty(default=0)
num_shards = ndb.IntegerProperty(default=0)
def get_count(self):
if self.num_shards > 0:
return sum([e.count for e in SimpleCounterShard.query(parent=self.key)])
return count
def increment(self):
try:
self._increment()
except:
self.num_shards += 1
self.increment()
self.put()
#ndb.transactional(retries=1):
def _increment(self):
if self.num_shards > 0:
SimpleCounterShard.increment(parent=self.key, self.num_shards)
else:
self.count += 1
self.put()
The important difference since Firestore in Datastore mode has been released is that Firestore in Datastore mode is strongly consistent and that you are likely not using entity groups. Thus a query will give an exact answer, and the sharded counters can nicely fit in the hierarchy with the root counter.

NDB Queries Exceeding GAE Soft Private Memory Limit

I currently have a an application running in the Google App Engine Standard Environment, which, among other things, contains a large database of weather data and a frontend endpoint that generates graph of this data. The database lives in Google Cloud Datastore, and the Python Flask application accesses it via the NDB library.
My issue is as follows: when I try to generate graphs for WeatherData spanning more than about a week (the data is stored for every 5 minutes), my application exceeds GAE's soft private memory limit and crashes. However, stored in each of my WeatherData entities are the relevant fields that I want to graph, in addition to a very large json string containing forecast data that I do not need for this graphing application. So, the part of the WeatherData entities that is causing my application to exceed the soft private memory limit is not even needed in this application.
My question is thus as follows: is there any way to query only certain properties in the entity, such as can be done for specific columns in a SQL-style query? Again, I don't need the entire forecast json string for graphing, only a few other fields stored in the entity. The other approach I tried to run was to only fetch a couple of entities out at a time and split the query into multiple API calls, but it ended up taking so long that the page would time out and I couldn't get it to work properly.
Below is my code for how it is currently implemented and breaking. Any input is much appreciated:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
for acct in qry.fetch():
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
# Children Entity - log of a weather at parent location
class WeatherData(ndb.Model):
# model for data to save
...
# Function for querying data below a given ancestor between two optional
# times
#classmethod
def time_ordered_query(cls, ancestor_key, start=None, end=None):
return cls.query(cls.time>=start, cls.time<=end,ancestor=ancestor_key).order(-cls.time)
EDIT: I tried the iterative page fetching strategy described in the link from the answer below. My code was updated to the following:
wDataCsv = 'Time,' + ','.join(wData.keys())
qry = WeatherData.time_ordered_query(ndb.Key('Location', loc),start=start_date,end=end_date)
cursor = None
while True:
gc.collect()
fetched, next_cursor, more = qry.fetch_page(FETCHNUM, start_cursor=cursor)
if fetched:
for acct in fetched:
d = [acct.time.strftime(date_string)]
for attr in wData.keys():
d.append(str(acct.dict_access(attr)))
wData[attr].append([acct.time.strftime(date_string),acct.dict_access(attr)])
wDataCsv += '\\n' + ','.join(d)
if more and next_cursor:
cursor = next_cursor
else:
break
where FETCHNUM=500. In this case, I am still exceeding the soft private memory limit for queries of the same length as before, and the query takes much, much longer to run. I suspect the problem may be with Python's garbage collector not deleting the already used information that is re-referenced, but even when I include gc.collect() I see no improvement there.
EDIT:
Following the advice below, I fixed the problem using Projection Queries. Rather than have a separate projection for each custom query, I simply ran the same projection each time: namely querying all properties of the entity excluding the JSON string. While this is not ideal as it still pulls gratuitous information from the database each time, generating individual queries of each specific query is not scalable due to the exponential growth of necessary indices. For this application, as each additional property is negligible additional memory (aside form that json string), it works!

You can use projection queries to fetch only the properties of interest from each entity. Watch out for the limitations, though. And this still can't scale indefinitely.
You can split your queries across multiple requests (more scalable), but use bigger chunks, not just a couple (you can fetch 500 at a time) and cursors. Check out examples in How to delete all the entries from google datastore?
You can bump your instance class to one with more memory (if not done already).
You can prepare intermediate results (also in the datastore) from the big entities ahead of time and use these intermediate pre-computed values in the final stage.
Finally you could try to create and store just portions of the graphs and just stitch them together in the end (only if it comes down to that, I'm not sure how exactly it would be done, I imagine it wouldn't be trivial).

How To Use Entity Groups And Ancestors with DjangoForms

I would like to rewrite the example from the GAE djangoforms article to be show most up to date after submitting a form (e.g. when updating or adding a new entry) on Google App Engine using the High Replication Datastore.
The main recurring query in this article is:
query = db.GqlQuery("SELECT * FROM Item ORDER BY name")
which we will translate to:
query = Item.all().order('name') // datastore request
This query I would like to get the latest updated data from the high replication datastore after submitting the form (only in these occasions, I assume I can redirect to a specific urls after submission which just uses the query for the latest data and in all other cases I would not do this).
validating the form storing the results happens like:
data = ItemForm(data=self.request.POST)
if data.is_valid():
# Save the data, and redirect to the view page
entity = data.save(commit=False)
entity.added_by = users.get_current_user()
entity.put() // datastore request
and getting the latest entry from the datastore for populating a form (for editing) happens like:
id = int(self.request.get('id'))
item = Item.get(db.Key.from_path('Item', id)) // datastore request
data = ItemForm(data=self.request.POST, instance=item)
So how do I add entity groups/ancestor keys to these datastore queries to reflect the latest data after form submission. Please note, I don't want all queries to have the latest data, when populating a form (for editing) and after submitting a form.
Who can help me with practical code examples?

If it is in the same block, you have reference of the current intance.
Then once you put() it, you can get its id by:
if data.is_valid():
entity = data.save(commit=False)
entity.added_by = users.get_current_user()
entity.put()
id= entity.key().id() #this gives you inserted data id

Autoincrement ID in App Engine datastore

I'm using App Engine datastore, and would like to make sure that row IDs behave similarly to "auto-increment" fields in mySQL DB.
Tried several generation strategies, but can't seem to take control over what happens:
the IDs are not consecutive, there seem to be several "streams" growing in parallel.
the ids get "recycled" after old rows are deleted
Is such a thing at all possible ?
I really would like to refrain from keeping (indexed) timestamps for each row.

It sounds like you can't rely on IDs being sequential without a fair amount of extra work. However, there is an easy way to achieve what you are trying to do:
We'd like to delete old items (older than two month worth, for
example)
Here is a model that automatically keeps track of both its creation and its modification times. Simply using the auto_now_add and auto_now parameters makes this trivial.
from google.appengine.ext import db
class Document(db.Model):
user = db.UserProperty(required=True)
title = db.StringProperty(default="Untitled")
content = db.TextProperty(default=DEFAULT_DOC)
created = db.DateTimeProperty(auto_now_add=True)
modified = db.DateTimeProperty(auto_now=True)
Then you can use cron jobs or the task queue to schedule your maintenance task of deleting old stuff. Find the oldest stuff is as easy as sorting by created date or modified date:
db.Query(Document).order("modified")
# or
db.Query(Document).order("created")

What I know, is that auto-generated ID's as Long Integers are available in Google App Engine, but there's no guarantee that the value's are increasing and there's also no guarantee that the numbers are real one-increments.
So, if you nee timestamping and increments, add a DateTime field with milliseconds, but then you don't know that the numbers are unique.
So, the best this to do (what we are using) is: (sorry for that, but this is indeed IMHO the best option)
use a autogenerated ID as Long (we use Objectify in Java)
use a timestamp on each entity and use a index to query the entities (use a descending index) to get the top X

I think this is probably a fairly good solution, however be aware that I have not tested it in any way, shape or form. The syntax may even be incorrect!
The principle is to use memcache to generate a monotonic sequence, using the datastore to provide a fall-back if memcache fails.
class IndexEndPoint(db.Model):
index = db.IntegerProperty (indexed = False, default = 0)
def find_next_index (cls):
""" finds the next free index for an entity type """
name = 'seqindex-%s' % ( cls.kind() )
def _from_ds ():
"""A very naive way to find the next free key.
We just take the last known end point and loop untill its free.
"""
tmp_index = IndexEndPoint.get_or_insert (name).index
index = None
while index is None:
key = db.key.from_path (cls.kind(), tmp_index))
free = db.get(key) is None
if free:
index = tmp_index
tmp_index += 1
return index
index = None
while index is None:
index = memcache.incr (index_name)
if index is None: # Our index might have been evicted
index = _from_ds ()
if memcache.add (index_name, index): # if false someone beat us to it
index = None
# ToDo:
# Use a named task to update IndexEndPoint so if the memcache index gets evicted
# we don't have too many items to cycle over to find the end point again.
return index
def make_new (cls):
""" Makes a new entity with an incrementing ID """
result = None
while result is None:
index = find_next_index (cls)
def txn ():
"""Makes a new entity if index is free.
This should only fail if we had a memcache miss
(does not have to be on this instance).
"""
key = db.key.from_path (cls.kind(), index)
if db.get (key) is not None:
return
result = cls (key)
result.put()
return result
result = db.run_in_transaction (txn)

How many Datastore reads consume each Fetch, Count and Query operations?

I'm reading on Google App Engine groups many users (Fig1, Fig2, Fig3) that can't figure out where the high number of Datastore reads in their billing reports come from.
As you might know, Datastore reads are capped to 50K operations/day, above this budget you have to pay.
50K operations sounds like a lot of resources, but unluckily, it seems that each operation (Query, Entity fetch, Count..), hides several Datastore reads.
Is it possible to know via API or some other approach, how many Datastore reads are hidden behind the common RPC.get , RPC.runquery calls?
Appstats seems useless in this case because it gives just the RPC details and not the hidden reads cost.
Having a simple Model like this:
class Example(db.Model):
foo = db.StringProperty()
bars= db.ListProperty(str)
and 1000 entities in the datastore, I'm interested in the cost of these kind of operations:
items_count = Example.all(keys_only = True).filter('bars=','spam').count()
items_count = Example.all().count(10000)
items = Example.all().fetch(10000)
items = Example.all().filter('bars=','spam').filter('bars=','fu').fetch(10000)
items = Example.all().fetch(10000, offset=500)
items = Example.all().filter('foo>=', filtr).filter('foo<', filtr+ u'\ufffd')

See http://code.google.com/appengine/docs/billing.html#Billable_Resource_Unit_Cost .
A query costs you 1 read plus 1 read for each entity returned. "Returned" includes entities skipped by offset or count.
So that is 1001 reads for each of these:
Example.all(keys_only = True).filter('bars=','spam').count()
Example.all().count(1000)
Example.all().fetch(1000)
Example.all().fetch(1000, offset=500)
For these, the number of reads charged is 1 plus the number of entities that match the filters:
Example.all().filter('bars=','spam').filter('bars=','fu').fetch()
Example.all().filter('foo>=', filtr).filter('foo<', filtr+ u'\ufffd').fetch()
Instead of using count you should consider storing the count in the datastore, sharded if you need to update the count more than once a second. http://code.google.com/appengine/articles/sharding_counters.html
Whenever possible you should use cursors instead of an offset.

Just to make sure:
I'm almost sure:
Example.all().count(10000)
This one uses small datastore operations (no need to fetch the entities, only keys), so this would count as 1 read + 10,000 (max) small operations.