What is the best way to do AppEngine Model Memcaching? - google-app-engine

Currently my application caches models in memcache like this:
memcache.set("somekey", aModel)
But Nicks' post at http://blog.notdot.net/2009/9/Efficient-model-memcaching suggests that first converting it to protobuffers is a lot more efficient. But after running some tests I found out it's indeed smaller in size, but actually slower (~10%).
Do others have the same experience or am I doing something wrong?
Test results: http://1.latest.sofatest.appspot.com/?times=1000
import pickle
import time
import uuid
from google.appengine.ext import webapp
from google.appengine.ext import db
from google.appengine.ext.webapp import util
from google.appengine.datastore import entity_pb
from google.appengine.api import memcache
class Person(db.Model):
name = db.StringProperty()
times = 10000
class MainHandler(webapp.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
m = Person(name='Koen Bok')
t1 = time.time()
for i in xrange(int(self.request.get('times', 1))):
key = uuid.uuid4().hex
memcache.set(key, m)
r = memcache.get(key)
self.response.out.write('Pickle took: %.2f' % (time.time() - t1))
t1 = time.time()
for i in xrange(int(self.request.get('times', 1))):
key = uuid.uuid4().hex
memcache.set(key, db.model_to_protobuf(m).Encode())
r = db.model_from_protobuf(entity_pb.EntityProto(memcache.get(key)))
self.response.out.write('Proto took: %.2f' % (time.time() - t1))
def main():
application = webapp.WSGIApplication([('/', MainHandler)], debug=True)
util.run_wsgi_app(application)
if __name__ == '__main__':
main()

The Memcache call still pickles the object with or without using protobuf. Pickle is faster with a protobuf object since it has a very simple model
Plain pickle objects are larger than protobuf+pickle objects, hence they save time on Memcache, but there is more processor time in doing the protobuf conversion
Therefore in general either method works out about the same...but
The reason you should use protobuf is it can handle changes between versions of the models, whereas Pickle will error. This problem will bite you one day, so best to handle it sooner

Both pickle and protobufs are slow in App Engine since they're implemented in pure Python. I've found that writing my own, simple serialization code using methods like str.join tends to be faster since most of the work is done in C. But that only works for simple datatypes.

One way to do it more quickly is to turn your model into a dictionary and use the native eval / repr function as your (de)serializers -- with caution of course, as always with the evil eval, but it should be safe here given that there is no external step.
Below an example of a class Fake_entity implementing exactly that.
You first create your dictionary through fake = Fake_entity(entity) then you can simply store your data via memcache.set(key, fake.serialize()). The serialize() is a simple call to the native dictionary method of repr, with some additions if you need (e.g. add an identifier at the beginning of the string).
To fetch it back, simply use fake = Fake_entity(memcache.get(key)). The Fake_entity object is a simple dictionary whose keys are also accessible as attributes. You can access your entity properties normally, except referenceProperties give keys instead of fetching the object (which is actually quite useful). You can also get() the actual entity with fake.get(), or more interestigly, change it and then save with fake.put().
It does not work with lists (if you fetch multiple entities from a query), but could be easily be adjusted with join/split functions using an identifier like '### FAKE MODEL ENTITY ###' as the separator. Use with db.Model only, would need small adjustments for Expando.
class Fake_entity(dict):
def __init__(self, record):
# simple case: a string, we eval it to rebuild our fake entity
if isinstance(record, basestring):
import datetime # <----- put all relevant eval imports here
from google.appengine.api import datastore_types
self.update( eval(record) ) # careful with external sources, eval is evil
return None
# serious case: we build the instance from the actual entity
for prop_name, prop_ref in record.__class__.properties().items():
self[prop_name] = prop_ref.get_value_for_datastore(record) # to avoid fetching entities
self['_cls'] = record.__class__.__module__ + '.' + record.__class__.__name__
try:
self['key'] = str(record.key())
except Exception: # the key may not exist if the entity has not been stored
pass
def __getattr__(self, k):
return self[k]
def __setattr__(self, k, v):
self[k] = v
def key(self):
from google.appengine.ext import db
return db.Key(self['key'])
def get(self):
from google.appengine.ext import db
return db.get(self['key'])
def put(self):
_cls = self.pop('_cls') # gets and removes the class name form the passed arguments
# import xxxxxxx ---> put your model imports here if necessary
Cls = eval(_cls) # make sure that your models declarations are in the scope here
real_entity = Cls(**self) # creates the entity
real_entity.put() # self explanatory
self['_cls'] = _cls # puts back the class name afterwards
return real_entity
def serialize(self):
return '### FAKE MODEL ENTITY ###\n' + repr(self)
# or simply repr, but I use the initial identifier to test and eval directly when getting from memcache
I would welcome speed tests on this, I would assume this is quite faster than the other approaches. Plus, you do not have any risks if your models have changed somehow in the meantime.
Below an example of what the serialized fake entity looks like. Take a particular look at datetime (created) as well as reference properties (subdomain) :
### FAKE MODEL ENTITY ###
{'status': u'admin', 'session_expiry': None, 'first_name': u'Louis', 'last_name': u'Le Sieur', 'modified_by': None, 'password_hash': u'a9993e364706816aba3e25717000000000000000', 'language': u'fr', 'created': datetime.datetime(2010, 7, 18, 21, 50, 11, 750000), 'modified': None, 'created_by': None, 'email': u'chou#glou.bou', 'key': 'agdqZXJlZ2xlcgwLEgVMb2dpbhjmAQw', 'session_ref': None, '_cls': 'models.Login', 'groups': [], 'email___password_hash': u'chou#glou.bou+a9993e364706816aba3e25717000000000000000', 'subdomain': datastore_types.Key.from_path(u'Subdomain', 229L, _app=u'jeregle'), 'permitted': [], 'permissions': []}
Personally I also use static variables (faster than memcache) to cache my entities in the short term, and fetch the datastore when the server has changed or its memory has been flushed for some reason (which happens quite often in fact).

Related

parallel code execution python2.7 ndb

in my app i for one of the handler i need to get a bunch of entities and execute a function for each one of them.
i have the keys of all the enities i need. after fetching them i need to execute 1 or 2 instance methods for each one of them and this slows my app down quite a bit. doing this for 100 entities takes around 10 seconds which is way to slow.
im trying to find a way to get the entities and execute those functions in parallel to save time but im not really sure which way is the best.
i tried the _post_get_hook but the i have a future object and need to call get_result() and execute the function in the hook which works kind of ok in the sdk but gets a lot of 'maximum recursion depth exceeded while calling a Python objec' but i can't really undestand why and the error message is not really elaborate.
is the Pipeline api or ndb.Tasklets what im searching for?
atm im going by trial and error but i would be happy if someone could lead me to the right direction.
EDIT
my code is something similar to a filesystem, every folder contains other folders and files. The path of the Collections set on another entity so to serialize a collection entity i need to get the referenced entity and get the path. On a Collection the serialized_assets() function is slower the more entities it contains. If i could execute a serialize function for each contained asset side by side it would speed things up quite a bit.
class Index(ndb.Model):
path = ndb.StringProperty()
class Folder(ndb.Model):
label = ndb.StringProperty()
index = ndb.KeyProperty()
# contents is a list of keys of contaied Folders and Files
contents = ndb.StringProperty(repeated=True)
def serialized_assets(self):
assets = ndb.get_multi(self.contents)
serialized_assets = []
for a in assets:
kind = a._get_kind()
assetdict = a.to_dict()
if kind == 'Collection':
assetdict['path'] = asset.path
# other operations ...
elif kind == 'File':
assetdict['another_prop'] = asset.another_property
# ...
serialized_assets.append(assetdict)
return serialized_assets
#property
def path(self):
return self.index.get().path
class File(ndb.Model):
filename = ndb.StringProperty()
# other properties....
#property
def another_property(self):
# compute something here
return computed_property
EDIT2:
#ndb.tasklet
def serialized_assets(self, keys=None):
assets = yield ndb.get_multi_async(keys)
raise ndb.Return([asset.serialized for asset in assets])
is this tasklet code ok?
Since most of the execution time of your functions are spent waiting for RPCs, NDB's async and tasklet support is your best bet. That's described in some detail here. The simplest usage for your requirements is probably to use the ndb.map function, like this (from the docs):
#ndb.tasklet
def callback(msg):
acct = yield ndb.get_async(msg.author)
raise tasklet.Return('On %s, %s wrote:\n%s' % (msg.when, acct.nick(), msg.body))
qry = Messages.query().order(-Message.when)
outputs = qry.map(callback, limit=20)
for output in outputs:
print output
The callback function is called for each entity returned by the query, and it can do whatever operations it needs (using _async methods and yield to do them asynchronously), returning the result when it's done. Because the callback is a tasklet, and uses yield to make the asynchronous calls, NDB can run multiple instances of it in parallel, and even batch up some operations.
The pipeline API is overkill for what you want to do. Is there any reason why you couldn't just use a taskqueue?
Use the initial request to get all of the entity keys, and then enqueue a task for each key having the task execute the 2 functions per-entity. The concurrency will be based then on the number of concurrent requests as configured for that taskqueue.

How transactions influence read consistency for next non ancestor query in NDB

The apply phase of save may fail and/or is still being done asynchronously before next not strongly-consistent read — non ancestor query.
Based on local testing article I have wrote a test that should simulate inconsistent reads:
import dev_appserver
dev_appserver.fix_sys_path()
import unittest
from google.appengine.ext import ndb
from google.appengine.ext import testbed
from google.appengine.datastore import datastore_stub_util
class SomeModel(ndb.Model):
pass
class SingleEntityConsistency(unittest.TestCase):
def setUp(self):
# Setup AppEngine env
self.testbed = testbed.Testbed()
self.testbed.activate()
self.policy = datastore_stub_util.PseudoRandomHRConsistencyPolicy(probability=0)
self.testbed.init_datastore_v3_stub(consistency_policy=self.policy)
self.testbed.init_memcache_stub()
# A test key
self.key = ndb.Key('SomeModel', 'test')
def tearDown(self):
self.testbed.deactivate()
def test_tx_get_or_insert(self):
p = SomeModel.get_or_insert('test')
self.assertEqual(0, SomeModel.query().count(1), "Shouldn't be applied yet")
self.assertEqual(1, SomeModel.query(ancestor=self.key).count(1), "Ancestor query read should be consistent")
def test_no_tx_insert(self):
p = SomeModel(id='test')
p.put()
self.assertEqual(0, SomeModel.query().count(2), "Shouldn't be applied yet")
self.assertEqual(1, SomeModel.query(ancestor=self.key).count(1), "Ancestor query read should be consistent")
def test_with_ancestor(self):
p = SomeModel(id='test')
p.put()
self.assertEqual(p, SomeModel.query(ancestor=self.key).get())
def test_key(self):
p = SomeModel(id='test')
p.put()
self.assertEqual(p, self.key.get())
if __name__ == '__main__':
unittest.main()
Actual questions…
Does wrapping put() in transaction change behaviour described in the beginning? Do I still need a strongly consistent query to make a sure that I'll read was was written in the txn? (tests suggest that, I still need strongly consistent query)
Is key.get() considered to be strongly-consistent? (tests suggest that, it is)
UPDATE
I have updated test code as Guido mentioned, now all test pass:
self.testbed.init_datastore_v3_stub(consistency_policy=self.policy)
I believe you must do something to activate the policy. That would explain the test failures. Also I believe only queries are affected and a lone put is effectively a transaction. Finally beware of NDB's caches.

Django - Are model save() methods lazy?

Are model save() methods lazy in django?
For instance, at what line in the following code sample will django hit the database?
my_model = MyModel()
my_model.name = 'Jeff Atwood'
my_model.save()
# Some code that is independent of my_model...
model_id = model_instance.id
print (model_id)
It does not make much sense to have a lazy save, does it? Django's QuerySets are lazy, the model's save method is not.
From the django source:
django/db/models/base.py, lines 424–437:
def save(self, force_insert=False, force_update=False, using=None):
"""
Saves the current instance. Override this in a subclass if you want to
control the saving process.
The 'force_insert' and 'force_update' parameters can be used to insist
that the "save" must be an SQL insert or update (or equivalent for
non-SQL backends), respectively. Normally, they should not be set.
"""
if force_insert and force_update:
raise ValueError("Cannot force both insert and updating in \
model saving.")
self.save_base(using=using, force_insert=force_insert,
force_update=force_update)
save.alters_data = True
Then, save_base does the heavy lifting (same file, lines 439–545):
...
transaction.commit_unless_managed(using=using)
...
And in django/db/transaction.py, lines 167–178, you'll find:
def commit_unless_managed(using=None):
"""
Commits changes if the system is not in managed transaction mode.
"""
...
P.S. All line numbers apply to django version (1, 3, 0, 'alpha', 0).

Is there some performance issue between leaving empty ListProperties or using dynamic (expando) properties?

Is there a datastore performance difference between adding dynamic properties of the expando class when they are needed for an entity or the simpler (for me) framework of just setting up all possible properties I might need from the start even though most instances will just be left empty.
In my specific case I would be having 5-8 empty ReferenceList properties as 'overhead' that will be empty when I skip using expando class.
There is a penalty for setting up all the possible properties you might need from the start.
If you use a regular db.Model, then every property will be serialized whenever you put() it. This includes overhead for the name of the property as well as the value. This overhead is present even if the property's value is not required and is set to None! (Though setting the value to None seems to result in a slightly smaller protobuf representation).
On the other hand, if you use db.Expando and don't specify the properties which might not appear, then only the dynamic properties which are actually present on a model will be serialized. Dynamic properties which are not present are not serialized at all => no overhead. However, if you explicitly declare (fixed) properties on the model then you will have the exact same overhead as a regular db.Model (there is no difference in the serialization of fixed properties between regular models and expando models).
In practice, I don't know if the overhead of using fixed properties will be enough to noticeably impact performance, but it will most certainly eat up a little more storage space and CPU time to serialize even empty fixed fields.
If I were you, I'd go with an expando model and dynamic properties.
Example app which demonstrates what I described above:
from google.appengine.ext import db, webapp
from google.appengine.ext.webapp.util import run_wsgi_app
# all model names equal length because they get serialized too
class EmptyModel(db.Model):
pass
class TestVModel(db.Model):
value = db.IntegerProperty(required=False)
class TestExpand(db.Expando):
pass
class MainPage(webapp.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
# create an empty model, one with a prop = None, and one with a prop set
tEmpty = EmptyModel()
tNone = TestVModel()
tVal = TestVModel(value=5)
# do the same but using an expando model with a dynamic property
eEmpty = TestExpand()
eNone = TestExpand()
eNone.value = None
eVal = TestExpand()
eVal.value = 5
# determine the serialized size of each model (note: no keys assigned)
fEncodedSz = lambda o : len(db.model_to_protobuf(o).Encode())
szEmpty = fEncodedSz(tEmpty)
szNone = fEncodedSz(tNone)
szVal = fEncodedSz(tVal)
szEEmpty = fEncodedSz(eEmpty)
szENone = fEncodedSz(eNone)
szEVal = fEncodedSz(eVal)
# output the results
self.response.out.write("Comparison of model sizes with fixed props with expando models\nwith dynamic props:\n\n")
self.response.out.write("Model: empty=>%dB prop=None=>%dB prop=Val=>%dB\n" %\
(szEmpty, szNone, szVal))
self.response.out.write("Expando: empty=>%dB prop=None=>%dB prop=Val=>%dB\n\n" %\
(szEEmpty, szENone, szEVal))
self.response.out.write("Note that the expando property which specifies *no* value for the\ndynamic property 'value' is smaller than if 'None' is assigned.")
application = webapp.WSGIApplication([('/', MainPage)])
def main(): run_wsgi_app(application)
if __name__ == '__main__': main()
Output:
Comparison of model sizes with fixed props with expando models
with dynamic props:
Model: empty=>30B prop=None=>43B prop=Val=>45B
Expando: empty=>30B prop=None=>43B prop=Val=>45B
Note that the expando property which specifies *no* value for the
dynamic property 'value' is smaller than if 'None' is assigned.
Taking into account that you can swap Model with Expando, I would say that they are only client-side facades for datastore entities, which in turn have no fixed schema. So in the datastore there are neither Models nor Expandos, and the number of properties per objects instance doesn't really matter (in other way than the usual one, i.e. the bigger the object is, the more time it takes to transfer it etc).
But if I'm wrong here, please minus and correct me :)

How do I dynamically determine if a Model class exist in Google App Engine?

I want to be able to take a dynamically created string, say "Pigeon" and determine at runtime whether Google App Engine has a Model class defined in this project named "Pigeon". If "Pigeon" is the name of a existant model class, I would like to then get a reference to the Pigeon class so defined.
Also, I don't want to use eval at all, since the dynamic string "Pigeon" in this case, comes from outside.
You could try, although probably very, very bad practice:
def get_class_instance(nm) :
try :
return eval(nm+'()')
except :
return None
Also, to make that safer, you could give eval a locals hash: eval(nm+'()', {'Pigeon':pigeon})
I'm not sure if that would work, and it definitely has an issue: if there is a function called the value of nm, it would return that:
def Pigeon() :
return "Pigeon"
print(get_class_instance('Pigeon')) # >> 'Pigeon'
EDIT: Another way of doing it is possibly (untested), if you know the module:
(Sorry, I keep forgetting it's not obj.hasattr, its hasattr(obj)!)
import models as m
def get_class_instance(nm) :
if hasattr(m, nm) :
return getattr(m, nm)()
else : return None
EDIT 2: Yes, it does work! Woo!
Actually, looking through the source code and interweb, I found a undocumented method that seems to fit the bill.
from google.appengine.ext import db
key = "ModelObject" #This is a dynamically generated string
klass = db.class_for_kind(key)
This method will throw a descriptive exception if the class does not exist, so you should probably catch it if the key string comes from the outside.
There's two fairly easy ways to do this without relying on internal details:
Use the google.appengine.api.datastore API, like so:
from google.appengine.api import datastore
q = datastore.Query('EntityType')
if q.get(1):
print "EntityType exists!"
The other option is to use the db.Expando class:
def GetEntityClass(entity_type):
class Entity(db.Expando):
#classmethod
def kind(cls):
return entity_type
return Entity
cls = GetEntityClass('EntityType')
if cls.all().get():
print "EntityType exists!"
The latter has the advantage that you can use GetEntityClass to generate an Expando class for any entity type, and interact with it the same way you would a normal class.

Resources