How to optimize one-to many queries in the datastore

How to optimize one-to many queries in the datastore - google-app-engine

I have a latency problem in my application due to the datastore doing additional queries for referenced entities. I have received good advice on how to handle this for single value properties by the use of the get_value_for_datastore() function. However my application also have one-to many relationships as shown in the code below, and I have not found a way to prefetch these entities. The result is an unacceptable latency when trying to show a table of 200 documents and their associated documentFiles (>6000ms).
(There will probably never be more than 10.000 Documents or DocumentFiles)
Is there a way to solve this?
models.py
class Document(db.Expando):
title = db.StringProperty()
lastEditedBy = db.ReferenceProperty(DocUser, collection_name = 'documentLastEditedBy')
...
class DocUser(db.Model):
user = db.UserProperty()
name = db.StringProperty()
hasWriteAccess= db.BooleanProperty(default = False)
isAdmin = db.BooleanProperty(default = False)
accessGroups = db.ListProperty(db.Key)
...
class DocumentFile(db.Model):
description= db.StringProperty()
blob = blobstore.BlobReferenceProperty()
created = db.DateTimeProperty() # needs to be stored here in relation to upload / download of everything
document = db.ReferenceProperty(Document, collection_name = 'files')
#property
def link(self):
return '%s' % (self.key().id(),self.blob.filename)
...
main.py
docUsers = DocUser.all()
docUsersNameDict = dict([(i.key(), i.name) for i in docUsers])
documents = Document.all()
for d idocuments:
out += '<td>%s</td>' % d.title
docUserKey = Document.lastEditedBy.get_value_for_datastore(d)
out +='<td>%s</td>' % docUsersNameDict.get(docUserKey)
out += '<td>'
# Creates a new query for each document, resulting in unacceptable latency
for file in d.files:
out += file.link + '<br>'
out += '</td>'

Denormalize and store the link in your Document, so that getting the link will be fast.
You will need to be careful that when you update a DocumentFile, you need to update the associated Document. This operates under the assumption that you read the link from the datastore far more often than you update it.
Denormalizing is often the fix for poor performance on App Engine.

Load your files asynchronously. Use get_value_for_datastore on d.files, which should return a collection of keys, which you can then do db.get_async(key) to return a future object. You will not be able to write out your result procedurally as you have done, but it should be trivial to assemble a partial request / dictionary for all documents, with a collection of pending future gets(), and then when you do your iteration to build the results, you can finalize the futures, which will have finished without blocking {~0ms latency}.
Basically, you need two iterations. The first iteration will go through and asynchronously request the files you need, and the second iteration will go through, finalize your gets, and build your response.
https://developers.google.com/appengine/docs/python/datastore/async

Related

Datastore: Is it possible to only save to memcache when using ndb API?

dear all
Currently I'm using ndb API to store some statistic information. Unfortunately, this becomes the major source of my cost. I'm thinking it should be much cheaper if I only save them to memcache. It doesn't matter if data is lost due to cache expire.
After read the manual, I assume _use_datastore class variable can be used to configure this behaviour:
class StaticModel(ndb.Model):
_use_datastore = False
userid = ndb.StringProperty()
created_at = ndb.DateTimeProperty(auto_now_add=True)
May I know if above statement is the right solution?
Cheers!

I think there are three ways to achieve what you want.
The first is to set _use_datastore = False on the NDB model class as per your question.
The second would be to pass use_datastore=False whenever you put / get / delete a StaticModel. An example would be:
model = StaticModel(userid="foo")
key = model.put(use_datastore=False)
n = key.get(use_datastore=False)
The third option would be to set a datastore policy in the NDB Context which returns false for any StaticModel keys. Something like:
context.set_datastore_policy(lambda key: True if key.kind() == 'StaticModel' else False)

parallel code execution python2.7 ndb

in my app i for one of the handler i need to get a bunch of entities and execute a function for each one of them.
i have the keys of all the enities i need. after fetching them i need to execute 1 or 2 instance methods for each one of them and this slows my app down quite a bit. doing this for 100 entities takes around 10 seconds which is way to slow.
im trying to find a way to get the entities and execute those functions in parallel to save time but im not really sure which way is the best.
i tried the _post_get_hook but the i have a future object and need to call get_result() and execute the function in the hook which works kind of ok in the sdk but gets a lot of 'maximum recursion depth exceeded while calling a Python objec' but i can't really undestand why and the error message is not really elaborate.
is the Pipeline api or ndb.Tasklets what im searching for?
atm im going by trial and error but i would be happy if someone could lead me to the right direction.
EDIT
my code is something similar to a filesystem, every folder contains other folders and files. The path of the Collections set on another entity so to serialize a collection entity i need to get the referenced entity and get the path. On a Collection the serialized_assets() function is slower the more entities it contains. If i could execute a serialize function for each contained asset side by side it would speed things up quite a bit.
class Index(ndb.Model):
path = ndb.StringProperty()
class Folder(ndb.Model):
label = ndb.StringProperty()
index = ndb.KeyProperty()
# contents is a list of keys of contaied Folders and Files
contents = ndb.StringProperty(repeated=True)
def serialized_assets(self):
assets = ndb.get_multi(self.contents)
serialized_assets = []
for a in assets:
kind = a._get_kind()
assetdict = a.to_dict()
if kind == 'Collection':
assetdict['path'] = asset.path
# other operations ...
elif kind == 'File':
assetdict['another_prop'] = asset.another_property
# ...
serialized_assets.append(assetdict)
return serialized_assets
#property
def path(self):
return self.index.get().path
class File(ndb.Model):
filename = ndb.StringProperty()
# other properties....
#property
def another_property(self):
# compute something here
return computed_property
EDIT2:
#ndb.tasklet
def serialized_assets(self, keys=None):
assets = yield ndb.get_multi_async(keys)
raise ndb.Return([asset.serialized for asset in assets])
is this tasklet code ok?

Since most of the execution time of your functions are spent waiting for RPCs, NDB's async and tasklet support is your best bet. That's described in some detail here. The simplest usage for your requirements is probably to use the ndb.map function, like this (from the docs):
#ndb.tasklet
def callback(msg):
acct = yield ndb.get_async(msg.author)
raise tasklet.Return('On %s, %s wrote:\n%s' % (msg.when, acct.nick(), msg.body))
qry = Messages.query().order(-Message.when)
outputs = qry.map(callback, limit=20)
for output in outputs:
print output
The callback function is called for each entity returned by the query, and it can do whatever operations it needs (using _async methods and yield to do them asynchronously), returning the result when it's done. Because the callback is a tasklet, and uses yield to make the asynchronous calls, NDB can run multiple instances of it in parallel, and even batch up some operations.

The pipeline API is overkill for what you want to do. Is there any reason why you couldn't just use a taskqueue?
Use the initial request to get all of the entity keys, and then enqueue a task for each key having the task execute the 2 functions per-entity. The concurrency will be based then on the number of concurrent requests as configured for that taskqueue.

How can I better use filters in appengine to save me filtering by looping through a long list of entities?

the following bit of code is run regularly as a cronjob and is turning out to be very computationally expensive! The main problem is in the for loop, and I think this can be made a little more efficient using better filtering, however I'm at a loss as to how I can do that.
free_membership_type = MembershipType.all().filter("membership_class =", "Free").filter("live =", True).get()
all_free_users = UserMembershipType.all().filter("membership_active =", True)
all_free_users = all_free_users.filter("membership_type =", free_membership_type).fetch(limit = 999999)
if all_free_users:
for free_user in all_free_users:
activation_status = ActivationStatus.all().filter("user = ", free_user.user).get()
if activation_status and activation_status.activated:
documents_left = WeeklyLimits.all().filter("user = ", free_user.user).get()
if documents_left > 0:
do something...
The models which the code uses are:
class MembershipType(db.Model):
membership_class = db.StringProperty()
membership_code = db.StringProperty()
live = db.BooleanProperty(default = False)
class UserMembershipType(db.Model):
user = db.ReferenceProperty(UserModel)
membership_type = db.ReferenceProperty(MembershipType)
membership_active = db.BooleanProperty(default = False)
class ActivationStatus(db.Model):
user = db.ReferenceProperty(UserModel)
activated = db.BooleanProperty(default = False)
class WeeklyLimits(db.Model):
user = db.ReferenceProperty(UserModel)
membership_type = db.ReferenceProperty(MembershipType)
documents_left = db.IntegerProperty(default = 0)
The code I'm using in production does make better use of caching for the various entities, however the for loop still has to cycle through a bunch of users to finally find the few that it needs to do the operation on. Ideally I'd filter out all of the users that don't fulfil the criteria and only then start looping through the list of users - is there some kind of magic bullet that I can use here to achieve this?

The magic that you are probably looking for is denormalization. It looks to me like these classes can all be meaningfully combined into a single model:
class Membership(db.Model):
user = db.ReferenceProperty(UserModel)
membership_class = db.StringProperty()
membership_code = db.StringProperty()
live = db.BooleanProperty(default = False)
membership_active = db.BooleanProperty(default = False)
activated = db.BooleanProperty(default = False)
documents_left = db.IntegerProperty(default = 0)
Then, you can use one query to do all of your filtering.
Over-normalization is a common anti-pattern in AppEngine development. The models that you posted look like they might as well be table definitions for a relational database (although, it's arguable whether its more compartmentalized than needed even for that scenario) and AppEngine's datastore is very much not a relational database.
Can you see any downside to storing all of those fields in a single model?

You could improve this by storing the data closer together in a single model. For example, a single entity kind of UserMembership could have all of the fields you need, and you could do a single query:
.filter("membership_type =", "FREE").filter("status =", "ACTIVE").filter("documentsLeft >", 0)
This would require an extra index to be defined, but will run much, much faster.

If you want to avoid denormalizing your data as suggested in the other two answers, you could also consider using Google's new SQL service instead of the normal datastore: http://googleappengine.blogspot.com/2011/10/google-cloud-sql-your-database-in-cloud.html
With SQL you could do all of this in a single query, even with separate entities.

Python - read with key in App Engine

I have a python program in Google App Engine
When finding an object in the datastore when I have the key as a string, how can I do a direct read. Below is my code which is performing a loop, not good.....
class Opportunity(db.Model):
customer = db.ReferenceProperty(Customer,collection_name='opportunitys')
BNusername = db.StringProperty()
opportunity_no = db.StringProperty()
# etc etc etc.....
#BnPresets holds the object key as a string
opportunitys = Opportunity.all()
opportunitys.filter('BNusername =',BnPresets.myusername)
for oprec in opportunitys:
if str(oprec.key()) == BnPresets.recordkey:
opportunity = oprec
# I have the object here and can process etc etc

You can instantiate db.Key from string by passing it directly to the constructor:
opportunity_key = db.Key(BnPresets.recordkey)
Once you have that, simply db.get to obtain the entity identified by this key:
opportunity = db.get(opportunity_key)
I guess (by looking at the query you use) that you also want to verify the username of the object you got:
if opportunity.BNusername == BnPresets.myusername
process_opportunity(opportunity)
That should be pretty much it. The bottom line is that you should use the key first - as it uniquely identifies your entity - rather than querying for some other property and iterating through results.

Google App Engine - Is this also a Put method? Or something else

Was wondering if I'm unconsciously using the Put method in my last line of code ( Please have a look). Thanks.
class User(db.Model):
name = db.StringProperty()
total_points = db.IntegerProperty()
points_activity_1 = db.IntegerProperty(default=100)
points_activity_2 = db.IntegerProperty(default=200)
def calculate_total_points(self):
self.total_points = self.points_activity_1 + self.points_activity_2
#initialize a user ( this is obviously a Put method )
User(key_name="key1",name="person1").put()
#get user by keyname
user = User.get_by_key_name("key1")
# QUESTION: is this also a Put method? It worked and updated my user entity's total points.
User.calculate_total_points(user)

While that method will certainly update the copy of the object that is in-memory, I do not see any reason to believe that the change will be persisted to the the datastore. Datastore write operations are costly, so they are not going to happen implicitly.
After running this code, use the datastore viewer to look at the copy of the object in the datastore. I think that you may find that it does not have the changed total_point value.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to optimize one-to many queries in the datastore - google-app-engine

Related

Datastore: Is it possible to only save to memcache when using ndb API?

parallel code execution python2.7 ndb

How can I better use filters in appengine to save me filtering by looping through a long list of entities?

Python - read with key in App Engine

Google App Engine - Is this also a Put method? Or something else

Categories

Resources