Is it possible to bulk load an NDB child Entity in GAE? - google-app-engine

At some point in the future I may need to bulk load migration data (i.e. from a CSV). Has anyone had exceptions raised doing the following? Also is there any change in behaviour if the ndb.put_multi() function is used?
from google.appengine.ext import ndb
while True:
if not id:
break
id, name = read_csv_row(readline())
x = X(parent=ndb.Key('Y','static_id')
x.id, x.name = id, name
x.put()
class X(ndb.Model):
id = StringProperty()
name = StringProperty()
class Y(ndb.Model):
pass
def read_csv_row(line):
"""returns tuple"""

From my research and thanks to comments it seems that the code above (where it made into valid code) create problems which would eventually lead to google.appengine.api.datastore_errors.Timeout exceptions being thrown.
See another question:
Datastore write limit tests - trying to break app engine, but it won´t break ;)
The best suggestion I have so far is to use a Task Queue to to rate limit this. More information on:
blog.notdot.net/tag/deferred

Related

How to disable BadValueError (required field) value in Google App Engine during scanning all records?

I want to scan all records to check if there is not errors inside data.
How can I disable BadValueError to no break scan on lack of required field?
Consider that I can not change StringProperty to not required and such properties can be tenths in real code - so such workaround is not useful?
class A(db.Model):
x = db.StringProperty(required = True)
for instance in A.all():
# check something
if something(instance):
instance.delete()
Can I use some function to read datastore.Entity directly to avoid such problems with not need validation?
The solution I found for this problem was to use a resilient query, it ignores any exception thrown by a query, you can try this:
def resilient_query(query):
query_iter = iter(query)
while True:
next_result = query_iter.next()
#check something
yield next_result
except Exception, e:
next_result.delete()
query = resilient_query(A.query())
If you use ndb, you can load all your models as an ndb.Expando, then modify the values. This doesn't appear to be possible in db because you cannot specify a kind for a Query in db that differs from your model class.
Even though your model is defined in db, you can still use ndb to fix your entities:
# Setup a new ndb connection with ndb.Expando as the default model.
conn = ndb.make_connection(default_model=ndb.Expando)
# Use this connection in our context.
ndb.set_context(ndb.make_context(conn=conn))
# Query for all A kinds
for a in ndb.Query(kind='A'):
if a.x is None:
a.x = 'A more appropriate value.'
# Re-put the broken entity.
a.put()
Also note that this (and other solutions listed) will be subject to whatever time limits you are restricted to (i.e. 60 seconds on an App Engine frontend). If you are dealing with large amounts of data you will most likely want to write a custom map reduce job to do this.
Try setting a default property option to some distinct value that does not exist otherwise.
class A(db.Model):
x = db.StringProperty(required = True, default = <distinct value>)
Then load properties and check for this value.
you can override the _check_initialized(self) method of ndb.Model in your own Model subclass and replace the default logic with your own logic (or skip altogether as needed).

Datastore: Is it possible to only save to memcache when using ndb API?

dear all
Currently I'm using ndb API to store some statistic information. Unfortunately, this becomes the major source of my cost. I'm thinking it should be much cheaper if I only save them to memcache. It doesn't matter if data is lost due to cache expire.
After read the manual, I assume _use_datastore class variable can be used to configure this behaviour:
class StaticModel(ndb.Model):
_use_datastore = False
userid = ndb.StringProperty()
created_at = ndb.DateTimeProperty(auto_now_add=True)
May I know if above statement is the right solution?
Cheers!
I think there are three ways to achieve what you want.
The first is to set _use_datastore = False on the NDB model class as per your question.
The second would be to pass use_datastore=False whenever you put / get / delete a StaticModel. An example would be:
model = StaticModel(userid="foo")
key = model.put(use_datastore=False)
n = key.get(use_datastore=False)
The third option would be to set a datastore policy in the NDB Context which returns false for any StaticModel keys. Something like:
context.set_datastore_policy(lambda key: True if key.kind() == 'StaticModel' else False)

parallel code execution python2.7 ndb

in my app i for one of the handler i need to get a bunch of entities and execute a function for each one of them.
i have the keys of all the enities i need. after fetching them i need to execute 1 or 2 instance methods for each one of them and this slows my app down quite a bit. doing this for 100 entities takes around 10 seconds which is way to slow.
im trying to find a way to get the entities and execute those functions in parallel to save time but im not really sure which way is the best.
i tried the _post_get_hook but the i have a future object and need to call get_result() and execute the function in the hook which works kind of ok in the sdk but gets a lot of 'maximum recursion depth exceeded while calling a Python objec' but i can't really undestand why and the error message is not really elaborate.
is the Pipeline api or ndb.Tasklets what im searching for?
atm im going by trial and error but i would be happy if someone could lead me to the right direction.
EDIT
my code is something similar to a filesystem, every folder contains other folders and files. The path of the Collections set on another entity so to serialize a collection entity i need to get the referenced entity and get the path. On a Collection the serialized_assets() function is slower the more entities it contains. If i could execute a serialize function for each contained asset side by side it would speed things up quite a bit.
class Index(ndb.Model):
path = ndb.StringProperty()
class Folder(ndb.Model):
label = ndb.StringProperty()
index = ndb.KeyProperty()
# contents is a list of keys of contaied Folders and Files
contents = ndb.StringProperty(repeated=True)
def serialized_assets(self):
assets = ndb.get_multi(self.contents)
serialized_assets = []
for a in assets:
kind = a._get_kind()
assetdict = a.to_dict()
if kind == 'Collection':
assetdict['path'] = asset.path
# other operations ...
elif kind == 'File':
assetdict['another_prop'] = asset.another_property
# ...
serialized_assets.append(assetdict)
return serialized_assets
#property
def path(self):
return self.index.get().path
class File(ndb.Model):
filename = ndb.StringProperty()
# other properties....
#property
def another_property(self):
# compute something here
return computed_property
EDIT2:
#ndb.tasklet
def serialized_assets(self, keys=None):
assets = yield ndb.get_multi_async(keys)
raise ndb.Return([asset.serialized for asset in assets])
is this tasklet code ok?
Since most of the execution time of your functions are spent waiting for RPCs, NDB's async and tasklet support is your best bet. That's described in some detail here. The simplest usage for your requirements is probably to use the ndb.map function, like this (from the docs):
#ndb.tasklet
def callback(msg):
acct = yield ndb.get_async(msg.author)
raise tasklet.Return('On %s, %s wrote:\n%s' % (msg.when, acct.nick(), msg.body))
qry = Messages.query().order(-Message.when)
outputs = qry.map(callback, limit=20)
for output in outputs:
print output
The callback function is called for each entity returned by the query, and it can do whatever operations it needs (using _async methods and yield to do them asynchronously), returning the result when it's done. Because the callback is a tasklet, and uses yield to make the asynchronous calls, NDB can run multiple instances of it in parallel, and even batch up some operations.
The pipeline API is overkill for what you want to do. Is there any reason why you couldn't just use a taskqueue?
Use the initial request to get all of the entity keys, and then enqueue a task for each key having the task execute the 2 functions per-entity. The concurrency will be based then on the number of concurrent requests as configured for that taskqueue.

What approach is best for mapping a legacy application tables named after years in Django?

Better see what the table names look like:
2009_articles
2010_articles
2011_articles
2009_customers
2010_customers
2011_customers
2009_invoices
2010_invoices
2011_invoices
Developers have simulated some kind of partitioning (long before mysql supported it) but now it breaks any try to make a quick frontend so customers can see their invoices and switch years.
After a couple on months I have the following results:
Changing Invoice._meta.db_table is useless cause any other relation deduced by the ORM will be wrong
models.py cannot get request variables
Option a:
Use abstract models so Invoice10 adds meta.db_table=2010 and inherits from Invoice model, and Invoice11 adds meta.db_table=2011, Not DRY although the app shouldn't need to support more than two or three years at the same time, but I will have to still check if
Option b:
Duplicate models and change imports on my views:
if year == 2010:
from models import Article10 as Article
and so on
Option c:
Dynamic models as referred to in several places on the net, but why have a 100% dynamic model when I just need a 1% part of the model dynamic?
Option d:
Wow, just going crazy after frustration. What about multiple database settings and use a router?
Any help will be much appreciated.
Option e: create new relevant models/database structure, and do an import of the old data in the new structure.
Ugly, but must inform.
I achieved something useful by using Django signal: class_prepared and ThreadLocal to get session:
from django.db.models.signals import class_prepared
from myapp.middlewares import ThreadLocal
apps = ('oneapp', 'otherapp',)
def add_table_prefix(sender, *args, **kwargs):
if sender._meta.app_label in apps:
request = ThreadLocal.get_current_request()
try:
year = request.session['current_year']
except:
year = settings.CURRENT_YEAR
prefix = getattr(settings, 'TABLE_PREFIX', '')
sender._meta.db_table = ('%s_%s_%s_%s') % (prefix, year, sender._meta.coolgest_prefix, sender._meta.db_table)
class_prepared.connect(add_table_prefix)
So is one model class mapping several identical database tables (invoices_01_2013, invoices_02_2013, ...) depending of what month and year the application user is browsing.
Working fine in production.

Google App Engine - Is this also a Put method? Or something else

Was wondering if I'm unconsciously using the Put method in my last line of code ( Please have a look). Thanks.
class User(db.Model):
name = db.StringProperty()
total_points = db.IntegerProperty()
points_activity_1 = db.IntegerProperty(default=100)
points_activity_2 = db.IntegerProperty(default=200)
def calculate_total_points(self):
self.total_points = self.points_activity_1 + self.points_activity_2
#initialize a user ( this is obviously a Put method )
User(key_name="key1",name="person1").put()
#get user by keyname
user = User.get_by_key_name("key1")
# QUESTION: is this also a Put method? It worked and updated my user entity's total points.
User.calculate_total_points(user)
While that method will certainly update the copy of the object that is in-memory, I do not see any reason to believe that the change will be persisted to the the datastore. Datastore write operations are costly, so they are not going to happen implicitly.
After running this code, use the datastore viewer to look at the copy of the object in the datastore. I think that you may find that it does not have the changed total_point value.

Resources