Will this transaction method work? - google-app-engine

I'm trying to write a transactional method for the app engine datastore but it's hard to test if it's working so I'm trying to validate my approach first. I have a post request that checks if a property is true, and if not true then do something else and set it to true.
def post(self):
key = self.request.get('key')
obj = db.get(key)
if obj.property is False:
update_obj(ojb.key()) // transactional method to update obj and set value to True
if obj.property is True:
// do something else

I am posting your code with some added comments
def post(self):
key = self.request.get('key')
# this gets the most recent entity using the key
obj = db.get(key)
if not obj.property:
# You should do the most recent check inside the transaction.
# After the above if-check the property might have changed by
# a faster request.
update_obj(ojb.key()) # transactional method to update obj and set value to True
if obj.property:
# do something else
Consider transactions as a group of actions on an entity that will all execute or all fail.
Transactions ensure that anything inside them will remain consistent. If something alters an entity and becomes different than it was, the transaction will fail and then repeat again.
Another approach if I understand what you need:
def post(self):
key = self.request.get('key')
self.check_obj_property(key)
# continue your logic
#db.transctional
def check_obj_property(key):
obj = db.get(key)
if obj.property:
#it's set already continue with other logic
return
# Its not set so set it and increase once the counter.
obj.property = True
# Do something else also?
obj.count += 1
# Save of course
obj.put()
As you see I've put all my checks inside a transaction.
The above code, if run concurrently, will only increase the count once.
Imagine it like a counter that counts how many times the obj.property has been set to True

Related

Dynamically change a #Core.periodic method's repeat time

Assuming I have a method X with a Core.periodic decorator initially set to 60 seconds, is there a way to change the repeat time of the method X to say 45 seconds from another method (call it Y) while the agent is running?
class SomeAgent(Agent)
...
#Core.periodic(settings.HEARTBEAT_PERIOD)
method X():
#Do stuff
method Y():
#Change method X's repeat time
If you want to change a periodic you must set it up using a call to self.core.periodic.
self.core.periodic returns a reference to the greenlet which is running the periodic method. Call the kill method on that to stop the greenlet before you start a new one. You'll want to setup the periodic in an onsetup method. Unless your periodic function uses the message bus, in which case you will want to put in it an onstart method.
class SomeAgent(Agent):
def __init__(self, **kwargs):
super(SomeAgent, self).__init__(**kwargs)
self.periodic_greenlet = None
#Core.receiver('onstart')
def onstart(self, sender, **kwargs):
self.periodic_greenlet = self.core.periodic(settings.HEARTBEAT_PERIOD, self.X)
def X(self):
#Do stuff
def Y(self, new_period):
#Checking for None may seem superfluous, but there are some possible race
#conditions at startup that cannot be completely eliminated.
if self.periodic_greenlet is not None:
self.periodic_greenlet.kill()
self.periodic_greenlet = self.core.periodic(new_period, self.X)

How to disable BadValueError (required field) value in Google App Engine during scanning all records?

I want to scan all records to check if there is not errors inside data.
How can I disable BadValueError to no break scan on lack of required field?
Consider that I can not change StringProperty to not required and such properties can be tenths in real code - so such workaround is not useful?
class A(db.Model):
x = db.StringProperty(required = True)
for instance in A.all():
# check something
if something(instance):
instance.delete()
Can I use some function to read datastore.Entity directly to avoid such problems with not need validation?
The solution I found for this problem was to use a resilient query, it ignores any exception thrown by a query, you can try this:
def resilient_query(query):
query_iter = iter(query)
while True:
next_result = query_iter.next()
#check something
yield next_result
except Exception, e:
next_result.delete()
query = resilient_query(A.query())
If you use ndb, you can load all your models as an ndb.Expando, then modify the values. This doesn't appear to be possible in db because you cannot specify a kind for a Query in db that differs from your model class.
Even though your model is defined in db, you can still use ndb to fix your entities:
# Setup a new ndb connection with ndb.Expando as the default model.
conn = ndb.make_connection(default_model=ndb.Expando)
# Use this connection in our context.
ndb.set_context(ndb.make_context(conn=conn))
# Query for all A kinds
for a in ndb.Query(kind='A'):
if a.x is None:
a.x = 'A more appropriate value.'
# Re-put the broken entity.
a.put()
Also note that this (and other solutions listed) will be subject to whatever time limits you are restricted to (i.e. 60 seconds on an App Engine frontend). If you are dealing with large amounts of data you will most likely want to write a custom map reduce job to do this.
Try setting a default property option to some distinct value that does not exist otherwise.
class A(db.Model):
x = db.StringProperty(required = True, default = <distinct value>)
Then load properties and check for this value.
you can override the _check_initialized(self) method of ndb.Model in your own Model subclass and replace the default logic with your own logic (or skip altogether as needed).

Google AppEngine Pipelines API

I would like to rewrite some of my tasks as pipelines. Mainly because of the fact that I need a way of detecting when a task finished or start a tasks in specific order. My problem is that I'm not sure how to rewrite the recursive tasks to pipelines. By recursive I mean tasks that call themselves like this:
class MyTask(webapp.RequestHandler):
def post(self):
cursor = self.request.get('cursor', None)
[set cursor if not null]
[fetch 100 entities form datastore]
if len(result) >= 100:
[ create the same task in the queue and pass the cursor ]
[do actual work the task was created for]
Now I would really like to write it as a pipeline and do something similar to:
class DoSomeJob(pipeline.Pipeline):
def run(self):
with pipeline.InOrder():
yield MyTask()
yield MyOtherTask()
yield DoSomeMoreWork(message2)
Any help with this one will be greatly appreciated. Thank you!
A basic pipeline just returns a value:
class MyFirstPipeline(pipeline.Pipeline):
def run(self):
return "Hello World"
The value has to be JSON serializable.
If you need to coordinate several pipelines you will need to use a generator pipeline and the yield statement.
class MyGeneratorPipeline(pipeline.Pipeline):
def run(self):
yield MyFirstPipeline()
You can treat the yielding of a pipeline as if it returns a 'future'.
You can pass this future as the input arg to another pipeline:
class MyGeneratorPipeline(pipeline.Pipeline):
def run(self):
result = yield MyFirstPipeline()
yield MyOtherPipeline(result)
The Pipeline API will ensure that the run method of MyOtherPipeline is only called once the result future from MyFirstPipeline has been resolved to a real value.
You can't mix yield and return in the same method. If you are using yield the value has to be a Pipeline instance. This can lead to a problem if you want to do this:
class MyRootPipeline(pipeline.Pipeline):
def run(self, *input_args):
results = []
for input_arg in input_args:
intermediate = yield MyFirstPipeline(input_arg)
result = yield MyOtherPipeline(intermediate)
results.append(result)
yield results
In this case the Pipeline API just sees a list in your final yield results line, so it doesn't know to resolve the futures inside it before returning and you will get an error.
They're not documented but there is a library of utility pipelines included which can help here:
https://code.google.com/p/appengine-pipeline/source/browse/trunk/src/pipeline/common.py
So a version of the above which actually works would look like:
import pipeline
from pipeline import common
class MyRootPipeline(pipeline.Pipeline):
def run(self, *input_args):
results = []
for input_arg in input_args:
intermediate = yield MyFirstPipeline(input_arg)
result = yield MyOtherPipeline(intermediate)
results.append(result)
yield common.List(*results)
Now we're ok, we're yielding a pipeline instance and Pipeline API knows to resolve its future value properly. The source of the common.List pipeline is very simple:
class List(pipeline.Pipeline):
"""Returns a list with the supplied positional arguments."""
def run(self, *args):
return list(args)
...at the point that this pipeline's run method is called the Pipeline API has resolved all of the items in the list to actual values, which can be passed in as *args.
Anyway, back to your original example, you could do something like this:
class FetchEntitites(pipeline.Pipeline):
def run(self, cursor=None)
if cursor is not None:
cursor = Cursor(urlsafe=cursor)
# I think it's ok to pass None as the cursor here, haven't confirmed
results, next_curs, more = MyModel.query().fetch_page(100,
start_cursor=cursor)
# queue up a task for the next page of results immediately
future_results = []
if more:
future_results = yield FetchEntitites(next_curs.urlsafe())
current_results = [ do some work on `results` ]
# (assumes current_results and future_results are both lists)
# this will have to wait for all of the recursive calls in
# future_results to resolve before it can resolve itself:
yield common.Extend(current_results, future_results)
Further explanation
At the start I said we can treat result = yield MyPipeline() as if it returns a 'future'. This is not strictly true, obviously we are actually just yielding the instantiated pipeline. (Needless to say our run method is now a generator function.)
The weird part of how Python's yield expressions work is that, despite what it looks like, the value that you yield goes somewhere outside the function (to the Pipeline API apparatus) rather than into your result var. The value of the result var on the left side of the expression is also pushed in from outside the function, by calling send on the generator (the generator being the run method you defined).
So by yielding an instantiated Pipeline, you are letting the Pipeline API take that instance and call its run method somewhere else at some other time (in fact it will be passed into a task queue as a class name and a set of args and kwargs and re-instantiated there... this is why your args and kwargs need to be JSON serializable too).
Meanwhile the Pipeline API sends a PipelineFuture object into your run generator and this is what appears in your result var. It seems a bit magical and counter-intuitive but this is how generators with yield expressions work.
It's taken quite a bit of head-scratching for me to work it out to this level and I welcome any clarifications or corrections on anything I got wrong.
When you create a pipeline, it hands back an object that represents a "stage". You can ask the stage for its id, then save it away. Later, you can reconstitute the stage from the saved id, then ask the stage if it's done.
See http://code.google.com/p/appengine-pipeline/wiki/GettingStarted and look for has_finalized. There's an example that does most of what you need.

parallel code execution python2.7 ndb

in my app i for one of the handler i need to get a bunch of entities and execute a function for each one of them.
i have the keys of all the enities i need. after fetching them i need to execute 1 or 2 instance methods for each one of them and this slows my app down quite a bit. doing this for 100 entities takes around 10 seconds which is way to slow.
im trying to find a way to get the entities and execute those functions in parallel to save time but im not really sure which way is the best.
i tried the _post_get_hook but the i have a future object and need to call get_result() and execute the function in the hook which works kind of ok in the sdk but gets a lot of 'maximum recursion depth exceeded while calling a Python objec' but i can't really undestand why and the error message is not really elaborate.
is the Pipeline api or ndb.Tasklets what im searching for?
atm im going by trial and error but i would be happy if someone could lead me to the right direction.
EDIT
my code is something similar to a filesystem, every folder contains other folders and files. The path of the Collections set on another entity so to serialize a collection entity i need to get the referenced entity and get the path. On a Collection the serialized_assets() function is slower the more entities it contains. If i could execute a serialize function for each contained asset side by side it would speed things up quite a bit.
class Index(ndb.Model):
path = ndb.StringProperty()
class Folder(ndb.Model):
label = ndb.StringProperty()
index = ndb.KeyProperty()
# contents is a list of keys of contaied Folders and Files
contents = ndb.StringProperty(repeated=True)
def serialized_assets(self):
assets = ndb.get_multi(self.contents)
serialized_assets = []
for a in assets:
kind = a._get_kind()
assetdict = a.to_dict()
if kind == 'Collection':
assetdict['path'] = asset.path
# other operations ...
elif kind == 'File':
assetdict['another_prop'] = asset.another_property
# ...
serialized_assets.append(assetdict)
return serialized_assets
#property
def path(self):
return self.index.get().path
class File(ndb.Model):
filename = ndb.StringProperty()
# other properties....
#property
def another_property(self):
# compute something here
return computed_property
EDIT2:
#ndb.tasklet
def serialized_assets(self, keys=None):
assets = yield ndb.get_multi_async(keys)
raise ndb.Return([asset.serialized for asset in assets])
is this tasklet code ok?
Since most of the execution time of your functions are spent waiting for RPCs, NDB's async and tasklet support is your best bet. That's described in some detail here. The simplest usage for your requirements is probably to use the ndb.map function, like this (from the docs):
#ndb.tasklet
def callback(msg):
acct = yield ndb.get_async(msg.author)
raise tasklet.Return('On %s, %s wrote:\n%s' % (msg.when, acct.nick(), msg.body))
qry = Messages.query().order(-Message.when)
outputs = qry.map(callback, limit=20)
for output in outputs:
print output
The callback function is called for each entity returned by the query, and it can do whatever operations it needs (using _async methods and yield to do them asynchronously), returning the result when it's done. Because the callback is a tasklet, and uses yield to make the asynchronous calls, NDB can run multiple instances of it in parallel, and even batch up some operations.
The pipeline API is overkill for what you want to do. Is there any reason why you couldn't just use a taskqueue?
Use the initial request to get all of the entity keys, and then enqueue a task for each key having the task execute the 2 functions per-entity. The concurrency will be based then on the number of concurrent requests as configured for that taskqueue.

Django - Are model save() methods lazy?

Are model save() methods lazy in django?
For instance, at what line in the following code sample will django hit the database?
my_model = MyModel()
my_model.name = 'Jeff Atwood'
my_model.save()
# Some code that is independent of my_model...
model_id = model_instance.id
print (model_id)
It does not make much sense to have a lazy save, does it? Django's QuerySets are lazy, the model's save method is not.
From the django source:
django/db/models/base.py, lines 424–437:
def save(self, force_insert=False, force_update=False, using=None):
"""
Saves the current instance. Override this in a subclass if you want to
control the saving process.
The 'force_insert' and 'force_update' parameters can be used to insist
that the "save" must be an SQL insert or update (or equivalent for
non-SQL backends), respectively. Normally, they should not be set.
"""
if force_insert and force_update:
raise ValueError("Cannot force both insert and updating in \
model saving.")
self.save_base(using=using, force_insert=force_insert,
force_update=force_update)
save.alters_data = True
Then, save_base does the heavy lifting (same file, lines 439–545):
...
transaction.commit_unless_managed(using=using)
...
And in django/db/transaction.py, lines 167–178, you'll find:
def commit_unless_managed(using=None):
"""
Commits changes if the system is not in managed transaction mode.
"""
...
P.S. All line numbers apply to django version (1, 3, 0, 'alpha', 0).

Resources