NDB tasklets not visible in appstats - google-app-engine

Why is it that the line numbers from methods decorated with #ndb.tasklet are not present in appstats?
In our app we have a convention to include both a synchronous and an asynchronous version of functions, something like:
def do_something(self, param=None):
return self.do_something_async(param=param).get_result()
#ndb.tasklet
def do_something_async(self, param=None):
stuff = yield self.do_something_else_async(stuff=param)
# ...
raise ndb.Return(stuff)
…but even after setting appegnine_config.appstats_MAX_STACK to something huge, and emptying appengine_config.appstats_RE_STACK_SKIP, still the reports in appstats will leave my application code the first time some_tasklet.get_result() is called.
Here's an example from appstats:
The expanded stack frame at learn.get_list_of_cards_to_learn() simply returns self.get_list_of_cards_to_learn_async().get_result(), which is a tasklet that in turn calls a bunch of other tasklets. However none of those tasklets are visible in appstats, all I see is ndb internals.
I'm not sure how exactly ndb is executing those decorators, but if I put a pdb trace in one of them and run my test suite, I can see the stack frames all the way down to the pdb line I put in the tasklet, so I don't understand why is that not there in appstats.
Some of the requests cause a large amount of RPC calls, but I'm not sure how to figure out which part of my app is making them, as I cannot trace it past the first tasklet in appstats.
Is there something maybe I need to fine-tune in appengine_config?

This has to do with the way tasklets are managed by NDB's scheduler. There's not much you can do about it.

Related

Google App Engine ndb: Order of mixed sync and async operations

When using Google App Engine ndb, do I have to worry about mixing synchronous and asynchronous put operations in the same function?
e.g. Say that I have some code like this:
class Entity(ndb.Model):
some_flag = ndb.BooleanProperty()
def set_flag():
ent=Entity()
ent.some_flag = False
ent.put_async()
ent.some_flag = True
ent.put()
Does that datastore take care of ensuring that all pending async writes are applied before the synchronous write (so that after set_flag runs, it is guaranteed that the flag will be True)? Or is there a race condition because the async put might complete after the synchronous put?
No, the datastore does not take care of this for you.
Even with synchronous puts, calls from different threads can overwrite each other.
I recommend that you read up a bit on transactions, and when and why there are helpful.
For sample code, and a practical solution, you may have a look at Dan McGrath's reply to the "Cloud Datastore: ways to avoid race conditions" question.

What is SourceFunction#run is supposed to work in Flink?

I have implemented a Source by extending RichSourceFunction for our Message Queue that Flink doesn't support.
When I implements the run method whose signature is:
override def run(sc: SourceFunction.SourceContext[String]): Unit = {
val msg = read_from_mq
sc.collect(msg)
}
When the run method is called, if there is no newer message in message queue,
Should I run without calling sc.collect or
I can wait until newer data comes(in this case, run method will be blocked).
I would prefer the 2nd one,not sure if this is the correct usage.
The run method of a Flink source should loop, endlessly producing output until its cancel method is called. When there's nothing to produce, then it's best if you can find a way to do a blocking wait.
The apache nifi source connector is another reasonable example to use as a model. You will note that it sleeps for a configurable interval when there's nothing for it to do.
As you probably know both options are functionally correct and will yield correct results.
This being said the second one is preferred because you're not holding the thread. In fact, if you take a look at the RabbitMQ connector implementation you'll notice that this exactly how it is implemented: inside its run it indirectly waits for messages to be placed on a BlockingQueue.

Making db.put() failsafe

I would like to make a db.put() operation in my Google App Engine service as resilient as possible, trying to maximize the likelihood of success even in the event of infrastructure issues or overload. What I have come up with at the moment is to catch every possible exception that could occur and to create a task that retries the commit if the first attempt fails:
try:
db.put(new_user_record)
except DeadlineExceededError:
deferred.defer(db.put,new_user_record)
except:
deferred.defer(db.put,new_user_record)
Does this code trap all possible error paths? Or are there other ways db.put() can fail that would not by caught by this code?
Edit on March 28, 2013 - To clarify when failure is expected
It seems that the answers so far assume that if db.put() fails then it is because the datastore is down. In my experience of having run fairly high-workload applications this is not necessarily a requirement. Sometimes you run into workload-specific API bottlenecks, sometimes the slowness of one API causes the request deadline to expire in another. Even though such events have a low frequency, their number can be sizable if traffic is high. These are the situations I am trying to cover.
I wouldn't say this is the best approach - whatever caused the original exception is just likely to happen again. What I would do for extra resilience is first load the record to be saved into memcache and in the event of an exception with the put (any exception) it could attempt a certain number of retries (for example 3) with a short sleep between each attempt. Depending on your application this could be either a synchronous operation or using deferred tasks it could be done asynchronously using the data in memcache.
Finally I'd actually do a query on the record in the data store even if there wasn't an exception to confirm the row has actually been written.
Well, i don't think that it is a good idea to try such a fallback at all. If the datastore is down, its down and youre out of luck (shouldn't happen frequently :)
Some thoughts to your code:
There are way more exceptions that could be raised during a put-opertation (like InternalError, Timeout, CommittedButStillApplying, TransactionFailedError)
Some of them don't mean that the put has failed. (ie. CommittedButStillApplying just means the put-operation is delayed). With your approach, you would end up having that entry twice in the datastore after your deferred call succeeds.
Tasks are limited to ~100KB (total size, not payload). If your payload is close to or above that limit, the deferred-api will automatically try to
serialize your payload to the datastore in order to keep the task itself below that limit. If the datastore is really unavailable, this will fail, too.
So its probably better to catch datastore errors, and inform your user that his request failed.
Its all good to retry, however use exponential backoff and most important proper transaction use so that fail xoesnt end up o a partial write.

How to prevent ndb from batching a put_async() call and make it issue the RPC immediately?

I have a request handler that updates an entity, saves it to the datastore, then needs to perform some additional work before returning (like queuing a background task and json-serializing some results). I want to parallelize this code, so that the additional work is done while the entity is being saved.
Here's what my handler code boils down to:
class FooHandler(webapp2.RequestHandler):
#ndb.toplevel
def post(self):
foo = yield Foo.get_by_id_async(some_id)
# Do some work with foo
# Don't yield, as I want to perform the code that follows
# while foo is being saved to the datastore.
# I'm in a toplevel, so the handler will not exit as long as
# this async request is not finished.
foo.put_async()
taskqueue.add(...)
json_result = generate_result()
self.response.headers["Content-Type"] = "application/json; charset=UTF-8"
self.response.write(json_result)
However, Appstats shows that the datastore.Put RPC is being done serially, after taskqueue.Add:
A little digging around in ndb.context.py shows that a put_async() call ends up being added to an AutoBatcher instead of the RPC being issued immediately.
So I presume that the _put_batcher ends up being flushed when the toplevel waits for all async calls to be complete.
I understand that batching puts has real benefits in certain scenarios, but in my case here I really want the put RPC to be sent immediately, so I can perform other work while the entity is being saved.
If I do yield foo.put_async(), then I get the same waterfall in Appstats, but with datastore.Put being done before the rest:
This is to be expected, as yield makes my handler wait for the put_async() call to complete before executing the rest of the code.
I also have tried adding a call to ndb.get_context().flush() right after foo.put_async(), but the datastore.Put and taskqueue.BulkAdd calls are still not being made in parallel according to Appstats.
So my question is: how can I force the call to put_async() to bypass the auto batcher and issue the RPC immediately?
There's no supported way to do it. Maybe there should be. Can you try if this works?
loop - ndb.eventloop.get_event_loop()
while loop.run_idle():
pass
You may have to look at the source code of ndb/eventloop.py to see what else you could try -- basically you want to try most of what run0() does except waiting for RPCs. In particular, it's possible that you would have to do this:
while loop.current:
loop.run0()
while loop.run_idle():
pass
(This still isn't supported, because there are other conditions you may have to handle too, but those don't seem to occur in your example.)
Try this, I'm not 100% certain it will help:
foo = yield Foo.get_by_id_async(some_id)
future = foo.put_async()
future.done()
ndb requests get put into the autobatcher, the batch gets sent to RPC when you need a result. Since you don't need the result of foo.put_async(), it doesn't get sent until you make another ndb call (you don't) or until the #ndb.toplevel ends.
Calling future.done() does not block, but I'm guessing it might trigger the request.
Another thing to try to force the operation is:
ndb.get_context().flush()

Fail-safe datastore updates on app engine

The app engine datastore, of course, has downtime. However, I'd like to have a "fail-safe" put which is more robust in the face of datastore errors (see motivation below). It seems like the task queue is an obvious place to defer writes when the datastore is unavailable. I don't know of any other solutions though (other than shipping off the data to a third-party via urlfetch).
Motivation: I have an entity which really needs to be put in the datastore - simply showing an error message to the user won't do. For example, perhaps some side effect has taken place which can't easily be undone (perhaps some interaction with a third-party site).
I've come up with a simple wrapper which (I think) provides a reasonable "fail-safe" put (see below). Do you see any problems with this, or have an idea for a more robust implementation? (Note: Thanks to suggestions posted in the answers by Nick Johnson and Saxon Druce, this post has been edited with some improvements to the code.)
import logging
from google.appengine.api.labs.taskqueue import taskqueue
from google.appengine.datastore import entity_pb
from google.appengine.ext import db
from google.appengine.runtime.apiproxy_errors import CapabilityDisabledError
def put_failsafe(e, db_put_deadline=20, retry_countdown=60, queue_name='default'):
"""Tries to e.put(). On success, 1 is returned. If this raises a db.Error
or CapabilityDisabledError, then a task will be enqueued to try to put the
entity (the task will execute after retry_countdown seconds) and 2 will be
returned. If the task cannot be enqueued, then 0 will be returned. Thus a
falsey value is only returned on complete failure.
Note that since the taskqueue payloads are limited to 10kB, if the protobuf
representing e is larger than 10kB then the put will be unable to be
deferred to the taskqueue.
If a put is deferred to the taskqueue, then it won't necessarily be
completed as soon as the datastore is back up. Thus it is possible that
e.put() will occur *after* other, later puts when 1 is returned.
Ensure e's model is imported in the code which defines the task which tries
to re-put e (so that e can be deserialized).
"""
try:
e.put(rpc=db.create_rpc(deadline=db_put_deadline))
return 1
except (db.Error, CapabilityDisabledError), ex1:
try:
taskqueue.add(queue_name=queue_name,
countdown=retry_countdown,
url='/task/retry_put',
payload=db.model_to_protobuf(e).Encode())
logging.info('failed to put to db now, but deferred put to the taskqueue e=%s ex=%s' % (e, ex1))
return 2
except (taskqueue.Error, CapabilityDisabledError), ex2:
return 0
Request handler for the task:
from google.appengine.ext import db, webapp
# IMPORTANT: This task deserializes entity protobufs. To ensure that this is
# successful, you must import any db.Model that may need to be
# deserialized here (otherwise this task may raise a KindError).
class RetryPut(webapp.RequestHandler):
def post(self):
e = db.model_from_protobuf(entity_pb.EntityProto(self.request.body))
e.put() # failure will raise an exception => the task to be retried
I don't expect to use this for every put - most of the time, showing an error message is just fine. It is tempting to use it for every put, but I think sometimes it might be more confusing for the user if I tell them that their changes will appear later (and continue to show them the old data until the datastore is back up and the deferred puts execute).
Your approach is reasonable, but has several caveats:
By default, a put operation will retry until it runs out of time. Since you have a backup strategy, you may want to give up sooner - in which case you should supply an rpc parameter to the put method call, specifying a custom deadline.
There's no need to set an explicit countdown - the task queue will retry failing operations for you at increasing intervals.
You don't need to use pickle - Protocol Buffers have a natural string encoding which is much more efficient. See this post for a demonstration of how to use it.
As Saxon points out, task queue payloads are limited to 10 kilobytes, so you may have trouble with large entities.
Most importantly, this changes the datastore consistency model from 'strongly consistent' to 'eventually consistent'. That is, the put that you enqueued to the task queue could be applied at any time in the future, overwriting any changes that were made in the interim. Any number of race conditions are possible, essentially rendering transactions useless if there are puts pending on the task queue.
One potential issue is that tasks are limited to 10kb of data, so this won't work if you have an entity which is larger than that once pickled.

Resources