When using Google App Engine ndb, do I have to worry about mixing synchronous and asynchronous put operations in the same function?
e.g. Say that I have some code like this:
class Entity(ndb.Model):
some_flag = ndb.BooleanProperty()
def set_flag():
ent=Entity()
ent.some_flag = False
ent.put_async()
ent.some_flag = True
ent.put()
Does that datastore take care of ensuring that all pending async writes are applied before the synchronous write (so that after set_flag runs, it is guaranteed that the flag will be True)? Or is there a race condition because the async put might complete after the synchronous put?
No, the datastore does not take care of this for you.
Even with synchronous puts, calls from different threads can overwrite each other.
I recommend that you read up a bit on transactions, and when and why there are helpful.
For sample code, and a practical solution, you may have a look at Dan McGrath's reply to the "Cloud Datastore: ways to avoid race conditions" question.
Related
I have a Flink job with the classic shape of datasource-operator1-operatorN-sink.
From what I can observe, the open() method of operator1 is invoked before the open() method of the datasource.
In the open() method of operator1 I need to handle some business logic, that it is dependent of stuff which gets resolved at datasource.open()
1- Is there any way that I can restrain that the operator1.open() is not invoked until datasource.open() is?
2- Is there any way to communicate/signal from the datasource.open() method, to the operator1.open() method?
Trying to establish some sort of out-of-band communication between operators often gets folks into trouble. At best it can screw up performance, and at worst it can lead to deadlocks.
What you might try instead is to rely on the signaling pathway that already exists between the data source and the async function -- in other words, emit a specially encoded event from the data source that tells the async function it can start now, and have the async function wait for that special record before doing other processing.
I am working on a CoProcessFunction that uses a third party library for detecting certain patterns of events based on some rules. So, in the end, the ProcessElement1 method is basically forwarding the events to this library and registering a callback so that, when a match is detected, the CoProcessFunction can emit an output event. For achieving this, the callback relies on a reference to the out: Collector[T] parameter in ProcessElement1.
Having said that, I am not sure whether this use case is well-supported by Flink, since:
There might be multiple threads spanned by the third party library (let's say I have not any control over the amount of threads spanned, this is decided by the library)
I am not sure whether out might be recreated or something by Flink at some point, invalidating the references in the callbacks, making them crash
So far I have not observed any issues, but I have just run my program in the small. It would be great to hear from the experts whether my approach is correct and how could this be approached otherwise.
As an update based on Arvid's comments. Since my current process function already works well for me, except for the fact I don't have access to the mailbox executor, I have simply created a custom operator for injecting that:
class MyOperator(myFunction: MyFunction)
extends KeyedCoProcessOperator(myFunction)
{
private lazy val mailboxExecutor = getContainingTask
.getMailboxExecutorFactory
.createExecutor(getOperatorConfig.getChainIndex)
override def open(): Unit = {
super.open()
userFunction.asInstanceOf[MyFunction].mailboxExecutor = mailboxExecutor
}
}
This way I can register callbacks that will send mails to be processed one by one. In the main application I use it like this:
.transform("wrapping function in operator", new MyOperator(new MyFunction()))
So far everything looks good to me, but if you see problems or know a better way, it would be great to hear your thoughts on this again. In particular, the way of getting access to the mailbox executor is definitively a bit clumsy...
If you have asynchronous callbacks, you really should use asyncIO. So use your CoProcessFunction to emit a Tuple2 and have a asyncIO directly following it.
Op now added that he may not get a result back at all which makes asyncIO difficult to use. You could rely on the timeout to trigger such that the element gets removed but that may slow down processing as asyncIO has a limited queue of "active" elements.
So, the way to go in Flink 1.10 would probably to implement a custom operator using the MailboxExecutor.
Getting the executor is still a bit clumsy, but you could check AsyncWaitOperator and the AsyncWaitOperatorFactory.
Code sketch for using executor
// setup is optionally but if you use timestamped records, you usually do that
void setup(StreamTask<?, ?> containingTask, StreamConfig config, Output<StreamRecord<OUT>> output) {
super.setup(containingTask, config, output);
this.timestampedCollector = new TimestampedCollector<>(output);
}
void processElement(record) {
externalLib.addElement(record, (match) -> {
mailboxExecutor.execute(() -> {
timestampedCollector.collect(match);
});
});
}
Note that this involves quite a bit #PublicEvolving code and we already have some changes on our agenda. So be prepared to adjust code for 1.11.
Why is it that the line numbers from methods decorated with #ndb.tasklet are not present in appstats?
In our app we have a convention to include both a synchronous and an asynchronous version of functions, something like:
def do_something(self, param=None):
return self.do_something_async(param=param).get_result()
#ndb.tasklet
def do_something_async(self, param=None):
stuff = yield self.do_something_else_async(stuff=param)
# ...
raise ndb.Return(stuff)
…but even after setting appegnine_config.appstats_MAX_STACK to something huge, and emptying appengine_config.appstats_RE_STACK_SKIP, still the reports in appstats will leave my application code the first time some_tasklet.get_result() is called.
Here's an example from appstats:
The expanded stack frame at learn.get_list_of_cards_to_learn() simply returns self.get_list_of_cards_to_learn_async().get_result(), which is a tasklet that in turn calls a bunch of other tasklets. However none of those tasklets are visible in appstats, all I see is ndb internals.
I'm not sure how exactly ndb is executing those decorators, but if I put a pdb trace in one of them and run my test suite, I can see the stack frames all the way down to the pdb line I put in the tasklet, so I don't understand why is that not there in appstats.
Some of the requests cause a large amount of RPC calls, but I'm not sure how to figure out which part of my app is making them, as I cannot trace it past the first tasklet in appstats.
Is there something maybe I need to fine-tune in appengine_config?
This has to do with the way tasklets are managed by NDB's scheduler. There's not much you can do about it.
The put(...) method of GAE's memcahe API accepts as an argument (in one of it's overloaded implementations) a SetPolicy argument. In the Javadocs here it sais that if you chose "ADD_ONLY_IF_NOT_PRESENT" as policy it's, and I quote:
"useful to avoid race conditions."
My questions are:
what happends with an expired value that was set with the same key? If I add to memcache something like (key=1, value=whatever), then this entry expires, and then I try to add (key=1, value=whatever2) using ADD_ONLY_IF_NOT_PRESENT is whatever2 added to cache or not?
What does it mean "useful for race conditions"? More specifically, does it mean that if I use put(...) with ADD_ONLY_IF_NOT_PRESENT SetPolicy I am no longer required to use getIdentifiable and putIfUntouched in order to avoid race conditions when adding stuff concurrentlly to the memcache?
If the value expires, it's not in memcache anymore, so the RPC will set it.
If you do a get, then do a put only if nothing was there, you've introduced a race condition whereby someone else might've put the data while you were checking. Doing a single operation avoids this.
The app engine datastore, of course, has downtime. However, I'd like to have a "fail-safe" put which is more robust in the face of datastore errors (see motivation below). It seems like the task queue is an obvious place to defer writes when the datastore is unavailable. I don't know of any other solutions though (other than shipping off the data to a third-party via urlfetch).
Motivation: I have an entity which really needs to be put in the datastore - simply showing an error message to the user won't do. For example, perhaps some side effect has taken place which can't easily be undone (perhaps some interaction with a third-party site).
I've come up with a simple wrapper which (I think) provides a reasonable "fail-safe" put (see below). Do you see any problems with this, or have an idea for a more robust implementation? (Note: Thanks to suggestions posted in the answers by Nick Johnson and Saxon Druce, this post has been edited with some improvements to the code.)
import logging
from google.appengine.api.labs.taskqueue import taskqueue
from google.appengine.datastore import entity_pb
from google.appengine.ext import db
from google.appengine.runtime.apiproxy_errors import CapabilityDisabledError
def put_failsafe(e, db_put_deadline=20, retry_countdown=60, queue_name='default'):
"""Tries to e.put(). On success, 1 is returned. If this raises a db.Error
or CapabilityDisabledError, then a task will be enqueued to try to put the
entity (the task will execute after retry_countdown seconds) and 2 will be
returned. If the task cannot be enqueued, then 0 will be returned. Thus a
falsey value is only returned on complete failure.
Note that since the taskqueue payloads are limited to 10kB, if the protobuf
representing e is larger than 10kB then the put will be unable to be
deferred to the taskqueue.
If a put is deferred to the taskqueue, then it won't necessarily be
completed as soon as the datastore is back up. Thus it is possible that
e.put() will occur *after* other, later puts when 1 is returned.
Ensure e's model is imported in the code which defines the task which tries
to re-put e (so that e can be deserialized).
"""
try:
e.put(rpc=db.create_rpc(deadline=db_put_deadline))
return 1
except (db.Error, CapabilityDisabledError), ex1:
try:
taskqueue.add(queue_name=queue_name,
countdown=retry_countdown,
url='/task/retry_put',
payload=db.model_to_protobuf(e).Encode())
logging.info('failed to put to db now, but deferred put to the taskqueue e=%s ex=%s' % (e, ex1))
return 2
except (taskqueue.Error, CapabilityDisabledError), ex2:
return 0
Request handler for the task:
from google.appengine.ext import db, webapp
# IMPORTANT: This task deserializes entity protobufs. To ensure that this is
# successful, you must import any db.Model that may need to be
# deserialized here (otherwise this task may raise a KindError).
class RetryPut(webapp.RequestHandler):
def post(self):
e = db.model_from_protobuf(entity_pb.EntityProto(self.request.body))
e.put() # failure will raise an exception => the task to be retried
I don't expect to use this for every put - most of the time, showing an error message is just fine. It is tempting to use it for every put, but I think sometimes it might be more confusing for the user if I tell them that their changes will appear later (and continue to show them the old data until the datastore is back up and the deferred puts execute).
Your approach is reasonable, but has several caveats:
By default, a put operation will retry until it runs out of time. Since you have a backup strategy, you may want to give up sooner - in which case you should supply an rpc parameter to the put method call, specifying a custom deadline.
There's no need to set an explicit countdown - the task queue will retry failing operations for you at increasing intervals.
You don't need to use pickle - Protocol Buffers have a natural string encoding which is much more efficient. See this post for a demonstration of how to use it.
As Saxon points out, task queue payloads are limited to 10 kilobytes, so you may have trouble with large entities.
Most importantly, this changes the datastore consistency model from 'strongly consistent' to 'eventually consistent'. That is, the put that you enqueued to the task queue could be applied at any time in the future, overwriting any changes that were made in the interim. Any number of race conditions are possible, essentially rendering transactions useless if there are puts pending on the task queue.
One potential issue is that tasks are limited to 10kb of data, so this won't work if you have an entity which is larger than that once pickled.