Why does NDB context cache set a environment variable? - google-app-engine

Looking though the Google NDB code, I can´t quite seem to work out why the context cache sets a environment variable.
The code in quesiton:
https://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/ndb/tasklets.py
_CONTEXT_KEY = '__CONTEXT__'
def get_context():
# XXX Docstring
ctx = None
if os.getenv(_CONTEXT_KEY):
ctx = _state.current_context
if ctx is None:
ctx = make_default_context()
set_context(ctx)
return ctx
(...)
def set_context(new_context):
# XXX Docstring
os.environ[_CONTEXT_KEY] = '1'
_state.current_context = new_context
I know what it does, but why? (Speculation on my side removed, I don´t want to mislead answers)
Update:
The _state is based on this code:
class _State(utils.threading_local):
"""Hold thread-local state."""
current_context = None
(...)
https://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/ext/ndb/utils.py
# Define a base class for classes that need to be thread-local.
# This is pretty subtle; we want to use threading.local if threading
# is supported, but object if it is not.
if threading.local.__module__ == 'thread':
logging_debug('Using threading.local')
threading_local = threading.local
else:
logging_debug('Not using threading.local')
threading_local = object

Environment variables are specific/scoped to the request, so that provides a way of getting the context anywhere in your code without having to refer to a specific object/entity or provide a request specific lookup mechanism.
Some environment variables are set before the request is processed from the real environment, app.yaml.
Then for each request the environment variables are set from appengine_config.py , then the WSGI environment for the request, then the handler, and then other components contribute (ie your code may populate the environment), this is specific to each request.
So the environment is considered threadsafe (ie won't leak things across concurrent requests)

Related

Django Tastypie prevent file uri's being saved to a FileField

I've got a Django app with Tastypie, and mainly BackBone client side. One of my models has a few ImageFields. Here is a similar setup to help me explain the issue.
settings.py
MEDIA_URL = "/media/"
models.py
class Foo(models.model):
bar = models.ImageField()
baz = models.CharField()
api.py
class FooResource(ModelResource):
class Meta:
queryset=models.Foo.objects.all()
resource_name = "foo"
authorization = Authorization()
When I make a GET request to the API, it appends the MEDIA_URL to the file names to return the URI where a bar can be accessed. However, when I change the value of baz on a row, and then make a PUT request with that, it also changes the value for a bar to the URI. This means that the next time I GET the row, it appends the MEDIA_URL again, breaking the system and appending it for each successive GET and PUT. I end up with values for bar in the DB that look like.
/media/media/media/bar.jpg
I think I should fix this by overriding a method in my ModelResource, so that when there is a PUT request, it recognizes that it's getting either a URI or a real file, and alters its behavior in some way.
Is this the correct fix? Could you provide some implementation details of a fix?
I found the answer. Tastypie is well designed, similarly to Django. Unfortunately I was not familiar with the terminology so when I read the docs I didn't understand. You can easily modify behavior of the API at many levels. Here is my new API definition, which fixed the issue.
api.py
class FooResource(ModelResource):
class Meta:
queryset=models.Foo.objects.all()
resource_name = "foo"
authorization = Authorization()
def hydrate_bar(bundle):
bundle["bar"] = bundle["bar"].strip(MEDIA_URL)
return bundle
I should add that this only works for me because I exclusively POST my image files individually with a post_detail method which doesn't call this method. If I was to POST or PUT image files as part of the entire row, I expect this might raise an error if that isn't considered.

Google appengine pipelines - define the queue to use

I'd like to be able to set which queue to use within a pipeline, so that I can use custom settings for that pipeline in queue.yaml. The only way I can see to do this is to do so when the stage is started, via:
first_stage = ingest.CustomPipelineA(some_data)
first_stage.start(queue_name=foo)
However, I have nested and pre-requisite pipelines, such as:
with pipeline.InOrder():
yield CustomPipelineA(some_shared_data)
future_b = yield CustomPipelineB(some_shared_data)
with pipeline.After(future_b):
future_c = yield CustomPipelineC(some_shared_data, future_b)
with pipeline.After(future_c):
future_d = yield CustomPipelineD(some_shared_data, future_c)
It would be nice if I could set the queue name on the constructor, but it's not possible based on the pipeline docs: https://code.google.com/p/appengine-pipeline/wiki/GettingStarted#Execution_ordering.
Any ideas?
I think it's possible in Python (but not in Java). Here's an example from the same webpage as you linked to :
stage = MySearchEnginePipeline(15)
stage.start(queue_name='pipelinequeue')
I believe I've figured this out for Execution Ordering, within the run statement, you can:
self._context.queue_name = "my-custom-queue-name"

NDB doesn't return same instance for asynchronous gets when memcache is enabled

My program relies on the NDB context cache so that different ndb.Key.get() calls will receive the same model instance.
However, I discovered that this doesn't work properly with asynchronous gets. The expected behavior is that NDB's batcher combines the requests and return the same model instance but that doesn't happen.
The problem only occurs when memcache is enabled which is also strange.
Here is a test case (run it twice):
class Entity(ndb.Model):
pass
# Disabling memcache fixes the issue
# Entity._use_memcache = False
entity_key = ndb.Key('Entity', 1)
# Set up entity in datastore and memcache on first run
if not entity_key.get():
entity = Entity(key=entity_key)
entity.put()
return
# Clear cache after Key.get() above
ndb.get_context().clear_cache()
# Entity is now in memcache and datastore but not context
entity_future_a = entity_key.get_async()
entity_future_b = entity_key.get_async()
entity_a = entity_future_a.get_result()
entity_b = entity_future_b.get_result()
# FAILS
assert entity_a is entity_b
So far I have only tested this on the local SDK.
It is possible that this is happening because you are not calling yield in there. Can you try setting up the environment so you can use
entity_a, entity_b = yield entity_future_a, entity_b_future
?

serve different wsgiapplications depending on request domain on GAE with threadsafe:true

what im trying to do is to load different applications (webapp2.WSGIApplication) depending on the request domain.
for example www.domain_1.com should load the application in app1.main.application while www.domain_2.com should load app2.main.appplication.
of course im on the same GAE appid and im using namespaces to separate the apps data.
this works pretty good with 'threadsafe:false' and a runner.py file where a function determines which application to return
it seems that with 'threadsafe:true' the first request loads the wsgiapplication into the instance and further requests dont execute the 'application dispatching' logic any more so the request gets a response from the wrong app.
im using python2.7 and webapp2
what is the best way to do this?
edit:
a very simplified version of my runner.py
def main():
if domain == 'www.mydomain_1.com':
from app_1 import application
namespace = 'app_1'
elif domain == 'www.domain_2.com':
from app_2 import application
namespace = 'app_2'
namespace_manager.set_namespace(namespace)
return wsgiref.handlers.CGIHandler().run(application)
if __name__ == '__main__':
main()
and in app.yaml
- url: /.*
script: app-runner.py
Your runner script is a CGI script. The full behavior of a CGI script with multithreading turned on is not documented, and the way the docs are written I'm guessing this won't be supported fully. Instead, the docs say you must refer to the WSGI application object directly from app.yaml, using the module path to a global variable containing the object, when multithreading is turned on. (CGI scripts retain their old behavior in Python 2.7 with multithreading turned off.)
The behavior you're seeing is explained by your use of imports. Within a single instance, each import statement only has an effect the first time it is encountered. After that, the module is assumed to be imported and the import statement has no effect on subsequent requests. You can import both values into separate names, then call run() with the appropriate value.
But if you want to enable multithreading (and that's a good idea), your dispatcher should be a WSGI application itself, stored in a module global referred to by app.yaml. I don't know offhand how to dispatch a request to another WSGI application from within a WSGI application, but that might be a reasonable thing to do. Alternatively, you might consider using or building a layer above WSGI to do this dispatch.
made it happen by subclassing webapp2.WSGIApplication and overriding __call__() which is called before dispatching to a RequestHandler.
prefixing routes (and removing the prefix in the handlers initialize) and substructuring config to be able to use the instance memory.
class CustomWSGIApplication(webapp2.WSGIApplication):
def __call__(self, environ, start_response):
routes, settings, ns = get_app(environ)
namespace_manager.set_namespace(ns)
environ['PATH_INFO'] = '/%s%s' %(ns, environ.get('PATH_INFO'))
for route in routes:
r, h = route # returns a tuple with mapping and handler
newroute = ('/%s%s'%(ns, r), h,)
self.router.add(newroute)
if settings:
self.config[ns] = settings
self.debug = debug
with self.request_context_class(self, environ) as (request, response):
try:
if request.method not in self.allowed_methods:
# 501 Not Implemented.
raise exc.HTTPNotImplemented()
rv = self.router.dispatch(request, response)
if rv is not None:
response = rv
except Exception, e:
try:
# Try to handle it with a custom error handler.
rv = self.handle_exception(request, response, e)
if rv is not None:
response = rv
except HTTPException, e:
# Use the HTTP exception as response.
response = e
except Exception, e:
# Error wasn't handled so we have nothing else to do.
response = self._internal_error(e)
try:
return response(environ, start_response)
except Exception, e:
return self._internal_error(e)(environ, start_response)

How to pass a google mapreduce parameter to done_callback

I'm having trouble setting a parameter when kicking off a mapreduce via start_map so I can access it in done_callback. Numerous things I've read imply that it's possible, but somehow I've not got the earth-moon-stars properly aligned. Ultimately, what I'm trying to accomplish is to delete the temporary blob I created for the mapreduce job.
Here's how I kick it off:
mrID = control.start_map(
"Find friends",
"findfriendshandler.findFriendHandler",
"mapreduce.input_readers.BlobstoreLineInputReader",
{"blob_keys": blobKey},
shard_count=7,
mapreduce_parameters={'done_callback': '/fnfrdone','blobKey': blobKey})
In done_callback, the context object isn't available:
class FindFriendsDoneHandler(webapp.RequestHandler):
def post(self):
ctx = context.get()
if ctx is not None:
params = ctx.mapreduce_spec.mapper.params
try:
blobKey = params['blobKey']
logging.info(['BLOBKEY ' + blobKey])
except KeyError:
logging.info('blobKey key not found in params')
else:
logging.info('context.get did not work') #THIS IS WHAT GETS OUTPUT
Thanks!
EDIT: It seems like there may be more than one MR library, so I wanted to include my various imports:
from mapreduce import control
from mapreduce import operation as op
from mapreduce import context
from mapreduce import model
Below is the code I used in my done_callback handler to retrieve my blobKey user parameter:
class FindFriendsDoneHandler(webapp.RequestHandler):
mrID = self.request.headers['Mapreduce-Id']
try:
mapreduceState = MapreduceState.get_by_key_name(mrID)
mrSpec = mapreduceState.mapreduce_spec
jsonSpec = mrSpec.to_json()
jsonParams = jsonSpec['params']
blobKey = jsonParams['blobKey']
blobInfo = BlobInfo.get(blobKey)
blobInfo.delete()
logging.info('Temp blob deleted successfully for mapreduce:' + mrID)
except:
logging.warning('Unable to delete temp blob for mapreduce:' + mrID)
This uses the mapreduce ID passed into done callback via the header to retrieve the mapreduce state model object from the mapreduce state table. The model stores any user params sent via start_map in a mapreduce_spec property which is in json format.
Note that MR, itself, actually stores the blob_key elsewhere in mapreduce_spec.
Thanks again to #Nick for pointing me to the model.py source file.
I'd love to hear if there's a simpler way to get at MR user params...
Context is only available to mappers/reducers - it's largely concerned with things that don't make sense outside the context of one. As you can see from the source, however, the "Mapreduce-Id" header is set, from which you can get the ID of the mapreduce job.
You shouldn't have to do your own cleanup, though - mapreduce has a handler that does it for you.

Resources