How to pass a google mapreduce parameter to done_callback - google-app-engine

I'm having trouble setting a parameter when kicking off a mapreduce via start_map so I can access it in done_callback. Numerous things I've read imply that it's possible, but somehow I've not got the earth-moon-stars properly aligned. Ultimately, what I'm trying to accomplish is to delete the temporary blob I created for the mapreduce job.
Here's how I kick it off:
mrID = control.start_map(
"Find friends",
"findfriendshandler.findFriendHandler",
"mapreduce.input_readers.BlobstoreLineInputReader",
{"blob_keys": blobKey},
shard_count=7,
mapreduce_parameters={'done_callback': '/fnfrdone','blobKey': blobKey})
In done_callback, the context object isn't available:
class FindFriendsDoneHandler(webapp.RequestHandler):
def post(self):
ctx = context.get()
if ctx is not None:
params = ctx.mapreduce_spec.mapper.params
try:
blobKey = params['blobKey']
logging.info(['BLOBKEY ' + blobKey])
except KeyError:
logging.info('blobKey key not found in params')
else:
logging.info('context.get did not work') #THIS IS WHAT GETS OUTPUT
Thanks!
EDIT: It seems like there may be more than one MR library, so I wanted to include my various imports:
from mapreduce import control
from mapreduce import operation as op
from mapreduce import context
from mapreduce import model

Below is the code I used in my done_callback handler to retrieve my blobKey user parameter:
class FindFriendsDoneHandler(webapp.RequestHandler):
mrID = self.request.headers['Mapreduce-Id']
try:
mapreduceState = MapreduceState.get_by_key_name(mrID)
mrSpec = mapreduceState.mapreduce_spec
jsonSpec = mrSpec.to_json()
jsonParams = jsonSpec['params']
blobKey = jsonParams['blobKey']
blobInfo = BlobInfo.get(blobKey)
blobInfo.delete()
logging.info('Temp blob deleted successfully for mapreduce:' + mrID)
except:
logging.warning('Unable to delete temp blob for mapreduce:' + mrID)
This uses the mapreduce ID passed into done callback via the header to retrieve the mapreduce state model object from the mapreduce state table. The model stores any user params sent via start_map in a mapreduce_spec property which is in json format.
Note that MR, itself, actually stores the blob_key elsewhere in mapreduce_spec.
Thanks again to #Nick for pointing me to the model.py source file.
I'd love to hear if there's a simpler way to get at MR user params...

Context is only available to mappers/reducers - it's largely concerned with things that don't make sense outside the context of one. As you can see from the source, however, the "Mapreduce-Id" header is set, from which you can get the ID of the mapreduce job.
You shouldn't have to do your own cleanup, though - mapreduce has a handler that does it for you.

Related

How to send custom DocumentOperation to DocumentProcessing pipeline from a Processor?

Scenario: I've been stuck on this for way to long and I think solution might be easy but I just can't see it, this is the scenario:
cURL POST to http://localhost:8080/my_imports (raw JSON data on body)
->
MyImportsCustomHandler (extends ThreadedHttpRequestHandler [Validations]
->
MyObjectProcessor (extends Processor) [JSON deserialize and data massage]
->
MyFirstDocumentProcessor (extends DocumentProcessor) [Set some fields and save]
Problem is that execution never reaches MyFirstDocumentProcessor, likely because request didn't started from the document_api endpoints (intentionaly).
There are no errors thrown, just processing route never reaches the document processor chain, I think it should because on MyObjectProcessor I'm doing:
DocumentType type =
localDocHandler.getDocumentTypeManager().getDocumentType("my_doc");
DocumentId id = new DocumentId("id:default:my_doc::2");
Document document = new Document(type, id);
DocumentPut docPut = new DocumentPut(document);
Processing proc = com.yahoo.docproc.Processing.of(docPut);
I got this idea from here: https://github.com/vespa-engine/vespa/blob/master/docproc/src/test/java/com/yahoo/docproc/util/SplitterJoinerTestCase.java
but on that test I see this line splitter.process(p);, which I'm not able to find a suitable replacement that works inside a Processor, in that context I only have the Request, Execution and DocumentProcessingHandler
I hope somebody versed on Vespa con shine some light on this, is just the last hop on the processing chain that I can't bridge :|
To write documents from Java code, you need to use the Document Access API:
http://docs.vespa.ai/documentation/document-api-guide.html#document-access
A working solution is in https://github.com/vespa-engine/sample-apps/pull/44

Django Tastypie prevent file uri's being saved to a FileField

I've got a Django app with Tastypie, and mainly BackBone client side. One of my models has a few ImageFields. Here is a similar setup to help me explain the issue.
settings.py
MEDIA_URL = "/media/"
models.py
class Foo(models.model):
bar = models.ImageField()
baz = models.CharField()
api.py
class FooResource(ModelResource):
class Meta:
queryset=models.Foo.objects.all()
resource_name = "foo"
authorization = Authorization()
When I make a GET request to the API, it appends the MEDIA_URL to the file names to return the URI where a bar can be accessed. However, when I change the value of baz on a row, and then make a PUT request with that, it also changes the value for a bar to the URI. This means that the next time I GET the row, it appends the MEDIA_URL again, breaking the system and appending it for each successive GET and PUT. I end up with values for bar in the DB that look like.
/media/media/media/bar.jpg
I think I should fix this by overriding a method in my ModelResource, so that when there is a PUT request, it recognizes that it's getting either a URI or a real file, and alters its behavior in some way.
Is this the correct fix? Could you provide some implementation details of a fix?
I found the answer. Tastypie is well designed, similarly to Django. Unfortunately I was not familiar with the terminology so when I read the docs I didn't understand. You can easily modify behavior of the API at many levels. Here is my new API definition, which fixed the issue.
api.py
class FooResource(ModelResource):
class Meta:
queryset=models.Foo.objects.all()
resource_name = "foo"
authorization = Authorization()
def hydrate_bar(bundle):
bundle["bar"] = bundle["bar"].strip(MEDIA_URL)
return bundle
I should add that this only works for me because I exclusively POST my image files individually with a post_detail method which doesn't call this method. If I was to POST or PUT image files as part of the entire row, I expect this might raise an error if that isn't considered.

Provide a callback URL in Google Cloud Storage signed URL

When uploading to GCS (Google Cloud Storage) using the BlobStore's createUploadURL function, I can provide a callback together with header data that will be POSTed to the callback URL.
There doesn't seem to be a way to do that with GCS's signed URL's
I know there is Object Change Notification but that won't allow the user to provide upload specific information in the header of a POST, the way it is possible with createUploadURL's callback.
My feeling is, if createUploadURL can do it, there must be a way to do it with signed URL's, but I can't find any documentation on it. I was wondering if anyone may know how createUploadURL achieves that callback calling behavior.
PS: I'm trying to move away from createUploadURL because of the __BlobInfo__ entities it creates, which for my specific use case I do not need, and somehow seem to be indelible and are wasting storage space.
Update: It worked! Here is how:
Short Answer: It cannot be done with PUT, but can be done with POST
Long Answer:
If you look at the signed-URL page, in front of HTTP_Verb, under Description, there is a subtle note that this page is only relevant to GET, HEAD, PUT, and DELETE, but POST is a completely different game. I had missed this, but it turned out to be very important.
There is a whole page of HTTP Headers that does not list an important header that can be used with POST; that header is success_action_redirect, as voscausa correctly answered.
In the POST page Google "strongly recommends" using PUT, unless dealing with form data. However, POST has a few nice features that PUT does not have. They may worry that POST gives us too many strings to hang ourselves with.
But I'd say it is totally worth dropping createUploadURL, and writing your own code to redirect to a callback. Here is how:
Code:
If you are working in Python voscausa's code is very helpful.
I'm using apejs to write javascript in a Java app, so my code looks like this:
var exp = new Date()
exp.setTime(exp.getTime() + 1000 * 60 * 100); //100 minutes
json['GoogleAccessId'] = String(appIdentity.getServiceAccountName())
json['key'] = keyGenerator()
json['bucket'] = bucket
json['Expires'] = exp.toISOString();
json['success_action_redirect'] = "https://" + request.getServerName() + "/test2/";
json['uri'] = 'https://' + bucket + '.storage.googleapis.com/';
var policy = {'expiration': json.Expires
, 'conditions': [
["starts-with", "$key", json.key],
{'Expires': json.Expires},
{'bucket': json.bucket},
{"success_action_redirect": json.success_action_redirect}
]
};
var plain = StringToBytes(JSON.stringify(policy))
json['policy'] = String(Base64.encodeBase64String(plain))
var result = appIdentity.signForApp(Base64.encodeBase64(plain, false));
json['signature'] = String(Base64.encodeBase64String(result.getSignature()))
The code above first provides the relevant fields.
Then creates a policy object. Then it stringify's the object and converts it into a byte array (you can use .getBytes in Java. I had to write a function for javascript).
A base64 encoded version of this array, populates the policy field.
Then it is signed using the appidentity package. Finally the signature is base64 encoded, and we are done.
On the client side, all members of the json object will be added to the Form, except the uri which is the form's address.
var formData = new FormData(document.forms.namedItem('upload'));
var blob = new Blob([thedata], {type: 'application/json'})
var keys = ['GoogleAccessId', 'key', 'bucket', 'Expires', 'success_action_redirect', 'policy', 'signature']
for(field in keys)
formData.append(keys[field], url[keys[field]])
formData.append('file', blob)
var rest = new XMLHttpRequest();
rest.open('POST', url.uri)
rest.onload = callback_function
rest.send(formData)
If you do not provide a redirect, the response status will be 204 for success. But if you do redirect, the status will be 200. If you got 403 or 400 something about the signature or policy maybe wrong. Look at the responseText. If is often helpful.
A few things to note:
Both POST and PUT have a signature field, but these mean slightly different things. In case of POST, this is a signature of the policy.
PUT has a baseurl which contains the key (object name), but the URL used for POST may only include bucket name
PUT requires expiration as seconds from UNIX epoch, but POST wants it as an ISO string.
A PUT signature should be URL encoded (Java: by wrapping it with a URLEncoder.encode call). But for POST, Base64 encoding suffices.
By extension, for POST do Base64.encodeBase64String(result.getSignature()), and do not use the Base64.encodeBase64URLSafeString function
You cannot pass extra headers with the POST; only those listed in the POST page are allowed.
If you provide a URL for success_action_redirect, it will receive a GET with the key, bucket and eTag.
The other benefit of using POST is you can provide size limits. With PUT however, if a file breached your size restriction, you can only delete it after it was fully uploaded, even if it is multiple-tera-bytes.
What is wrong with createUploadURL?
The method above is a manual createUploadURL.
But:
You don't get those __BlobInfo__ objects which create many indexes and are indelible. This irritates me as it wastes a lot of space (which reminds me of a separate issue: issue 4231. Please go give it a star)
You can provide your own object name, which helps create folders in your bucket.
You can provide different expiration dates for each link.
For the very very few javascript app-engineers:
function StringToBytes(sz) {
map = function(x) {return x.charCodeAt(0)}
return sz.split('').map(map)
}
You can include succes_action_redirect in a policy document when you use GCS post object.
Docs here: Docs: https://cloud.google.com/storage/docs/xml-api/post-object
Python example here: https://github.com/voscausa/appengine-gcs-upload
Example callback result:
def ok(self):
""" GCS upload success callback """
logging.debug('GCS upload result : %s' % self.request.query_string)
bucket = self.request.get('bucket', default_value='')
key = self.request.get('key', default_value='')
key_parts = key.rsplit('/', 1)
folder = key_parts[0] if len(key_parts) > 1 else None
A solution I am using is to turn on Object Changed Notifications. Any time an object is added, a Post is sent to a URL - in my case - a servlet in my project.
In the doPost() I get all info of objected added to GCS and from there, I can do whatever.
This worked great in my App Engine project.

Using bottle.py and blobstore GAE

I recently started using bottle and GAE blobstore and while I can upload the files to the blobstore I cannot seem to find a way to download them from the store.
I followed the examples from the documentation but was only successful on the uploading part. I cannot integrate the example in my app since I'm using a different framework from webapp/2.
How would I go about creating an upload handler and download handler so that I can get the key of the uploaded blob and store it in my data model and use it later in the download handler?
I tried using the BlobInfo.all() to create a query the blobstore but I'm not able to get the key name field value of the entity.
This is my first interaction with the blobstore so I wouldn't mind advice on a better approach to the problem.
For serving a blob I would recommend you to look at the source code of the BlobstoreDownloadHandler. It should be easy to port it to bottle, since there's nothing very specific about the framework.
Here is an example on how to use BlobInfo.all():
for info in blobstore.BlobInfo.all():
self.response.out.write('Name:%s Key: %s Size:%s Creation:%s ContentType:%s<br>' % (info.filename, info.key(), info.size, info.creation, info.content_type))
for downloads you only really need to generate a response that includes the header "X-AppEngine-BlobKey:[your blob_key]" along with everything else you need like a Content-Disposition header if desired. or if it's an image you should probably just use the high performance image serving api, generate a url and redirect to it.... done
for uploads, besides writing a handler for appengine to call once the upload is safely in blobstore (that's in the docs)
You need a way to find the blob info in the incoming request. I have no idea what the request looks like in bottle. The Blobstoreuploadhandler has a get_uploads method and there's really no reason it needs to be an instance method as far as I can tell. So here's an example generic implementation of it that expects a webob request. For bottle you would need to write something similar that is compatible with bottles request object.
def get_uploads(request, field_name=None):
"""Get uploads for this request.
Args:
field_name: Only select uploads that were sent as a specific field.
populate_post: Add the non blob fields to request.POST
Returns:
A list of BlobInfo records corresponding to each upload.
Empty list if there are no blob-info records for field_name.
stolen from the SDK since they only provide a way to get to this
crap through their crappy webapp framework
"""
if not getattr(request, "__uploads", None):
request.__uploads = {}
for key, value in request.params.items():
if isinstance(value, cgi.FieldStorage):
if 'blob-key' in value.type_options:
request.__uploads.setdefault(key, []).append(
blobstore.parse_blob_info(value))
if field_name:
try:
return list(request.__uploads[field_name])
except KeyError:
return []
else:
results = []
for uploads in request.__uploads.itervalues():
results += uploads
return results
For anyone looking for this answer in future, to do this you need bottle (d'oh!) and defnull's multipart module.
Since creating upload URLs is generally simple enough and as per GAE docs, I'll just cover the upload handler.
from bottle import request
from multipart import parse_options_header
from google.appengine.ext.blobstore import BlobInfo
def get_blob_info(field_name):
try:
field = request.files[field_name]
except KeyError:
# Maybe form isn't multipart or file wasn't uploaded, or some such error
return None
blob_data = parse_options_header(field.content_type)[1]
try:
return BlobInfo.get(blob_data['blob-key'])
except KeyError:
# Malformed request? Wrong field name?
return None
Sorry if there are any errors in the code, it's off the top of my head.

Automatically cached models in App Engine

I've been working on creating a subclass of db.Model that is automatically cached, i.e.:
instance.put would store the entity in memcache before persisting it to the datastore
class.get_by_key_name would first check the cache, and if missed, would go to the datastore to retrieve it and cache it after retrieval
I developed the approach below (which appears to work for me), but I have a few questions:
I had read Nick Johnson's article on efficient model memcaching which suggests implementing the serialization for memcache through protocol buffers. Looking at the memcache API source code in the SDK, it looks like Google has already implemented protobuf serialization by default. Is my interpretation correct?
Am I missing some important details (which could get me in the future) in the way I am subclassing db.Model or overriding the two methods?
Is there a more efficient way of implementing what I've done below?
Are there guidelines, benchmarks or best practices for when such entity caching would make sense from a performance perspective? Or would it always make sense to cache entities? On a related note, should I be reading anything into the fact that Google hasn't provided a cached model in the modeling API? Are there too many special cases to be thinking about?
Below is my current implementation. I would really appreciate any and all guidance/suggestions on caching entities (even if your response is not a direct answer to one of the 4 questions above, but relevant to the topic overall).
from google.appengine.ext import db
from google.appengine.api import memcache
import os
import logging
class CachedModel(db.Model):
'''Subclass of db.Model that automatically caches entities for put and
attempts to load from cache for get_by_key_name
'''
#classmethod
def get_by_key_name(cls, key_names, parent=None, **kwargs):
cache = memcache.Client()
# Ensure that every new deployment of the application results in a cache miss
# by including the application version ID in the namespace of the cache entry
namespace = os.environ['CURRENT_VERSION_ID'] + '_' + cls.__name__
if not isinstance(key_names, list):
key_names = [key_names]
entities = cache.get_multi(key_names, namespace=namespace)
if entities:
logging.info('%s (namespace=%s) retrieved from memcache' % (str(entities.keys()), namespace))
missing_key_names = list(set(key_names) - set(entities.keys()))
# For keys missed in memcahce, attempt to retrieve entities from datastore
if missing_key_names:
missing_entities = super(CachedModel, cls).get_by_key_name(missing_key_names, parent, **kwargs)
missing_mapping = zip(missing_key_names, missing_entities)
# Determine entities that exist in datastore and store them to memcache
entities_to_cache = dict()
for key_name, entity in missing_mapping:
if entity:
entities_to_cache[key_name] = entity
if entities_to_cache:
logging.info('%s (namespace=%s) cached by get_by_key_name' % (str(entities_to_cache.keys()), namespace))
cache.set_multi(entities_to_cache, namespace=namespace)
non_existent = set(missing_key_names) - set(entities_to_cache.keys())
if non_existent:
logging.info('%s (namespace=%s) missing from cache and datastore' % (str(non_existent), namespace))
# Combine entities retrieved from cache and entities retrieved from datastore
entities.update(missing_mapping)
if len(key_names) == 1:
return entities[key_names[0]]
else:
return [entities[key_name] for key_name in key_names]
def put(self, **kwargs):
cache = memcache.Client()
namespace = os.environ['CURRENT_VERSION_ID'] + '_' + self.__class__.__name__
cache.set(self.key().name(), self, namespace=namespace)
logging.info('%s (namespace=%s) cached by put' % (self.key().name(), namespace))
return super(CachedModel, self).put(**kwargs)
Rather than reinventing the wheel, why not switch to NDB, which already implements memcaching of model instances?
You might check out Nick Johnson's article on adding pre and post hooks for data model classes as an alternative to overriding get_by_key_name. That way your hook could work even when using db.get and db.put.
That said, I've found in my app that I've had more dramatic performance improvements caching things at a higher level - like all the content I need to render an entire page, or the page's html itself if possible.
You also might check out the asynctools library which can help you run datastore queries in parallel and cache the results.
I lot of good tips from Nick Johnson's you want implement are already implemented in the module appengine-mp. like serialization via protocolbuf or prefetching entities.
About your method get_by_key_names you can check the code. If you want create your own db.Model layer, maybe that can help you but you can also contribute to improve the existing model. ;)

Resources