How to pass parameters dynamically to map function on GAE mapreduce? - google-app-engine

I need to run a mapreduce job that is dynamic in the sense that parameters need to be passed to the map and reduce functions each time the mapreduce job is run (e.g., in response to a user request).
How do I accomplish this? I could not see anywhere in the documentation how to do dynamic processing at runtime for map and reduce.
class MatchProcessing(webapp2.RequestHandler):
def get(self):
requestKeyID=int(self.request.get('riderbeeRequestID'))
userKey=self.request.get('userKey')
pipeline = MatchingPipeline(requestKeyID, userKey)
pipeline.start()
self.redirect(pipeline.base_path + "/status?root=" + pipeline.pipeline_id)
class MatchingPipeline(base_handler.PipelineBase):
def run(self, requestKeyID, userKey):
yield mapreduce_pipeline.MapreducePipeline(
"riderbee_matching",
"tasks.matchingMR.riderbee_map",
"tasks.matchingMR.riderbee_reduce",
"mapreduce.input_readers.DatastoreInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
mapper_params={
"entity_kind": "models.rides.RiderbeeRequest",
"requestKeyID": requestKeyID,
"userKey": userKey,
},
reducer_params={
"mime_type": "text/plain",
},
shards=16)
def riderbee_map(riderbeeRequest):
# would like to access the requestKeyID and userKey parameters that were passed in mapper_params
# so that we can do some processing based on that
yield (riderbeeRequest.user.email, riderbeeRequest.key().id())
def riderbee_reduce(key, values):
# would like to access the requestKeyID and userKey parameters that were passed earlier, perhaps through reducer_params
# so that we can do some processing based on that
yield "%s: %s\n" % (key, len(values))
Help please?

I'm pretty sure you can just specify parameters in mapper_parameters, and read them from the context module. See http://code.google.com/p/appengine-mapreduce/wiki/UserGuidePython#Mapper_parameters for more details.

This is how to access the mapper parameters from the mapper function, using the context module:
from mapreduce import context
def riderbee_map(riderbeeRequest):
ctx = context.get()
params = ctx.mapreduce_spec.mapper.params
requestKeyID = params["requestKeyID"]

Related

Flink Scala API functions on generic parameters

It's a follow up question on Flink Scala API "not enough arguments".
I'd like to be able to pass Flink's DataSets around and do something with it, but the parameters to the dataset are generic.
Here's the problem I have now:
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import scala.reflect.ClassTag
object TestFlink {
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val text = env.fromElements(
"Who's there?",
"I think I hear them. Stand, ho! Who's there?")
val split = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
id(split).print()
env.execute()
}
def id[K: ClassTag](ds: DataSet[K]): DataSet[K] = ds.map(r => r)
}
I have this error for ds.map(r => r):
Multiple markers at this line
- not enough arguments for method map: (implicit evidence$256: org.apache.flink.api.common.typeinfo.TypeInformation[K], implicit
evidence$257: scala.reflect.ClassTag[K])org.apache.flink.api.scala.DataSet[K]. Unspecified value parameters evidence$256, evidence$257.
- not enough arguments for method map: (implicit evidence$4: org.apache.flink.api.common.typeinfo.TypeInformation[K], implicit evidence
$5: scala.reflect.ClassTag[K])org.apache.flink.api.scala.DataSet[K]. Unspecified value parameters evidence$4, evidence$5.
- could not find implicit value for evidence parameter of type org.apache.flink.api.common.typeinfo.TypeInformation[K]
Of course, the id function here is just an example, and I'd like to be able to do something more complex with it.
How it can be solved?
you also need to have TypeInformation as a context bound on the K parameter, so:
def id[K: ClassTag: TypeInformation](ds: DataSet[K]): DataSet[K] = ds.map(r => r)
The reason is, that Flink analyses the types that you use in your program and creates a TypeInformation instance for each type you use. If you want to create generic operations then you need to make sure a TypeInformation of that type is available by adding a context bound. This way, the Scala compiler will make sure an instance is available at the call site of the generic function.

GAE: Writing the API: a simple PATCH method (Python)

I have a google-cloud-endpoints, in the docs, I did'nt find how to write a PATCH method.
My request:
curl -XPATCH localhost:8080/_ah/api/hellogreeting/1 -d '{"message": "Hi"}'
My method handler looks like this:
from models import Greeting
from messages import GreetingMessage
#endpoints.method(ID_RESOURCE, Greeting,`
path='hellogreeting/{id}', http_method='PATCH',
name='greetings.patch')
def greetings_patch(self, request):
request.message, request.username
greeting = Greeting.get_by_id(request.id)
greeting.message = request.message # It's ok, cuz message exists in request
greeting.username = request.username # request.username is None. Writing the IF conditions in each string(checking on empty), I think it not beatifully.
greeting.put()
return GreetingMessage(message=greeting.message, username=greeting.username)
So, now in Greeting.username field will be None. And it's wrong.
Writing the IF conditions in each string(checking on empty), I think it not beatifully.
So, what is the best way for model updating partially?
I do not think there is one in Cloud Endpoints, but you can code yours easily like the example below.
You will need to decide how you want your patch to behave, in particular when it comes to attributes that are objects : should you also apply the patch on the object attribute (in which case use recursion) or should you just replace the original object attribute with the new one like in my example.
def apply_patch(origin, patch):
for name in dir( patch ):
if not name.startswith( '__' ):
setattr(origin,name,getattr(patch,name))

Google AppEngine Pipelines API

I would like to rewrite some of my tasks as pipelines. Mainly because of the fact that I need a way of detecting when a task finished or start a tasks in specific order. My problem is that I'm not sure how to rewrite the recursive tasks to pipelines. By recursive I mean tasks that call themselves like this:
class MyTask(webapp.RequestHandler):
def post(self):
cursor = self.request.get('cursor', None)
[set cursor if not null]
[fetch 100 entities form datastore]
if len(result) >= 100:
[ create the same task in the queue and pass the cursor ]
[do actual work the task was created for]
Now I would really like to write it as a pipeline and do something similar to:
class DoSomeJob(pipeline.Pipeline):
def run(self):
with pipeline.InOrder():
yield MyTask()
yield MyOtherTask()
yield DoSomeMoreWork(message2)
Any help with this one will be greatly appreciated. Thank you!
A basic pipeline just returns a value:
class MyFirstPipeline(pipeline.Pipeline):
def run(self):
return "Hello World"
The value has to be JSON serializable.
If you need to coordinate several pipelines you will need to use a generator pipeline and the yield statement.
class MyGeneratorPipeline(pipeline.Pipeline):
def run(self):
yield MyFirstPipeline()
You can treat the yielding of a pipeline as if it returns a 'future'.
You can pass this future as the input arg to another pipeline:
class MyGeneratorPipeline(pipeline.Pipeline):
def run(self):
result = yield MyFirstPipeline()
yield MyOtherPipeline(result)
The Pipeline API will ensure that the run method of MyOtherPipeline is only called once the result future from MyFirstPipeline has been resolved to a real value.
You can't mix yield and return in the same method. If you are using yield the value has to be a Pipeline instance. This can lead to a problem if you want to do this:
class MyRootPipeline(pipeline.Pipeline):
def run(self, *input_args):
results = []
for input_arg in input_args:
intermediate = yield MyFirstPipeline(input_arg)
result = yield MyOtherPipeline(intermediate)
results.append(result)
yield results
In this case the Pipeline API just sees a list in your final yield results line, so it doesn't know to resolve the futures inside it before returning and you will get an error.
They're not documented but there is a library of utility pipelines included which can help here:
https://code.google.com/p/appengine-pipeline/source/browse/trunk/src/pipeline/common.py
So a version of the above which actually works would look like:
import pipeline
from pipeline import common
class MyRootPipeline(pipeline.Pipeline):
def run(self, *input_args):
results = []
for input_arg in input_args:
intermediate = yield MyFirstPipeline(input_arg)
result = yield MyOtherPipeline(intermediate)
results.append(result)
yield common.List(*results)
Now we're ok, we're yielding a pipeline instance and Pipeline API knows to resolve its future value properly. The source of the common.List pipeline is very simple:
class List(pipeline.Pipeline):
"""Returns a list with the supplied positional arguments."""
def run(self, *args):
return list(args)
...at the point that this pipeline's run method is called the Pipeline API has resolved all of the items in the list to actual values, which can be passed in as *args.
Anyway, back to your original example, you could do something like this:
class FetchEntitites(pipeline.Pipeline):
def run(self, cursor=None)
if cursor is not None:
cursor = Cursor(urlsafe=cursor)
# I think it's ok to pass None as the cursor here, haven't confirmed
results, next_curs, more = MyModel.query().fetch_page(100,
start_cursor=cursor)
# queue up a task for the next page of results immediately
future_results = []
if more:
future_results = yield FetchEntitites(next_curs.urlsafe())
current_results = [ do some work on `results` ]
# (assumes current_results and future_results are both lists)
# this will have to wait for all of the recursive calls in
# future_results to resolve before it can resolve itself:
yield common.Extend(current_results, future_results)
Further explanation
At the start I said we can treat result = yield MyPipeline() as if it returns a 'future'. This is not strictly true, obviously we are actually just yielding the instantiated pipeline. (Needless to say our run method is now a generator function.)
The weird part of how Python's yield expressions work is that, despite what it looks like, the value that you yield goes somewhere outside the function (to the Pipeline API apparatus) rather than into your result var. The value of the result var on the left side of the expression is also pushed in from outside the function, by calling send on the generator (the generator being the run method you defined).
So by yielding an instantiated Pipeline, you are letting the Pipeline API take that instance and call its run method somewhere else at some other time (in fact it will be passed into a task queue as a class name and a set of args and kwargs and re-instantiated there... this is why your args and kwargs need to be JSON serializable too).
Meanwhile the Pipeline API sends a PipelineFuture object into your run generator and this is what appears in your result var. It seems a bit magical and counter-intuitive but this is how generators with yield expressions work.
It's taken quite a bit of head-scratching for me to work it out to this level and I welcome any clarifications or corrections on anything I got wrong.
When you create a pipeline, it hands back an object that represents a "stage". You can ask the stage for its id, then save it away. Later, you can reconstitute the stage from the saved id, then ask the stage if it's done.
See http://code.google.com/p/appengine-pipeline/wiki/GettingStarted and look for has_finalized. There's an example that does most of what you need.

parallel code execution python2.7 ndb

in my app i for one of the handler i need to get a bunch of entities and execute a function for each one of them.
i have the keys of all the enities i need. after fetching them i need to execute 1 or 2 instance methods for each one of them and this slows my app down quite a bit. doing this for 100 entities takes around 10 seconds which is way to slow.
im trying to find a way to get the entities and execute those functions in parallel to save time but im not really sure which way is the best.
i tried the _post_get_hook but the i have a future object and need to call get_result() and execute the function in the hook which works kind of ok in the sdk but gets a lot of 'maximum recursion depth exceeded while calling a Python objec' but i can't really undestand why and the error message is not really elaborate.
is the Pipeline api or ndb.Tasklets what im searching for?
atm im going by trial and error but i would be happy if someone could lead me to the right direction.
EDIT
my code is something similar to a filesystem, every folder contains other folders and files. The path of the Collections set on another entity so to serialize a collection entity i need to get the referenced entity and get the path. On a Collection the serialized_assets() function is slower the more entities it contains. If i could execute a serialize function for each contained asset side by side it would speed things up quite a bit.
class Index(ndb.Model):
path = ndb.StringProperty()
class Folder(ndb.Model):
label = ndb.StringProperty()
index = ndb.KeyProperty()
# contents is a list of keys of contaied Folders and Files
contents = ndb.StringProperty(repeated=True)
def serialized_assets(self):
assets = ndb.get_multi(self.contents)
serialized_assets = []
for a in assets:
kind = a._get_kind()
assetdict = a.to_dict()
if kind == 'Collection':
assetdict['path'] = asset.path
# other operations ...
elif kind == 'File':
assetdict['another_prop'] = asset.another_property
# ...
serialized_assets.append(assetdict)
return serialized_assets
#property
def path(self):
return self.index.get().path
class File(ndb.Model):
filename = ndb.StringProperty()
# other properties....
#property
def another_property(self):
# compute something here
return computed_property
EDIT2:
#ndb.tasklet
def serialized_assets(self, keys=None):
assets = yield ndb.get_multi_async(keys)
raise ndb.Return([asset.serialized for asset in assets])
is this tasklet code ok?
Since most of the execution time of your functions are spent waiting for RPCs, NDB's async and tasklet support is your best bet. That's described in some detail here. The simplest usage for your requirements is probably to use the ndb.map function, like this (from the docs):
#ndb.tasklet
def callback(msg):
acct = yield ndb.get_async(msg.author)
raise tasklet.Return('On %s, %s wrote:\n%s' % (msg.when, acct.nick(), msg.body))
qry = Messages.query().order(-Message.when)
outputs = qry.map(callback, limit=20)
for output in outputs:
print output
The callback function is called for each entity returned by the query, and it can do whatever operations it needs (using _async methods and yield to do them asynchronously), returning the result when it's done. Because the callback is a tasklet, and uses yield to make the asynchronous calls, NDB can run multiple instances of it in parallel, and even batch up some operations.
The pipeline API is overkill for what you want to do. Is there any reason why you couldn't just use a taskqueue?
Use the initial request to get all of the entity keys, and then enqueue a task for each key having the task execute the 2 functions per-entity. The concurrency will be based then on the number of concurrent requests as configured for that taskqueue.

What is the best way to do AppEngine Model Memcaching?

Currently my application caches models in memcache like this:
memcache.set("somekey", aModel)
But Nicks' post at http://blog.notdot.net/2009/9/Efficient-model-memcaching suggests that first converting it to protobuffers is a lot more efficient. But after running some tests I found out it's indeed smaller in size, but actually slower (~10%).
Do others have the same experience or am I doing something wrong?
Test results: http://1.latest.sofatest.appspot.com/?times=1000
import pickle
import time
import uuid
from google.appengine.ext import webapp
from google.appengine.ext import db
from google.appengine.ext.webapp import util
from google.appengine.datastore import entity_pb
from google.appengine.api import memcache
class Person(db.Model):
name = db.StringProperty()
times = 10000
class MainHandler(webapp.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
m = Person(name='Koen Bok')
t1 = time.time()
for i in xrange(int(self.request.get('times', 1))):
key = uuid.uuid4().hex
memcache.set(key, m)
r = memcache.get(key)
self.response.out.write('Pickle took: %.2f' % (time.time() - t1))
t1 = time.time()
for i in xrange(int(self.request.get('times', 1))):
key = uuid.uuid4().hex
memcache.set(key, db.model_to_protobuf(m).Encode())
r = db.model_from_protobuf(entity_pb.EntityProto(memcache.get(key)))
self.response.out.write('Proto took: %.2f' % (time.time() - t1))
def main():
application = webapp.WSGIApplication([('/', MainHandler)], debug=True)
util.run_wsgi_app(application)
if __name__ == '__main__':
main()
The Memcache call still pickles the object with or without using protobuf. Pickle is faster with a protobuf object since it has a very simple model
Plain pickle objects are larger than protobuf+pickle objects, hence they save time on Memcache, but there is more processor time in doing the protobuf conversion
Therefore in general either method works out about the same...but
The reason you should use protobuf is it can handle changes between versions of the models, whereas Pickle will error. This problem will bite you one day, so best to handle it sooner
Both pickle and protobufs are slow in App Engine since they're implemented in pure Python. I've found that writing my own, simple serialization code using methods like str.join tends to be faster since most of the work is done in C. But that only works for simple datatypes.
One way to do it more quickly is to turn your model into a dictionary and use the native eval / repr function as your (de)serializers -- with caution of course, as always with the evil eval, but it should be safe here given that there is no external step.
Below an example of a class Fake_entity implementing exactly that.
You first create your dictionary through fake = Fake_entity(entity) then you can simply store your data via memcache.set(key, fake.serialize()). The serialize() is a simple call to the native dictionary method of repr, with some additions if you need (e.g. add an identifier at the beginning of the string).
To fetch it back, simply use fake = Fake_entity(memcache.get(key)). The Fake_entity object is a simple dictionary whose keys are also accessible as attributes. You can access your entity properties normally, except referenceProperties give keys instead of fetching the object (which is actually quite useful). You can also get() the actual entity with fake.get(), or more interestigly, change it and then save with fake.put().
It does not work with lists (if you fetch multiple entities from a query), but could be easily be adjusted with join/split functions using an identifier like '### FAKE MODEL ENTITY ###' as the separator. Use with db.Model only, would need small adjustments for Expando.
class Fake_entity(dict):
def __init__(self, record):
# simple case: a string, we eval it to rebuild our fake entity
if isinstance(record, basestring):
import datetime # <----- put all relevant eval imports here
from google.appengine.api import datastore_types
self.update( eval(record) ) # careful with external sources, eval is evil
return None
# serious case: we build the instance from the actual entity
for prop_name, prop_ref in record.__class__.properties().items():
self[prop_name] = prop_ref.get_value_for_datastore(record) # to avoid fetching entities
self['_cls'] = record.__class__.__module__ + '.' + record.__class__.__name__
try:
self['key'] = str(record.key())
except Exception: # the key may not exist if the entity has not been stored
pass
def __getattr__(self, k):
return self[k]
def __setattr__(self, k, v):
self[k] = v
def key(self):
from google.appengine.ext import db
return db.Key(self['key'])
def get(self):
from google.appengine.ext import db
return db.get(self['key'])
def put(self):
_cls = self.pop('_cls') # gets and removes the class name form the passed arguments
# import xxxxxxx ---> put your model imports here if necessary
Cls = eval(_cls) # make sure that your models declarations are in the scope here
real_entity = Cls(**self) # creates the entity
real_entity.put() # self explanatory
self['_cls'] = _cls # puts back the class name afterwards
return real_entity
def serialize(self):
return '### FAKE MODEL ENTITY ###\n' + repr(self)
# or simply repr, but I use the initial identifier to test and eval directly when getting from memcache
I would welcome speed tests on this, I would assume this is quite faster than the other approaches. Plus, you do not have any risks if your models have changed somehow in the meantime.
Below an example of what the serialized fake entity looks like. Take a particular look at datetime (created) as well as reference properties (subdomain) :
### FAKE MODEL ENTITY ###
{'status': u'admin', 'session_expiry': None, 'first_name': u'Louis', 'last_name': u'Le Sieur', 'modified_by': None, 'password_hash': u'a9993e364706816aba3e25717000000000000000', 'language': u'fr', 'created': datetime.datetime(2010, 7, 18, 21, 50, 11, 750000), 'modified': None, 'created_by': None, 'email': u'chou#glou.bou', 'key': 'agdqZXJlZ2xlcgwLEgVMb2dpbhjmAQw', 'session_ref': None, '_cls': 'models.Login', 'groups': [], 'email___password_hash': u'chou#glou.bou+a9993e364706816aba3e25717000000000000000', 'subdomain': datastore_types.Key.from_path(u'Subdomain', 229L, _app=u'jeregle'), 'permitted': [], 'permissions': []}
Personally I also use static variables (faster than memcache) to cache my entities in the short term, and fetch the datastore when the server has changed or its memory has been flushed for some reason (which happens quite often in fact).

Resources