App Engine ndb parallel fetch by key - google-app-engine

I'm retrieving a batch of items using their keys, with something like this:
from google.appengine.ext.ndb import model
# …
keys = [model.Key('Card', id, namespace=ns) id in ids]
cards = yield model.get_multi_async(keys)
The result of that in appstats is this:
The reverse-waterfall thing seems to be caused by keys being sent one by one in parallel, each in its own RPC.
My question is, is there a way to retrieve multiple objects by keys with a single RPC call? (Assuming that would speed up the overall response time of the app).

Quoting guido response in the thread linked by lecstor.
You can always try issuing fewer RPCs by passing
max_entity_groups_per_rpc=N to the get_multi_async() call
Multiple parallel rpcs should be more efficient than a single multi-key RPC.
The engineers responsible for the HRD implementation assure me this is more
efficient than issuing a single multi-key Get RPC

Related

What is being done in all this non-rpc time for a fetch?

Say I have some model Object, and ~2000 entities in my datastore. Using app stats with the following code
ndb.get_context().set_cache_policy(lambda x: False)
ndb.get_context().set_memcache_policy(lambda x: False)
objects = Object.query().fetch()
I get the following profile
What is being done for the ~18 seconds that it's not waiting for RPCs?
It's likely deserializing those entities into python objects and the process is very very slow. You shouldn't be fetching that many entities during a single web request from a client anyway and if it's for some kind of a batch job - the time shouldn't matter that much (note also that once you go over several thousand items - your requests will likely time out at some point so you will need to use something like query cursors).
You may also find this and this blog posts helpful on some hacks to speed the deserialization process up.
Also, unrelated but this is one of the many cases where Golang would shine and way outperform Python on the exact same task (that delay would be almost non-existent).

AppEngine: Entity schema migration w/ pipelines

I'm curious to know what the best practices are for migrating entity schemas in Google App Engine. We use pipeline a lot and my inclination was to build a pipeline task to handle this migration. So this is what I came up with (in this example, I store if the user's age is a prime number):
class MigrateUsers(pipeline.Pipeline):
def run(self, keys):
futures = []
users = ndb.get_multi(keys)
for user in users:
user.age_is_prime = is_prime(user.age)
futures.append(user.put_async())
ndb.Future.wait_all(futures)
class Migration(pipeline.Pipeline):
def run(self):
all_results = []
q = ds.User.query().filter()
more = True
next_cursor = None
# Fetch user keys in batch and create MigrateUsers jobs
while more:
user_keys, next_cursor, more = \
q.fetch_page(500, keys_only=True, start_cursor=next_cursor)
all_results.append((yield MigrateUsers(keys=user_keys)))
# Wait for them all to finish
pipeline.After(*all_results)
My question really is, am I doing this right? It feels a little kludgy that my "Migration" tasks iterates over all the users in order to create segmented tasks. I did take a look a mapreduce, but I didn't get the feeling it was appropriate. I'd appreciate any advice, and if you're using mapreduce and wouldn't mind transforming my example, I'd really appreciate it.
MapReduce is great for migrations. In my own experience, a migration usually means I need to go over all my entities, update them, and then write them back to the datastore. In this case, I only really need the "map" part, and I don't need the "reduce" part of mapreduce.
The benefit of using mapreduce is that it'll automatically batch your entities over different instances in parallel, so your operation will complete much faster than running serially in your pipeline example. The MR SDK has a DatastoreInputReader() that will fetch every entity of a given kind, and call a map function on each, you just have to provide that map function:
from mapreduce import operation as op
def prime_age_map(user_entity):
user_entity.age_is_prime = is_prime(user.age)
if user_entity.age_is_prime:
yield op.db.Put(user_entity)
There us some boilerplate code I'm not including because I haven't switched up to the latest SDK and what I have would probably be incorrect, but it should be pretty simple because you're only using half the pipeline.
I'm not sure how realistic your example is, but if it's real and you have a many entities, it would be much better to precalculate the prime values (http://primes.utm.edu/lists/small/1000.txt - only the top 30 or so are reasonable age values), and execute specific queries on those age values and update those entities, instead of iterating over the entire Kind. You can do this using the MapReduce pipeline, but you'll have to modify the given DatastoreInputReader to issue a more specific query than fetching your entire Kind.
I would strongly recommend looking into app engine TaskQueues for schema migrations. It's a lot easier to setup and operate than backends or MapReduce, IMO. You can find some info here: blog entry.

most efficient way to get, modify and put a batch of entities with ndb

in my app i have a few batch operations i perform.
unfortunately this sometimes takes forever to update 400-500 entities.
what i have is all the entity keys, i need to get them, update a property and save them to the datastore and saving them can take up to 40-50 seconds which is not what im looking for.
ill simplify my model to explain what i do (which is pretty simple anyway):
class Entity(ndb.Model):
title = ndb.StringProperty()
keys = [key1, key2, key3, key4, ..., key500]
entities = ndb.get_multi(keys)
for e in entities:
e.title = 'the new title'
ndb.put_multi(entities)
getting and modifying does not take too long. i tried to get_async getting in a tasklet and whatever else is possible which only changes if the get or the forloop takes longer.
but what really bothers me is that a put takes up to 50seconds...
what is the most efficient way to do this operation(s) in a decent amount of time. of course i know that it depends on many factors like the complexity of the entity but the time it takes to put is really over the acceptable limit to me.
i already tried async operations, tasklets...
I wonder if doing smaller batches of e.g. 50 or 100 entities will be faster. If you make that into a task let you can try running those tasklets concurrently.
I also recommend looking at this with Appstats to see if that shows something surprising.
Finally assuming this uses the HRD you may find that there is a limit on the number of entity groups per batch. This limit defaults very low. Try raising it.
Sounds like what MapReduce was designed for. You can do this fast, by simultaneously getting and modifying all the entities at the same time, scaled across multiple server instances. Your cost goes up by using more instances though.
I'm going to assume that you have the entity design that you want (i.e. I'm not going to ask you what you're trying to do and how maybe you should have one big entity instead of a bunch of small ones that you have to update all the time). Because that wouldn't be very nice. ( =
What if you used the Task Queue? You could create multiple tasks and each task could take as URL params the keys it is responsible for updating and the property and value that should be set. That way the work is broken up into manageable chunks and the user's request can return immediately while the work happens in the background? Would that work?

Mass updates in Google App Engine Datastore

What is the proper way to perform mass updates on entities in a Google App Engine Datastore? Can it be done without having to retrieve the entities?
For example, what would be the GAE equivilant to something like this in SQL:
UPDATE dbo.authors
SET city = replace(city, 'Salt', 'Olympic')
WHERE city LIKE 'Salt%';
There isn't a direct translation. The datastore really has no concept of updates; all you can do is overwrite old entities with a new entity at the same address (key). To change an entity, you must fetch it from the datastore, modify it locally, and then save it back.
There's also no equivalent to the LIKE operator. While wildcard suffix matching is possible with some tricks, if you wanted to match '%Salt%' you'd have to read every single entity into memory and do the string comparison locally.
So it's not going to be quite as clean or efficient as SQL. This is a tradeoff with most distributed object stores, and the datastore is no exception.
That said, the mapper library is available to facilitate such batch updates. Follow the example and use something like this for your process function:
def process(entity):
if entity.city.startswith('Salt'):
entity.city = entity.city.replace('Salt', 'Olympic')
yield op.db.Put(entity)
There are other alternatives besides the mapper. The most important optimization tip is to batch your updates; don't save back each updated entity individually. If you use the mapper and yield puts, this is handled automatically.
No, it can't be done without retrieving the entities.
There's no such thing as a '1000 max record limit', but there is of course a timeout on any single request - and if you have large amounts of entities to modify, a simple iteration will probably fall foul of that. You could manage this by splitting it up into multiple operations and keeping track with a query cursor, or potentially by using the MapReduce framework.
you could use the query class, http://code.google.com/appengine/docs/python/datastore/queryclass.html
query = authors.all().filter('city >', 'Salt').fetch()
for record in query:
record.city = record.city.replace('Salt','Olympic')

Google App Engine: efficient large deletes (about 90000/day)

I have an application that has only one Model with two StringProperties.
The initial number of entities is around 100 million (I will upload those with the bulk loader).
Every 24 hours I must remove about 70000 entities and add 100000 entities. My question is now: what is the best way of deleting those entities?
Is there anyway to avoid fetching the entity before deleting it? I was unable to find a way of doing something like:
DELETE from xxx WHERE foo1 IN ('bar1', 'bar2', 'bar3', ...)
I realize that app engine offers an IN clause (albeit with a maximum length of 30 (because of the maximum number of individual requests per GQL query 1)), but to me that still seems strange because I will have to get the x entities and then delete them again (making two RPC calls per entity).
Note: the entity should be ignored if not found.
EDIT: Added info about problem
These entities are simply domains. The first string being the SLD and the second the TLD (no subdomains). The application can be used to preform a request like this http://[...]/available/stackoverflow.com . The application will return a True/False json object.
Why do I have so many entities? Because the datastore contains all registered domains (.com for now). I cannot perform a whois request in every case because of TOSs and latency. So I initially populate the datastore with an entire zone file and then daily add/remove the domains that have been registered/dropped... The problem is, that these are pretty big quantities and I have to figure out a way to keep costs down and add/remove 2*~100000 domains per day.
Note: there is hardly any computation going on as an availability request simply checks whether the domain exists in the datastore!
1: ' A maximum of 30 datastore queries are allowed for any single GQL query.' (http://code.google.com/appengine/docs/python/datastore/gqlreference.html)
If are not doing so already you should be using key_names for this.
You'll want a model something like:
class UnavailableDomain(db.Model):
pass
Then you will populate your datastore like:
UnavailableDomain.get_or_insert(key_name='stackoverflow.com')
UnavailableDomain.get_or_insert(key_name='google.com')
Then you will query for available domains with something like:
is_available = UnavailableDomain.get_by_key_name('stackoverflow.com') is None
Then when you need to remove a bunch of domains because they have become available, you can build a big list of keys without having to query the database first like:
free_domains = ['stackoverflow.com', 'monkey.com']
db.delete(db.Key.from_path('UnavailableDomain', name) for name in free_domains)
I would still recommend batching up the deletes into something like 200 per RPC, if your free_domains list is really big
have you considered the appengine-mapreduce library. It comes with the pipeline library and you could utilise both to:
Create a pipeline for the overall task that you will run via cron every 24hrs
The 'overall' pipeline would start a mapper that filters your entities and yields the delete operations
after the delete mapper completes, the 'overall' pipeline could call an 'import' pipeline to start running your entity creation part.
pipeline api can then send you an email to report on it's status

Resources