Appengine looping across large datasets

Appengine looping across large datasets - google-app-engine

I need to loop over a large dataset within appengine. Ofcourse, as the datastore times out after a small amount of time, I decided to use tasks to solve this problem, here's an attempt to explain the method I'm trying to use:
Initialization of task via http post
0) Create query (entity.query()), and set a batch_size limit (i.e. 500)
1) Check if there are any cursors--if this is the first time running, there won't be any.
2a) If there are no cursors, use iter() with the following options: produce_cursors = true, limit= batch_size
2b) If there are curors, use iter() with same options as 2a + set start_cursor to the cursor.
3) Do a for loop to iterate through the results pulled by iter()
4) Get cursor_after()
5) Queue new task (basically re-run the task that was running) passing the cursor into the payload.
So if this code were to work the way I wanted, there'd only be 1 task running at any particular time in the queue. However, I started running the task this morning and 3 hours later when I looked at the queue, there were 4 tasks in it! This is weird because the new task should only be launched at the end of the task launching it.
Here's the actual code with no edits:
class send_missed_swipes(BaseHandler): #disabled
def post(self):
"""Loops across entire database (as filtered) """
#Settings
BATCH_SIZE = 500
cursor = self.request.get('cursor')
start = datetime.datetime(2014, 2, 13, 0, 0, 0, 0)
end = datetime.datetime(2014, 3, 5, 0, 0, 00, 0)
#Filters
swipes = responses.query()
swipes = swipes.filter(responses.date>start)
if cursor:
num_updated = int(self.request.get('num_updated'))
cursor = ndb.Cursor.from_websafe_string(cursor)
swipes = swipes.iter(produce_cursors=True,limit=BATCH_SIZE,start_cursor=cursor)
else:
num_updated = 0
swipes = swipes.iter(produce_cursors=True,limit=BATCH_SIZE)
count = 0
for swipe in swipes:
count += 1
if swipe.date>end:
pass
else:
uKey = str(swipe.uuId.urlsafe())
pKey = str(swipe.pId.urlsafe())
act = swipe.act
taskqueue.add(queue_name="analyzeData", url="/admin/analyzeData/send_swipes", params={'act':act,'uKey':uKey,'pKey':pKey})
num_updated += 1
logging.info('count = '+str(count))
logging.info('num updated = '+str(num_updated))
cursor = swipes.cursor_after().to_websafe_string()
taskqueue.add(queue_name="default", url="/admin/analyzeData/send_missed_swipes", params={'cursor':cursor,'num_updated':num_updated})
This is a bit of a complicated question, so please let me know if I need to explain it better. And thanks for the help!
p.s. Threadsafe is false in app.yaml

I believe a task can be executed multiple times, therefore it is important to make your process idempotent.
From doc https://developers.google.com/appengine/docs/python/taskqueue/overview-push
Note that this example is not idempotent. It is possible for the task
queue to execute a task more than once. In this case, the counter is
incremented each time the task is run, possibly skewing the results.
You can create task with name to handle this
https://developers.google.com/appengine/docs/python/taskqueue/#Python_Task_names
I'm curious why threadsafe=False in your yaml?

A bit off topic (since I'm not addressing your problems), but this sounds like a job for map reduce.
On topic: you can create custom queue with max_concurrent_requests=1. You could still have multiple tasks in the queue, but only one would be executing at a time.

Related

Inconsistency in App Engine datastore vs what I know it should be from parsing the same data source locally

This may be a trivial question, but I was just hoping to get some practical experience from people who may know more about this than I do.
I wanted to generate a database in GAE from a very large series of XML files -- as a form of validation, I am calculating statistics on the GAE datastore, and I know there should be ~16,000 entities, but when I perform a count, I'm getting more on the order of 12,000.
The way I'm doing counting is basically I perform a filter, fetch a page of 1000 entities, and then spin up task queues for each entity (using its key). Each task queue then adds "1" to a counter that I'm storing.
I think I may have juiced the datastore writes too much; I set the rate of my task queues to 50/s.. I did get some writing errors, but not nearly enough to justify the 4,000 difference. Could it be possible that I was rushing the counting calls too much that it lead to inconsistency? Would slowing the rate that I process task queues to something like 5/s solve the problem? Thanks.

You can count your entities very easily (no tasks and almost for free):
int total = 0;
Query q = new Query("entity_kind").setKeysOnly();
// set your filter on this query
QueryResultList<Entity> results;
Cursor cursor = null;
FetchOptions queryOptions = FetchOptions.Builder.withLimit(1000).chunkSize(1000);
do {
if (cursor != null) {
queryOptions.startCursor(cursor);
}
results = datastore.prepare(q).asQueryResultList(queryOptions);
total += results.size();
cursor = results.getCursor();
} while (results.size() == 1000);
System.out.println("Total entities: " + total);
UPDATE:
If looping like I suggested takes too long, you can spin a task for every 100/500/1000 entities - it's definitely more efficient than creating a task for each entity. Even very complex calculations should take milliseconds in Java if done right.
For example, each task can retrieve a batch of entities, spin a new task (and pass a query cursor to this new task), and then proceed with your calculations.

Which NDB query function is more efficient to iterate through a big set of query results?

I use NDB for my app and use iter() with limit and starting cursor to iterate through 20,000 query results in a task. A lot of time I run into timeout error.
Timeout: The datastore operation timed out, or the data was temporarily unavailable.
The way I make the call is like this:
results = query.iter(limit=20000, start_cursor=cursor, produce_cursors=True)
for item in results:
process(item)
save_cursor_for_next_time(results.cursor_after().urlsafe())
I can reduce the limit but I thought a task can run as long as 10 mins. 10 mins should be more than enough time to go through 20000 results. In fact, on a good run, the task can complete in just about a minute.
If I switched to fetch() or fetch_page(), would they be more efficient and less likely to run into the timeout error? I suspect there's a lot of overhead in iter() that causes the timeout error.
Thanks.

Fetch is not really any more efficient they all use the same mechanism, unless you know how many entities you want upfront - then fetch can be more efficient as you end up with just one round trip.
You can increase the batch size for iter, that can improve things. See https://developers.google.com/appengine/docs/python/ndb/queryclass#kwdargs_options
From the docs the default batch size is 20, which would mean for 20,000 entities a lot of batches.
Other things that can help. Consider using map and or map_async on the processing, rather than explicitly calling process(entity) Have a read https://developers.google.com/appengine/docs/python/ndb/queries#map also introducing async into your processing can mean improved concurrency.
Having said all of that you should profile so you can understand where the time is used. For instance the delays could be in your process due to things you are doing there.

There are other things to conside with ndb like context caching, you need to disable it. But I also used iter method for these. I also made an ndb version of the mapper api with the old db.
Here is my ndb mapper api that should solve timeout problems and ndb caching and easily create this kind of stuff:
http://blog.altlimit.com/2013/05/simple-mapper-class-for-ndb-on-app.html
with this mapper api you can create it like or you can just improve it too.
class NameYourJob(Mapper):
def init(self):
self.KIND = YourItemModel
self.FILTERS = [YourItemModel.send_email == True]
def map(self, item):
# here is your process(item)
# process here
item.send_email = False
self.update(item)
# Then run it like this
from google.appengine.ext import deferred
deferred.defer(NameYourJob().run, 50, # <-- this is your batch
_target='backend_name_if_you_want', _name='a_name_to_avoid_dups')

For potentially long query iterations, we use a time check to ensure slow processing can be handled. Given the disparities in GAE infrastructure performance, you will likely never find an optimal processing number. The code excerpt below is from an on-line maintenance handler we use which generally runs within ten seconds. If not, we get a return code saying it needs to be run again thanks to our timer check. In your case, you would likely break the process after passing the cursor to your next queue task. Here is some sample code which is edited down to hopefully give you a good idea of our logic. One other note: you may choose to break this up into smaller bites and then fan out the smaller tasks by re-enqueueing the task until it completes. Doing 20k things at once seems very aggressive in GAE's highly variable environment. HTH -stevep
def over_dt_limit(start, milliseconds):
dt = datetime.datetime.now() - start
mt = float(dt.seconds * 1000) + (float(dt.microseconds)/float(1000))
if mt > float(milliseconds):
return True
return False
#set a start time
start = datetime.datetime.now()
# handle a timeout issue inside your query iteration
for item in query.iter():
# do your loop logic
if over_dt_limit(start, 9000):
# your specific time-out logic here
break

Google App Engine Added task goes missing

When I add a task to the task queue, sometimes the task goes missing. I dont get any errors but I just dont find the tasks in my logs. Suppose I add n tasks. The computation cannot go forward without these n tasks finishing. However, I find that one or more of these n tasks just went missing after they were added and my whole algorithm stops in the middle.
What could be the reason ?
I keep a variable w to check the number of times the task was added. I observe w = n though some tasks were not created.
def addtask_whx(index,user,seqlen,vp_compress,iseq_compress):
global w
while True :
timeout_ms = 100
taskq_name = 'whx'+'--'+str(index[0])+'-'+str(index[1])+'-'+str(index[2])+'-'+str(index[3])+'-'+str(index[5]) + '--' + user
try :
taskqueue.add(name=taskq_name+str(timeout_ms),queue_name='whx',url='/whx', params={'m': index[0],'n': index[1],'o': index[2],'p': index[3],'q':0,'r':index[5],'user': user,'seqlen':seqlen,'vp':vp_compress,'iseq':iseq_compress})
w = w+1
break
except DeadlineExceededError:
taskq_name = taskq_name + str(timeout_ms)
time.sleep(float(timeout_ms)/1000)
timeout_ms = timeout_ms*4
logging.error("WHX Task Queue Add Timeout Retrying")
except TransientError:
taskq_name = taskq_name + str(timeout_ms)
time.sleep(float(timeout_ms)/1000)
timeout_ms = timeout_ms*4
logging.error("WHX Task Queue Add Transient Error Retrying")
except TombstonedTaskError:
logging.error("WHX Task Queue Tombstoned Error")
break

Disclaimer: this is not the answer you're looking for, but I hope it will help you nonetheless.
The computation cannot go forward
without these n tasks finishing
It sounds like you are using the task queue for something it was not designed to do. You should read: http://code.google.com/appengine/docs/java/taskqueue/overview.html#Queue_Concepts
Tasks are not guaranteed to be executed in the order they arrive, and they are not guaranteed to be executed exactly once. In some cases, a single task may be executed more than once or not at all. Further, a task can be cancelled and re-queued at the discretion of the App Engine based on available resources. For example, your timeout_ms = 100 is very low; if a new JVM has to be started, which could take several seconds, tasks n+1 and n+2 may be executed before task n.
In short, the task queue is not a reliable mechanism for performing strictly sequential computation. You've been warned.
-tjw

Try to fill the GAE datastore but the code consumes to much cpu time. How to optimize this?

I try to get the list of images in Amazon EC2 inside the Google datastore. I want to realize this with a cron job inside the GAE.
class AmazonEC2uswest(db.Model):
ami = db.StringProperty(required=True)
mani = db.StringProperty()
typ = db.StringProperty()
arch = db.StringProperty()
state = db.StringProperty()
owner = db.StringProperty()
class CronAMIsAmazonUS_WEST(webapp.RequestHandler):
def get(self):
aws_access_key_id_admin = "<secret>"
aws_secret_access_key_admin = "<secret>"
conn_us_west = boto.ec2.connect_to_region('us-west-1', aws_access_key_id=aws_access_key_id_admin,
aws_secret_access_key=aws_secret_access_key_admin, is_secure = False)
liste_images_us_west = conn_us_west.get_all_images()
laenge_liste_images_us_west = len(liste_images_us_west)
for i in range(laenge_liste_images_us_west):
datastore_uswest_AMIs = AmazonEC2uswest(ami=liste_images_us_west[i].id,
mani=str(liste_images_us_west[i].location),
typ=liste_images_us_west[i].type,
arch=liste_images_us_west[i].architecture,
state=liste_images_us_west[i].state,
owner=liste_images_us_west[i].ownerId)
datastore_uswest_AMIs.put()
The problem: Getting the list with get_all_images() lasts only a few seconds. But writing the data to the Google datastore needs way too much CPU time.
My IBM T42p (P4M with 2GHz) needs for that piece of code approx. 1 Minute!
Is it possible to optimize my code in a way that it needs fewer CPU time?

First possible optimisation: create all the entities in your loop, and then call db.put() with a list of all of them after you're finished. Something like:
entities = []
for i in range(laenge_liste_images_us_west):
datastore_uswest_AMIs = AmazonEC2uswest(...)
entities.append(datastore_uswest_AMIs)
db.put(entities)
or:
db.put([AmazonEC2uswest(...) for image in liste_images_us_west])
If that's still too slow, the right thing to do is probably:
Get the list of images.
Divide these up into small batches which can complete comfortably in under 30 seconds. So in your example which is currently taking a minute, you want at least 4 batches, maybe more, and the number of batches should depend on the number of images you get.
For each batch, add a task to a task queue, specifying which images to add to the DB. This might be done by specifying all the data, or just by specifying a range of images to handle. Which you do depends on being able to store the data temporarily: there's a limit to what you can store in a task, if you go past that you could use memcache, or you could only store the image id, not all the fields. Or you could create more tasks, so that the data for each batch is under the limit.
In the task handler, process just that batch. If you have all the data then great, otherwise get it again with get_all_images. Then generate and store just the entities that belong to this batch.
You don't have to use tasks, cron alone could handle it if you can remember how far you got last time the job ran, and continue from there next time. But tasks seem appropriate to me.

What's the best way to count results in GQL?

I figure one way to do a count is like this:
foo = db.GqlQuery("SELECT * FROM bar WHERE baz = 'baz')
my_count = foo.count()
What I don't like is my count will be limited to 1000 max and my query will probably be slow. Anyone out there with a workaround? I have one in mind, but it doesn't feel clean. If only GQL had a real COUNT Function...

You have to flip your thinking when working with a scalable datastore like GAE to do your calculations up front. In this case that means you need to keep counters for each baz and increment them whenever you add a new bar, instead of counting at the time of display.
class CategoryCounter(db.Model):
category = db.StringProperty()
count = db.IntegerProperty(default=0)
then when creating a Bar object, increment the counter
def createNewBar(category_name):
bar = Bar(...,baz=category_name)
counter = CategoryCounter.filter('category =',category_name).get()
if not counter:
counter = CategoryCounter(category=category_name)
else:
counter.count += 1
bar.put()
counter.put()
db.run_in_transaction(createNewBar,'asdf')
now you have an easy way to get the count for any specific category
CategoryCounter.filter('category =',category_name).get().count

+1 to Jehiah's response.
Official and blessed method on getting object counters on GAE is to build sharded counter. Despite heavily sounding name, this is pretty straightforward.

Count functions in all databases are slow (eg, O(n)) - the GAE datastore just makes that more obvious. As Jehiah suggests, you need to store the computed count in an entity and refer to that if you want scalability.
This isn't unique to App Engine - other databases just hide it better, up until the point where you're trying to count tens of thousands of records with each request, and your page render time starts to increase exponentially...

According to the GqlQuery.count() documentation, you can set the limit to be some number greater than 1000:
from models import Troll
troll_count = Troll.all(keys_only=True).count(limit=31337)
Sharded counters are the right way to keep track of numbers like this, as folks have said, but if you figure this out late in the game (like me) then you'll need to initialize the counters from an actual count of objects. But this is a great way to burn through your free quota of Datastore Small Operations (50,000 I think). Every time you run the code, it will use up as many ops as there are model objects.

I haven't tried it, and this is an utter resource hog, but perhaps iterating with .fetch() and specifying the offset would work?
LIMIT=1000
def count(query):
result = offset = 0
gql_query = db.GqlQuery(query)
while True:
count = gql_query.fetch(LIMIT, offset)
if count < LIMIT:
return result
result += count
offset += LIMIT

orip's solution works with a little tweaking:
LIMIT=1000
def count(query):
result = offset = 0
gql_query = db.GqlQuery(query)
while True:
count = len(gql_query.fetch(LIMIT, offset))
result += count
offset += LIMIT
if count < LIMIT:
return result

We now have Datastore Statistics that can be used to query entity counts and other data. These values do not always reflect the most recent changes as they are updated once every 24-48 hours. Check out the documentation (see link below) for more details:
Datastore Statistics

As pointed out by #Dimu, the stats computed by Google on a periodic basis are a decent go-to resource when precise counts are not needed and the % of records are NOT changing drastically during any given day.
To query the statistics for a given Kind, you can use the following GQL structure:
select * from __Stat_Kind__ where kind_name = 'Person'
There are a number of properties returned by this which are helpful:
count -- the number of Entities of this Kind
bytes -- total size of all Entities stored of this Kind
timestamp -- an as of date/time for when the stats were last computed
Example Code
To answer a follow-up question posted as a comment to my answer, I am now providing some sample C# code that I am using, which admittedly may not be as robust as it should be, but seems to work OK for me:
/// <summary>Returns an *estimated* number of entities of a given kind</summary>
public static long GetEstimatedEntityCount(this DatastoreDb database, string kind)
{
var query = new GqlQuery
{
QueryString = $"select * from __Stat_Kind__ where kind_name = '{kind}'",
AllowLiterals = true
};
var result = database.RunQuery(query);
return (long) (result?.Entities?[0]?["count"] ?? 0L);
}

The best workaround might seem a little counter-intuitive, but it works great in all my appengine apps. Rather than relying on the integer KEY and count() methods, you add an integer field of your own to the datatype. It might seem wasteful until you actually have more than 1000 records, and you suddenly discover that fetch() and limit() DO NOT WORK PAST THE 1000 RECORD BOUNDARY.
def MyObj(db.Model):
num = db.IntegerProperty()
When you create a new object, you must manually retrieve the highest key:
max = MyObj.all().order('-num').get()
if max : max = max.num+1
else : max = 0
newObj = MyObj(num = max)
newObj.put()
This may seem like a waste of a query, but get() returns a single record off the top of the index. It is very fast.
Then, when you want to fetch past the 1000th object limit, you simply do:
MyObj.all().filter('num > ' , 2345).fetch(67)
I had already done this when I read Aral Balkan's scathing review: http://aralbalkan.com/1504 . It's frustrating, but when you get used to it and you realize how much faster this is than count() on a relational db, you won't mind...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight