Parallel and bulk process datastore elements in google cloud dataflow - google-app-engine

Problem:
I have list of 2M+ users data in my datastore project. I would like to send a weekly newsletter to all users. The mailing API accepts max 50 email address per API call.
Previous Solution:
Used app-engine backend and a simple datastore query to process all the records at one go. But what happens is, sometimes I get memory overflow critical error log and the process starts all over again. Because of this some users, get the same email more than once. So I moved to dataflow.
Current Solution:
I use the FlatMap function to send each email id to a function and then send email individually to each user.
def process_datastore(project, pipeline_options):
p = beam.Pipeline(options=pipeline_options)
query = make_query()
entities = (p | 'read from datastore' >> ReadFromDatastore(project, query))
entities | beam.FlatMap(lambda entity: sendMail([entity.properties.get('emailID', "")]))
return p.run()
With cloud dataflow, I have ensured that each user gets a mail only once and also nobody is missed out. There are no memory errors.
But this current process takes 7 hours to finish running. I have tried to replace FlatMap with ParDo, with the assumption that ParDo will parallelize the process. But even that takes same time.
Question:
How to bunch the email ids in group of 50, so that the mail API call is effectively used?
How to parallelize the process such that the time taken is less than an hour?

You could use query cursors to split your users in batches of 50 and do the actual batch processing (the email sending) inside push queue or deferred tasks. This would be a GAE-only solution, without cloud dataflow, IMHO a lot simpler.
You can find an example of such processing in Google appengine: Task queue performance (taking the answer into account as well). That solution is using the deferred library, but it is almost trivial to use push queue tasks instead.
The answer touches on the parallelism aspect in the sense that you may want to limit it to keep costs down.
You could also split the batching itself inside the tasks to obtain an indefinitely scalable solution (any number of recipients, without hitting memory or deadline exceeded failures), with the task re-enqueing itself to continue the work from where it left off.

Related

how to use Google Cloud Memcach to save/update unique items

In my application I run a cron job to loop over all users (2500 user) to choose an item for every user out of 4k items, considering that:
- choosing the item is based on some user info,
- I need to make sure that each user take a unique item that wasn't taken by any one else, so relation is one-to-one
To achieve this I have to run this cron job and loop over the users one by one sequentially and pick up the item for each then remove it from the list (not to be chosen by next user(s)) then move to the next user
actually in my system the number of users/items is getting bigger and bigger every single day, this cron job now takes 2 hours to set items to all users.
I need to improve this, one of the things I've thought about is using Threads but I cant do that since Im using automatic scaling, so I start thinking about push Queues, so when the cron jobs run, will make a loop like this:
for(User user : users){
getMyItem(user.getId());
}
where getMyItem will push the task to a servlet to handle it and choose the best item for this person based on his data.
Let's say I'll start doing that so what will be the best/robust solution to avoid setting an item to more than one user ?
Since Im using basic scaling and 8 instances, can't rely on static variables.
one of the things that came across my mind is to create a table in the DB that accept only unique items then I insert into it the taken items so if the insertion is done successfully it means no body else took this item so i can just assign it to that person, but this will make the performance a bit lower cause I need to make write DB operation with every call (I want to avoid that)
Also I thought about MemCach, its really fast but not robust enough, if I save a Set of items into it which will accept only unique items, then if more than one thread was trying to access this Set at the same time to update it, only one thread will be able to save its data and all other threads data might be overwritten and lost.
I hope you guys can help to find a solution for this problem, thanks in advance :)
First - I would advice against using solely memcache for such algorithm - the key thing to remember about memcache is that it is volatile and might dissapear at any time, breaking the algorithm.
From Service levels:
Note: Whether shared or dedicated, memcache is not durable storage. Keys can be evicted when the cache fills up, according to the
cache's LRU policy. Changes in the cache configuration or datacenter
maintenance events can also flush some or all of the cache.
And from How cached data expires:
Under rare circumstances, values can also disappear from the cache
prior to expiration for reasons other than memory pressure. While
memcache is resilient to server failures, memcache values are not
saved to disk, so a service failure can cause values to become
unavailable.
I'd suggest adding a property, let's say called assigned, to the item entities, by default unset (or set to null/None) and, when it's assigned to a user, set to the user's key or key ID. This allows you:
to query for unassigned items when you want to make assignments
to skip items recently assigned but still showing up in the query results due to eventual consistency, so no need to struggle for consistency
to be certain that an item can uniquely be assigned to only a single user
to easily find items assigned to a certain user if/when you're doing per-user processing of items, eventually setting the assigned property to a known value signifying done when its processing completes
Note: you may need a one-time migration task to update this assigned property for any existing entities when you first deploy the solution, to have these entities included in the query index, otherwise they would not show up in the query results.
As for the growing execution time of the cron jobs: just split the work into multiple fixed-size batches (as many as needed) to be performed in separate requests, typically push tasks. The usual approach for splitting is using query cursors. The cron job would only trigger enqueueing the initial batch processing task, which would then enqueue an additional such task if there are remaining batches for processing.
To get a general idea of such a solution works take a peek at Google appengine: Task queue performance (it's python, but the general idea is the same).
If you are planning for push jobs inside a cron and you want the jobs to be updating key-value pairs as an addon to improvise the speed and performance, we can split the number of users and number of items into multiple key-(list of values) pairs so that our push jobs will pick the key random ( logic to write to pick a key out of 4 or 5 keys) and then remove an item from the list of items and update the key again, try to have a locking before working on the above part. Example of key value paris.
Userlist1: ["vijay",...]
Userlist2: ["ramana",...]

Does task queue truly run tasks in parallel?

We have an application that takes some input from a user and makes ~50 RPC calls. Each call takes around 4-5 minutes.
In the backend we are using a push queue and enqueuing each of these 50 calls as tasks. This is our queue spec:
queue:
- name: some-name
rate: 500/s
bucket_size: 100
max_concurrent_requests: 500
My understanding is that all 50 requests should be run in parallel, and thus all of them should be complete in 4-5 minutes. But what's actually happening is that only around ~15 of these requests are returning results, while the rest cross the 10 min limit and time out. Another thing to note is that this seems to work fine if we bring down the number of requests to < 10.
There's always the possibility that the requests that timed out did so because the RPC response actually took that long. But what I wanted to confirm is :
My understanding of the tasks running in parallel is correct.
Our queue config and the number of tasks we're enqueuing has nothing to do with these requests timing out.
Are these correct ?
(1) Parallel execution
Yes, tasks can be executed in parallel (up to 500 in your case), but in push queues, your app has no control in which particular order the tasks in a push queue are executed and no direct control how many tasks are executed at once. (Your app can control in which sequence tasks are added to a queue though, see the pattern in (2) below)
App Engine uses certain factors to decide how fast and which tasks are executed, especially the queue configuration and also the scaling configuration (e.g. in app.yaml). Since you pay for every first 15 minutes of an instance, it could get very expensive to really have 50 instances launched, then idling for 15 minutes before shutting them down (until the next request). In this regard, the mechanism that spawns new instances is a little smarter, whether it is HTTP requests by users or task queues.
(2) Request time outs
Yes, it is very unlikely that the enqueuing has anything to do with these request time outs. Unless the time-outs are an unintentional side-effect of the wrong assumption that a particular task was executed before.
In order to avoid request time outs in general, it makes sense to split a task into multiple tasks. For example, if you have a task do_foo and those executions exceed the time outs frequently (or memory limits), you could instead have do_foo load off work to other tasks that will do the actual jobs.
For some migration tasks I use this pattern in a linear / sequential way. E.g. classmethod do_foo just queries entities of a certain kind (ordered by creation timestamp for example), maybe filtered, by page (e.g. 50 in transactions with ancestor). It does some writes to the entities first, and only at the very end after successful commit it creates a new transactional do_foo task with cursor parameter to the next page, eventually with a countdown of 1 sec to avoid transaction errors. The next execution of do_foo will continue with the next page (of course only after the task with the previous page completed).
Depending on the nature of the tasks, you could alternatively have each task fan out into multiple tasks per execution, e.g. do_foo triggers do_bar, do_something and do_more. Also note that up to five tasks can be created transactionally inside a transaction.

Getting memory limit exceed error on App Engine when doing a db clean

I have the following code that I run everyweek through a cron job to clear older db entries. After 3-4 minutes I get Exceeded soft private memory limit of 128 MB with 189 MB after servicing 1006 requests total.
Then there is this message also While handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application. Below is the clear code.
def clean_user_older_stories(user):
stories = Story.query(Story.user==user.key).order(-Story.created_time).fetch(offset=200, limit=500, keys_only=True)
print 'stories len ' + str(len(stories))
ndb.delete_multi(stories)
def clean_older_stories():
for user in User.query():
clean_user_older_stories(user)
I guess there is a better way to deal with this. How do I handle this?
It's because of In-Context Cache
With executing long-running queries in background tasks, it's possible for the in-context cache to consume large amounts of memory. This is because the cache keeps a copy of every entity that is retrieved or stored in the current context.
Try disabling cache
To avoid memory exceptions in long-running tasks, you can disable the cache or set a policy that excludes whichever entities are consuming the most memory.
ctx = ndb.get_context
ctx.set_cache_policy(False)
ctx.set_memcache_policy(False)
Have you tried making your User query a keys_only query? You are not using any User properties besides the key and this would help cut down on memory usage.
You should page through large queries by setting a page_size and using a Cursor.
Your handler can invoke itself through the task queue with the next cursor until the end of the result set is reached. Optionally you can use the deferred API cut down on boilerplate code for this kind of task.
That being said, the 'join' your are doing between User and Store could make this challenging. I would page through Users first as it seems from what you have described Users will grow overtime but number of Stories per User is limited.

GAE Map Reduce - hitting entities more than once?

I'm using Map Reduce (http://code.google.com/p/appengine-mapreduce/) to do an operation over a set of entities. However, I am finding my operations are being duplicated.
Are map reduce maps sometimes called more than once for a specific entity? Is this the case even if they don't fail the initial time?
edit: here are some more details.
def reparent_request(entity):
#check if the entity has a parent
if not is_valid_to_reparent(entity):
return
#copy it
try:
copy = clone_entity(Request, entity, parent=entity.user)
copy.put() #we hard put here so we can use the reference later in this function.
except:
...
... update some references to the copied object ...
#delete the original
yield op.db.Delete(entity)
At the end, I am non-deterministically left with two entities, both with the new parent.
I've reparented a load of entities before - it was a nightmare because of the exact problem you're facing.
What I would do instead is:
Create a new queue. Ensure its paused and that you have a lot of storage space dedicated to queues. It's only temporary, but you'll need it.
Instead of editing your entities in your map reduce job, add them to the queue with a name that will be unique for each entity. The key works fine.
When adding to the queue, because it's paused you'll get an error if you try and add the same named queue twice - so catch the error and skip it, because you know that entity must already have been touched by the map reduce job.
When you're confident that every entity has a matching queue task and the map reduce job has finished, unpause your queue. The queue will do the reparenting.
A couple of notes:
* the task queue size can get pretty big. Can't remember numbers, but it was gigs. Also the size of the queue doesn't update in real time - so don't worry that it might still says gigs of tasks when the queue is nearly empty.
* the reliability of the queue storage is an unknown I believe. It didn't happen to us, but queue items could disappear I guess. Fortunately, you can rerun this process multiple times to ensure it worked, especially if you're deleting entities.
* you may want to ensure you queue has a concurrency limit on it. Without one, a delay in the execution of a couple of tasks can absolutely cripple your application. Learnt that the hard way! I think 30 concurrent tasks went quite well for us.
Hope that's useful, let me know if you come up with any improvements!
App Engine mapreduce runs on the task queue, and like anything else that uses the task queue, tasks have to be idempotent - that is, running them multiple times should have the same effect as running them once. Tasks will occasionally be run more than once; the mapreduce library may have its own reasons for rerunning mapper tasks, too.
In your situation, I'd suggest creating the new entity with a key whose ID is the same as the old entity; that way running it multiple times will just overwrite the same entity.

Google App Engine Go score counting and saving

I want to make a simple GAE app in Go that will let users vote and store their answers in two ways. First way will be raw data (Database store of "voted for X"), the second will be a running count of those votes ("12 votes for X, 10 votes for Y"). What is an effective way to store both of those values with the app being accessed by multiple people at the same time? If I retrieve the data from the Datastore, change it, and save it back for one instance, another might be wanting to do the same in parallel, and I`m not sure if the final result will be correct.
It seems like a good way to do that is to simply store all vote events as separate entities (the "voted for X" way) and use the Task Queue for the recalculation (the "12 votes for X, 10 votes for Y" way), so the recalculation is done offline and sequentially (without any races and other concurrency issues). Then you'd have to put the recalc task every once in a while to the queue so the results are updated.
The Task Queue doesn't allow adding another task with the same name as an existing one, but doesn't allow checking whether a specific task is already enqueued, so maybe simply trying adding a task with a same name to the queue will be enough to be sure that multiple recalc tasks are not there.
Another way would be to use a goroutine waiting for a poke from an input channel in order to recalculate the results. I haven't run such goroutines on App Engine so I'm not sure of the general behavior of this approach.

Resources