For my first mapreduce project, using Google App Engine, Python version, I implemented a simple counter:
def process(entity):
yield op.counters.Increment("counter1")
Partway through the run, I went over quota. When my quota gets reset tomorrow, will it pick up where it left off, and eventually produce the final result, or do I need enough quota do perform the whole task, without being interrupted in this way?
This is only practice. For my "real" mapreduce job, I'm going to be modifying each entry in my database table. Is there some good some way to save my table data, in case something goes awry?
Thanks in advance.
Mapreduce counters are stored in the datastore, so they will persist even if you pause the mapreduce for an extended period of time.
Likewise, modifications made in a mapreduce are executed in batches at regular intervals; changes you make are applied more or less immediately.
Related
Good day,
I am running a back-end to an application as an app engine (Java).
Using endpoints, I receive requests. The problem is, there is something big I need to compute, but I need fast response times for the front end. So as a solution I want to precompute something, and store it a dedicated the memcache.
The way I did this, is by adding in a static block, and then running a deferred task on the default queue. Is there a better way to have something calculated on startup?
Now, this deferred task performs a large amount of datastore operations. Sometimes, they time out. So I created a system where it retries on a timeout until it succeeds. However, when I start up the app engine, it immediately creates two of the deferred task. It also keeps retrying the tasks when they fail, despite the fact that I set DeferredTaskContext.setDoNotRetry(true);.
Honestly, the deferred tasks feel very finicky.
I just want to run a method that takes >5 minutes (probably longer as the data set grows). I want to run this method on startup, and afterwards on a regular basis. How would you model this? My first thought was a cron job but they are limited in time. I would need a cron job that runs a deferred task, hope they don't pile up somehow or spawn duplicates or start retrying.
Thanks for the help and good day.
Dries
Your datastore operations should never time out. You need to fix this - most likely, by using cursors and setting the right batch size for your large queries.
You can perform initialization of objects on instance startup - check if an object is available, if not - do the calculations.
Remember to store the results of your calculations in the datastore (in addition to Memcache) as Memcache is volatile. This way you don't have to recalculate everything a few seconds after the first calculation was completed if a Memcache object was dropped for any reason.
Deferred tasks can be scheduled to perform after a specified delay. So instead of using a cron job, you can create a task to be executed after 1 hour (for example). This task, when it completes its own calculations, can create another task to be excited after an hour, and so on.
I have a Google appengine application, written in Go, that has a cron process which runs once a day at 3am. This process looks at all of the changes that have happened to my data during the day and stores some meta data about what happened. My users can run reports on this meta data to see trends that have happened over several months. The process does around 10-20 million datastore writes every night. It all works just fine, but since I have started running it I have noticed a significant increase in my monthly bill from Google (from around $50/month to around $400/month).
I have just setup a very basic taskqueue that this runs in, I have not changed the default settings at all. Is there a better way that I could be running this process at night that could save me money? I have never messed around with the backends (which are now depreciated) or the modules api, and I know they've changed a lot of this stuff recently so I'm not sure where to start looking. Any advice would be greatly appreciated.
Look at your instances at 3am. It might be that GAE spins up a lot of them to handle the job. You could configure your job to make it run less paralel so it will take longer but perhaps it will need only 1 instance then.
However, if your database writes are indeed the biggest factor this won't make a big impact.
You can try looking at your data models and indexes. Remember that each indexed field costs 2 writes extra, so see if you can remove indexes from some fields if you don't need them.
One improvement that you can do is to batch your write operations, you can use memcache for this (pay the dedicated one since it's more reliable). Write the updates to memcache, once it's about 900K, flush it to datastore. This will reduces the number of write to datastore A LOT, especially if your metadata's size is small.
I'd like to make a Google App Engine app that sends a Facebook message to a user a fixed time (e.g. one day) after they click a button in the app. It's not scalable to use cron or the task queue for potentially millions of tiny jobs. I've also considered implementing my own queue using a background thread, but that's only available using the Backends API as far as I know, which is designed for much larger usage and is not free.
Is there a scalable way for a free Google App Engine app to execute a large number of small tasks after a fixed period of time?
For starters, if you're looking to do millions of tiny jobs, you're going to blow past the free quota very quickly, any way you look at it. The free quota's meant for testing.
It depends on the granularity of your tasks. If you're executing a lot of tasks once per day, cron hooked up to a mapreduce operation (which essentially sends out a bunch of tasks on task queues) works fine. You'll basically issue a datastore query to find the tasks that need to be run, and send them out on the mapreduce.
If you execute this task thousands of times a day (every minute), it may start getting expensive because you're issuing many queries. Note that if most of those queries return nothing, the cost is still minimal.
The other option is to store your tasks in memory rather than in the datastore, that's where you'd want to start using backends. But backends are expensive to maintain. Look into using Google Compute Engine, which gives much cheaper VMs.
EDIT:
If you go the cron/datastore route, you'd store a new entity whenever a user wants to send a deferred message. Most importantly, it'd have a queryable timestamp for when the message should be sent, probably rounded to the nearest minute or the nearest 5 minutes, whatever you decide your granularity should be.
You would then have a cron job that runs at the set interval, say every minute. On each run it would build a query for all the cron jobs it needs to send for the given minute.
If you really do have hundreds of thousands of messages to send each minute, you're not going to want to do it from the cron task. You'd want the cron task to spawn a mapreduce job that will fan out the query and spawn tasks to send your messages.
I'm trying to use the Mapreduce library to do a datastore schema migration on a live App Engine app.
My mapper uses the DatastoreKeyInputReader. Each map operation fetches the old entity using my old schema, does some data manipulation, and then writes the new entity (with a new Kind) under the new schema. I use NDB to get and put the entities transactionally, with the transaction option "force_writes" set to True.
Since the app is live, I want to block all other datastore writes during the migration except for the mapper's transactional writes. Unfortunately, while an individual map operation can run correctly with force_writes set to True in an app with writes disabled, I can't actually start the Mapreduce job - it triggers a CapabilityDisabledError because the Mapreduce library itself doesn't employ force_writes in its own bookkeeping of the jobs.
An obvious hack might be to start the job with writes enabled, and then quickly disable them once the job has started. However, this seems pretty risky to me, especially since my app makes several writes per second under normal conditions. Plus, it would probably break once the job ends unless writes were re-enabled just in time for cleanup.
So, how can I start a Mapreduce job when I need to also disable datastore writes? Seems like the cleanest thing would be for mapreduce.yaml to support a force_writes option for a job...
How long is the mapreduce process going to run for? If it's not a terribly long process (i.e not 3 days or something) you may want to simply add a default hanlder that routes all traffic to your site to a static HTML maintenance page and then run the mapreduce job. As soon it's completed you can redeploy with the default handler removed.
I'm trying to construct a non-trivial GAE app and I'm not sure if a cron job, tasks, backends or a mix of all is what I need to use based on the request time-out limit that GAE has for HTTP requests.
The distinct steps I need to do are:
1) I have upwards of 15,000 sites I need to pull data from at a regular schedule and without any user interaction. The total number of sites isn't going to static but they're all saved in the datastore [Table0] along side the interval at which they're read at. The interval may vary as regular as every day to every 30 days.
2) For each site from step #1 that fits the "pull" schedule criteria, I need to fetch data from it via HTTP GET (again, it might be all of them or as few as 2 or 3 sites). Once I get the response back from the site, parse the result and save this data into the datastore as [Table1].
3) For all of the data that was recently put into the datastore in [Table1] (they'll have a special flag), I need to issue additional HTTP request to a 3rd party site to do some additional processing. As soon as I receive data from this site, I store all of the relevant info into another table [Table2] in the datastore.
4) As soon as data is available and ready from step #3, I need to take all of it and perform some additional transformation and update the original table [Table1] in the datastore.
I'm not certain which of the different components I need to use to ensure that I can complete each piece of the work without exceeding the response deadline that's placed on the web requests of GAE. For requests initiated by cron jobs and tasks, I believe you're allowed 10 mins to complete it, whereas typical user-driven requests are allowed 30 seconds.
Task queues are the best way to do this in general, but you might want to check out the App Engine Pipeline API, which is designed for exactly the sort of workflow you're talking about.
GAE is a tough platform for your use-case. But, out of extreme masochism, I am attempting something similar. So here are my two cents, based on my experience so far:
Backends -- Use them for any long-running, I/O intensive tasks you may have (Web-Crawling is a good example, assuming you can defer compute-intensive processing for later).
Mapreduce API -- excellent for compute-intensive/parallel jobs such as stats collection, indexing etc. Until recently, this library only had a mapper implementation, but recently Google also released an in-memory Shuffler that is good for jobs that fit in about 100MB.
Task Queues -- For when everything else fails :-).
Cron -- mostly to kick off periodic tasks -- which context you execute them in, is up to you.
It might be a good idea to design your backend tasks so that they can be scheduled (manually, or perhaps by querying your current quota usage) in the "Frontend" context using task queues, if you have spare Frontend CPU cycles.
I abandoned GAE before Backends came out, so can't comment on that. But, what I did a few times was:
Cron scheduled to kick off process
Cron handler invokes a task URL
task grabs first item (URL) from datastore, executes HTTP request, operates on data, updates the URL record as having worked on it and the invokes the task URL again.
So cron is basically waking up taskqueue periodically and taskqueue runs recursively until it reaches some stopping point.
You can see it in action one of my public GAE apps - https://github.com/mavenn/watchbots-gae-python.