I have read the docs about taskqueue and push queues in gae which is used to create long running tasks.
I have doubt in why there was the need for MapReduce? As both do the processing in background, what are the main principal differences between them.
Can someone please explain this?
Edit: I guess i was comparing Apples with monkeys! Hadoop, mapreduce are related. And gae is a backend framework.
You are getting confused with two entirely different things altogether.
MapReduce paradigm is all about distributed parallel processing of very huge amount of data.
TaskQueue is a Scheduler; which can schedule a task to execute say at certain time. It is just a scheduler like a unix cronjobs.
Please take note of bold and italic words in above statements to see the difference.
From the definition of TaskQueue
Task queues let applications perform work, called tasks,
asynchronously outside of a user request. If an app needs to execute
work in the background, it adds tasks to task queues. The tasks are
executed later, by worker services.
By definition, TaskQueue work outside of a user request; means there is no actual user request to perform a task (it is simply submitted/scheduled sometime in past). mapreduce programs are submitted by users to execute, though you may use TaskQueue to schedule one in future.
You are probably getting confused due to words like task, queue, scheduling used in mapreduce world. But those all thing in mapreduce may have some similarity, as they are generic terms - but they are definitely not the same.
Related
My queue task uses urlfetch to get some data from an external API and saves it to ndb Datastore entities.
This takes about 15 seconds total.
Somehow, when the task runs, all other handlers (simple json response handlers) become slower. (slower means +500ms)
What could be causing this?
Isn't the idea of background tasks that is doesn't affect the user facing requests.
I stumbled upon this blogpost, but my task takes longer than 1 second to complete. I don't see how that's going to help me.
By default, your tasks are executed by the same instances that serve user requests. Background or not, they share the same CPU, memory and bandwidth. It's a good idea to run these tasks on a different module, which means a different instance. You can do it by specifying a target for your task queue.
Note that typically an automatic App Engine scheduler will spin a new instance when responses from your current instances slow down. However, a slowdown in your case is caused not by the growing volume of standard requests, but an unusual request which takes much longer. This prevents automatic scheduler from reacting to the increased latencies. You can switch to manual or basic scheduling, which give you more control over capacity (total number of instances) and rules for spinning new instances, but creating a different module for background tasks is a better solution.
I'd like to make a Google App Engine app that sends a Facebook message to a user a fixed time (e.g. one day) after they click a button in the app. It's not scalable to use cron or the task queue for potentially millions of tiny jobs. I've also considered implementing my own queue using a background thread, but that's only available using the Backends API as far as I know, which is designed for much larger usage and is not free.
Is there a scalable way for a free Google App Engine app to execute a large number of small tasks after a fixed period of time?
For starters, if you're looking to do millions of tiny jobs, you're going to blow past the free quota very quickly, any way you look at it. The free quota's meant for testing.
It depends on the granularity of your tasks. If you're executing a lot of tasks once per day, cron hooked up to a mapreduce operation (which essentially sends out a bunch of tasks on task queues) works fine. You'll basically issue a datastore query to find the tasks that need to be run, and send them out on the mapreduce.
If you execute this task thousands of times a day (every minute), it may start getting expensive because you're issuing many queries. Note that if most of those queries return nothing, the cost is still minimal.
The other option is to store your tasks in memory rather than in the datastore, that's where you'd want to start using backends. But backends are expensive to maintain. Look into using Google Compute Engine, which gives much cheaper VMs.
EDIT:
If you go the cron/datastore route, you'd store a new entity whenever a user wants to send a deferred message. Most importantly, it'd have a queryable timestamp for when the message should be sent, probably rounded to the nearest minute or the nearest 5 minutes, whatever you decide your granularity should be.
You would then have a cron job that runs at the set interval, say every minute. On each run it would build a query for all the cron jobs it needs to send for the given minute.
If you really do have hundreds of thousands of messages to send each minute, you're not going to want to do it from the cron task. You'd want the cron task to spawn a mapreduce job that will fan out the query and spawn tasks to send your messages.
is there a way to determine when a set of Google App Engine tasks (and child tasks they spawn) have all completed?
Let's say that I have 100 tasks to execute and 10 of those spawn 10 child tasks each. That's 200 tasks. Let's also say that those child tasks might spawn more tasks, recursively, etc...
Is there a way to determine when all tasks have completed? I tried using the app engine pipeline API, but it doesn't look like it's going to work out for my particular use case, even though it is a great API.
My use case is that I want to make a whole bunch of rate limited URL fetch calls while concurrently writing to a blob. At the end of all the URL fetch calls, I want to finalize the blob.
I found the solution with the pipeline API, but it does so much writing to the datastore that it wouldn't be cost effective for me with how often I need to run the pipeline.
There's no way around writing to a persistent storage medium of some sort, and the datastore is the only game in town. You could write your own server to track completions using a backend, but that's an awful lot of overhead for a simple task. Using the pipeline API is your best bet.
I'm trying to construct a non-trivial GAE app and I'm not sure if a cron job, tasks, backends or a mix of all is what I need to use based on the request time-out limit that GAE has for HTTP requests.
The distinct steps I need to do are:
1) I have upwards of 15,000 sites I need to pull data from at a regular schedule and without any user interaction. The total number of sites isn't going to static but they're all saved in the datastore [Table0] along side the interval at which they're read at. The interval may vary as regular as every day to every 30 days.
2) For each site from step #1 that fits the "pull" schedule criteria, I need to fetch data from it via HTTP GET (again, it might be all of them or as few as 2 or 3 sites). Once I get the response back from the site, parse the result and save this data into the datastore as [Table1].
3) For all of the data that was recently put into the datastore in [Table1] (they'll have a special flag), I need to issue additional HTTP request to a 3rd party site to do some additional processing. As soon as I receive data from this site, I store all of the relevant info into another table [Table2] in the datastore.
4) As soon as data is available and ready from step #3, I need to take all of it and perform some additional transformation and update the original table [Table1] in the datastore.
I'm not certain which of the different components I need to use to ensure that I can complete each piece of the work without exceeding the response deadline that's placed on the web requests of GAE. For requests initiated by cron jobs and tasks, I believe you're allowed 10 mins to complete it, whereas typical user-driven requests are allowed 30 seconds.
Task queues are the best way to do this in general, but you might want to check out the App Engine Pipeline API, which is designed for exactly the sort of workflow you're talking about.
GAE is a tough platform for your use-case. But, out of extreme masochism, I am attempting something similar. So here are my two cents, based on my experience so far:
Backends -- Use them for any long-running, I/O intensive tasks you may have (Web-Crawling is a good example, assuming you can defer compute-intensive processing for later).
Mapreduce API -- excellent for compute-intensive/parallel jobs such as stats collection, indexing etc. Until recently, this library only had a mapper implementation, but recently Google also released an in-memory Shuffler that is good for jobs that fit in about 100MB.
Task Queues -- For when everything else fails :-).
Cron -- mostly to kick off periodic tasks -- which context you execute them in, is up to you.
It might be a good idea to design your backend tasks so that they can be scheduled (manually, or perhaps by querying your current quota usage) in the "Frontend" context using task queues, if you have spare Frontend CPU cycles.
I abandoned GAE before Backends came out, so can't comment on that. But, what I did a few times was:
Cron scheduled to kick off process
Cron handler invokes a task URL
task grabs first item (URL) from datastore, executes HTTP request, operates on data, updates the URL record as having worked on it and the invokes the task URL again.
So cron is basically waking up taskqueue periodically and taskqueue runs recursively until it reaches some stopping point.
You can see it in action one of my public GAE apps - https://github.com/mavenn/watchbots-gae-python.
Let's say I have 1000's of jobs to perform repeatedly, how would you propose I architect my system on Google AppEngine?
I need to be able to add more jobs whilst effectively scaling the system. Scheduled Tasks are part of the solution of course as well as Task Queues but I am looking for more insights has to best utilize these resources.
NOTE: There are no dependencies between "jobs".
Based on what little description you've provided, it's hard to say. You probably want to use the Task Queue, and maybe the deferred library if you're using Python. All that's required to use these is to use the API to enqueue a task.
If you're talking about having many repeating tasks, you have a couple of options:
Start off the first task on the task queue manually, and use 'chaining' to have each invocation queue the next one with the appropriate interval.
Store each schedule in the datastore. Have a cron job regularly scan for any tasks that have reached their ETA; fire off a task queue task for each, updating the ETA for the next run.
I think you could use Cron Jobs.
Regards.