Strategy for interleaving jobs on AppEngine? - google-app-engine

Let's say I have 1000's of jobs to perform repeatedly, how would you propose I architect my system on Google AppEngine?
I need to be able to add more jobs whilst effectively scaling the system. Scheduled Tasks are part of the solution of course as well as Task Queues but I am looking for more insights has to best utilize these resources.
NOTE: There are no dependencies between "jobs".

Based on what little description you've provided, it's hard to say. You probably want to use the Task Queue, and maybe the deferred library if you're using Python. All that's required to use these is to use the API to enqueue a task.
If you're talking about having many repeating tasks, you have a couple of options:
Start off the first task on the task queue manually, and use 'chaining' to have each invocation queue the next one with the appropriate interval.
Store each schedule in the datastore. Have a cron job regularly scan for any tasks that have reached their ETA; fire off a task queue task for each, updating the ETA for the next run.

I think you could use Cron Jobs.
Regards.

Related

Difference between TaskQueue and MapReduce in Google App Engine

I have read the docs about taskqueue and push queues in gae which is used to create long running tasks.
I have doubt in why there was the need for MapReduce? As both do the processing in background, what are the main principal differences between them.
Can someone please explain this?
Edit: I guess i was comparing Apples with monkeys! Hadoop, mapreduce are related. And gae is a backend framework.
You are getting confused with two entirely different things altogether.
MapReduce paradigm is all about distributed parallel processing of very huge amount of data.
TaskQueue is a Scheduler; which can schedule a task to execute say at certain time. It is just a scheduler like a unix cronjobs.
Please take note of bold and italic words in above statements to see the difference.
From the definition of TaskQueue
Task queues let applications perform work, called tasks,
asynchronously outside of a user request. If an app needs to execute
work in the background, it adds tasks to task queues. The tasks are
executed later, by worker services.
By definition, TaskQueue work outside of a user request; means there is no actual user request to perform a task (it is simply submitted/scheduled sometime in past). mapreduce programs are submitted by users to execute, though you may use TaskQueue to schedule one in future.
You are probably getting confused due to words like task, queue, scheduling used in mapreduce world. But those all thing in mapreduce may have some similarity, as they are generic terms - but they are definitely not the same.

Google App Engine - Push Taskqueue - Limit to Countdown?

I am setting up push task queue on my Google App Engine App with a countdown parameter so it will execute at some point in the future.
However, my countdown parameter can be very large in seconds, for instance months or even a year in the future. Just want to make sure this will not cause any problems or overhead cost? Maybe there is a more efficient way to do this?
It probably would work, but it seems like a bad idea. What do you do if you change your task processing code? You can't modify a task in the queue. You'd somehow have to keep track of the tasks, delete the old ones and replace them with new ones that work with your updated code.
Instead, store information about the tasks in the data store. Run a cron job once a day or once a week, process the info in the data store, and launch the tasks as needed. You can still use a countdown if you need a precise execution date and time.
The current limit in Task Queues is 30 days, and we don't have plans to raise that substantially.
Writing scheduled operations to datastore and running a daily cron job to inject that day's tasks is a good strategy. That would allow you to update the semantics as your product evolves.

Custom Metrics cron job Datastore timeout

I have written a code to write data to custom metrics cloud monitoring - google app engine.
For that i am storing the data for some amount of time say: 15min into datastore and then a cron job runs and gets the data from there and plots the data on the cloud monitoring dashboard.
Now my problem is : while fetching huge data to plot from the datastore the cron job may timeout. Also i wanted to know what happens when cron job fails ?
Also Can it fail if the number of records is high ? if it can, what alternates could we do. Safely how many records cron could process in 10 min timeout duration.
Please let me know if any other info is needed.
Thanks!
You can run your cron job on an instance with basic or manual scaling. Then it can run for as long as you need it.
Cron job is not re-tried. You need to implement this mechanism yourself.
A better option is to use deferred tasks. Your cron job should create as many tasks to process data as necessary and add them to the queue. In this case you don't have to redo the whole job - or remember a spot from which to resume, because tasks are automatically retried if they fail.
Note that with tasks you may not need to create basic/manual scaling instances if each task takes less than 10 minutes to execute.
NB: If possible, it's better to create a large number of tasks that execute quickly as opposed to one or few tasks that take minutes. This way you minimize wasted resources if a task fails, and have smaller impact on other processes running on the same instance.

Can I block on a Google AppEngine Pull Task Queue until a Task is available?

Can I block on a Google AppEngine Pull Task Queue until a Task is available? Or, do I need to poll an empty queue until a task is available?
You need to poll the queue. A typical use case for pull queues is to have multiple backends, each obtaining one thousand tasks at a time.
For use cases where there are no tasks in the queue for hours at a time, push queues can be a better fit.
Not 100% sure about your question, but thought to try an answer. Having a pull task queue started by a cron may apply. Saves the expense of running a backend. I have client-side log data that needs to be serialized and stored. On-line handler simply passes the client data to a task pull queue. Cron fires up the task every minute, and up to 10k log items get serialized and stored each run. (Change settings according to your loads -- these more than meet my modest needs.) In this case, the queue acts as a buffer, and load spikes get spread across even processing units. Obviously not useful if you want quick access to the TQ data or have wildly unpredictable loads. Very importantly the log data serialization cuts data writes by a factor of 1,000. May not apply to your question, so I'll end with a big HTH. -stevep

How to logically organize recurring tasks?

What's the best way to create recurring tasks?
Should I create some special syntax and parse it, kind of similar to Cronjobs on Linux or should I much rather just use a cronjob that runs every hour to create more of those recurring tasks with no end?
Keep in mind, that you can have endless recurring tasks and tasks with an enddate.
Quartz is an open source job scheduling system that uses cron expressions to control the periodicity of the job executions.
My approach is always "minimum effort for maximum effect" (or best bang per buck).
If it can be done with cron, why not use cron? I'd consider it wasted effort to re-implement cron just for the fun of it so, unless you really need features that cron doesn't have, stick with it.

Resources