Scheduling cron jobs - google-app-engine

I want to develop an app on which a user can register for alerts( multiple) so that whenever the fare hits below some threshold, he gets a notification. Fares are fetched from a third party website.I want to do this on google app-engine.
Now from what i understand , i need a process running 24/7 which checks the fares at say intervals of 30 mins and send out a notification whenever it hits below the threshold. Probably the cron job of app-engine can be used for this task ? But at max 100 cron jobs can be scheduled, what would be the better way to this. Also having a process for each user would be wastage of resources, what would be better scheduling algorithms for higher efficiency ?

You want to schedule a single cron that runs every 30 minutes and throws an item onto a task queue. That single item on the task queue would then be able to go through all your users, and generate tasks to fetch whatever you need in the background again. Two important things:
You want the initial cron call to return as quickly as possible, as URLs have a 60 second deadline.
Split up any work into separate task queues to achieve above and also iterate through data sources and/or users.
Based on what you're explaining, you can use push task queues: https://cloud.google.com/appengine/docs/python/taskqueue/overview-push

Related

Continuously running service in Google Cloud Engine

I am trying to figure out how to run a service(1) when it does not receive any calls.
I want to use Microservices Architecture.
Basically i want to run this service (1) when the other service(2) is receiving calls and all data.
As the service(1) i mentioned is not receiving it would not have to spawn new instances and i would want only the service(2) to scale.
I have noticed scheduling jobs with cron yaml but the number of calls is limited.
I need to get this service(1) to be active every 1 min when service(2) is active.
It's hard to give a good answer without knowing more about what service (1) has to do when it is 'active'. It sounds you want cron to launch a task every minute.
You can use cron in conjunction with push queues: https://cloud.google.com/appengine/docs/standard/go/taskqueue/push/
When creating a push queue task, you can set the property delay before adding it to the queue: https://cloud.google.com/appengine/docs/standard/go/taskqueue/reference#Task
(For me in Python they called it countdown https://cloud.google.com/appengine/docs/standard/python/refdocs/google.appengine.api.taskqueue.taskqueue#google.appengine.api.taskqueue.taskqueue.add)
You could have a cron job that fires every 24 hrs. That cron job would load up your push queue with tasks who's delays are staggered. The delay of the first one is 1 min, the delay of the second one is 2 min, etc.

Custom Metrics cron job Datastore timeout

I have written a code to write data to custom metrics cloud monitoring - google app engine.
For that i am storing the data for some amount of time say: 15min into datastore and then a cron job runs and gets the data from there and plots the data on the cloud monitoring dashboard.
Now my problem is : while fetching huge data to plot from the datastore the cron job may timeout. Also i wanted to know what happens when cron job fails ?
Also Can it fail if the number of records is high ? if it can, what alternates could we do. Safely how many records cron could process in 10 min timeout duration.
Please let me know if any other info is needed.
Thanks!
You can run your cron job on an instance with basic or manual scaling. Then it can run for as long as you need it.
Cron job is not re-tried. You need to implement this mechanism yourself.
A better option is to use deferred tasks. Your cron job should create as many tasks to process data as necessary and add them to the queue. In this case you don't have to redo the whole job - or remember a spot from which to resume, because tasks are automatically retried if they fail.
Note that with tasks you may not need to create basic/manual scaling instances if each task takes less than 10 minutes to execute.
NB: If possible, it's better to create a large number of tasks that execute quickly as opposed to one or few tasks that take minutes. This way you minimize wasted resources if a task fails, and have smaller impact on other processes running on the same instance.

GAE - Execute many small tasks after a fixed time

I'd like to make a Google App Engine app that sends a Facebook message to a user a fixed time (e.g. one day) after they click a button in the app. It's not scalable to use cron or the task queue for potentially millions of tiny jobs. I've also considered implementing my own queue using a background thread, but that's only available using the Backends API as far as I know, which is designed for much larger usage and is not free.
Is there a scalable way for a free Google App Engine app to execute a large number of small tasks after a fixed period of time?
For starters, if you're looking to do millions of tiny jobs, you're going to blow past the free quota very quickly, any way you look at it. The free quota's meant for testing.
It depends on the granularity of your tasks. If you're executing a lot of tasks once per day, cron hooked up to a mapreduce operation (which essentially sends out a bunch of tasks on task queues) works fine. You'll basically issue a datastore query to find the tasks that need to be run, and send them out on the mapreduce.
If you execute this task thousands of times a day (every minute), it may start getting expensive because you're issuing many queries. Note that if most of those queries return nothing, the cost is still minimal.
The other option is to store your tasks in memory rather than in the datastore, that's where you'd want to start using backends. But backends are expensive to maintain. Look into using Google Compute Engine, which gives much cheaper VMs.
EDIT:
If you go the cron/datastore route, you'd store a new entity whenever a user wants to send a deferred message. Most importantly, it'd have a queryable timestamp for when the message should be sent, probably rounded to the nearest minute or the nearest 5 minutes, whatever you decide your granularity should be.
You would then have a cron job that runs at the set interval, say every minute. On each run it would build a query for all the cron jobs it needs to send for the given minute.
If you really do have hundreds of thousands of messages to send each minute, you're not going to want to do it from the cron task. You'd want the cron task to spawn a mapreduce job that will fan out the query and spawn tasks to send your messages.

How to create X tasks as fast as possible on Google App Engine

We push out alerts from GAE, and let's say we need to push out 50 000 alerts to CD2M (Cloud 2 Device Messaging). For this we:
Read all who wants alerts from the datastore
Loop through and create a "push task" for each notification
The problem is that the creation of the task takes some time so this doesn't scale when the user base grows. In my experience we are getting 20-30 seconds just creating the tasks when there is a lot of them. The reason for one task pr. push message is so that we can retry the task if something fails and it will only affect a single subscriber. Also C2DM only supports sending to one user at a time.
Will it be faster if we:
Read all who wants alerts from the datastore
Loop through and create a "pool task" for each 100 subscribers
Each "Pool task" will generate 100 "push tasks" when they execute
The task execution is very fast so in our scenario it seems like the creation of the tasks is the bottleneck and not the execution of the tasks. That's why I thought about this scenario to be able to increase the parallelism of the application. I would guess this would lead to faster execution but then again I may be all wrong :-)
We do something similar with APNS (Apple Push Notification Server): we create a task for a batch of notifications at a time (= pool task as you call it). When task executes, we iterate over a batch and send it to push server.
The difference with your setup is that we have a separate server for communicating with push, as APNS only supports socket communication.
The only downside is if there is an error, then whole task will be repeated and some users might get two notifications.
This sounds like it varies based on the number of alerts you need to send out, how long it takes to send each alert, and the number of active instances you have running.
My guess is that it takes a few milliseconds to tens of milliseconds to send out a CD2M alert, while it takes a few seconds for an instance to spin up, so you can probably issue a few hundred or a few thousand alerts before justifying another task instance. The ratio of the amount of time it takes to send each CD2M message vs the time it takes to launch an instance will dictate how many messages you'd want to send per task.
If you already have a fair number of instances running though, you don't have the delay of waiting for instances to spin up.
BTW, this seems almost like a perfect application of the MapReduce API. It mostly does what you describe in the second version, except it takes your initial query, and breaks that up into subqueries that each return a "page" of the result set. A task is launched for each subquery which processes all the items in its "page". This is an improvement from what you describe, because you don't need to spend the time looping through your initial result set.
I believe the default implementation for the MapReduce API just queries for all entities of a particular kind (ie all User objects), but you can change the filter used.

How Google App Engine Java Task Queues can be used for mass scheduling for users?

I am focusing GAE-J for developing a Java web application.
I have a scenario where user will create his schedule for set of reminders. And I have to send emails on that particular date/time.
I can not create thread on GAE. So I have the solution of Task Queues.
So can I achieve this functionality with Task Queues. User will create tasks. And App Engine will execute it on specific date and time.
Thanks
Although using the task queue directly, as Chris suggests, will work, for longer reminder periods (eg, 30+ days) and in cases where the reminder might be modified, a more indirect approach is probably wise.
What I would recommend is storing reminders in the datastore, and then taking one of a few approaches, depending on your requirements:
Run a regular cron job (say, hourly) that fetches a list of reminders coming up in the next interval, and schedules task queue tasks for each.
Have a single task that you schedule to be run at the time the next reminder (system-wide) is due, which sends out the reminder(s) and then enqueues a new task for the next reminder that's due.
Run a backend, as Chris suggests, which regularly scans the datastore for upcoming reminders.
In all the above cases, you'll probably need some special case code for when a user sets a reminder in less than the minimum polling interval you've set - probably enqueuing a task directly. You'll also want to consider batching up the sending of reminders, to minimize tasks and wallclock time consumed.
You can do this with Task Queues - basically when you receive the request 'remind me at date/time X by sending an email', you create a new task with the following basic structure:
if current time is close to or past the given date/time X:
send the email
else
fail this task
If the reminder time is far in the future, the first few times the task is scheduled, it will fail and be scheduled for later. The downside of this approach is that it doesn't guarantee that the task will run exactly when the reminder is supposed to be sent - it may be a little while before or afterwards. You could slim down this window by taking into account that your task can run for 10 minutes, so if you're within 10 minutes of the reminder time, sleep until the right time and then send the e-mail.
If the reminders have to be sent out as close in time as possible then just use a Backend - keep an instance running forever and dispatch all reminders to it, and it can continuously look at all reminders it has to send out and send them out at exactly the right time.

Resources