Schedule a GET request with GAE

Schedule a GET request with GAE - google-app-engine

I've been researching ways to schedule a GET request with a GAE application. Specifically, I want my application to respond 1 hour after it's been requested by fetching a different URL that points to another app's API.
Are Deferred Tasks the way to handle this?
I also found that Tasks have an "eta" argument that specifies earliest time of execution. Could this be preferred over "_countdown"?
Or investigate Cron jobs? These GET requests won't be happening regularly, so I don't know if Cron jobs are appropriate.
Thanks! Please help me clarify if necessary.

Yes that's a good way to do it, all you have to do is to set the _countdown in your deferred call, which is how many seconds you want to wait until this task will be executed.
Example as from the docs:
deferred.defer(do_something_expensive, _countdown=3600, _queue="myqueue")
Or you can simply use the Task API where you can set all the different parameters on when and how exactly you want this task to be executed. Whatever suits you best you can use either eta or countdown, from GAE perspective it is exactly the same.

As long as you don't want to the second accuracy (say minute accuracy). I would add the request to the datastore implementing a queue of requests. Then have a cron job run every minute looking for requests scheduled for that time period. Then I would submit a task to perform the requests. Name the task so you are unlikely to have the same task re-submitted. The task can retry a couple of times (if it errors) then you can mark the request as completed in your queue.
This way you can handle any number of scheduled requests. You don't end up with thousands of tasks. You can know if requests will run, when they run etc...

Related

Custom Metrics cron job Datastore timeout

I have written a code to write data to custom metrics cloud monitoring - google app engine.
For that i am storing the data for some amount of time say: 15min into datastore and then a cron job runs and gets the data from there and plots the data on the cloud monitoring dashboard.
Now my problem is : while fetching huge data to plot from the datastore the cron job may timeout. Also i wanted to know what happens when cron job fails ?
Also Can it fail if the number of records is high ? if it can, what alternates could we do. Safely how many records cron could process in 10 min timeout duration.
Please let me know if any other info is needed.
Thanks!

You can run your cron job on an instance with basic or manual scaling. Then it can run for as long as you need it.
Cron job is not re-tried. You need to implement this mechanism yourself.
A better option is to use deferred tasks. Your cron job should create as many tasks to process data as necessary and add them to the queue. In this case you don't have to redo the whole job - or remember a spot from which to resume, because tasks are automatically retried if they fail.
Note that with tasks you may not need to create basic/manual scaling instances if each task takes less than 10 minutes to execute.
NB: If possible, it's better to create a large number of tasks that execute quickly as opposed to one or few tasks that take minutes. This way you minimize wasted resources if a task fails, and have smaller impact on other processes running on the same instance.

Google app engine API: Running large tasks

Good day,
I am running a back-end to an application as an app engine (Java).
Using endpoints, I receive requests. The problem is, there is something big I need to compute, but I need fast response times for the front end. So as a solution I want to precompute something, and store it a dedicated the memcache.
The way I did this, is by adding in a static block, and then running a deferred task on the default queue. Is there a better way to have something calculated on startup?
Now, this deferred task performs a large amount of datastore operations. Sometimes, they time out. So I created a system where it retries on a timeout until it succeeds. However, when I start up the app engine, it immediately creates two of the deferred task. It also keeps retrying the tasks when they fail, despite the fact that I set DeferredTaskContext.setDoNotRetry(true);.
Honestly, the deferred tasks feel very finicky.
I just want to run a method that takes >5 minutes (probably longer as the data set grows). I want to run this method on startup, and afterwards on a regular basis. How would you model this? My first thought was a cron job but they are limited in time. I would need a cron job that runs a deferred task, hope they don't pile up somehow or spawn duplicates or start retrying.
Thanks for the help and good day.
Dries

Your datastore operations should never time out. You need to fix this - most likely, by using cursors and setting the right batch size for your large queries.
You can perform initialization of objects on instance startup - check if an object is available, if not - do the calculations.
Remember to store the results of your calculations in the datastore (in addition to Memcache) as Memcache is volatile. This way you don't have to recalculate everything a few seconds after the first calculation was completed if a Memcache object was dropped for any reason.
Deferred tasks can be scheduled to perform after a specified delay. So instead of using a cron job, you can create a task to be executed after 1 hour (for example). This task, when it completes its own calculations, can create another task to be excited after an hour, and so on.

How Google App Engine Java Task Queues can be used for mass scheduling for users?

I am focusing GAE-J for developing a Java web application.
I have a scenario where user will create his schedule for set of reminders. And I have to send emails on that particular date/time.
I can not create thread on GAE. So I have the solution of Task Queues.
So can I achieve this functionality with Task Queues. User will create tasks. And App Engine will execute it on specific date and time.
Thanks

Although using the task queue directly, as Chris suggests, will work, for longer reminder periods (eg, 30+ days) and in cases where the reminder might be modified, a more indirect approach is probably wise.
What I would recommend is storing reminders in the datastore, and then taking one of a few approaches, depending on your requirements:
Run a regular cron job (say, hourly) that fetches a list of reminders coming up in the next interval, and schedules task queue tasks for each.
Have a single task that you schedule to be run at the time the next reminder (system-wide) is due, which sends out the reminder(s) and then enqueues a new task for the next reminder that's due.
Run a backend, as Chris suggests, which regularly scans the datastore for upcoming reminders.
In all the above cases, you'll probably need some special case code for when a user sets a reminder in less than the minimum polling interval you've set - probably enqueuing a task directly. You'll also want to consider batching up the sending of reminders, to minimize tasks and wallclock time consumed.

You can do this with Task Queues - basically when you receive the request 'remind me at date/time X by sending an email', you create a new task with the following basic structure:
if current time is close to or past the given date/time X:
send the email
else
fail this task
If the reminder time is far in the future, the first few times the task is scheduled, it will fail and be scheduled for later. The downside of this approach is that it doesn't guarantee that the task will run exactly when the reminder is supposed to be sent - it may be a little while before or afterwards. You could slim down this window by taking into account that your task can run for 10 minutes, so if you're within 10 minutes of the reminder time, sleep until the right time and then send the e-mail.
If the reminders have to be sent out as close in time as possible then just use a Backend - keep an instance running forever and dispatch all reminders to it, and it can continuously look at all reminders it has to send out and send them out at exactly the right time.

Tasks, Cron jobs or Backends for an app

I'm trying to construct a non-trivial GAE app and I'm not sure if a cron job, tasks, backends or a mix of all is what I need to use based on the request time-out limit that GAE has for HTTP requests.
The distinct steps I need to do are:
1) I have upwards of 15,000 sites I need to pull data from at a regular schedule and without any user interaction. The total number of sites isn't going to static but they're all saved in the datastore [Table0] along side the interval at which they're read at. The interval may vary as regular as every day to every 30 days.
2) For each site from step #1 that fits the "pull" schedule criteria, I need to fetch data from it via HTTP GET (again, it might be all of them or as few as 2 or 3 sites). Once I get the response back from the site, parse the result and save this data into the datastore as [Table1].
3) For all of the data that was recently put into the datastore in [Table1] (they'll have a special flag), I need to issue additional HTTP request to a 3rd party site to do some additional processing. As soon as I receive data from this site, I store all of the relevant info into another table [Table2] in the datastore.
4) As soon as data is available and ready from step #3, I need to take all of it and perform some additional transformation and update the original table [Table1] in the datastore.
I'm not certain which of the different components I need to use to ensure that I can complete each piece of the work without exceeding the response deadline that's placed on the web requests of GAE. For requests initiated by cron jobs and tasks, I believe you're allowed 10 mins to complete it, whereas typical user-driven requests are allowed 30 seconds.

Task queues are the best way to do this in general, but you might want to check out the App Engine Pipeline API, which is designed for exactly the sort of workflow you're talking about.

GAE is a tough platform for your use-case. But, out of extreme masochism, I am attempting something similar. So here are my two cents, based on my experience so far:
Backends -- Use them for any long-running, I/O intensive tasks you may have (Web-Crawling is a good example, assuming you can defer compute-intensive processing for later).
Mapreduce API -- excellent for compute-intensive/parallel jobs such as stats collection, indexing etc. Until recently, this library only had a mapper implementation, but recently Google also released an in-memory Shuffler that is good for jobs that fit in about 100MB.
Task Queues -- For when everything else fails :-).
Cron -- mostly to kick off periodic tasks -- which context you execute them in, is up to you.
It might be a good idea to design your backend tasks so that they can be scheduled (manually, or perhaps by querying your current quota usage) in the "Frontend" context using task queues, if you have spare Frontend CPU cycles.

I abandoned GAE before Backends came out, so can't comment on that. But, what I did a few times was:
Cron scheduled to kick off process
Cron handler invokes a task URL
task grabs first item (URL) from datastore, executes HTTP request, operates on data, updates the URL record as having worked on it and the invokes the task URL again.
So cron is basically waking up taskqueue periodically and taskqueue runs recursively until it reaches some stopping point.
You can see it in action one of my public GAE apps - https://github.com/mavenn/watchbots-gae-python.

Task Queue VS. URLFetch

I need to run a script (python) in App Engine for many times.
One possibility is just to run a loop and use urlfetch with a link to the script.
The other one is to open a task with the script URL.
What is the difference between both ways? It seems like Tasks have a quota (100,000 daily free tasks) so why should I use them?
Thanks,
Joel

Briefly:
Bulk adding tasks to the queue will probably be easier, and possibly quicker, than using URLFetch. Although using async url-fetches might help with this.
When a task fails, it will automatically retry. Assuming you check the status of your call, URLFetch might just hang for a while before you get some type of error.
You can control the rate at which tasks are executed. So if you add 1,000 tasks fast you can let them slowly run at 10 / minute (or whatever you want), helping you not blow through your other quotas.
If you enable billing, the free quota is 20,000,000 / tasks per day.
Depending on what you are doing, tasks can be transactionally enqueued, which gives you some really powerful abilities.