Custom Metrics cron job Datastore timeout - google-app-engine

I have written a code to write data to custom metrics cloud monitoring - google app engine.
For that i am storing the data for some amount of time say: 15min into datastore and then a cron job runs and gets the data from there and plots the data on the cloud monitoring dashboard.
Now my problem is : while fetching huge data to plot from the datastore the cron job may timeout. Also i wanted to know what happens when cron job fails ?
Also Can it fail if the number of records is high ? if it can, what alternates could we do. Safely how many records cron could process in 10 min timeout duration.
Please let me know if any other info is needed.
Thanks!

You can run your cron job on an instance with basic or manual scaling. Then it can run for as long as you need it.
Cron job is not re-tried. You need to implement this mechanism yourself.
A better option is to use deferred tasks. Your cron job should create as many tasks to process data as necessary and add them to the queue. In this case you don't have to redo the whole job - or remember a spot from which to resume, because tasks are automatically retried if they fail.
Note that with tasks you may not need to create basic/manual scaling instances if each task takes less than 10 minutes to execute.
NB: If possible, it's better to create a large number of tasks that execute quickly as opposed to one or few tasks that take minutes. This way you minimize wasted resources if a task fails, and have smaller impact on other processes running on the same instance.

Related

Google App Engine - Push Taskqueue - Limit to Countdown?

I am setting up push task queue on my Google App Engine App with a countdown parameter so it will execute at some point in the future.
However, my countdown parameter can be very large in seconds, for instance months or even a year in the future. Just want to make sure this will not cause any problems or overhead cost? Maybe there is a more efficient way to do this?
It probably would work, but it seems like a bad idea. What do you do if you change your task processing code? You can't modify a task in the queue. You'd somehow have to keep track of the tasks, delete the old ones and replace them with new ones that work with your updated code.
Instead, store information about the tasks in the data store. Run a cron job once a day or once a week, process the info in the data store, and launch the tasks as needed. You can still use a countdown if you need a precise execution date and time.
The current limit in Task Queues is 30 days, and we don't have plans to raise that substantially.
Writing scheduled operations to datastore and running a daily cron job to inject that day's tasks is a good strategy. That would allow you to update the semantics as your product evolves.

how do you deploy a cron script in production?

i would like to write a script that schedules various things throughout the day. unfortunately it will do > 100 different tasks a day, closer to 500 and could be up to 10,000 in the future.
All the tasks are independent in that you can think of my script as a service for end users who sign up and want me to schedule a task for them. so if 5 ppl sign up and person A wants me to send them an email at 9 am, this will be different than person B who might want me to query an api at 10:30 pm etc.
now, conceptually I plan to have a database that tells me what each persons task will be and what time they asked to schedule that task and the frequency. once a day I will get this data from my database so I have an up-to-date record of all the tasks that need to be executed in the day
running them through a loop I can create channels that can execute timers or tickers for each task.
the question I have is how does this get deployed in production to, for example google app engine? since those platforms are for Web servers I'm not sure how this would work...Or am I supposed to use Google Compute Engine and have it act as a computation for 24 hours? Can google compute engine even make http calls?
also if I have to have say 500 channels in go open 24 hrs a day, does that count as 500 containers in google app engine? I imagine that will get very costly quickly, despite what is essentially a very low cost product.
so again the question comes back to, how does a cron script get deployed in production?
any help or guidance will be greatly appreciated as I have done a lot of googling and unfortunately everything leads back to a cron scheduler that has a limit of 100 tasks in google app engine...
Details about cron operation on GAE can be found here.
The tricky portion from your prospective is that updating the cron configuration is done from outside the application, so it's at least difficult (if not impossible) to customize the cron jobs based on your app user's actions.
It is however possible to just run a generic cron job (once a minute, for example) and have that job's handler read the users' custom job configs and further generate tasks accordingly to handle them. Running ~10K tasks per day is usually not an issue, they might even fit inside the free app quotas (depending on what the tasks are actually doing).
The same technique can be applied on a regular Linux OS (including on a GCE VM). I didn't yet use GCE, so I can't tell exactly if/how would a dynamically updated cron be possible with it.
You only need one cron job for your requirements. This cron job can run every 30 minutes - or once per day. It will see what has to be done over the next period of time, create tasks to do it, and add these tasks to the queue.
It can all be done by a single App Engine instance. The number of instances you need to execute your tasks depends, of course, on how long each task runs. You have a lot of control over running the task queue.

Google app engine API: Running large tasks

Good day,
I am running a back-end to an application as an app engine (Java).
Using endpoints, I receive requests. The problem is, there is something big I need to compute, but I need fast response times for the front end. So as a solution I want to precompute something, and store it a dedicated the memcache.
The way I did this, is by adding in a static block, and then running a deferred task on the default queue. Is there a better way to have something calculated on startup?
Now, this deferred task performs a large amount of datastore operations. Sometimes, they time out. So I created a system where it retries on a timeout until it succeeds. However, when I start up the app engine, it immediately creates two of the deferred task. It also keeps retrying the tasks when they fail, despite the fact that I set DeferredTaskContext.setDoNotRetry(true);.
Honestly, the deferred tasks feel very finicky.
I just want to run a method that takes >5 minutes (probably longer as the data set grows). I want to run this method on startup, and afterwards on a regular basis. How would you model this? My first thought was a cron job but they are limited in time. I would need a cron job that runs a deferred task, hope they don't pile up somehow or spawn duplicates or start retrying.
Thanks for the help and good day.
Dries
Your datastore operations should never time out. You need to fix this - most likely, by using cursors and setting the right batch size for your large queries.
You can perform initialization of objects on instance startup - check if an object is available, if not - do the calculations.
Remember to store the results of your calculations in the datastore (in addition to Memcache) as Memcache is volatile. This way you don't have to recalculate everything a few seconds after the first calculation was completed if a Memcache object was dropped for any reason.
Deferred tasks can be scheduled to perform after a specified delay. So instead of using a cron job, you can create a task to be executed after 1 hour (for example). This task, when it completes its own calculations, can create another task to be excited after an hour, and so on.

Scheduling cron jobs

I want to develop an app on which a user can register for alerts( multiple) so that whenever the fare hits below some threshold, he gets a notification. Fares are fetched from a third party website.I want to do this on google app-engine.
Now from what i understand , i need a process running 24/7 which checks the fares at say intervals of 30 mins and send out a notification whenever it hits below the threshold. Probably the cron job of app-engine can be used for this task ? But at max 100 cron jobs can be scheduled, what would be the better way to this. Also having a process for each user would be wastage of resources, what would be better scheduling algorithms for higher efficiency ?
You want to schedule a single cron that runs every 30 minutes and throws an item onto a task queue. That single item on the task queue would then be able to go through all your users, and generate tasks to fetch whatever you need in the background again. Two important things:
You want the initial cron call to return as quickly as possible, as URLs have a 60 second deadline.
Split up any work into separate task queues to achieve above and also iterate through data sources and/or users.
Based on what you're explaining, you can use push task queues: https://cloud.google.com/appengine/docs/python/taskqueue/overview-push

GAE - Execute many small tasks after a fixed time

I'd like to make a Google App Engine app that sends a Facebook message to a user a fixed time (e.g. one day) after they click a button in the app. It's not scalable to use cron or the task queue for potentially millions of tiny jobs. I've also considered implementing my own queue using a background thread, but that's only available using the Backends API as far as I know, which is designed for much larger usage and is not free.
Is there a scalable way for a free Google App Engine app to execute a large number of small tasks after a fixed period of time?
For starters, if you're looking to do millions of tiny jobs, you're going to blow past the free quota very quickly, any way you look at it. The free quota's meant for testing.
It depends on the granularity of your tasks. If you're executing a lot of tasks once per day, cron hooked up to a mapreduce operation (which essentially sends out a bunch of tasks on task queues) works fine. You'll basically issue a datastore query to find the tasks that need to be run, and send them out on the mapreduce.
If you execute this task thousands of times a day (every minute), it may start getting expensive because you're issuing many queries. Note that if most of those queries return nothing, the cost is still minimal.
The other option is to store your tasks in memory rather than in the datastore, that's where you'd want to start using backends. But backends are expensive to maintain. Look into using Google Compute Engine, which gives much cheaper VMs.
EDIT:
If you go the cron/datastore route, you'd store a new entity whenever a user wants to send a deferred message. Most importantly, it'd have a queryable timestamp for when the message should be sent, probably rounded to the nearest minute or the nearest 5 minutes, whatever you decide your granularity should be.
You would then have a cron job that runs at the set interval, say every minute. On each run it would build a query for all the cron jobs it needs to send for the given minute.
If you really do have hundreds of thousands of messages to send each minute, you're not going to want to do it from the cron task. You'd want the cron task to spawn a mapreduce job that will fan out the query and spawn tasks to send your messages.

Resources