Continuously running service in Google Cloud Engine - google-app-engine

I am trying to figure out how to run a service(1) when it does not receive any calls.
I want to use Microservices Architecture.
Basically i want to run this service (1) when the other service(2) is receiving calls and all data.
As the service(1) i mentioned is not receiving it would not have to spawn new instances and i would want only the service(2) to scale.
I have noticed scheduling jobs with cron yaml but the number of calls is limited.
I need to get this service(1) to be active every 1 min when service(2) is active.

It's hard to give a good answer without knowing more about what service (1) has to do when it is 'active'. It sounds you want cron to launch a task every minute.
You can use cron in conjunction with push queues: https://cloud.google.com/appengine/docs/standard/go/taskqueue/push/
When creating a push queue task, you can set the property delay before adding it to the queue: https://cloud.google.com/appengine/docs/standard/go/taskqueue/reference#Task
(For me in Python they called it countdown https://cloud.google.com/appengine/docs/standard/python/refdocs/google.appengine.api.taskqueue.taskqueue#google.appengine.api.taskqueue.taskqueue.add)
You could have a cron job that fires every 24 hrs. That cron job would load up your push queue with tasks who's delays are staggered. The delay of the first one is 1 min, the delay of the second one is 2 min, etc.

Related

Why is Google Cloud Tasks so slow?

I use Google Cloud Tasks with AppEngine to process tasks, but the tasks wait about 2-3 minutes in the queue before being sent to my App Engine endpoint.
There is no "delay" set on the tasks, and I expect them to be sent right away.
So the question is: Is Cloud Tasks slow?
As you can see is the following screenshot, Cloud Tasks gives an ETA of about 3 mins:
The official word from Google is that this is the best you can expect from their task queues.
In my experience, how you configure tasks seems to influence how quickly they get executed.
It seems that:
If you don't change the default behavior of your task queues (e.g., maximum concurrent, etc.) and if you don't specify an execution time of a task (e.g., eta) then your tasks will execute very soon after submission.
If you mess with either of these two things, then Google takes longer to execute your tasks. My guess is that it is the extra overhead of controlling task rate and execution.
I see from your screenshot that you have a task with an ETA of 2 min 49 sec which is the time until your task will be run. You have high bucket size and concurrency numbers, so I think your issue has more to do with the parameters you are using when queueing your tasks, especially the scheduled_time attribute. Check your code to see if you are adding a delay to your tasks, and make sure to tune it down.
Just adding here, that as of February 2023, I can queue tasks and then consume them VERY fast using the Python 3.7 libraries.
Takes me about 13.5 seconds to queue up 1000 tasks.
Takes about 1 minute to process those 1000 tasks using a Cloud Run deployed python/flask app. (No other processing done, just receive and reply with 200).
So, super fast!
BTW, pubsub was much slower in my tests... about 40ms per message to queue a message.

Scheduling cron jobs

I want to develop an app on which a user can register for alerts( multiple) so that whenever the fare hits below some threshold, he gets a notification. Fares are fetched from a third party website.I want to do this on google app-engine.
Now from what i understand , i need a process running 24/7 which checks the fares at say intervals of 30 mins and send out a notification whenever it hits below the threshold. Probably the cron job of app-engine can be used for this task ? But at max 100 cron jobs can be scheduled, what would be the better way to this. Also having a process for each user would be wastage of resources, what would be better scheduling algorithms for higher efficiency ?
You want to schedule a single cron that runs every 30 minutes and throws an item onto a task queue. That single item on the task queue would then be able to go through all your users, and generate tasks to fetch whatever you need in the background again. Two important things:
You want the initial cron call to return as quickly as possible, as URLs have a 60 second deadline.
Split up any work into separate task queues to achieve above and also iterate through data sources and/or users.
Based on what you're explaining, you can use push task queues: https://cloud.google.com/appengine/docs/python/taskqueue/overview-push

How to create X tasks as fast as possible on Google App Engine

We push out alerts from GAE, and let's say we need to push out 50 000 alerts to CD2M (Cloud 2 Device Messaging). For this we:
Read all who wants alerts from the datastore
Loop through and create a "push task" for each notification
The problem is that the creation of the task takes some time so this doesn't scale when the user base grows. In my experience we are getting 20-30 seconds just creating the tasks when there is a lot of them. The reason for one task pr. push message is so that we can retry the task if something fails and it will only affect a single subscriber. Also C2DM only supports sending to one user at a time.
Will it be faster if we:
Read all who wants alerts from the datastore
Loop through and create a "pool task" for each 100 subscribers
Each "Pool task" will generate 100 "push tasks" when they execute
The task execution is very fast so in our scenario it seems like the creation of the tasks is the bottleneck and not the execution of the tasks. That's why I thought about this scenario to be able to increase the parallelism of the application. I would guess this would lead to faster execution but then again I may be all wrong :-)
We do something similar with APNS (Apple Push Notification Server): we create a task for a batch of notifications at a time (= pool task as you call it). When task executes, we iterate over a batch and send it to push server.
The difference with your setup is that we have a separate server for communicating with push, as APNS only supports socket communication.
The only downside is if there is an error, then whole task will be repeated and some users might get two notifications.
This sounds like it varies based on the number of alerts you need to send out, how long it takes to send each alert, and the number of active instances you have running.
My guess is that it takes a few milliseconds to tens of milliseconds to send out a CD2M alert, while it takes a few seconds for an instance to spin up, so you can probably issue a few hundred or a few thousand alerts before justifying another task instance. The ratio of the amount of time it takes to send each CD2M message vs the time it takes to launch an instance will dictate how many messages you'd want to send per task.
If you already have a fair number of instances running though, you don't have the delay of waiting for instances to spin up.
BTW, this seems almost like a perfect application of the MapReduce API. It mostly does what you describe in the second version, except it takes your initial query, and breaks that up into subqueries that each return a "page" of the result set. A task is launched for each subquery which processes all the items in its "page". This is an improvement from what you describe, because you don't need to spend the time looping through your initial result set.
I believe the default implementation for the MapReduce API just queries for all entities of a particular kind (ie all User objects), but you can change the filter used.

How Google App Engine Java Task Queues can be used for mass scheduling for users?

I am focusing GAE-J for developing a Java web application.
I have a scenario where user will create his schedule for set of reminders. And I have to send emails on that particular date/time.
I can not create thread on GAE. So I have the solution of Task Queues.
So can I achieve this functionality with Task Queues. User will create tasks. And App Engine will execute it on specific date and time.
Thanks
Although using the task queue directly, as Chris suggests, will work, for longer reminder periods (eg, 30+ days) and in cases where the reminder might be modified, a more indirect approach is probably wise.
What I would recommend is storing reminders in the datastore, and then taking one of a few approaches, depending on your requirements:
Run a regular cron job (say, hourly) that fetches a list of reminders coming up in the next interval, and schedules task queue tasks for each.
Have a single task that you schedule to be run at the time the next reminder (system-wide) is due, which sends out the reminder(s) and then enqueues a new task for the next reminder that's due.
Run a backend, as Chris suggests, which regularly scans the datastore for upcoming reminders.
In all the above cases, you'll probably need some special case code for when a user sets a reminder in less than the minimum polling interval you've set - probably enqueuing a task directly. You'll also want to consider batching up the sending of reminders, to minimize tasks and wallclock time consumed.
You can do this with Task Queues - basically when you receive the request 'remind me at date/time X by sending an email', you create a new task with the following basic structure:
if current time is close to or past the given date/time X:
send the email
else
fail this task
If the reminder time is far in the future, the first few times the task is scheduled, it will fail and be scheduled for later. The downside of this approach is that it doesn't guarantee that the task will run exactly when the reminder is supposed to be sent - it may be a little while before or afterwards. You could slim down this window by taking into account that your task can run for 10 minutes, so if you're within 10 minutes of the reminder time, sleep until the right time and then send the e-mail.
If the reminders have to be sent out as close in time as possible then just use a Backend - keep an instance running forever and dispatch all reminders to it, and it can continuously look at all reminders it has to send out and send them out at exactly the right time.

crawler on appengine

i want to run a program continiously on appengine.This program will automatically crawl some website continiously and store the data into its database.Is it possible for the program to
continiously keep doing it on appengine?Or will appengine kill the process?
Note:The website which will be crawled is not stored on appengine
i want to run a program continiously
on appengine.
Can't.
The closest you can get is background-running scheduled tasks that last no more than 30 seconds:
Notably, this means that the lifetime
of a single task's execution is
limited to 30 seconds. If your task's
execution nears the 30 second limit,
App Engine will raise an exception
which you may catch and then quickly
save your work or log process.
A friend of mine suggested following
Create a task queue
Start the queue by passing some data.
Use an Exception handler and handle DeadlineExceededException.
In your handler create a new queue for same purpose.
You can run your job infinitely. You only need to consider used CPU Time and storage.
You might want to consider Backends introduced in the newer version of GAE.
These run continuous processes
Is Possible Yes, I have already build a solution on Appengine - wowprice
Sharing all details here will make my answer lengthy,
Problem - Suppose I want to crawl walmart.com, As i known that I cant crawl in one shot(millions products)
Solution - I have designed my spider to break the task in smaller task.
Step 1 : I input job for walmart.com, Job scheduler will create a task.
Step 2 : My spider will pick the job and its notice that Its index page, now my spider will create more jobs as starting page as categories page, Now its enters 20 more tasks
Step 3 : now spider make more smaller jobs for subcategories, and its will go till it gets product list page and create task for it.
Step 4 : for product list pages, its get the product and make call to to stores the product data and in case of next page It ll make one task to crawl them.
Advantages -
We can crawl without breaking 30 seconds rules, and speed of crawling will depends backend machine, It will provide parallel crawling for single target.
they fixed it for you.
you can run background threads on a manual scaled instance.
check https://developers.google.com/appengine/docs/python/modules/#Python_Background_threads
You cannot literally run one continuous process for more than 30 seconds. However, you can use the Task Queue to have one process call another in a continuous chain. Alternatively you can schedule jobs to run with the Cron service.
Use a cron job to periodically check for pages which have not been scraped in the past n hours/days/whatever, and put scraping tasks for some subset of these pages onto a task queue. This way your processes don't get killed for taking too long, and you don't hammer the server you're scraping with excessive bursts of traffic.
I've done this, and it works pretty well. Watch out for task timeouts; if things take too long, split them into multiple phases and be sure to use memcached liberally.
Try this:
on appengine run any program. You connect from browser, click for start url during ajax. Ajax call server, download some data from internet and return you (your browser) next url. This is not one request, each url is one diferent request. You mast only resolve in JS how ajax is calling url un cycle.
You can using lasted GAE service called backends . Check this http://code.google.com/appengine/docs/java/backends/
Backends are special App Engine instances that have no request deadlines, higher memory and CPU limits, and persistent state across requests. They are started automatically by App Engine and can run continously for long periods. Each backend instance has a unique URL to use for requests, and you can load-balance requests across multiple instances.

Resources