which queue should I use (App Engine) - google-app-engine

I have app engine application. I have users refresh tokens (In order to have access to google drive) in my database.
Now, I want to create this:
Every week (I mean every 7th day), I want to temorary download users PDF documents from google drive and work with them. I send mails to each user about their pdf documents.
The main problem is that , there might be many users. and each user may have a lot of document too. I should do that work for each users, once a week. But each users data needs much time too.
QUESTION:
So, now I think, which time service should I use? Cron or Task Queue. and why? and if Task Queue, which one. which will be faster, and flexible? I can send mail to the user later too (it's not necessary to send mails immediately when she/he requests)
QUESTION2: can I run Task Queue , for instance, once a week?
For example If I want to run it every day I can use something like that:
<rate>1/d</rate>
but how can I do that once a week?
QUESTION3: because of there might be many user (and because of each user needs much time), can I use something like that?
CRON job will be weekly (once a week). And CRON to call TASK QUEUSE, for each user. Each user data will be downloaded in the app engine server temporary (I think , If I save it in memory, it will be very hard for server). Then I will see PDF documents and send mails to each user. Is this good way? or should I use only CRON? did I have limitation here? On server storage or at queues or something like that.

Use both. Create a cron job to run every 7 days. Have the cron job fire a task (in the push queue) to process your PDFs. I'd use a separate task for each PDF to process, and configure your queue.yaml so that it processes them at the correct rate (depending on budget / rate limiting factors etc).
If you need to send mail, you can do this from the task request, via the mail api.
As a side note, if you have many users, a better approach may be to have the cron job run more frequently than every 7 days (say, once a day, or even more). You can use logic to determine which users need to be processed each time the cron runs. This may even the load, and ultimately save you money.

Related

how do you deploy a cron script in production?

i would like to write a script that schedules various things throughout the day. unfortunately it will do > 100 different tasks a day, closer to 500 and could be up to 10,000 in the future.
All the tasks are independent in that you can think of my script as a service for end users who sign up and want me to schedule a task for them. so if 5 ppl sign up and person A wants me to send them an email at 9 am, this will be different than person B who might want me to query an api at 10:30 pm etc.
now, conceptually I plan to have a database that tells me what each persons task will be and what time they asked to schedule that task and the frequency. once a day I will get this data from my database so I have an up-to-date record of all the tasks that need to be executed in the day
running them through a loop I can create channels that can execute timers or tickers for each task.
the question I have is how does this get deployed in production to, for example google app engine? since those platforms are for Web servers I'm not sure how this would work...Or am I supposed to use Google Compute Engine and have it act as a computation for 24 hours? Can google compute engine even make http calls?
also if I have to have say 500 channels in go open 24 hrs a day, does that count as 500 containers in google app engine? I imagine that will get very costly quickly, despite what is essentially a very low cost product.
so again the question comes back to, how does a cron script get deployed in production?
any help or guidance will be greatly appreciated as I have done a lot of googling and unfortunately everything leads back to a cron scheduler that has a limit of 100 tasks in google app engine...
Details about cron operation on GAE can be found here.
The tricky portion from your prospective is that updating the cron configuration is done from outside the application, so it's at least difficult (if not impossible) to customize the cron jobs based on your app user's actions.
It is however possible to just run a generic cron job (once a minute, for example) and have that job's handler read the users' custom job configs and further generate tasks accordingly to handle them. Running ~10K tasks per day is usually not an issue, they might even fit inside the free app quotas (depending on what the tasks are actually doing).
The same technique can be applied on a regular Linux OS (including on a GCE VM). I didn't yet use GCE, so I can't tell exactly if/how would a dynamically updated cron be possible with it.
You only need one cron job for your requirements. This cron job can run every 30 minutes - or once per day. It will see what has to be done over the next period of time, create tasks to do it, and add these tasks to the queue.
It can all be done by a single App Engine instance. The number of instances you need to execute your tasks depends, of course, on how long each task runs. You have a lot of control over running the task queue.

GAE Datastore / Task queues – Saving only one item per user at a time

I've developed a python app that registers information from incoming emails and saves this information to the GAE Datastore. Registering the emails works just fine. As part of the registration, emails with the same subject and recipients get a conversation ID. However, sometimes emails enter the system so fast after each other, that emails from the same conversation don't get the same ID. This happens because two emails from the same conversation are being processed at the same time and GAE doesn't see the other entry yet when running a query for this conversation.
I've been thinking of a way to prevent this, and think it would be best if the system processes only one email per user at a time (each sender has his own account). This could be done by having a push task queue that first checks if there is currently an email being processed for this user, and if so, put the new task in a pull queue from which it can be retrieved as soon as the previous task has been finished.
The big disadvantage of this, is that (I think) I can't run the push queue asynchronous, which obviously is a big performance disadvantage. Any ideas on what would be a better way to setup such a process?
Apparently this was a typical race-condition. I've made use of the Transactions functionality to prevent multiple processes writing at the same time. Documentation can be found here: https://cloud.google.com/appengine/docs/python/datastore/transactions

How Google App Engine Java Task Queues can be used for mass scheduling for users?

I am focusing GAE-J for developing a Java web application.
I have a scenario where user will create his schedule for set of reminders. And I have to send emails on that particular date/time.
I can not create thread on GAE. So I have the solution of Task Queues.
So can I achieve this functionality with Task Queues. User will create tasks. And App Engine will execute it on specific date and time.
Thanks
Although using the task queue directly, as Chris suggests, will work, for longer reminder periods (eg, 30+ days) and in cases where the reminder might be modified, a more indirect approach is probably wise.
What I would recommend is storing reminders in the datastore, and then taking one of a few approaches, depending on your requirements:
Run a regular cron job (say, hourly) that fetches a list of reminders coming up in the next interval, and schedules task queue tasks for each.
Have a single task that you schedule to be run at the time the next reminder (system-wide) is due, which sends out the reminder(s) and then enqueues a new task for the next reminder that's due.
Run a backend, as Chris suggests, which regularly scans the datastore for upcoming reminders.
In all the above cases, you'll probably need some special case code for when a user sets a reminder in less than the minimum polling interval you've set - probably enqueuing a task directly. You'll also want to consider batching up the sending of reminders, to minimize tasks and wallclock time consumed.
You can do this with Task Queues - basically when you receive the request 'remind me at date/time X by sending an email', you create a new task with the following basic structure:
if current time is close to or past the given date/time X:
send the email
else
fail this task
If the reminder time is far in the future, the first few times the task is scheduled, it will fail and be scheduled for later. The downside of this approach is that it doesn't guarantee that the task will run exactly when the reminder is supposed to be sent - it may be a little while before or afterwards. You could slim down this window by taking into account that your task can run for 10 minutes, so if you're within 10 minutes of the reminder time, sleep until the right time and then send the e-mail.
If the reminders have to be sent out as close in time as possible then just use a Backend - keep an instance running forever and dispatch all reminders to it, and it can continuously look at all reminders it has to send out and send them out at exactly the right time.

Tasks, Cron jobs or Backends for an app

I'm trying to construct a non-trivial GAE app and I'm not sure if a cron job, tasks, backends or a mix of all is what I need to use based on the request time-out limit that GAE has for HTTP requests.
The distinct steps I need to do are:
1) I have upwards of 15,000 sites I need to pull data from at a regular schedule and without any user interaction. The total number of sites isn't going to static but they're all saved in the datastore [Table0] along side the interval at which they're read at. The interval may vary as regular as every day to every 30 days.
2) For each site from step #1 that fits the "pull" schedule criteria, I need to fetch data from it via HTTP GET (again, it might be all of them or as few as 2 or 3 sites). Once I get the response back from the site, parse the result and save this data into the datastore as [Table1].
3) For all of the data that was recently put into the datastore in [Table1] (they'll have a special flag), I need to issue additional HTTP request to a 3rd party site to do some additional processing. As soon as I receive data from this site, I store all of the relevant info into another table [Table2] in the datastore.
4) As soon as data is available and ready from step #3, I need to take all of it and perform some additional transformation and update the original table [Table1] in the datastore.
I'm not certain which of the different components I need to use to ensure that I can complete each piece of the work without exceeding the response deadline that's placed on the web requests of GAE. For requests initiated by cron jobs and tasks, I believe you're allowed 10 mins to complete it, whereas typical user-driven requests are allowed 30 seconds.
Task queues are the best way to do this in general, but you might want to check out the App Engine Pipeline API, which is designed for exactly the sort of workflow you're talking about.
GAE is a tough platform for your use-case. But, out of extreme masochism, I am attempting something similar. So here are my two cents, based on my experience so far:
Backends -- Use them for any long-running, I/O intensive tasks you may have (Web-Crawling is a good example, assuming you can defer compute-intensive processing for later).
Mapreduce API -- excellent for compute-intensive/parallel jobs such as stats collection, indexing etc. Until recently, this library only had a mapper implementation, but recently Google also released an in-memory Shuffler that is good for jobs that fit in about 100MB.
Task Queues -- For when everything else fails :-).
Cron -- mostly to kick off periodic tasks -- which context you execute them in, is up to you.
It might be a good idea to design your backend tasks so that they can be scheduled (manually, or perhaps by querying your current quota usage) in the "Frontend" context using task queues, if you have spare Frontend CPU cycles.
I abandoned GAE before Backends came out, so can't comment on that. But, what I did a few times was:
Cron scheduled to kick off process
Cron handler invokes a task URL
task grabs first item (URL) from datastore, executes HTTP request, operates on data, updates the URL record as having worked on it and the invokes the task URL again.
So cron is basically waking up taskqueue periodically and taskqueue runs recursively until it reaches some stopping point.
You can see it in action one of my public GAE apps - https://github.com/mavenn/watchbots-gae-python.

crawler on appengine

i want to run a program continiously on appengine.This program will automatically crawl some website continiously and store the data into its database.Is it possible for the program to
continiously keep doing it on appengine?Or will appengine kill the process?
Note:The website which will be crawled is not stored on appengine
i want to run a program continiously
on appengine.
Can't.
The closest you can get is background-running scheduled tasks that last no more than 30 seconds:
Notably, this means that the lifetime
of a single task's execution is
limited to 30 seconds. If your task's
execution nears the 30 second limit,
App Engine will raise an exception
which you may catch and then quickly
save your work or log process.
A friend of mine suggested following
Create a task queue
Start the queue by passing some data.
Use an Exception handler and handle DeadlineExceededException.
In your handler create a new queue for same purpose.
You can run your job infinitely. You only need to consider used CPU Time and storage.
You might want to consider Backends introduced in the newer version of GAE.
These run continuous processes
Is Possible Yes, I have already build a solution on Appengine - wowprice
Sharing all details here will make my answer lengthy,
Problem - Suppose I want to crawl walmart.com, As i known that I cant crawl in one shot(millions products)
Solution - I have designed my spider to break the task in smaller task.
Step 1 : I input job for walmart.com, Job scheduler will create a task.
Step 2 : My spider will pick the job and its notice that Its index page, now my spider will create more jobs as starting page as categories page, Now its enters 20 more tasks
Step 3 : now spider make more smaller jobs for subcategories, and its will go till it gets product list page and create task for it.
Step 4 : for product list pages, its get the product and make call to to stores the product data and in case of next page It ll make one task to crawl them.
Advantages -
We can crawl without breaking 30 seconds rules, and speed of crawling will depends backend machine, It will provide parallel crawling for single target.
they fixed it for you.
you can run background threads on a manual scaled instance.
check https://developers.google.com/appengine/docs/python/modules/#Python_Background_threads
You cannot literally run one continuous process for more than 30 seconds. However, you can use the Task Queue to have one process call another in a continuous chain. Alternatively you can schedule jobs to run with the Cron service.
Use a cron job to periodically check for pages which have not been scraped in the past n hours/days/whatever, and put scraping tasks for some subset of these pages onto a task queue. This way your processes don't get killed for taking too long, and you don't hammer the server you're scraping with excessive bursts of traffic.
I've done this, and it works pretty well. Watch out for task timeouts; if things take too long, split them into multiple phases and be sure to use memcached liberally.
Try this:
on appengine run any program. You connect from browser, click for start url during ajax. Ajax call server, download some data from internet and return you (your browser) next url. This is not one request, each url is one diferent request. You mast only resolve in JS how ajax is calling url un cycle.
You can using lasted GAE service called backends . Check this http://code.google.com/appengine/docs/java/backends/
Backends are special App Engine instances that have no request deadlines, higher memory and CPU limits, and persistent state across requests. They are started automatically by App Engine and can run continously for long periods. Each backend instance has a unique URL to use for requests, and you can load-balance requests across multiple instances.

Resources