Task Queue in GAE JAVA - Battle Engine - google-app-engine

I use Google App Engine Java for over a year. Trying write a simple strategy game.
Unfortunately current solution does not satisfy me. My experience with GAE is even insufficient.
My problem is to simulate a battle at a specific time, which shall be calculated in the task queue.
Current solution:
I have entities in Datastore :
BattleInfo ( field: battleID, startTime, userA, userB)
I have prepared two servlets: BattleInitServlet and BattleServlet.
BattleInitServlet - execute the query to the datastore and loads BattleInfo. Tasks are added to the battle-queue:
Queue queue = QueueFactory.getQueue("battle-queue");
...
for(BattleInfo i : list){
long countdownMillis = i.getStartTime() - currentTime;
queue.add(TaskOptions.Builder.withUrl("/battle").param("battleID", i.getBattleID()).method(TaskOptions.Method.POST).countdownMillis(countdownMillis));
}
BattleServlet - task will start at a specific time. I receive BattleInfo from the battleID and loads the required data. Simulates the battle and save to datastore.
BattleInitServlet is executed every 30 minutes using cron.
With a few simulation , result is OK. Tests with a large amount of simulation, task queue is clogged. I can solve this by changing the settings Queue, but I increase costs, and does not solve the problem. I do not know how to speed up and optimize.
I tested the pipeline API that works fast. The problem is that the pipeline API initiates a moment and executed tasks immediately .
I want to prepare a pipeline using tasks that runs in the background and executed the battle for a specified time and simulates it. Just don't know how to write it using GAE and tasks.
Has anyone had a similar problem?
Please help me. Thank you.

Can't you link the Pipeline API with your Cron Job to delay execution?
FYI, I guess Pipeline API also leverages Task Queues internally but uses sharding, slicing to optimize the performance.

Related

Why is Google Cloud Tasks so slow?

I use Google Cloud Tasks with AppEngine to process tasks, but the tasks wait about 2-3 minutes in the queue before being sent to my App Engine endpoint.
There is no "delay" set on the tasks, and I expect them to be sent right away.
So the question is: Is Cloud Tasks slow?
As you can see is the following screenshot, Cloud Tasks gives an ETA of about 3 mins:
The official word from Google is that this is the best you can expect from their task queues.
In my experience, how you configure tasks seems to influence how quickly they get executed.
It seems that:
If you don't change the default behavior of your task queues (e.g., maximum concurrent, etc.) and if you don't specify an execution time of a task (e.g., eta) then your tasks will execute very soon after submission.
If you mess with either of these two things, then Google takes longer to execute your tasks. My guess is that it is the extra overhead of controlling task rate and execution.
I see from your screenshot that you have a task with an ETA of 2 min 49 sec which is the time until your task will be run. You have high bucket size and concurrency numbers, so I think your issue has more to do with the parameters you are using when queueing your tasks, especially the scheduled_time attribute. Check your code to see if you are adding a delay to your tasks, and make sure to tune it down.
Just adding here, that as of February 2023, I can queue tasks and then consume them VERY fast using the Python 3.7 libraries.
Takes me about 13.5 seconds to queue up 1000 tasks.
Takes about 1 minute to process those 1000 tasks using a Cloud Run deployed python/flask app. (No other processing done, just receive and reply with 200).
So, super fast!
BTW, pubsub was much slower in my tests... about 40ms per message to queue a message.

Difference between TaskQueue and MapReduce in Google App Engine

I have read the docs about taskqueue and push queues in gae which is used to create long running tasks.
I have doubt in why there was the need for MapReduce? As both do the processing in background, what are the main principal differences between them.
Can someone please explain this?
Edit: I guess i was comparing Apples with monkeys! Hadoop, mapreduce are related. And gae is a backend framework.
You are getting confused with two entirely different things altogether.
MapReduce paradigm is all about distributed parallel processing of very huge amount of data.
TaskQueue is a Scheduler; which can schedule a task to execute say at certain time. It is just a scheduler like a unix cronjobs.
Please take note of bold and italic words in above statements to see the difference.
From the definition of TaskQueue
Task queues let applications perform work, called tasks,
asynchronously outside of a user request. If an app needs to execute
work in the background, it adds tasks to task queues. The tasks are
executed later, by worker services.
By definition, TaskQueue work outside of a user request; means there is no actual user request to perform a task (it is simply submitted/scheduled sometime in past). mapreduce programs are submitted by users to execute, though you may use TaskQueue to schedule one in future.
You are probably getting confused due to words like task, queue, scheduling used in mapreduce world. But those all thing in mapreduce may have some similarity, as they are generic terms - but they are definitely not the same.

Custom Metrics cron job Datastore timeout

I have written a code to write data to custom metrics cloud monitoring - google app engine.
For that i am storing the data for some amount of time say: 15min into datastore and then a cron job runs and gets the data from there and plots the data on the cloud monitoring dashboard.
Now my problem is : while fetching huge data to plot from the datastore the cron job may timeout. Also i wanted to know what happens when cron job fails ?
Also Can it fail if the number of records is high ? if it can, what alternates could we do. Safely how many records cron could process in 10 min timeout duration.
Please let me know if any other info is needed.
Thanks!
You can run your cron job on an instance with basic or manual scaling. Then it can run for as long as you need it.
Cron job is not re-tried. You need to implement this mechanism yourself.
A better option is to use deferred tasks. Your cron job should create as many tasks to process data as necessary and add them to the queue. In this case you don't have to redo the whole job - or remember a spot from which to resume, because tasks are automatically retried if they fail.
Note that with tasks you may not need to create basic/manual scaling instances if each task takes less than 10 minutes to execute.
NB: If possible, it's better to create a large number of tasks that execute quickly as opposed to one or few tasks that take minutes. This way you minimize wasted resources if a task fails, and have smaller impact on other processes running on the same instance.

Google app engine API: Running large tasks

Good day,
I am running a back-end to an application as an app engine (Java).
Using endpoints, I receive requests. The problem is, there is something big I need to compute, but I need fast response times for the front end. So as a solution I want to precompute something, and store it a dedicated the memcache.
The way I did this, is by adding in a static block, and then running a deferred task on the default queue. Is there a better way to have something calculated on startup?
Now, this deferred task performs a large amount of datastore operations. Sometimes, they time out. So I created a system where it retries on a timeout until it succeeds. However, when I start up the app engine, it immediately creates two of the deferred task. It also keeps retrying the tasks when they fail, despite the fact that I set DeferredTaskContext.setDoNotRetry(true);.
Honestly, the deferred tasks feel very finicky.
I just want to run a method that takes >5 minutes (probably longer as the data set grows). I want to run this method on startup, and afterwards on a regular basis. How would you model this? My first thought was a cron job but they are limited in time. I would need a cron job that runs a deferred task, hope they don't pile up somehow or spawn duplicates or start retrying.
Thanks for the help and good day.
Dries
Your datastore operations should never time out. You need to fix this - most likely, by using cursors and setting the right batch size for your large queries.
You can perform initialization of objects on instance startup - check if an object is available, if not - do the calculations.
Remember to store the results of your calculations in the datastore (in addition to Memcache) as Memcache is volatile. This way you don't have to recalculate everything a few seconds after the first calculation was completed if a Memcache object was dropped for any reason.
Deferred tasks can be scheduled to perform after a specified delay. So instead of using a cron job, you can create a task to be executed after 1 hour (for example). This task, when it completes its own calculations, can create another task to be excited after an hour, and so on.

crawler on appengine

i want to run a program continiously on appengine.This program will automatically crawl some website continiously and store the data into its database.Is it possible for the program to
continiously keep doing it on appengine?Or will appengine kill the process?
Note:The website which will be crawled is not stored on appengine
i want to run a program continiously
on appengine.
Can't.
The closest you can get is background-running scheduled tasks that last no more than 30 seconds:
Notably, this means that the lifetime
of a single task's execution is
limited to 30 seconds. If your task's
execution nears the 30 second limit,
App Engine will raise an exception
which you may catch and then quickly
save your work or log process.
A friend of mine suggested following
Create a task queue
Start the queue by passing some data.
Use an Exception handler and handle DeadlineExceededException.
In your handler create a new queue for same purpose.
You can run your job infinitely. You only need to consider used CPU Time and storage.
You might want to consider Backends introduced in the newer version of GAE.
These run continuous processes
Is Possible Yes, I have already build a solution on Appengine - wowprice
Sharing all details here will make my answer lengthy,
Problem - Suppose I want to crawl walmart.com, As i known that I cant crawl in one shot(millions products)
Solution - I have designed my spider to break the task in smaller task.
Step 1 : I input job for walmart.com, Job scheduler will create a task.
Step 2 : My spider will pick the job and its notice that Its index page, now my spider will create more jobs as starting page as categories page, Now its enters 20 more tasks
Step 3 : now spider make more smaller jobs for subcategories, and its will go till it gets product list page and create task for it.
Step 4 : for product list pages, its get the product and make call to to stores the product data and in case of next page It ll make one task to crawl them.
Advantages -
We can crawl without breaking 30 seconds rules, and speed of crawling will depends backend machine, It will provide parallel crawling for single target.
they fixed it for you.
you can run background threads on a manual scaled instance.
check https://developers.google.com/appengine/docs/python/modules/#Python_Background_threads
You cannot literally run one continuous process for more than 30 seconds. However, you can use the Task Queue to have one process call another in a continuous chain. Alternatively you can schedule jobs to run with the Cron service.
Use a cron job to periodically check for pages which have not been scraped in the past n hours/days/whatever, and put scraping tasks for some subset of these pages onto a task queue. This way your processes don't get killed for taking too long, and you don't hammer the server you're scraping with excessive bursts of traffic.
I've done this, and it works pretty well. Watch out for task timeouts; if things take too long, split them into multiple phases and be sure to use memcached liberally.
Try this:
on appengine run any program. You connect from browser, click for start url during ajax. Ajax call server, download some data from internet and return you (your browser) next url. This is not one request, each url is one diferent request. You mast only resolve in JS how ajax is calling url un cycle.
You can using lasted GAE service called backends . Check this http://code.google.com/appengine/docs/java/backends/
Backends are special App Engine instances that have no request deadlines, higher memory and CPU limits, and persistent state across requests. They are started automatically by App Engine and can run continously for long periods. Each backend instance has a unique URL to use for requests, and you can load-balance requests across multiple instances.

Resources