Task queue of size 1 for serial processing - google-app-engine

I want a serial queue where only one task may process at a time. This is my declaration in queue.xml:
<queue>
<name>pumpkin</name>
<rate>20/s</rate>
<bucket-size>1</bucket-size>
<max-concurrent-requests>1</max-concurrent-requests>
</queue>
Does the "rate" parameter have any effect in this setup?
I want tasks to queue up and only process one at a time.
Thanks

The queue rate and bucket size do not limit the number of tasks processes simultaneously. See this answer which captured a (now gone) official explanation of these configs better than the (current) official docs (you may want to revisit your configs accordingly): https://stackoverflow.com/a/3740846/4495081
However your max-concurrent-requests should ensure only one task is executed at a time, according to the docs:
You can avoid this possibility by setting max_concurrent_requests to a
lower value. For example, if you set max_concurrent_requests to 10,
our example queue maintains about 20 tasks/second when latency is 0.3
seconds. However, when the latency increases over 0.5 seconds, this
setting throttles the processing rate to ensure that no more than 10
tasks run simultaneously.

Related

Why is GAE task on queue being run more often than specified?

I have a queue defined as follows:
....
<queue>
<name>sendMailsBatch</name>
<rate>1/m</rate>
<max-concurrent-requests>1</max-concurrent-requests>
<retry-parameters>
<min-backoff-seconds>70</min-backoff-seconds>
<max-doublings>1</max-doublings>
</retry-parameters>
</queue>
....
I want there to be time gap of at least 60 seconds between each time a task runs. This must be the case no matter whether it is the same task being run because it fails, or whether it is different tasks being run.
The process starts by one task being put onto the queue, and this task will at the end - if all datastore operations are successfull - add another task to the queue (as it uses a cursor from the datastore operation executed by the task).
When I look at the log though, the tasks are executed too often:
Why are the tasks executed this frequently, when I have configured that at most one task can run at a time, and at the most one task per minute, and if a task fails, there should be at least 70s between the runs?
Thanks,
-Louise
When processing the queue, the app engine uses all of the concurrent requests specified to process what is already in its bucket. Once it finishes those tasks, it won't perform any additional work until a new task appears on the bucket. The rate at which these tasks are added to the bucket is defined by <rate>.
In your case, you set the <rate> correctly but since you didn't explicitly set the <bucket-size> parameter, it defaulted to 5 as mentioned here: https://cloud.google.com/appengine/docs/standard/java/config/queueref. Once you explicitly set the <bucket-size> to 1, you should no longer run into this issue.

How do I decide between taskSpawn(), period(), and watchdogs?

We are using embedded C for the VxWorks real time operating system.
Currently, all of our UDP connections are started with TaskSpawn().
This routine creates and activates a new task with a specified
priority and options and returns a system-assigned ID.
We specify the task size, a priority, and pass in an entry point.
These are continuous connections, and thus every entry point contains an infinite loop where we delay before the next iteration.
Then I discovered period().
period spawns a task to call a function periodically.
Period sounds like what we should be using instead, but I can't find any information on when you would prefer this function over TaskSpawn. Period also doesn't allow specifying the task size or the priority, so how is it decided? Is the task size dynamic? What will the priority be?
There are also watchdogs.
Any task may create a watchdog timer and use it to run a specified
routine in the context of the system-clock ISR, after a specified
delay.
Again, this seems to be in line with the goal of processing data at a particular rate. Which do I choose when a task must continuously execute code at the same rate (i.e. in real time)?
What are the differences between these 3 methods?
Here is a little clarification:
taskSpawn(..) creates a task with which you're free to do anything with you like.
Watchdogs shall only be used to monitor time constraints. Remember that the callback of the watchdog is executed within the context of the system clock ISR which has many limitations (e.g. free stack size, never use blocking function calls in an ISR, ...). Additionally executing "a lot of code" in the system clock ISR slows down your entire system.
period(..) is intended to be a helper for the VxWorks shell and not to be used by a program.
With that being said your only option is to use taskSpawn(..) unless you're doing some very simple stuff in which case period(..) might be ok to use.
If you need to do things cyclically in a specific time frame you might look at timers or taskDelay(..) in combination with sysClkRateSet(..).
Another option is to create two tasks. One that is setting a semaphore after a specific time intervall and the other "worker" tasks waits for this semaphore to do something. With that approach you separate "timing" from "action" which proved to be benefitial according to my experience. You also might want to monitor excution time of the "worker" task by using a watchdog.

Need suggestion for handling large number of timers/timeouts

I am working on redesign an existing L2TP(Layer 2 tunneling protocol) code.
For L2TP , the number of tunnels we support is 96K. L2TP protocol has a keep-alive mechanism where it needs to send HELLO msges.
Say if we have 96,000 tunnels for which L2TPd needs to send HELLO msg after configured timeout value , what is the best way to implement it ?
Right now , we have a timer thread , where for every 1sec , we iterate and send HELLO msges. This design is a old design which is not scaling now.
Please suggest me a design to handle large number of timers.
There are a couple of ways to implement timers:
1) select: this system call allows you to wait on a file descriptor, and then wake up. You can wait on a file descriptor that does nothing as a timeout
2) Posix Condition Variables: similar to select, they have a time out mechanism built in.
3) If you are using UNIX, you can set a UNIX signal to wake up.
Those are basic ideas. You can see how well they scale to multiple timers; I would guess you'd have to have multiple condvars/selects for some handful of the threads.
Dependingo on the behaviour you want, you would probably want a thread for every 100 timers or so, and use one of the mechanisms above to wake up
one of the timers. You'd have a thread sitting in a loop, and keeping
track on each of the 100 timeouts, then waking up.
Once you exceed 100 timers, you would simply create a new thread and have it manage the next 100 timers and so on.
I don't know if 100 is the right granularity, but it's something you'd play with.
Hopefully that's what you are looking for.
Typically, such requirements are met with a delta-queue. When a timeout is required, get the system tick count and add the timeout interval to it. This gives the Timeout Expiry Tick Count, (TETC). Insert the socket object into a queue that is sorted by decreasing TETC and have the thread wait for the TETC of the item at the head of the queue.
Typically, with asocket timeouts, queue insertion is cheap because there are many timeouts with the same interval and so new timeout insertion will normally take place at the queue tail.
Management of the queue, (actually, since insertion into the sorted queue could take place anywhere, it's more like a list than a queue, but whatever:), is best kept to one timeout thread that is normally performing a timed wait on a condvar or semaphore for the lowest TETC. New timeout-objects can then be queued to the thread on a thread-safe concurrent queue and signaled to the timeout-handler thread by the sema/condvar.
When the timeout thread becomes ready on TETC timeout, it could call some 'OnTimeout' method of the object itself, or it might put the timed-out object onto a threadpool input queue.
Such a delta-queue is much more efficient for handling large numbers of timeouts than any polling scheme, especially for requirements with longish intervals. No polling is required, no CPU/memory bandwidth wasted on continual iterations and the typical latency is going to be a system clock-tick or two.
It is dependent on the processor/OS, kernel version, architecture.
In linux, one of the option is to use its timer functionality for multiple timers. Addition of timer can be done using add_timer in linux. You can define it using timer_list and initilialize internal values of timer using init_timer.
Followed by it register it using add_timer after filling timer_list(timeout(expire), function to execute after timeout(function), parameter to the function(data)) appropriately for respective timer. If jiffies is more than or equal to timeout(expire), then the respective timer handler(function) shall be triggered.
Some processors have provisioning for timer wheels(that consists of a number of queues that are placed equally in time in slots) which can be configured for a wide range of timers,timeouts as per the requirement.

Scheduling events at microsecond granularity in POSIX

I'm trying to determine the granularity I can accurately schedule tasks to occur in C/C++. At the moment I can reliably schedule tasks to occur every 5 microseconds, but I'm trying to see if I can lower this further.
Any advice on how to achieve this / if it is possible would be greatly appreciated.
Since I know timer granularity can often be OS dependent: I am currently running on Linux, but would use Windows if the timing granularity is better (although I don't believe it is, based on what I've found for the QueryPerformanceCounter)
I execute all measurements on bare-metal (no VM). /proc/timer_info confirms nanosecond timer resolution for my CPU (but I know that doesn't translate to nanosecond alarm resolution)
Current
My current code can be found as a Gist here
At the moment, I'm able to execute a request every 5 microseconds (5000 nanoseconds) with less then 1% late arrivals. When late arrivals do occur, they are typically only one cycle (5000 nanoseconds) behind.
I'm doing 3 things at the moment
Setting the process to real-time priority (some pointed out by #Spudd86 here)
struct sched_param schedparm;
memset(&schedparm, 0, sizeof(schedparm));
schedparm.sched_priority = 99; // highest rt priority
sched_setscheduler(0, SCHED_FIFO, &schedparm);
Minimizing the timer slack
prctl(PR_SET_TIMERSLACK, 1);
Using timerfds (part of the 2.6 Linux kernel)
int timerfd = timerfd_create(CLOCK_MONOTONIC,0);
struct itimerspec timspec;
bzero(&timspec, sizeof(timspec));
timspec.it_interval.tv_sec = 0;
timspec.it_interval.tv_nsec = nanosecondInterval;
timspec.it_value.tv_sec = 0;
timspec.it_value.tv_nsec = 1;
timerfd_settime(timerfd, 0, &timspec, 0);
Possible improvements
Dedicate a processor to this process?
Use a nonblocking timerfd so that I can create a tight loop, instead of blocking (tight loop will waste more CPU, but may also be quicker to respond to an alarm)
Using an external embedded device for triggering (can't imagine why this would be better)
Why
I'm currently working on creating a workload generator for a benchmarking engine. The workload generator simulates an arrival rate (X requests / second, etc.) using a Poisson process. From the Poisson process, I can determine the relative times at which requests must be made from the benchmarking engine.
So for instance, at 10 requests a second, we may have requests made at:
t = 0.02, 0.04, 0.05, 0.056, 0.09 seconds
These requests need to be scheduled in advance and then executed. As the number of requests per second increases, the granularity required for scheduling these requests increases (thousands of requests per second requires sub-millisecond accuracy). As a result, I'm trying to figure out how to scale this system further.
You're very close to the limits of what vanilla Linux will offer you, and it's way past what it can guarantee. Adding the real-time patches to your kernel and tuning for full pre-emption will help give you better guarantees under load. I would also remove any dynamic memory allocation from your time critical code, malloc and friends can (and will) stall for a not-inconsequential (in a real-time sense) period of time if it has to reclaim the memory from the i/o cache. I would also be considering removing swap from that machine to help guarantee performance. Dedicating a processor to your task will help to prevent context switch times but, again, it's no guarantee.
I would also suggest that you be careful with that level of sched_priority, you're above various important bits of Linux there, which can lead to very strange effects.
What you gain from building a realtime kernel is more reliable guarantees (ie lower maximum latency) of the time between an IO/timer event handled by the kernel, and control being passed to your app in response. This comes at the price of lower throughput, and you might notice an increase in your best-case latency times.
However, the only reason for using OS timers to schedule events with high-precision is if you're afraid of burning CPU cycles in a loop while you wait for your next due event. OS timers (especially in MS Windows) are not reliable for high granularity timing events, and are very dependant on the sort of timing/HPET hardware available in your system.
When I require highly accurate event scheduling, I use a hybrid method. First, I measure the worst case latency - that is, the biggest difference between the time I requested to sleep, and the actual clock time after sleeping. Let's call this difference "D". (You can actually do this on-the-fly during normal running, by tracking "D" every time you sleep, with something like "D = (D*7 + lastD) / 8" to produce a temporal average).
Then never request to sleep beyond "N - D*2", where "N" is the time of the next event. When within "D*2" time of the next event, enter a spin loop and wait for "N" to occur.
This eats a lot more CPU cycles, but depending on the accuracy you require, you might be able to get away with a "sched_yield()" in your spin loop, which is more kind to your system.

app engine task queue wait limit

How long can a task sit in the task queue waiting to be processed before something happens? If its not forever, what are those somethings that might happen?
Can I add a very large number of tasks to a queue that has a very low processing rate and have them be processed over the course of days/weeks/months?
Are tasks ejected from the queue if they're waiting for their turn too long?
Task Queue Quota and Limits says
maximum countdown/ETA for a task:30 days from the current date and time
I think that's talking about intentionally/programatically setting an eta in the future, not how long a task is allowed to wait for its turn.
There's no limit on how many tasks you can have in your queue, other than the amount of storage you have allocated to storing tasks. There's likewise no limit how long they can wait to execute, though as you point out, you can't schedule a task with an ETA more than 30 days in the future.
As far as I know they last forever. I have had some in their for days. Right now I have some that are 9 days old, although the queue is paused. The only limit is the queue size and count (which are not currently enforced).

Resources