There is a single consumer and single producer thread. The producer thread data acquisition is slow. It queries a socket for data and the time it takes to produce data for the consumer is significantly longer than the time it takes for the consumer to process and send the data out. The problem is I am updating a display so I want the updates to slow down so they appear continuous rather than update in bursts.
I am using a double buffer right now but the consumer is waiting too long for the buffers to be swapped because the producer is taking too long to produce data. Perhaps if I slice up the data into smaller blocks and use a queue instead? That way the producer would feed the consumer a little at a time? Has anyone ever run into this problem?
Why not have a thread that updates the screen once a sec? The thread can sleep for a sec, wake up, check into what the producer and the consumer are doing, and update the screen based on their progress. You would get updates every second. If you want them faster or slower, then change the timer interval.
I am going to lock the send rate to the client to a frequency based on the rate of the data request. I originally thought the producer was going to be much faster than it was so that is why I made it into a producer/consumer thread. This is more like a frame rate problem where I need to synchronize the output at a consistent rate.
Related
I've a pipeline where I'm applying transformation rules(from broadcast state) on a stream of events; when I run broadcast stream and original stream in parallel without connecting, stream performance is really good, but the moment I do broadcast performance goes down drastically. How can I achieve better performance. Data passed between operators are in byte[] and data footprint is small as well.
I've attached snapshots of both scenarios:
Top row shows stream consuming events from Kafka and bottom row
shows rules consumed from another topic. With this setup I could
achieve throughput of upto ~20K msg/sec per task manager processing
12Gb of data in 4mins
2. I've connected the broadcast stream with the data stream for
processing in future . Note that only to measure performance of
broadcast I've made sure no records are consumed in the data
stream(top row). At the processing side of the broadcast state, i'm
only store received messages to MapState. With this setup I can get
throughput of upto ~1000 msg/sec per task manager processing 12Gb of
data in 18mins.
You've done more than simply connect the broadcast and keyed streams. Before, each event went through just one network shuffle (the rebalance, hash, and broadcast connections), and now there are four or five shuffles for each event.
Every shuffle is expensive. Try to reduce the number of times you change parallelism or use keyBy.
We are using Flink 1.9.1.
We have a source, a process function, and a sink. The application consumes and produces to kinesis.
The input rate (produced by a simulator) is 20 events per second. The per second output rate for the process function shows 14 per second. The back pressure metrics for the source is shown as OK (green). The event count (Number of events sent by the source) and the number of events received by the process function also match with very little delay.
But this count does not match the event count pushed by the simulator. This count matches the 14 per second rate.
Now my question is, does Flink regulate the input rate automatically?
In my case, how is the input rate controlled at 14 per second?
If it is not, is there any other metric that I should be looking at that I'm missing?
It's not possible to force a Flink pipeline to consume events at a particular rate. By design, there is limited buffering in the network stack, and the slowest task in the execution graph will dictate the rate at which the pipeline will consume and process events.
The back pressure monitoring (that green OK signal) is not a definitive guide to whether back pressure is occuring. So long as the job is able to make steady forward progress, it probably won't indicate that there's a problem. You could examine some of the network queue metrics to get more insight: e.g., inPoolUsage, outPoolUsage, inputQueueLength. See Flink Network Stack Vol. 2: Monitoring, Metrics, and that Backpressure Thing for a lot more on this topic.
20 events per second seems very slow, so I am a bit surprised that something can't keep up with that rate, but that appears to be what's happening.
I am using an F4 instance (because of memory needs) with automatic scheduling to do some background processing. It is run from a task queue. It takes 40s to 60s to complete each invocation. Because of the high memory needs, each instance should only handle one request at a time.
The action that needs to be done is not urgent. If it doesn't get scheduled for 30 minutes that isn't a problem. Even 60 minutes is acceptable and I'd rather make use of that time rather than spin up more instances. However, if the service gets popular and the is getting more than 60 requests an hour I want to spin up more instances to make sure there isn't more than a 60 minute wait.
I am having trouble figuring out how to configure the instance and queue parameters to keep my costs down but be able to scale in that way. My initial thought was something like this:
<queue>
<name>non-urgent-queue</name>
<target>slow-service</target>
<rate>1/m</rate>
<bucket-size>1</bucket-size>
<max-concurrent-requests>1</max-concurrent-requests>
</queue>
<automatic-scaling>
<min-idle-instances>0</min-idle-instances>
<max-idle-instances>0</max-idle-instances>
<min-pending-latency>20m</min-pending-latency>
<max-pending-latency>1h</max-pending-latency>
<max-concurrent-requests>1</max-concurrent-requests>
</automatic-scaling>
First of all those latency settings are invalid, but I can't find documentation on the valid range or units. Can anyone direct me to that info?
Secondly, if I understand the queue settings correctly, this configuration would limit it to 60 invocations an hour getting to the service, even if the task queue had 60+ jobs waiting.
Thanks for your help!
Indeed, throttling at the queue level basically defeats the ability to scale when needed. So you can't use the <rate> in the queue configuration at the values you have right now, you need to use the value matching the maximum rate you're willing to accept (with you max number of instances running simultaneously):
the max rate of requests that can go through the queue being limited at 1/min means you can't scale above 60/h
the <bucket-size> set at 1 means no peaks above the rate can be handled (as soon as one task starts the token bucket empties).
the <max-concurrent-requests> set at 1 will basically prevent multiple instances dealing simultaneouly with the queued workload. They may be started by the autoscaler because of the request latencies, but they won't be able to help since only one queue task can be handled at a time.
In the <automatic-scaling> section the <max-concurrent-requests> set to 1 is good - this ensures no instance handles more than 1 request at a time - which is what you want.
The bad news is that the max values for the latencies appear to be 15s. At least when using the app.yaml config for python (but I think it's unlikely for that to differ across language sandboxes):
Error 400: --- begin server output ---
automatic_scaling.min_pending_latency (30s), must be in the range [0.010000s,15.000000s].
--- end server output ---
and
Error 400: --- begin server output ---
automatic_scaling.max_pending_latency (60s), must be in the range [0.010000s,15.000000s].
--- end server output ---
Which probably also explains why your 5m and 1h values aren't accepted - I used 30s and 60s and got the above errors.
This means you won't be able to use the autoscaling parameters to tune such a slow-moving processing like you desire.
The only alternative I can think of is to have 2 queues:
a fast one feeding just trigger tasks for the slow-service jobs, but which your service intercepts and saves in the datastore. Maybe performed by some faster service (you don't want these stuck behind a slow-service job execution as it can cause unnecessary instance launching. Maybe, depending on the rest of your implementation, you can replace this queue completely with just storing the job info in the datastore instead of enqueing tasks in the fast queue.
a slow one for the actual slow-service job execution tasks
You'd also have a cron job executing once a minute, checking how many triggers are pending in the datastore, decide how much to scale and enqueue the corresponding number of slow-service job tasks in the slow queue. The autoscaler would simply bring up the corresponding number of instances (if needed). Low latency autoscaling configs would be desirable in this case - you already decided how you want your app to scale.
This is how I ended up doing it. I use a slow queue and a fast queue configured like this:
<queue>
<name>slow-queue</name>
<target>pdf-service</target>
<rate>2/m</rate>
<bucket-size>1</bucket-size>
<max-concurrent-requests>1</max-concurrent-requests>
</queue>
<queue>
<name>fast-queue</name>
<target>pdf-service</target>
<rate>10/m</rate>
<bucket-size>1</bucket-size>
<max-concurrent-requests>5</max-concurrent-requests>
</queue>
The max-concurrent-requests in the slow queue ensures only one task will run at a time, so there will only be one instance active.
Before I post to the slow queue I check to see how many items are already on the queue. The result may not be totally reliable, but for my purposes it is sufficient. In java:
QueueStatistics queueStats = queue.fetchStatistics();
if(queueStats.getNumTasks()<30) {
//post to slow queue
} else {
//post to fast queue
}
So when my slow queue gets too full, I post to the fast queue which allows concurrent requests.
The instance is configured like this:
<automatic-scaling>
<min-idle-instances>0</min-idle-instances>
<max-idle-instances>automatic</max-idle-instances>
<min-pending-latency>15s</min-pending-latency>
<max-pending-latency>15s</max-pending-latency>
<max-concurrent-requests>1</max-concurrent-requests>
</automatic-scaling>
So it will create new instances as slowly as possible (15s is the max latency) and make sure only one process runs on an instance at a time.
With this configuration I'll have a max of 6 instances at a time but that should do about 500/hr. I could increase the rate and concurrent requests to do more.
The negative of this solution is an element of unfairness. Under heavy load, some tasks will be stuck in the slow queue while others will get processed more quickly in the fast queue.
Because of that, I have decreased the max items on the slow queue to 13 so the unfairness won't be so extreme, maybe a 10 minute wait for jobs that go to the slow queue when it is full.
I'm trying to solve the following problem:
I have a series of "tasks" which I would like to execute
I have a fixed number of workers to execute these workers (since they call an external API using urlfetch and the number of parallel calls to this API is limited)
I would like for these "tasks" to be executed "as soon as possible" (ie. minimum latency)
These tasks are parts of larger tasks and can be categorized based on the size of the original task (ie. a small original task might generate 1 to 100 tasks, a medium one 100 to 1000 and a large one over 1000).
The tricky part: I would like to do all this efficiently (ie. minimum latency and use as many parallel API calls as possible - without getting over the limit), but at the same time try to prevent a large number of tasks generated from "large" original tasks to delay the tasks generated from "small" original tasks.
To put it an other way: I would like to have a "priority" assigned to each task with "small" tasks having a higher priority and thus prevent starvation from "large" tasks.
Some searching around doesn't seem to indicate that anything pre-made is available, so I came up with the following:
create three push queues: tasks-small, tasks-medium, tasks-large
set a maximum number of concurrent request for each such that the total is the maximum number of concurrent API calls (for example if the max. no. concurrent API calls is 200, I could set up tasks-small to have a max_concurrent_requests of 30, tasks-medium 60 and tasks-large 100)
when enqueueing a task, check the no. pending task in each queue (using something like the QueueStatistics class), and, if an other queue is not 100% utilized, enqueue the task there, otherwise just enqueue the task on the queue with the corresponding size.
For example, if we have task T1 which is part of a small task, first check if tasks-small has free "slots" and enqueue it there. Otherwise check tasks-medium and tasks-large. If none of them have free slots, enqueue it on tasks-small anyway and it will be processed after the tasks added before it are processed (note: this is not optimal because if "slots" free up on the other queues, they still won't process pending tasks from the tasks-small queue)
An other option would be to use PULL queue and have a central "coordinator" pull from that queue based on priorities and dispatch them, however that seems to add a little more latency.
However this seems a little bit hackish and I'm wondering if there are better alternatives out there.
EDIT: after some thoughts and feedback I'm thinking of using PULL queue after all in the following way:
have two PULL queues (medium-tasks and large-tasks)
have a dispatcher (PUSH) queue with a concurrency of 1 (so that only one dispatch task runs at any time). Dispatch tasks are created in multiple ways:
by a once-a-minute cron job
after adding a medium/large task to the push queues
after a worker task finishes
have a worker (PUSH) queue with a concurrency equal to the number of workers
And the workflow:
small tasks are added directly to the worker queue
the dispatcher task, whenever it is triggered, does the following:
estimates the number of free workers (by looking at the number of running tasks in the worker queue)
for any "free" slots it takes a task from the medium/large tasks PULL queue and enqueues it on a worker (or more precisely: adds it to the worker PUSH queue which will result in it being executed - eventually - on a worker).
I'll report back once this is implemented and at least moderately tested.
The small/medium/large original task queues won't help much by themselves - once the original tasks are enqueued they'll keep spawning worker tasks, potentially even breaking the worker task queue size limit. So you need to pace/control enqueing of the original tasks.
I'd keep track of the "todo" original tasks in the datastore/GCS and enqueue these original tasks only when the respective queue size is sufficiently low (1 or maybe 2 pending jobs), from either a recurring task, a cron job or a deferred task (depending on the rate at which you need to perform the original task enqueueing) which would implement the desired pacing and priority logic just like a push queue dispatcher, but without the extra latency you mentioned.
I have not used pull queues, but from my understanding they could suit your use-case very well. Your could define 3 pull queues, and have X workers all pulling tasks from them, first trying the "small" queue then moving on to "medium" if it is empty (where X is your maximum concurrency). You should not need a central dispatcher.
However, then you would be left to pay for X workers even when there are no tasks (or X / threadsPerMachine?), or scale them down & up yourself.
So, here is another thought: make a single push queue with the correct maximum concurrency. When you receive a new task, push its info to the datastore and queue up a generic job. That generic job will then consult the datastore looking for tasks in priority order, executing the first one it finds. This way a short task will still be executed by the next job, even if that job was already enqueued from a large task.
EDIT: I now migrated to a simpler solution, similar to what #eric-simonton described:
I have multiple PULL queues, one for each priority
Many workers pull on an endpoint (handler)
The handler generates a random number and does a simple "if less than 0.6, try first the small queue and then the large queue, else vice-versa (large then small)"
If the workers get no tasks or an error, they do semi-random exponential backoff up to maximum timeout (ie. they start pulling every 1 second and approximately double the timeout after each empty pull up to 30 seconds)
This final point is needed - amongst other reasons - because the number of pulls / second from a PULL queue is limited to 10k/s: https://cloud.google.com/appengine/docs/python/taskqueue/overview-pull#Python_Leasing_tasks
I implemented the solution described in the UPDATE:
two PULL queues (medium-tasks and large-tasks)
a dispatcher (PUSH) queue with a concurrency of 1
a worker (PUSH) queue with a concurrency equal to the number of workers
See the question for more details. Some notes:
there is some delay in task visibility due to eventual consistency (ie. the dispatchers tasks sometimes don't see the tasks from the pull queue even if they are inserted together) - I worked around by adding a countdown of 5 seconds to the dispatcher tasks and also adding a cron job that adds a dispatcher task every minute (so if the original dispatcher task doesn't "see" the task from the pull queue, an other will come along later)
made sure to name every task to eliminate the possibility of double-dispatching them
you can't lease 0 items from the PULL queues :-)
batch operations have an upper limit, so you have to do your own batching over the batch taskqueue calls
there doesn't seem to be a way to programatically get the "maximum parallelism" value for a queue, so I had to hard-code that in the dispatcher (to calculate how many more tasks it can schedule)
don't add dispatcher tasks if they are already some (at least 10) in the queue
This is more sounds like a design issue to me.
Scenario -
I have an embedded system with multiple threads -
One of the thread is xxx -- a networking protocol that tells the neighbour router -- Producer
Another thread is xxx-TE - this a traffic engineering - xxx protocol. - Consumer.
They both are communicating to each other via. Message queue. So, basically the producer puts the data in the xxx-TE queue for the thread xxx-TE.
Problem -
When we have a lot of nodes or in simple words a lot of routing information from xxx, the message put in the xxx - TE queue is lost.
Solution -
Is this the solution correct?
Should we increase the queue-depth so that the message is not lost.
[symptoms] - We see errors while pushing the message in the message queue.
Probably not.
Generally speaking, message queues should stay empty, or close to empty, as much of the time as possible. If your queue is not usually empty, you need to improve the speed at which messages are being processed.
Increasing the size of the queue is generally not a solution; if the queue is being filled faster than it is being emptied, it will always end up full in the end; increasing the size will only make it take slightly longer to fill up.
(An exception is if messages are being produced in an extremely "bursty" pattern. If this is the case, increasing the queue size may help to buffer against those bursts. However, a large burst, or several bursts back to back, may put you back in the same situation.)