Resource overallocation to slots in Flink - apache-flink

Regarding the features on Flink that allow to optimize resource usage in the cluster (+ latency, throughput ...), i.e. slot sharing, task chaining, async i/o and dynamic scaling, I would like to ask the following questions (all in the stream processing context):
In which cases would someone be interested in having the number of slots in a task manager higher than the number of cpu cores?
In which case should we prefer split a pipeline of tasks over multiple slots (disable slot sharing), instead of increasing the parallelism, in order for an application to keep up with the incoming data rates?
Is it possible that even when using all the features above, the resources reserved for a slot may be higher than the amount of resources that all the tasks in the slot require, thus causing us to have resources that are reserved for a slot, but not being used? Is it possible that such problems appear when we have tasks in applications with different latencies (or different parallelisms)? Or even when we are performing multiple aggregations (that cannot be optimised using folds or reduces) on the same window?
Thanks in advance.

Usually, it is recommended to reserve for each slot at least one CPU core. One reason why you would want to reserve more slots than cores is that you execute blocking operations in your operators. That way you can keep all of your cores busy.
If you observe that your application cannot keep up with the incoming data rate, then it is usually best to increase the parallelism (given that the bottleneck is not an operator with parallelism 1 and that your data has enough key values).
If you have multiple compute intensive operators in one pipeline (maybe even chained) and you have fewer cores than these operators per slot, then it might make sense to split up the pipeline. That way the computation of these operators can be better done concurrently.
Theoretically, it can be the case that you assign more resources to a slot than are actually needed. E.g. you have a single operator in each slot but multiple cores assigned to it. Also in case of different parallelism of operators, some slots might get more sub-tasks assigned than others. One thing you can always do is to monitor the execution of your job to detect under and over-provisioning.

Related

Does a "rescale()" operation cause serialization?

If I call a rescale() operation in Flink, I assume that there is NO serialization/deserialization (since the data is not crossing nodes), right? Further, is it correct to assume that objects are not copied/deep copied when rescale() is called?
I ask because I'm passing some large objects, 99% of which are common between multiple threads, so it would be a tremendous RAM waste if the objects were recopied in each thread after a rescale(). Instead, all the different threads should point to the same single object in the java heap for that node.
(Of course, if I call a rebalance, I would expect that there would be ONE serialization of the common objects to the other nodes, even if there are dozens of threads on each of the other nodes? That is, on the other nodes, there should only be 1 copy of a common object that all the threads for that node can share, right?)
Based on the rescale() documentation, there will be network traffic (and thus serialization/deserialization), just not as much as a rebalance(). But as several Flink committers have noted, data skew can make the reduction in network traffic insignificant compared to the cost of unbalanced data, which is why rebalance() is the default action when the stream topology changes.
Also, if you're passing around a lot of common data, then maybe look at using a broadcast stream to more efficiently share that across nodes.
Finally, it's conceptually easier to think about sub-tasks vs. threads. Each operator runs as a sub-task, which (on one Task Manager) is indeed being threaded, but the operator instances are separate, which means you don't have to worry about multi-threading at the operator level (unless you use class variables, which is usually a Bad Idea).

At what point does adding more threads stop helping?

I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
I think the best way to answer is to give a first overview on how threads are managed by the system. Nowadays all processors are actually multi-core and multi-thread per core, but for sake of simplicity let's first imagine a single core processor with single thread. This is physically limited in performing only a single task at the time, but we are still capable of running multitask programs.
So how is this possible? Well it is simply illusion!
The CPU is still performing a single task at the time, but switches between one and the other giving the illusion of multitasking. This process of changing from one task to the other is named Context switching.
During a Context switch all the data related to the task that is running is saved and the data related to the next task is loaded. Depending on the architecture of the CPU data can be saved in registers, cache, RAM, etc. The more the technology advances, the more performing solutions have been discovered. When the task is resumed, the whole data is fetched and the task continues its operations.
This concept introduces many issues in managing tasks, like:
Race condition
Synchronization
Starvation
Deadlock
There are other points, but this is just a quick list since the question does not focus on this.
Getting back to your question:
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
Short answer: It depends!
As previously said, to switch between a task and another, a Context switch is required. To perform this some storing and fetching data operations are required, but these operations are just an overhead for you computation and don't give you directly any advantage. So having too many tasks requires a high amount of Context switching, thus meaning a lot of computational time wasted! So at the end your task might be running slower than with less tasks.
Also, since you tagged this question with pthreads, it is also necessary to check that the code is compiled to run on multiple HW cores. Having a multi core CPU does not guarantee that you multitask code will run on multiple HW cores!
In your particular case of application:
I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
Is a good example of concurrent and data independent computing. This sort of tasks run great on GPU, since operations don't have data correlation and concurrent computing is performed in hardware (modern GPU have thousands of computing cores!)

About global memory access method

In general, for GPU, which accessing mode is faster (read data from a continous block of global memory)?
(1) for-loops with single or very small number of threads to read data from a block of global memory;
(2) let alot of threads, maybe from different blocks, to read data from global memory concurrently.
e.g.
if (threadIdx.x==0)
{
for (int i=0; i<1000; ++i)
buffer[i]=data[i];//data is stored in global memory
}
OR:
buffer[threadIdx.x]=data[threadIdx.x];//there are 1000 threads in this thread block
In short, the second should be faster generally. The Justification is followed:
There are two kinds of parallelism: Thread-Level Parallelism (TLP) and Instruction-Level Parallelism (ILP). Your first code (the loop) targets ILP and the second exploits TLP.
When the TLP is exploited, many memory requests are issued concurrently free of any control-flow dependencies. At this situation, hardware can take advantage of locality among threads to reduce total memory transactions (where possible). Moreover, hardware can serve the concurrent requests concurrently through L2-cache bank parallelism, memory controller parallelism, DRAM bank parallelism, and many other levels of parallelism.
However, in the ILP case, the existing control-dependency limits the number of concurrent issued memory requests. This is also true even in the case of loop-unrolling (hardware resources like scoreboard size and instruction window size limit the total outstanding instructions). So, many of the memory requests are actually serialized unnecessarily. Moreover, the hardware capability in memory access coalescing is not exploited.
The Solution one is faster.Cause 1000 Threads is 1000 tasks actually witch share one task address space.The process scheduling of the OS must cost much resources of CPU.So the CPU always be interrupted.
If you do the thing in one task , The CPU always process one task.
And multi-core CPU can process better , But 1000 threads is too large.

How can I evaluate performances of a lockless queue?

I have implemented a lockless queue using the hazard pointer methodology explained in http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf using GCC CAS instructions for the implementation and pthread local storage for thread local structures.
I'm now trying to evaluate the performance of the code I have written, in particular I'm trying to do a comparison between this implementation and the one that uses locks (pthread mutexes) to protect the queue.
I'm asking this question here because I tried comparing it with the "locked" queue and I found that this has better performances with respect to the lockless implementation. The only test I tried is creating 4 thread on a 4-core x86_64 machine doing 10.000.000 random operations on the queue and it it significantly faster than the lockless version.
I want to know if you can suggest me an approach to follow, i.e. what kind of operation I have to test on the queue and what kind of tool I can use to see where my lockless code is wasting its time.
I also want to understand if it is possible that the performance are worse for the lockless queue just because 4 threads are not enough to see a major improvement...
Thanks
First point: lock-free programming doesn't necessarily improve speed. Lock-free programming (when done correctly) guarantees forward progress. When you use locks, it's possible for one thread to crash (e.g., go into an infinite loop) while holding a mutex. When/if that happens, no other thread waiting on that mutex can make any more progress. If that mutex is central to normal operation, you may easily have to restart the entire process before any more work can be done at all. With lock-free programming, no such circumstance can arise. Other threads can make forward progress, regardless of what happens in any one thread1.
That said, yes, one of the things you hope for is often better performance -- but to see it, you'll probably need more than four threads. Somewhere in the range of dozens to hundreds of threads would give your lock-free code a much better chance of showing improved performance over a lock-based queue. To really do a lot of good, however, you not only need more threads, but more cores as well -- at least based on what I've seen so far, with four cores and well-written code, there's unlikely to be enough contention over a lock for lock-free programming to show much (if any) performance benefit.
Bottom line: More threads (at least a couple dozen) will improve the chances of the lock-free queue showing a performance benefit, but with only four cores, it won't be terribly surprising if the lock-based queue still keeps up. If you add enough threads and cores, it becomes almost inevitable that the lock-free version will win. The exact number of threads and cores necessary is hard to predict, but you should be thinking in terms of dozens at a minimum.
1 At least with respect to something like a mutex. Something like a fork-bomb that just ate all the system resources might be able to deprive the other threads of enough resources to get anything done -- but some care with things like quotas can usually prevent that as well.
The question is really to what workloads you are optimizing for. If congestion is rare, lock structures on modern OS are probably not too bad. They mainly use CAS instructions under the hood as long as they are on the fast path. Since these are quite optimized out it will be difficult to beat them with your own code.
Our own implementation can only win substantially for the congested part. Just random operations on the queue (you are not too precise in your question) will probably not do this if the average queue length is much longer than the number of threads that hack on it in parallel. So you must ensure that the queue is short, perhaps by introducing a bias about the random operation that is chosen if the queue is too long or too short. Then I would also charge the system with at least twice as much threads than there are cores. This would ensure that wait times (for memory) don't play in favor of the lock version.
The best way in my opinion is to identify hotspots in your application with locks
by profiling the code.Introduce the lockless mechanism and measure the same again.
As mentioned already by other posters, there may not be a significant improvement
at lower scale (number of threads, application scale, number of cores) but you might
see throughput improvements as you scale up the system.This is because deadlock
situations have been eliminated and threads are always making forward progress.
Another way of looking at an advantage with lockless schemes are that to some
extent one decouples system state from application performance because there
is no kernel/scheduler involvement and much of the code is userland except
for CAS which is a hw instruction.
With locks that are heavily contended, threads block and are scheduled once
locks are obtained which basically means they are placed at the end of the run
queue (for a specific prio level).Inadvertently this links the application to system
state and response time for the app now depends on the run queue length.
Just my 2 cents.

Fine grained multithreading - how much should a worker task do?

I'm using the work_pile pattern so the threads are always running and waiting on a semaphore for incoming new function pointers + data in a queue. Thats what the apple marketing guys now calls Grand Central Dispatch and promote as the new sliced bread thing.
I just wonder how to find out if it is usefull to split a short task into two even shorter ones. Is there rule on which i could judge if it is worth queuing a new object?
Two possible answers:
It depends.
Benchmark it.
I prefer the second one.
Anyway, if two tasks are always running one after the other (i.e., sequentially), I suppose that there is no gain to split them.
The limit on multitasking is how many cores you have and how much of the algorithm is concurrent. Various types of overhead, including locking, can reduce the amount of concurrency, lowering or even reversing the benefit of multitasking. That's why it works best when there are independent, long-running tasks. Having said that, so long as the overhead doesn't swallow the performance gains, it pays to divide even a short task up among cores.
The short answer is that you need to think about resources + workload + benchmarking.
Here are some of the ways things could break down:
Do you have idle threads? Is the workload chunky enough that a thread takes so long to complete that another thread is hanging out waiting for re-assignment (i.e., more threads than work)?
Do you have enough work? Is the overall task completed so quickly that it's not worth thinking about additional threads? Remember that increasing multithreading does increase overhead by some (sometimes) small but measurable amount.
Do you have available resources? Do you have more threads to give? Do you have CPU cycles that are sitting idle?
So, in short, I'd say that you need to think before you type. If you already have code that works at all, that's like money in the bank. Is it worth investing more of your time to increase the productivity of that code or would the return on investment be too low (or negative!)?

Resources