Does a "rescale()" operation cause serialization? - apache-flink

If I call a rescale() operation in Flink, I assume that there is NO serialization/deserialization (since the data is not crossing nodes), right? Further, is it correct to assume that objects are not copied/deep copied when rescale() is called?
I ask because I'm passing some large objects, 99% of which are common between multiple threads, so it would be a tremendous RAM waste if the objects were recopied in each thread after a rescale(). Instead, all the different threads should point to the same single object in the java heap for that node.
(Of course, if I call a rebalance, I would expect that there would be ONE serialization of the common objects to the other nodes, even if there are dozens of threads on each of the other nodes? That is, on the other nodes, there should only be 1 copy of a common object that all the threads for that node can share, right?)

Based on the rescale() documentation, there will be network traffic (and thus serialization/deserialization), just not as much as a rebalance(). But as several Flink committers have noted, data skew can make the reduction in network traffic insignificant compared to the cost of unbalanced data, which is why rebalance() is the default action when the stream topology changes.
Also, if you're passing around a lot of common data, then maybe look at using a broadcast stream to more efficiently share that across nodes.
Finally, it's conceptually easier to think about sub-tasks vs. threads. Each operator runs as a sub-task, which (on one Task Manager) is indeed being threaded, but the operator instances are separate, which means you don't have to worry about multi-threading at the operator level (unless you use class variables, which is usually a Bad Idea).

Related

Resources associated to an aio_context

The semantics of Linux's Asynchronous file IO (AIO) is well described in the man page of io_setup(2), io_submit(2) and io_getevents(2).
However, without diving in the block IO subsystem, the operational side of the implementation is a little less clear.
An aio_context allocates a queue for sending back io_events to a specific client in user-space. But is there more to it ?
Let be a file read sequentially chunks by chunks. Can requests, especially in Direct IO (DIO), be collated ? What if requests for two files are interleaved into one aio_context ? What if requests for one file are sent to two different aio_contexts ?
How requests are prioritized and scheduled in the above cases, with one or multiple aio_contexts ?
Is it possible that requests from two aio_contexts get interleaved at some point ? (Occasioning more seek latencies than intended.)
Does the thread or the CPU calling io_submit influence how it is scheduled ? Is the NUMA node containing the target buffer taken into consideration ?
More broadly, to which hardware resources (NUMA nodes, CPU cores, physical drives, file-systems and files) aio_contexts should be assigned, and at which level of granularity ?
Maybe it doesn't really matter and aio_contexts are no more than an abstraction for user-space programs.
I'm asking since I have observed a performance decrease when concurrently reading multiples files, each with it's own aio_context, compared to a manual Round-robin serialization of chunks requests into a single aio_context.
You can mix requests freely in a single context and I would do so. Otherwise you have to poll two separate contexts doubling the number of syscalls.
Requests to a context are passed to the kernels async IO VFS layer. Multiple files, multiple contexts, multiple processes or users doing the requests it all ends up in the same layer. The VFS layer then sends the requests to the relevant filesystems or block devices and all the usual collation and such happens naturally.
Requests to the same file to one or more context at the same time I think are undefined behavior if they overlap. They could be ordered one way or the other. The later request could be processed first for example. So you need to write your own synchronization if strict ordering is required. Same as one or more threads doing read/write calls in parallel.
Prioritization and scheduling will depend on the lower layers. Afaik block devices will reorder requests so they happen in increasing block numbers (elevator code) to minimize seek times on rotating disks.
Yes, requests from different contexts and normal read/write calls will get interleaved.
I think the requesting process and NUMA and such is completely ignored.
Note: When dealing with files make sure the filesystem supports the linux async IO hooks and you might need to use O_DIRECT on open() with all it's consequences.
A way to simply test this I found is to make lots of requests to a file in one io_submit() call and then check if the all finish simultaneously. If the filesystem falls back to sync IO then everything submitted will finish at the same time.

How to choose for multithreading - c

I have to do a program client-server in c where server can use n-threads that can work simultaneously for manage the request of clients.
For do it I use a socket that use a listener that put the new FD (of new connection request) in a list and then the threads can take it when they are able to do.
I know that I can use pipe too for communication between thread.
Is the socket the best way ? And why or why not?
Sorry for my bad English
To communicate between threads you can use socket as well as shared memory.
To do multithreading there are many libraries available on github, one of them I used is the below one.
https://github.com/snikulov/prog_posix_threads/blob/master/workq.c
I tried and tested the same way what you want. it works perfect!
There's one very nice resource related to socket multiplexing which I think you should stop and read after reading this answer. That resource is entitled The C10K problem, and it details numerous solutions to the problem people faced in the year 2000, of handling 10000 clients.
Of those solutions, multithreading is not the primary one. Indeed, multithreading as an optimisation should be one of your last resorts, as that optimisation will interfere with the instruments you use to diagnose other optimisations.
In general, here is how you should perform optimisations, in order to provide guaranteed justifications:
Use a profiler to determine the most significant bottlenecks (in your single-threaded program).
Perform your optimisation upon one of the more significant bottlenecks.
Use the profiler again, with the same set of data, to verify that your optimisation worked correctly.
You can repeat these steps ad infinitum until you decide the improvements are no longer tangible (meaning, good luck observing the differences between before and after). Following these steps will provide you with data you can show your employer, if he/she asks you what you've been doing for the last hour, so make sure you save the output of your profiler at each iteration.
Optimisations are per-machine; what this means is that an optimisation for your machine might actually be slower on another machine. For example, you may use a buffer of 4096 bytes for your machine, while the cache lines for another machine might indicate that 512 bytes is a better idea.
Hence, ideally, we should design programs and modules in such a way that their resources are minimal and can be easily be scaled up, substituted and/or otherwise adjusted for other machines. This can be difficult, as it means in the buffer example above you might start off with a buffer of one byte; you'd most likely need to study finite state machines to achieve that, and using buffers of one byte might not always be technically feasable (i.e. when dealing with fields that are guaranteed to be a certain width; you should use that width as your minimum limit, and scale up from there). The reward is ultra-portable and ultra-optimisable in all situations.
Keep in mind that extra threads use extra resources; we tend to assume that the stack space reserved for a thread can grow to 1MB, so 10000 sockets occupying 10000 threads (in a thread-per-socket model) would occupy about 10GB of memory! Yikes! The minimal resources method suggests that we should start off with one thread, and scale up from there, using a multithreading profiler to measure performance like in the three steps above.
I think you'll find, though, that for anything purely socket-driven, you likely won't need more than one thread, even for 10000 clients, if you study the C10K problem or use some library which has been engineered based on those findings (see your comments for one such suggestion). We're not talking about masses of number crunching, here; we're talking about socket operations, which the kernel likely processes using a single core, and so you can likely match that single core with a single thread, and avoid any context switching or thread synchronisation troubles/overheads incurred by multithreading.

Resource overallocation to slots in Flink

Regarding the features on Flink that allow to optimize resource usage in the cluster (+ latency, throughput ...), i.e. slot sharing, task chaining, async i/o and dynamic scaling, I would like to ask the following questions (all in the stream processing context):
In which cases would someone be interested in having the number of slots in a task manager higher than the number of cpu cores?
In which case should we prefer split a pipeline of tasks over multiple slots (disable slot sharing), instead of increasing the parallelism, in order for an application to keep up with the incoming data rates?
Is it possible that even when using all the features above, the resources reserved for a slot may be higher than the amount of resources that all the tasks in the slot require, thus causing us to have resources that are reserved for a slot, but not being used? Is it possible that such problems appear when we have tasks in applications with different latencies (or different parallelisms)? Or even when we are performing multiple aggregations (that cannot be optimised using folds or reduces) on the same window?
Thanks in advance.
Usually, it is recommended to reserve for each slot at least one CPU core. One reason why you would want to reserve more slots than cores is that you execute blocking operations in your operators. That way you can keep all of your cores busy.
If you observe that your application cannot keep up with the incoming data rate, then it is usually best to increase the parallelism (given that the bottleneck is not an operator with parallelism 1 and that your data has enough key values).
If you have multiple compute intensive operators in one pipeline (maybe even chained) and you have fewer cores than these operators per slot, then it might make sense to split up the pipeline. That way the computation of these operators can be better done concurrently.
Theoretically, it can be the case that you assign more resources to a slot than are actually needed. E.g. you have a single operator in each slot but multiple cores assigned to it. Also in case of different parallelism of operators, some slots might get more sub-tasks assigned than others. One thing you can always do is to monitor the execution of your job to detect under and over-provisioning.

Using many mutex locks

I have a large tree structure on which several threads are working at the same time. Ideally, I would like to have an individual mutex lock for each cell.
I looked at the definition of pthread_mutex_t in bits/pthreadtypes.h and it is fairly short, so the memory usage should not be an issue in my case.
However, is there any performance penalty when using many (let's say a few thousand) different pthread_mutex_ts for only 8 threads?
If you are locking and unlocking very frequently, there can be a penalty, since obtaining and releasing locks does take some time, and can take a fair amount of time if the locks are contended.
When using many locks in a structure like this, you will have to be very specific about what each lock actually locks, and make sure you are careful of AB-BA deadlocks. For example, if you are changing the tree's structure during a locking operation, you will need to lock all the nodes that will be changed, in a consistent order, and make sure that threads working on descendants do not become confused.
If you have a very large number of locks, spread out across memory, caching issues could cause performance problems, depending on the architecture, as locking operations will generally invalidate at least some part of the cache.
Your best bet is probably to implement a simple locking structure, then profile it, then refine it to improve performance, if necessary. I'm not sure what you're doing with the tree, but a good place to start might be a single reader-writer lock for the whole tree, if you expect to read much more than you update.
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."
-- Donald Knuth
Your locking/access patterns need to be stated in order to properly evaluate this. If each thread would only hold one or a few locks at a time and the probability that any two or more threads would want the same lock at the same time is low (either a random access patter or 8 runners on different positions on a circular track running at roughly the same speed or other more complicated things) then you will mostly avoid the worst case where a thread has to sleep to get a lock (or in some cases have to get the OS involved to decide who wins) because you have so few threads and so many locks.
If each thread might want hundreds or thousands of locks at any one time then things will start to change.
I won't touch deadlock avoidance because I don't know anything about the container that you are using, but you need to be aware of the need to avoid them.

Is lock free multithreaded programming making anything easier?

I only read a little bit about this topic, but it seems that the only benefit is to get around contention problems but it will not have any important effect on the deadlock problem as the code which is lock free is so small and fundamental (fifos, lifos, hash) that there was never a deadlock problem.
So it's all about performance - is this right?
Lock-free programming is (as far as I can see) always about performance, otherwise using a lock is in most cases much simpler, and therefore preferable.
Note however that with lock-free programming you can end up trading deadlock for live-lock, which is a lot harder to diagnose since no tools that I know of are designed to diagnose it (although I could be wrong there).
I'd say, only go down the path of lock-free if you have to; that is, you have a scenario where you have a heavily contended lock that is hurting your performance. (If it ain't broke, don't fix it).
Couple of issues.
We will soon be facing desktop systems with 64, 128 and 256 cores. Parallism in this domain is unlike our current experience of 2, 4, 8 cores; the algorithms which run successfully on such small systems will run slower on highly parallel systems due to contention.
In this sense, lock-free is important since it is contributes strongly to solving scalability.
There are also some very specific areas where lock-free is extremely convenient, such as the Windows kernel, where there are modes of execution where sleeps of any kind (such as waits) are forbidden, which obviously is very limiting with regard to data structures, but where lock-free provides a good solution.
Also, lock-free data structures often do not have failure modes; they cannot actually fail, where lock-based data structures can of course fail to obtain their locks. Not having to worry about failures simplifies code.
I've written a library of lock free data structures which I'll be releasing soon. I think if a developer can get hold of a well-proven API, then he can just use it - doesn't matter if it's lock-free or not, he doesn't need to worry about the complexity in the underlying implementation - and that's the way to go.
It's also about scalability. In order to get performance gains these days, you'll have to parallelise the problems you're working on so you can scale them across multiple cores - the more, the merrier.
The traditional way of doing this is by locking data structures that require parallel access but the more threads you can run truly parallel, the bigger an bottleneck this becomes.
So yes, it is about performance...
For preemptive threading, threads suspended while holding a lock can block threads that would otherwise be making forward progress. Lock-free doesn't have that problem since by Herlihy's definition, some other thread can always make forward progress.
For non-preemptive threading, it doesn't matter that much since even spin lock based solutions are lock-free by Herlihy's definition.
This is about performances - but also about the ability to take multi-thread loads:
locks grant an exclusive access to a portion of code: while a thread has a lock, other threads are spinning (looping while trying to acquire the lock) or blocked, sleeping until the lock is released (which usually happens if spinning lasts too long);
atomic operations grant an exclusive access to a resource (usually a word-sized variable or a pointer) by using uninterruptible intrinsic CPU instructions.
As locks BLOCK other threads' execution, a program is slowed-down.
As atomic operations execute serially (one after another), there is no blocking*.
(*) as long as the number of concurrent CPUs trying to access the same resource do not create a bottleneck - but we don't have enough CPU Cores yet to see this as a problem.
I have worked on the matter to write a wait-free (lock-free without wait states) Key-Value store for the server I am working on.
Libraries like Tokyo Cabinet (even TC-FIXED, a simple array) rely on locks to preserve the integrity of a database:
"while a writing thread is operating the database, other reading threads and writing threads are blocked" (Tokyo Cabinet documentation)
The results of a test without concurrency (a one-thread test):
SQLite time: 56.4 ms (a B-tree)
TC time: 10.7 ms (a hash table)
TC-FIXED time: 1.3 ms (an array)
G-WAN KV time: 0.4 ms (something new which works, but I am not sure a name is needed)
With concurrency (several threads writing and reading in the same DB), only the G-WAN KV survived the same test because (by contrast with the others) it never ever blocks.
So, yes, this KV store makes it easier for developpers to use it since they do not have to care about threading issues. Making it work this way was not trivial however.
I believe I saw an article that mathematically proved that any algorithm can be written in a wait free manner (which basically means that you can be assured of each thread always making progress towards its goal). This means that it can be applied to any large scale application (after all, a program is just an algorithm with many, many parameters) and because wait free ensures that neither dead/live-lock occurs within it (as long as it doesn't have bugs which preclude it from being truly wait free), it does simplify that side of the program. On the other hand, a mathematical proof is a far cry from actually implementing the code itself (AFAIK, there isn't even a fully lock-free linked list that can run on PCs, I've seen ones that cover most parts, but they usually either can't handle some common functions, or some functions require the structure to be locked).
On a side note, I've also found another proof that showed any lock-free algorithm can actually be considered wait-free due to the laws of probability and various other factors.
Scalability is a really important issue in efficient multi/manicore programming. The greatest limiting factor is actually the code section that should be executed in serial (see Amdahl's Law). However, contentions on locks are also very problematic.
Lock-free algorithm addresses the scalability problem which legacy lock has. So, I could say lock-free is mostly for performance, not decreasing the possibility of deadlock.
However, keep in mind, with current x86 architecture, writing general lock-free algorithm is impossible. This is because we can't atomically exchange arbitrary size of data in current x86 (and also true for other architectures except for Sun's ROCK). So, current lock-free data structures are quite limited and very specialized for specific uses.
I think current lock-free data structures would not be used anymore in a decade. I strongly expect hardware-assisted general lock-free mechanism (yes, that is transactional memory, TM) will be implemented within a decade. If any kind of TM is implemented, though it can't perfectly solve the problems of locks, many problems (including priority inversion and deadlock) will be eliminated. However, implementing TM in hardware is still very challenging, and in x86, only a draft just has been proposed.
It's still too long: 2 sentences summary.
Lock-free data structure is not panacea for lock-based multithreading programming (even TM is not. If you seriously need scalability and have troubles on lock contention, then consider lock-free data structure.

Resources