Monitoring performance counters during execution of a specific function - c

For some context, I'm profiling the execution of Memcached, and I would like to monitor dTLB misses during the execution of a specific function. Assuming that Memcached spawns multiple threads, each thread could potentially be executing the function in parallel. One particular solution I discovered, Perf features toggle events (Using perf probe to monitor performance stats during a particular function), should let me achieve this by setting probes on function entry and exit and toggling the event counter on/off on each probe respectively.
My question is:
(a) From my understanding, perf toggle events was included as part of a branch to Linux kernel 3.x. Has this been incorporated in recent LTS releases of Linux kernel 4.x? If not, are there any other alternatives?
(b) Another workaround I found is described here: performance monitoring for subset of process execution. However I'm not too sure if this will work correctly for the problem at hand. I'm concerned since Memcached is multi-threaded, having each thread spawn a new child process may cause too much overhead.
Any suggestions?

I could only find the implementation of the toggle events feature in the /perf/core_toggle repo, which is maintained by the developer of the feature. You can probably compile that code and play with the feature yourself. You can find examples on how to use it here. However, I don't think it has been accepted yet in the main Linux repo for any version of the kernel.
If you want to measure the number of one or more events, then there are alternatives that are easy to use, but require adding a few lines of code to your codebase. You can programmatically use the perf interface or other third-party tools that offer such APIs such as PAPI and LIKWID.

Related

Multicore ARM: how to assign a critical task to one dedicated core

Suppose an embedded system project where I have a multicore ARM processor (to make it simple assume 2 cores with an unshared cache between the 2 cores). Suppose my system contains a critical task and several non-critical tasks.
Therefore, can I assign the critical task to "core 1" exclusively? And all other to "core 2" exclusively?
If so, how to do and what are the best practices from an implementation point of view [assume I use C]? Should I use a library (if so which one)? An RTOS?
Ok, I see that you asked this over in the EE board as well. They gave the same answer I want to give you as well. Use an operating system of some sort to handle thread affinities. If your RTOS or whatever you have does not support this, then look into it and see how it actually handles process/thread scheduling.
Typically, each CPU on a system will be assigned some sort of thread that handles scheduling of tasks. This thread is one of the first things that an OS sets up. Feel free to research some micro kernels out there to see how this is done for your particular processor. You can also find the secret sauce for setting up this thread in the ARM documentation for your particular CPU.
But, I am going out on a limb and assuming this is far, far beyond the scope of any assignment given to you for a project. I would hope that you have some affinity of some sort built into what you were given. Setting up affinity for a known OS is a few seconds task. Setting up affinity on a bare metal system with no OS at all is much more involved.
Original question:
https://electronics.stackexchange.com/questions/356225/multicore-arm-how-to-assign-a-critical-task-to-one-dedicated-core#comment854845_356225
If you don't need real-time functionality, you can do this on a device with a Linux kernel without too much hassle.
See this question here

Erlang spawning large amounts of C processes

I've been looking into how I could embed languages (let's use Lua as an example) in Erlang. This of course isn't a new idea and there are many libraries out there that can do this. However I was wondering if it was possible to start a Genserver with state which is modified by Lua. This means that once you start the Genserver, it will start a (long running) Lua process to manipulate the Genserver's state. I know this is possible as well, but I was wondering if I could spawn 1,000 10,000 or even 100,000 of these processes.
I'm not really familiar with this topic but I have done some research.
(Please correct me if I'm wrong on any of these options).
TLDR; Skip to the last paragraph.
First option: NIFs:
This doesn't seem like an option since it will block the Erlang Scheduler of the current process. If I want to spawn a large amount of these it will freeze the entire runtime.
Second option: Port Driver:
It's like a NIF but communicates by sending data to a specified port, which can also send data back to Erlang. This is nice although this also seems to block the scheduler. I've tried a library which does the boiler plat for you as well, but that seemed to block the scheduler after spawning 10 processes. I've also looked into the postgresql example on the Erlang Documentation which is said to be async but I couldn't get the example code to work (R13?). Is it even possible to run as many Port Driver processes without blocking the runtime?
Third option: C Nodes:
I thought this was very interesting and wanted to try it out, but apparently the project "erlang-lua" already does this. It's nice because it won't crash your Erlang VM if something goes wrong and the processes are isolated. But in order to actually spawn a single process you need to spawn an entire node. I have no idea how expensive this is. Nor am I sure what the limit is for connecting nodes in a cluster, but I don't see myself spawning 100,000 C nodes.
Fourth option: Ports:
At first I thought this was the same as a Port Driver but it's actually different. You spawn a process which executes an application and communicates through STDIN and STDOUT. This would work well for spawning a large amount of processes, and (I think?) they aren't a threat to the Erlang VM. But if I'm going to communicate through STDIN / STDOUT, why even bother with an embeddable language to begin with? Might as well use any other scripting language.
And so after much research in a field I'm not familiar with I've come to this. You could a Genserver as an "entity" where the AI is written in Lua. Which is why I'd like to have a processes for each entity. My question is how do I achieve spawning many Genservers which communicate with long running Lua processes? Is this even possible? Should I be tackling my problem differently?
If you can make the Lua code — or more accurately, its underlying native code — cooperate with the Erlang VM, you have a few choices.
Consider one of the most important functions of the Erlang VM: managing the execution of a (potentially large number of) Erlang's lightweight processes across a relatively small set of scheduler threads. It uses several techniques to know when a process has used up its timeslice or is waiting and so should be scheduled out to give another process a chance to run.
You seem to be asking how you can get native code to run however it likes within the VM, but as you've already hinted, the reason native code can cause problems for the VM is that it has no practical way to stop the native code from completely taking over a scheduler thread and thus preventing regular Erlang processes from executing. Because of this, native code has to cooperatively yield the scheduler thread back to the VM.
For older NIFs, the choices for such cooperation were:
Keep the amount of time NIF calls ran on a scheduler thread to 1ms or less.
Create one or more private threads. Transition each long-running NIF call from its scheduler thread over to a private thread for execution, then return the scheduler thread to the VM.
The problems here are that not all calls can complete in 1ms or less, and that managing private threads can be error-prone. To get around the first problem, some developers would break the work down into chunks and use an Erlang function as a wrapper to manage a series of short NIF calls, each of which completed one chunk of work. As for the second problem, well, sometimes you just can't avoid it, despite its inherent difficulty.
NIFs running on Erlang 17.3 or later can also cooperatively yield the scheduler thread using the enif_schedule_nif function. To use this feature, the native code has to be able to do its work in chunks such that each chunk can complete within the usual 1ms NIF execution window, similar to the approach mentioned earlier but without the need to artificially return to an Erlang wrapper. My bitwise example code provides many details about this.
Erlang 17 also brought an experimental feature, off by default, called dirty schedulers. This is a set of VM schedulers that do not have the same native code execution time constraints as the regular schedulers; work there can block for essentially infinite periods without disrupting normal VM operation.
Dirty schedulers come in two flavors: CPU schedulers for CPU-bound work, and I/O schedulers for I/O-bound work. In a VM compiled to enable dirty schedulers, there are by default as many dirty CPU schedulers as there are regular schedulers, and there are 10 I/O schedulers. These numbers can be altered using command-line switches, but note that to try to prevent regular scheduler starvation, you can never have more dirty CPU schedulers than regular schedulers. Applications use the same enif_schedule_nif function mentioned earlier to execute NIFs on dirty schedulers. My bitwise example code provides many details about this too. Dirty schedulers will remain an experimental feature for Erlang 18 as well.
Native code in linked-in port drivers is subject to the same on-scheduler execution time constraints as NIFs, but drivers have two features NIFs don't:
Driver code can register file descriptors into the VM polling subsystem and be notified when any of those file descriptors becomes I/O-ready.
The driver API supports access to a non-scheduler async thread pool, the size of which is configurable but by default has 10 threads.
The first feature allows native driver code to avoid blocking a thread for I/O. For example, instead of performing a blocking recv call, driver code can register the socket file descriptor so the VM can poll it and call the driver back when the file descriptor becomes readable.
The second feature provides a separate thread pool useful for driver tasks that can't conform to the scheduler thread native code execution time constraints. You can achieve the same in a NIF but you have to set up your own thread pool and write your own native code to manage and access it. But regardless of whether you use the driver async thread pool, your own NIF thread pool, or dirty schedulers, note that they are all regular operating system threads, and so trying to start a huge number of them simply isn't practical.
Native driver code does not yet have dirty scheduler access, but this work is on-going and it might become available as an experimental feature in an 18.x release.
If your Lua code can make use of one or more of these features to cooperate with the Erlang VM, then what you're attempting may be possible.

How to avoid multi Threading

I came across this question and was very impressed by this answer.
I would really like to follow the advices from that answer, but I cannot imagine how to do that. How can I avoid multi threading?
There are often situations, that need to deal with different things concurrently (different hardware resources or networking for example) but at the same time they need to access shared data (like configurations, data to work on, and so on).
How could this be solved single-threaded without using any kinds of huge state-machines or event loops?
I know, that this is a huge topic, which cannot be answered as a whole on a platform like Stackoverflow. I think I really should go reading the advised book from the mentioned answer, but for now I would love to read some input here.
Maybe it is worth noting, that I am interested in solutions in C. Higher languages like Java, C++ and especially frameworks like Qt or similar ones simplify this a lot, but what about pure C?
Any input is very appreciated. Thank you all in advance
You already mentioned event loops, but I still think those provide an excellent alternative to multi-threading for many applications, and also serve as a good base when adding multi-threading later, if warranted.
Say you have an application that needs to handle user input, data received on a socket, timer events, and signals for example:
One multi-threaded design would be to spawn different threads to wait on the different event sources and have them synchronize their actions on some global state as events arrive. This often leads to messy synchronization and termination logic.
A single-threaded design would be to have a unified event loop that receives all types of events and handles them in the same thread as they arrive. On *nix systems, this can be accomplished using e.g. select(2), poll(2), or epoll(7) (the latter Linux-specific). Recent Linux versions also provide signalfd(2), timerfd (timerfd_create(2)), and eventfd(2) for cleanly fitting additional event types into this model, and on other unices you can use various tricks involving e.g. pipe(2)s to signal events. A nice library that abstracts much of this away is libevent, which also works on other platforms.
Besides not having to deal with multi-threading right away, the event loop approach also cleanly lends itself to adding multi-threading later if needed for performance or other reason: You simply let the event handler spawn threads for certain events. Having all event handling in a single location often greatly simplifies application design.
When you do need multiple threads (or processes), it helps to have narrow and well-tested interfaces between them, using e.g. synchronized queues. An alternative design for the event handler would be to have event-generating threads push events to an event queue from which the event handler then reads and dispatches them. This cleanly separates various parts of the program.
Read more about continuations and contination-passing style (and CPS transform).
CPS-transform could be a systematic way to "mimic" multi-threading.
You could have a look into CPC (Continuation Passing C, by Juliusz Chroboczek and Gabriel Kerneis), which is also a source to source C transformer. You could also read old Appel's book: Compiling with Continuations and Queinnec's book Lisp In Small Pieces
Read also more about event loops, callbacks, closures, call stacks, tail calls. These notions are related to your concerns.
See also the (nearly obsolete) setcontext(3) Linux function, and idle functions in event loops, see this.
You can implement concurrent tasks by using coroutines. You then have to explicitly pass control (the cpu) to another coroutine. It won't be done automatically by an interrupt after a small delay.
http://en.wikipedia.org/wiki/Coroutine

Programming a relatively large, threaded application for old systems

Today my boss and I were having a discussion about some code I had written. My code downloads 3 files from a given HTTP/HTTPS link. I had multi-threaded the download so that all 3 files are downloading simultaneously in 3 separate threads. During this discussion, my boss tells me that the code is going to be shipped to people who will most likely be running old hardware and software (I'm talking Windows 2000).
Until this time, I had never considered how a threaded application would scale on older hardware. I realize that if the CPU has only 1 core, threads are useless and may even worsen performance. I have been wondering if this download task is an I/O operation. Meaning, if an API is blocked waiting for information from the HTTP/HTTPS server, will another thread that wants to do some calculation be scheduled meanwhile? Do older OSes do such scheduling?
Another thing he said: Since the code is going to be run on old machines, my application should not eat the CPU. He said use Sleep() calls after CPU intensive tasks to allow other programs some breathing space. Now I was always under the impression that using Sleep() is terrible in any program. Am I wrong? When is using Sleep() justified?
Thanks for looking!
I have been wondering if this download task is an I/O operation.
Meaning, if an API is blocked waiting for information from the
HTTP/HTTPS server, will another thread that wants to do some
calculation be scheduled meanwhile? Do older OSes do such scheduling?
Yes they do. That's the joke of having blocked IO. The thread is suspended and other calculations (threads) take place until an event wakes up the blocked thread. That's why it makes completely sense to split it up into threads even for single core machines instead of doing some poor man scheduling between the downloads yourself in a single thread.
Of course your downloads affect each other regarding bandwith, so threading won't help to speedup the download :-)
Another thing he said: Since the code is going to be run on old
machines, my application should not eat the CPU. He said use Sleep()
calls after CPU intensive tasks to allow other programs some breathing
space.
Actually using sleep AFTER the task finished won't help here. Doing Sleep after a certain time of calculation (doing sort of time slicing) before going on with the calculation could help. But this is only true for cooperative systems (e.g. like Windows 3.11). This does not play a role for preemptive systems where the scheduler uses time slicing to allocate calculation time to threads. Here it would be more important to think about lowering the priority for CPU intensive tasks in order to give other tasks precedence...
Now I was always under the impression that using Sleep() is terrible
in any program. Am I wrong? When is using Sleep() justified?
This really depends on what you are doing. If you implement sort of busy waiting for a certain flag being set which is set maybe after few seconds it's better to recheck if it's set after going to sleep for a while in order to give up your scheduled time slice instead of just buring CPU power with checking for a flag never being set.
In modern systems there is no sense in introducing Sleep in a calculation as it will only slow down your calculation.
Scheduling is subject to the OS's scheduler. He's the one with the "big picture". In my opinion every approach to "do it better" is only valid inside the scope of a specific application where you have the overview over certain relationships that are not obvious to the scheduler.
Addendum:
I did some research and found that Windows supports preemptive multitasking from Windows 95. The Windows NT-line (where Windows 2000 belongs to) always supported preemptive multitasking.

Suspend and Resume thread (Windows, C)

I'm currently developing a heavily multi-threaded application, dealing with lots of small data batch to process.
The problem with it is that too many threads are being spawns, which slows down the system considerably. In order to avoid that, I've got a table of Handles which limits the number of concurrent threads. Then I "WaitForMultipleObjects", and when one slot is being freed, I create a new thread, with its own data batch to handle.
Now, I've got as many threads as I want (typically, one per core). Even then, the load incurred by multi-threading is extremely sensible. The reason for this: the data batch is small, so I'm constantly creating new threads.
The first idea I'm currently implementing is simply to regroup jobs into longer serial lists. Therefore, when I'm creating a new thread, it will have 128 or 512 data batch to handle before being terminated. It works well, but somewhat destroys granularity.
I was asked to look for another scenario: if the problem comes from "creating" threads too often, what about "pausing" them, loading data batch and "resuming" the thread?
Unfortunately, I'm not too successful.
The problem is: when a thread is in "suspend" mode, "WaitForMultipleObjects" does not detect it as available. In fact, I can't efficiently distinguish between an active and suspended thread.
So I've got 2 questions:
How to detect "suspended thread", so that i can load new data into it and resume it?
Is it a good idea? After all, is "CreateThread" really a ressource hog?
Edit
After much testings, here are my findings concerning Thread Pooling and IO Completion Port, both advised in this post.
Thread Pooling is tested using the older version "QueueUserWorkItem".
IO Completion Port requires using CreateIoCompletionPort, GetQueuedCompletionStatus and PostQueuedCompletionStatus;
1) First on performance : Creating many threads is very costly, and both thread pooling and io completion ports are doing a great job to avoid that cost. I am now down to 8-jobs per batch, from an earlier 512-jobs per batch, with no slowdown. This is considerable. Even when going to 1-job per batch, performance impact is less than 5%. Truly remarkable.
From a performance standpoint, QueueUserWorkItem wins, albeit by such a small margin (about 1% better) that it is almost negligible.
2) On usage simplicity :
Regarding starting threads : No question, QueueUserWorkItem is by far the easiest to setup. IO Completion port is heavyweight in comparison.
Regarding ending threads : Win for IO Completion Port.
For some unknown reason, MS provides no function in C to know when all jobs are completed with QueueUserWorkItem. It requires some nasty tricks to successfully implement this basic but critical function. There is no excuse for such a lack of feature.
3) On resource control : Big win for IO Completion Port, which allows to finely tune the number of concurrent threads, while there is no such control with QueueUserWorkItem, which will happily spend all CPU cycles from all available cores. That, in itself, could be a deal breaker for QueueUserWorkItem.
Note that newer version of Completion Port seems to allow that control, but are only available on Windows Vista and later.
4) On compatibility : small win for IO Completion Port, which is available since Windows NT4. QueueUserWorkItem only exists since Windows 2000. This is however good enough. Newer version of Completion Port is a no-go for Windows XP.
As can be guessed, I'm pretty much tied between the 2 solutions. They both answer correctly to my needs.
For a general situation, I suggest I/O Completion Port, mostly for resource control.
On the other hand, QueueUserWorkItem is easier to setup. Quite a pity that it loses most of this simplicity on requiring the programmer to deal alone with end-of-jobs detection.
Instead of implementing your own, consider using CreateThreadpool(). The OS will do the work for you, and you don't have to worry about getting it right.
Yes, there's a fair amount of overhead involved with CreateThread. One solution is to use a thread pool, QueueUserWorkItem. Another is to just start a set of threads and have them retrieve a 'job item' from a thread-safe queue.
If you want to also support Windows XP, you cannot use CreateThreadpool -- otherwise, if Vista and newer is sufficient, Windows thread pools are the easiest way.
If Windows XP support is needed, spawn a number of threads and assign them to an IO completion port, then have each thread block on GetQueuedCompletionStatus(). Completion ports let you post events to the port which will wake exactly one thread per event, and they are very efficient. They use a LIFO strategy on waking threads to keep caches warm, too.
In any case, you will never want to suspend a thread. Never ever. Block, wait, but don't suspend.
The reason is that with suspend you get the problem that you describe, plus you will create deadlocks, e.g. if your thread is within a critical section or mutex. Aside from a debugger, nobody should ever need to suspend a thread.

Resources