kernel time spent using openMPI? - c

I'm doing some investigation in profiling scalability tests on openMPI, with Linux Ubuntu 18.04.
I could profile the benchmark within some useful MPI profiling tools like mpiP, Scalasca, etc. However, still, there is an open question for me,
What is the kernel usage (time, memory, I/O, etc) to do an MPI job? I need to see kernel movements to profile the MPI tasks (processes) over different ranks, How can I profile the kernel usage? I think all the (mentioned above) profilers provide for me are from the user point of view, right?

Related

Multicore ARM: how to assign a critical task to one dedicated core

Suppose an embedded system project where I have a multicore ARM processor (to make it simple assume 2 cores with an unshared cache between the 2 cores). Suppose my system contains a critical task and several non-critical tasks.
Therefore, can I assign the critical task to "core 1" exclusively? And all other to "core 2" exclusively?
If so, how to do and what are the best practices from an implementation point of view [assume I use C]? Should I use a library (if so which one)? An RTOS?
Ok, I see that you asked this over in the EE board as well. They gave the same answer I want to give you as well. Use an operating system of some sort to handle thread affinities. If your RTOS or whatever you have does not support this, then look into it and see how it actually handles process/thread scheduling.
Typically, each CPU on a system will be assigned some sort of thread that handles scheduling of tasks. This thread is one of the first things that an OS sets up. Feel free to research some micro kernels out there to see how this is done for your particular processor. You can also find the secret sauce for setting up this thread in the ARM documentation for your particular CPU.
But, I am going out on a limb and assuming this is far, far beyond the scope of any assignment given to you for a project. I would hope that you have some affinity of some sort built into what you were given. Setting up affinity for a known OS is a few seconds task. Setting up affinity on a bare metal system with no OS at all is much more involved.
Original question:
https://electronics.stackexchange.com/questions/356225/multicore-arm-how-to-assign-a-critical-task-to-one-dedicated-core#comment854845_356225
If you don't need real-time functionality, you can do this on a device with a Linux kernel without too much hassle.
See this question here

What are the best to ways measure latencies from user space to kernel space?

I have to measure the latency between a user space program to the driver it interacts with. I basically send a packet through this application. The latecny is between write in the user space to corresponding write function in the kernel
I used clock_gettime with CLOCK_MONOTONIC in the user space and
getrawmonotonic in the kernel (driver) and when I see the difference, it is huge (around 4ms). So I am definitely using the wrong approach.
So, what are the best ways to do this?
To measure just a single context switch from user- to kernel-space, try to use TSC (Time Stamp Counter). It is available on x86 and ARMs, user- and kernel-space.
More info about TSC on Wikipedia: https://en.wikipedia.org/wiki/Time_Stamp_Counter
BSD-licensed implementation for x86 could be found here
and for 64-bit ARM here.
Also, as comments suggested, consider to use any standard tool available to measure a round-trip latency, i.e. use-to-kernel and back.
If I do this, I would use ftrace, which is a performance tool provided by linux kernel.
It can trace almost all the function in the kernel.
It firstly log the information into a ring buffer in the memory, so it coast very little.
There is a very good document in linux kernel source code "Documentation/trace/ftrace.txt", you also can find it here.
1.prepare the environment, configure ftrace.
2.run the application.
3.0.In the application, bind cpu, and give the application priority。
3.1.In the application, write something to the trace_marker,
3.2.In the application, call the function which you want to test.
4.get the log from the ring buffer.
5.calculate the latency.

why a c program spends so much time in kernel mode?

I use time command to measure execution time of a c program and I see that it spends so much time in kernel mode (although I expect it to run mostly in user mode). I don't know why, and I don't have any clue where to search for the problem.
this is an example:
real 0m44.548s
user 0m19.956s
sys 1m19.944s
These are information about the test program. it is streamcluster from parsec benchmark tools.
Application domain: data mining
Data sharing: low
Data exchange: medium
Parallelization model: data-parallel
Contains many pthread_mutexes and pthread_conditions
CPU bound
Little memory allocations or writing to files
I run this program on a virtual machine.

Running MMU-less Linux on ARM Cortex-R4

I'm using ARM Cortex-R4 for my system. It has a Memory Protection Unit instead of a Memory Management Unit. Effectively, this means that there's dedicated hardware for memory protection but that there's a one-to-one mapping between physical and virtual addresses. I'm a little confused about which Linux I should go for - standard Linux kernel with MMU disabled or uCLinux.
On ARM's evaluation board, I have run the standard kernel compiled with MMU disabled. I used the cramfs filesystem which is available on the official ARM website. After the kernel boots up, I'm in the shell, but I couldn't do much experimentation as I found that, most of the time, the shell stops responding (particularly when I press "tab" for auto-completion).
So I'm still not sure whether the MMU-less kernel should run smoothly if I use the correct filesystem. Also, which distro (buildroot?) should I use for the no-VM Linux?
Any idea or suggestion is welcome.
It's been more than 2 years since I asked this question. Now is the time I should write what I found for myself.
ucLinux was a project forked from the Linux kernel long back with the aim to develop Kernel for MMU less systems. However, after a certain while, it was merged to the parent Linux branch. So, today there doesn't exist any active ucLinux distribution.
So, if you disable MMU from the mainline kernel configuration, you'll get an MMU-less version. In fact, now there are configuration options provided in the kernel itself whereby a user can specify the memory layout and the access permissions.
Cheers!
uClinux is a Linux distribution which uses the Linux kernel with the MMU "turned off" and adds some applications and libraries on top of it. You wont choose one or the either as they are best one on top of the other.
If you got to a point where you have a shell running, you've managed to boot Linux sans MMU on your board but ran into a bug.
I believe ucLinux was built for something just like this [mmu less systems]
http://www.uclinux.org/description/

Monitor CPU and memory consumption of a specific processes in (Windows) C?

i would like to monitor the cpu and memory consumption of a given process in windows (nt architecture - xp,vista,win7), every few seconds to make a graph
i have searched around but found only non C solutions only (java,c#,c++, etc)
i know there is a PerformanceCounter class, but obviously it is not available in c.
thanks
Win32 Performance Counters:
http://msdn.microsoft.com/en-us/library/aa373083%28v=vs.85%29.aspx
Developer Audience:
Performance Counters is designed for use by C and
C++ developers.
However, If you just want a tool to show you this information, get Mark Russinovich's Process Explorer. It can show per-process stats and graphs.

Resources