MPI and 2-socket node nonuniform memory access

MPI and 2-socket node nonuniform memory access - c

I use a cluster which contains several nodes. Each them has 2 processors with 8 cores inside. I use Open MPI with SLURM.
My tests show that MPI Send/Recv data transfer rate is the following: between MPI process with rank 0 and MPI process 1 it's about 9 GB/sec, but between process 0 and process 2 it's 5 GB/sec. I assume that this happens because our processes execute on different processors.
I'd like to avoid non-local memory access. The recommendations I found here did not help. So the question is, is it possible to run 8 MPI processes - all on THE SAME processor? If it is - how do I do it?
Thanks.

The following set of command-line options to mpiexec should do the trick with versions of Open MPI before 1.7:
--by-core --bind-to-core --report-bindings
The last option will pretty-print the actual binding for each rank. Binding also activates some NUMA-awareness in the shared-memory BTL module.
Starting with Open MPI 1.7, processes are distributed round-robin over the available sockets and bound to a single core by default. To replicate the above command line, one should use:
--map-by core --bind-to core --report-bindings

It appears to be possible. The Process Binding and Rankfiles sections of the OpenMPI mpirun man page look promising. I would try some of the options shown with the --report-binding option set so you can verify process placement is how you intend and see if you get the performance improvement you expect out of your code.

You should look at the hostfile / rankfile documentation for your MPI library. Open MPI and MPICH both use different formats, but both will give you what you want.
Keep in mind that you will have performance issues if you oversubscribe your processor too heavily. Running more than 8 ranks on an 8 core processor will cause you to lose the performance benefits you gain from having locally shared memory.

With Slurm, set:
#SBATCH --ntasks=8
#SBATCH --ntasks-per-socket=8
to have all cores allocated on the same socket (CPU die), provided Slurm is correctly configured.

Related

Maximum threads and optimal threads

I have a hybrid MPI+OpenMP code using C programming. Before I run the code, I set the threads I want to use in Linux Bash:
!# /bin/sh
export OMP_NUM_THREADS=1
However, if I want to use the maximum threads in this computer, how can I set in Linux bash environment like the above?

Typically, OpenMP implementations are smart enough to provide a good default for the number of threads. For instance:
If you run an OpenMP program without any setting of OMP_NUM_THREADS and there's no MPI at all, OpenMP will usually determine how many cores there are and use all of them.
If there's MPI and the MPI processes are bound to a subset of the machine through process pinning, e.g., one rank per processor socket, OpenMP will inherit this subset and the implementation will automatically only use that subset.
In all those cases and most other cases, you do not need to set OMP_NUM_THREADS at all.
If you can share more details about what you are trying to achieve, one can provide a more detailed answer about what you need to do.

Multi-threading on ARM cortex A9 dual core (Linux or VxWorks)

I am working on how dual core (especially in embedded systems) could be beneficial. I would like to compare two targets: one with ARM -cortex-A9 (925 MHz) dual core, and the other with ARM cortex-A8 single core.
I have some ideas (please see below), but I am not sure, I will use the dual core features:
My questions are:
1-how to execute several threads on different cores (without OpenMP, because it didn't work on my target, and it isn't compatible with VxWorks)
2-How the kernel execute code on dual core with shared memory: how it allocate stack, heap, memory for global variable and static ones?
3-Is it possible to add C-flags in order to indicate number of CPU cores so we will be able to use dual core features.
4-How the kernel handle program execution (with a lot of threads) on dual core.
Some tests to compare two architecture regarding OS and Dual core / single core
Dual core VS single core:
create three threads that execute some routines and depend on each results other (like matrix multiplication). Afterwards measure the time taken to on the dual core once and the on the single core (how is it possible without openMP)
Ping pong:
One process sends a message to the other.two processes repeatedly pass a message back and forth. We should investigate how the time taken varies with the size of the message for each architecture.
One to all:
A process with rank 0 sends the same message to all other processes in the program and then receives a message of the same length from all other processes. How does the time taken varies with the size of the messages and with the number of processes?

Short answers wrt. Linux only:
how to execute several threads on different cores
Use multiple processes, or pthreads within a single process.
How the kernel execute code on dual core with shared memory: how it allocate stack, heap, memory for global variable and static ones?
In Linux, they all belong to the process. (You can declare thread-local variables, though, with for example the __thread keyword.)
In other words, if a thread allocates some memory, that memory is immediately visible to all other threads in the same process. Even if that thread exits, nothing happens to the memory. It is perfectly normal for some other thread to free the memory later on.
Each thread does get their own stack, which by default is quite large. (With pthreads, this is easy to control using a pthread_attr_t that specifies a smaller stack size.)
In general, there is no difference in memory handling or program execution, between a single-threaded and a multi-threaded process, on the kernel side. (Userspace code needs to use memory and threads properly, of course; the kernel does not try to stop stupid userspace code from shooting itself in the head, at all.)
Is it possible to add C-flags in order to indicate number of CPU cores so we will be able to use dual core features.
Yes, for example by examining the /proc/cpuinfo pseudofile.
However, it is much more common to leave such details for the system administrator. Services like Apache etc. let the administrator configure the number, either in a configuration file, or in the command line. Even make supports a -j JOBS parameter, allowing multiple compilation tasks to run in parallel.
I very warmly recommend you forget about any detection magic, and instead let the user specify the number of threads used. (If you insist, you could use detection magic for the default, but then only if the user/admin has not given you any hints about the number of threads to be used.)
It is also possible to set the CPU mask for each thread, specifying the set of cores that thread may run on. Other than in benchmarks, where this is done more for repeatability than anything else, or in dedicated processes designed to hog all the resources on a machine, it is very rare.
How the kernel handle program execution (with a lot of threads) on dual core.
The exact same way as it handles a lot of simultaneous processes.
There are some controls (control group stuff, cpu masks) and maybe resource accounting that are handled differently between separate processes and threads within the same process, but in general terms, separate threads in a single process are executed the same way as separate processes are.

Fork()ing and running on specific set of CPUs

I have a parent process, which I use to spawn a series of child processes, which each run their own program sequentially. Each of these programs change a file over time, I want to read the data from this file and see how it changes as each program runs.
I need two sets of data for this to work, the value of the file at some set interval (I haven't decided on the interval yet), and the time each program takes to run, there are other variables which can influence the execution times of these programs, which I want to see also.
So I figured to get more accurate timing of the child process while still reading from a file I could run them on different cores. I have 8 cores, I would like to run the parent process on 0-3, then fork the child to run on 4-7. I'm not sure if this is possible though within C, and a search around hasn't yielded any answers, which makes me think it isn't.
Within Linux, outside of a program, I can use taskset to do this.
I plan on setting aside 4 of the cores using the kernel parameter isolcpus(). I want as little noise as possible while running the child programs.

Asking the kernel to associate CPU cores with threads or processes is also known as setting the "affinity" between the core and the process/thread.
Under linux, there exists a set of functions that provide this capability. Take a look at the manual page for one of the functions...
man pthread_setaffinity_np
This family of API calls might be able to give you what you need.
That man page has a "see also" section that links to the other functions in this family.
Typically with features such as these that deal with kernel process and thread scheduling, it is entirely dependent on what mood the kernel is in at the time as to whether your requests are met or ignored. Your mileage may very due to system load or the number of available cores. Even if a system has 16 cores, these features may be disabled in the kernel compilation settings (think virtual machines). Equally, you may find that there are some additional options that you may be able to add to your kernel to get better results than the defaults.

CPU load for C process with respect to core(s)

I am about to find out how a specific process, in C, loads the CPU over a certain time frame.
The process may switch processor core during runtime, therefore
I need to handle that too. The CPU is an ARM processor.
I have looked at different ways to get the load, from standard top,
perf and also to calculate the load through the statistics given in the
/proc/[pid]/stat-file.
My thoughts is to have a program that read the /proc/[pid]/stat-file as suggested in the thread:
"How to calculate the CPU usage of a process by PID in Linux from C?" and calculate the load accordingly.
But how would I treat core switching? I need to notice it and adjust the load calculation.
How would you recommend me to achieve this?
Update: How can I see which core the process runs in and by that examine if it has switched core since the last chack assuming I poll the process figures/statistics at least twice?

The perf tool can tell you the amount of cpu-migrations your process has made. That is, how many times the process has switched cpu. It won't tell you which cpu cores though.

Move all threads to use other CPU core so one thread can use other CPU core?

Linux has many threads and processes executing across (lets say 2) CPU cores. I would like my single-threaded C/C++ application to be the only thread on CPU0. How would I "move" all the other threads to use CPU1?
I know I can use the Linux CPU scheduling functions to set affinity for a thread:
int sched_setaffinity(pid_t pid,size_t cpusetsize,cpu_set_t *mask);
but how I can push all the other threads on to CPU1? Is there a relatively simple way of doing this?
Would I have to get a list of all the pids, iterate through setting them all to CPU1 and then set my application thread to CPU0?

Your idea about doing this seems to be correct. However I would like to mentioned few points regarding this which should be understood carefully.
1.sched_setaffinity() is kind of request to kernel/scheduler(not the command) to select which CPUs is that process/thread allowed to execute. Actual scheduling of a process does depends on many other complicated factor.
2.You have mentioned that you may iterate through all PID's. This is not a good idea as by doing this you may try to change the scheduling for kernel services and init process. In all probability program would not have sufficient rights to do it for these processes but still we should not try to alter attributes of these processes as we do not know the impact.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

MPI and 2-socket node nonuniform memory access - c

With Slurm, set: #SBATCH --ntasks=8 #SBATCH --ntasks-per-socket=8 to have all cores allocated on the same socket (CPU die), provided Slurm is correctly configured.

Related

Maximum threads and optimal threads

Multi-threading on ARM cortex A9 dual core (Linux or VxWorks)

Fork()ing and running on specific set of CPUs

CPU load for C process with respect to core(s)

Move all threads to use other CPU core so one thread can use other CPU core?

Categories

Resources