Maximum threads and optimal threads

Maximum threads and optimal threads - c

I have a hybrid MPI+OpenMP code using C programming. Before I run the code, I set the threads I want to use in Linux Bash:
!# /bin/sh
export OMP_NUM_THREADS=1
However, if I want to use the maximum threads in this computer, how can I set in Linux bash environment like the above?

Typically, OpenMP implementations are smart enough to provide a good default for the number of threads. For instance:
If you run an OpenMP program without any setting of OMP_NUM_THREADS and there's no MPI at all, OpenMP will usually determine how many cores there are and use all of them.
If there's MPI and the MPI processes are bound to a subset of the machine through process pinning, e.g., one rank per processor socket, OpenMP will inherit this subset and the implementation will automatically only use that subset.
In all those cases and most other cases, you do not need to set OMP_NUM_THREADS at all.
If you can share more details about what you are trying to achieve, one can provide a more detailed answer about what you need to do.

Related

Multi-threading on ARM cortex A9 dual core (Linux or VxWorks)

I am working on how dual core (especially in embedded systems) could be beneficial. I would like to compare two targets: one with ARM -cortex-A9 (925 MHz) dual core, and the other with ARM cortex-A8 single core.
I have some ideas (please see below), but I am not sure, I will use the dual core features:
My questions are:
1-how to execute several threads on different cores (without OpenMP, because it didn't work on my target, and it isn't compatible with VxWorks)
2-How the kernel execute code on dual core with shared memory: how it allocate stack, heap, memory for global variable and static ones?
3-Is it possible to add C-flags in order to indicate number of CPU cores so we will be able to use dual core features.
4-How the kernel handle program execution (with a lot of threads) on dual core.
Some tests to compare two architecture regarding OS and Dual core / single core
Dual core VS single core:
create three threads that execute some routines and depend on each results other (like matrix multiplication). Afterwards measure the time taken to on the dual core once and the on the single core (how is it possible without openMP)
Ping pong:
One process sends a message to the other.two processes repeatedly pass a message back and forth. We should investigate how the time taken varies with the size of the message for each architecture.
One to all:
A process with rank 0 sends the same message to all other processes in the program and then receives a message of the same length from all other processes. How does the time taken varies with the size of the messages and with the number of processes?

Short answers wrt. Linux only:
how to execute several threads on different cores
Use multiple processes, or pthreads within a single process.
How the kernel execute code on dual core with shared memory: how it allocate stack, heap, memory for global variable and static ones?
In Linux, they all belong to the process. (You can declare thread-local variables, though, with for example the __thread keyword.)
In other words, if a thread allocates some memory, that memory is immediately visible to all other threads in the same process. Even if that thread exits, nothing happens to the memory. It is perfectly normal for some other thread to free the memory later on.
Each thread does get their own stack, which by default is quite large. (With pthreads, this is easy to control using a pthread_attr_t that specifies a smaller stack size.)
In general, there is no difference in memory handling or program execution, between a single-threaded and a multi-threaded process, on the kernel side. (Userspace code needs to use memory and threads properly, of course; the kernel does not try to stop stupid userspace code from shooting itself in the head, at all.)
Is it possible to add C-flags in order to indicate number of CPU cores so we will be able to use dual core features.
Yes, for example by examining the /proc/cpuinfo pseudofile.
However, it is much more common to leave such details for the system administrator. Services like Apache etc. let the administrator configure the number, either in a configuration file, or in the command line. Even make supports a -j JOBS parameter, allowing multiple compilation tasks to run in parallel.
I very warmly recommend you forget about any detection magic, and instead let the user specify the number of threads used. (If you insist, you could use detection magic for the default, but then only if the user/admin has not given you any hints about the number of threads to be used.)
It is also possible to set the CPU mask for each thread, specifying the set of cores that thread may run on. Other than in benchmarks, where this is done more for repeatability than anything else, or in dedicated processes designed to hog all the resources on a machine, it is very rare.
How the kernel handle program execution (with a lot of threads) on dual core.
The exact same way as it handles a lot of simultaneous processes.
There are some controls (control group stuff, cpu masks) and maybe resource accounting that are handled differently between separate processes and threads within the same process, but in general terms, separate threads in a single process are executed the same way as separate processes are.

MPI and 2-socket node nonuniform memory access

I use a cluster which contains several nodes. Each them has 2 processors with 8 cores inside. I use Open MPI with SLURM.
My tests show that MPI Send/Recv data transfer rate is the following: between MPI process with rank 0 and MPI process 1 it's about 9 GB/sec, but between process 0 and process 2 it's 5 GB/sec. I assume that this happens because our processes execute on different processors.
I'd like to avoid non-local memory access. The recommendations I found here did not help. So the question is, is it possible to run 8 MPI processes - all on THE SAME processor? If it is - how do I do it?
Thanks.

The following set of command-line options to mpiexec should do the trick with versions of Open MPI before 1.7:
--by-core --bind-to-core --report-bindings
The last option will pretty-print the actual binding for each rank. Binding also activates some NUMA-awareness in the shared-memory BTL module.
Starting with Open MPI 1.7, processes are distributed round-robin over the available sockets and bound to a single core by default. To replicate the above command line, one should use:
--map-by core --bind-to core --report-bindings

It appears to be possible. The Process Binding and Rankfiles sections of the OpenMPI mpirun man page look promising. I would try some of the options shown with the --report-binding option set so you can verify process placement is how you intend and see if you get the performance improvement you expect out of your code.

You should look at the hostfile / rankfile documentation for your MPI library. Open MPI and MPICH both use different formats, but both will give you what you want.
Keep in mind that you will have performance issues if you oversubscribe your processor too heavily. Running more than 8 ranks on an 8 core processor will cause you to lose the performance benefits you gain from having locally shared memory.

With Slurm, set:
#SBATCH --ntasks=8
#SBATCH --ntasks-per-socket=8
to have all cores allocated on the same socket (CPU die), provided Slurm is correctly configured.

Move all threads to use other CPU core so one thread can use other CPU core?

Linux has many threads and processes executing across (lets say 2) CPU cores. I would like my single-threaded C/C++ application to be the only thread on CPU0. How would I "move" all the other threads to use CPU1?
I know I can use the Linux CPU scheduling functions to set affinity for a thread:
int sched_setaffinity(pid_t pid,size_t cpusetsize,cpu_set_t *mask);
but how I can push all the other threads on to CPU1? Is there a relatively simple way of doing this?
Would I have to get a list of all the pids, iterate through setting them all to CPU1 and then set my application thread to CPU0?

Your idea about doing this seems to be correct. However I would like to mentioned few points regarding this which should be understood carefully.
1.sched_setaffinity() is kind of request to kernel/scheduler(not the command) to select which CPUs is that process/thread allowed to execute. Actual scheduling of a process does depends on many other complicated factor.
2.You have mentioned that you may iterate through all PID's. This is not a good idea as by doing this you may try to change the scheduling for kernel services and init process. In all probability program would not have sufficient rights to do it for these processes but still we should not try to alter attributes of these processes as we do not know the impact.

Most portable (among *nix) way to allow a thread lowering its own nice

What's the best way to grant a process/thread the right to lower its own nice value, without running it with full privileges? Solution can be external to the process itself (ulimit or setcap for example).
I'm looking for something portable at least across modern Linux and Mac OS X (and this is why I didn't reply myself with ulimit or setcap).

You'll need extra privileges to decrease the nice value (increase the logical priority). In Linux, this means either being run by root or having the CAP_SYS_NICE capability. Both can be set for the binary executable (either setuid root via chown and chmod, or setcap). The former will work on all Unix-like systems (but will require root privileges when installed), but the latter is Linux-specific.
The most acceptable portable way is probably to write a wrapper program, that can be installed setuid root. It will be very simple, just a couple of dozen lines of C. It simply calls sched_get_priority_min(), sched_get_priority_max(), sched_setscheduler(), and sched_setparam() to lower the nice value (getting it more CPU time), then calls seteuid(0); setregid(getgid(), getgid); setreuid(getuid(), getuid()); to drop the extra privileges, and finally execv() the actual program. Note: you most definitely want to hardcode the path to the actual program at install time. This should work without modifications on all Linux and Unix-like systems.
In your actual program, you simply increase the niceness of the threads that are not so important. In other words, you do not try to lower the niceness of any threads in your program, but increase the niceness of all other threads. The setuid root wrapper program is the portable way to reduce the minimum niceness level. You can obviously check the current niceness and scheduler details first to see if there is enough range to adjust. Perhaps your wrapper program can set command-line parameters or environment variables that tell the actual program which priority levels to use.

Any process can make itself nicer using setpriority() or sched_setscheduler(), and any thread using pthread_setchedparam() and pthread_setschedprio(). Both are defined in POSIX.1-2001, so should be available in basically all non-Windows systems. For details on the scheduler types and priorities available, see man 2 sched_setscheduler.
Note that higher numerical priority values indicate nicer process; lower logical priority. The larger the value, the less CPU time it gets. To find out the minimum and maximum values for a given scheduling policy, you must use sched_get_priority_min() and sched_get_priority_max().
Normally a process or thread should always be able to lower its priority (making it nicer), and use any scheduling policy that does not make it less nice. However, Linux kernels prior to 2.6.12 did not allow that for normal users, so your program should probably just try to make it or some of its threads nicer, but not mind too much if it happens to not be allowed on some rarer architectures. Most importantly, your algorithmic design should not rely on scheduling; strive for more robust code than that.

Can you ask for additional processors on the fly in MPI?

With MPI in C, you can do the following to run a program:
mpirun -np 5 program
where 5 is the number of processors to use and program is the program to run on these processors. Is it possible to request x processors like above but then whilst the program is running, if you decide you need y processors (let's say y>x), can you request more without restarting the program?
If the answer is yes, then how do you do this? If no then why not?
Many thanks.

This is possible; but it's not trivial. The application needs to have been coded to support this. Fundamentally, there's the issue that mpi provides various global communication and synchronization primitives, and it's not clear what to do with such operations when adding new parallelism - you wouldn't want new processes to non-deterministically deadlock or crash others, after all.
Here's some documentation on IBM's site - any MPI2 implementation should be conform to the same outline. Jonathan points out that the MPI Specification itself includes a pretty good example of doing this for a master-worker sort of problem.