I've recently started running MPI on my computer for some practice, after having some experience using MPI on a cluster. I have a dual core processor but was curious about what would happen if I specified a large number of processes and to my surprise it worked.
mpirun -np 40 ./wha
How exactly is this happening. Even considering the number of threads a single one of the processors could spawn this doesn't seem possible.
In case of OpenMPI, if the number of processes running is larger than the number of processors (i.e when Oversubscription happens), OpenMPI starts running the MPI processes in degraded mode. Running in degraded mode means yielding it's processor to other MPI processes for making progress (i.e time sharing happens). mpi_yield_when_idle can be set to 0 for making the mode aggressive explicitly, in such case the MPI process won't give the processor to other processes voluntarily.
See here
Related
I have small c program that use lot of cpu , this program compiled to exe , and I run it as a process from my c# gui.
When I want to run it parallel on all over my cpu cores ,I have 2 options.
I have 4 cpu cores.
Run this c exe from my c# as 4 process so my os seperate those process 1 for each core .
Edit my c code so it run 4 thread so os will seperate 1 thread for each core, and from c# I will run it as 1 process.
Which way will be faster?
Edit: those processes/ thread will run like 3-5 houres ,and dont need to communicate between anotger thread/ process.
All of this running on windows
Running 4 threads in C will be faster than running 4 processes in C#.
There is a higher penalty for switching between processes than for switching between threads and communication between processes is slower than between threads.
Allocate a process require more time that allocate a thread, because threads share the resources (code, data...)
If your cpu continuously change processes (this usually happens) is not better solution using the processes, using of thread is more efficent and specially with a single cpu systems...
https://stackoverflow.com/a/200543/3476815
I was recently doing some experiments with the fork function and I was concerned by a "simple" (short) question:
Does fork uses concurrency or parallelism (if more than one core) mechanisms?
Or, is it the OS which makes the best choice?
Thanks for your answer.
nb: Damn I fork bombed myself again!
Edit:
Concurrency: Each operation run on one core. Interrupts are received in order to switch from one process to another (Sequential computation).
Parallelism: Each operation run on two cores (or more).
fork() duplicates the current process, creating another independent process. End of story.
How the kernel chooses to schedule these processes is a different, very broad question. In general the kernel will try to use all available resources (cores) to run as many tasks as possible. When there are more runnable tasks than cores, it has to start making decisions about who gets to run, and for how long.
The fork function creates a separate process. It is up to the operating system how it handles different processes.
Of course, if only once core is available, the OS has no other choice but running all processes interleaved.
If more cores are available, every sane OS will distribute the processes to the different cores, so every core runs at least one process.
However, even then, more processes can be active than there are cores. So even then, it is up to the OS to decide which processes can be run parallel (by distributing to cores) and which have to be run interleaved (on a single core).
In fact, fork() is a system call (aka. system service) which creates a new process from the current process (read the return code to see who you are, the parent or the child).
In the UNIX work, processes shares the CPU computing time. This works like that :
a process is running
the clock generates an interrupt, calling the kernel and pausing the process
the kernel takes the list of available processes, and decide to resume one (this is called scheduling)
go to point 1)
When there is multiples processor cores, kernels are able to dispatch processes on them.
Well, you can do something. Write a complex program say, O(n^3), so that it takes a good amount of time to compute. fork() four times (if you have quad-core). Now open any graphical CPU monitor. Anything cool?
I have a C program (graphics benchamrk) that runs on a MIPS processor simulator(I'm looking to graph some performance characteristics). The processor has 8 cores but it seems like core 0 is executing more than its fair share of instructions. The benchmark is multithreaded with the work exactly distributed between the threads. Why could it be that core 0 happens to run about between 1/4 and half the instructions even though it is multithreaded on a 8 core processor?
What are some possible reasons this could be happening?
Most application workloads involve some number of system calls, which could block (e.g. for I/O). It's likely that your threads spend some amount of time blocked, and the scheduler simply runs them on the first available core. In an extreme case, if you have N threads but each is able to do work only 1/N of the time, a single core is sufficient to service the entire workload.
You could use pthread_setaffinity_np to assign each thread to a specific core, then see what happens.
You did not mention which OS you are using.
However, most of the code in most OSs is still written for a single core CPU.
Therefore, the OS will not try to evenly distribute the processes over the array of cores.
When there are multiple cores available, most OSs start a process on the first core that is available (and a blocked process leaves the related core available.)
As an example, on my system (a 4 core amd-64) running ubuntu linux 14.04, the CPUs are usually less than 1 percent busy, So everything could run on a single core.
There must be lots of applications running like videos and background long running applications, with several windows open to show much real activity on other than the first core.
I am having issues with OpenMP and MPI execution timings. When I select either 4 Threads (OMP) or 4 Proccesses (MPI) my execution time is slower than the serial code.
Both scripts have correct timings on other machines and both use the gettimeofday() function for timing. Below is a screen shot of both scripts being executed from 1-8 Threads/Procs:
RAM is not exceeding its limit and the disk is not busy during execution.The machine hosts an Intel i5 2500k (Stock not overclocked) and is running on Linux Mint 17 x64.
AS mentioned before, both programs produce the correct timings on other machines, so I think the issue has something to do with cpu affinity and the OS.
Has anyone encountered this issue before?
EDIT 1:
When using the argument 'bind-to-core' on the MPI execution, runtime is significantly increased, but still much slower than serial:
Problem was faulty hardware.
I Replaced motherboard with one of the same series/chipset (so did not require an install) now the timings are returning correct on both scripts.
I have just adding threading to a large application I have been developing for years. It is written in C and runs on Mac and Linux. This question is about OS X, 10.8.2 or 10.6.8.
Problem: I see the program opening two threads as I expect. However, apparently both threads are running on the same CPU, or at least, I never get more than 100% of a CPU allocated to the program. This almost defeats the entire purpose of having threads.
I use a fair number of mutexes, if that matters.
How can I force the OS to run each thread at 100% of different CPUs? (There are 8 CPUs on this machine.)
The mutexes may matter a lot here. Open up Instruments and run the time profiler instrument on your program after setting it to "record all thread states". This will let you see where your threads are blocked waiting for something (likely a mutex) instead of running.
Multiple running threads will be concurrent as long as they execute on different cores - as each core has it's own instance of the scheduler in every Unix-like OS. Being on separate CPU dies matters little: if fact, there's a benefit to sharing resources between threads running on separate cores of the same die.