I am having issues with OpenMP and MPI execution timings. When I select either 4 Threads (OMP) or 4 Proccesses (MPI) my execution time is slower than the serial code.
Both scripts have correct timings on other machines and both use the gettimeofday() function for timing. Below is a screen shot of both scripts being executed from 1-8 Threads/Procs:
RAM is not exceeding its limit and the disk is not busy during execution.The machine hosts an Intel i5 2500k (Stock not overclocked) and is running on Linux Mint 17 x64.
AS mentioned before, both programs produce the correct timings on other machines, so I think the issue has something to do with cpu affinity and the OS.
Has anyone encountered this issue before?
EDIT 1:
When using the argument 'bind-to-core' on the MPI execution, runtime is significantly increased, but still much slower than serial:
Problem was faulty hardware.
I Replaced motherboard with one of the same series/chipset (so did not require an install) now the timings are returning correct on both scripts.
Related
I've recently started running MPI on my computer for some practice, after having some experience using MPI on a cluster. I have a dual core processor but was curious about what would happen if I specified a large number of processes and to my surprise it worked.
mpirun -np 40 ./wha
How exactly is this happening. Even considering the number of threads a single one of the processors could spawn this doesn't seem possible.
In case of OpenMPI, if the number of processes running is larger than the number of processors (i.e when Oversubscription happens), OpenMPI starts running the MPI processes in degraded mode. Running in degraded mode means yielding it's processor to other MPI processes for making progress (i.e time sharing happens). mpi_yield_when_idle can be set to 0 for making the mode aggressive explicitly, in such case the MPI process won't give the processor to other processes voluntarily.
See here
I have a C program (graphics benchamrk) that runs on a MIPS processor simulator(I'm looking to graph some performance characteristics). The processor has 8 cores but it seems like core 0 is executing more than its fair share of instructions. The benchmark is multithreaded with the work exactly distributed between the threads. Why could it be that core 0 happens to run about between 1/4 and half the instructions even though it is multithreaded on a 8 core processor?
What are some possible reasons this could be happening?
Most application workloads involve some number of system calls, which could block (e.g. for I/O). It's likely that your threads spend some amount of time blocked, and the scheduler simply runs them on the first available core. In an extreme case, if you have N threads but each is able to do work only 1/N of the time, a single core is sufficient to service the entire workload.
You could use pthread_setaffinity_np to assign each thread to a specific core, then see what happens.
You did not mention which OS you are using.
However, most of the code in most OSs is still written for a single core CPU.
Therefore, the OS will not try to evenly distribute the processes over the array of cores.
When there are multiple cores available, most OSs start a process on the first core that is available (and a blocked process leaves the related core available.)
As an example, on my system (a 4 core amd-64) running ubuntu linux 14.04, the CPUs are usually less than 1 percent busy, So everything could run on a single core.
There must be lots of applications running like videos and background long running applications, with several windows open to show much real activity on other than the first core.
I have just adding threading to a large application I have been developing for years. It is written in C and runs on Mac and Linux. This question is about OS X, 10.8.2 or 10.6.8.
Problem: I see the program opening two threads as I expect. However, apparently both threads are running on the same CPU, or at least, I never get more than 100% of a CPU allocated to the program. This almost defeats the entire purpose of having threads.
I use a fair number of mutexes, if that matters.
How can I force the OS to run each thread at 100% of different CPUs? (There are 8 CPUs on this machine.)
The mutexes may matter a lot here. Open up Instruments and run the time profiler instrument on your program after setting it to "record all thread states". This will let you see where your threads are blocked waiting for something (likely a mutex) instead of running.
Multiple running threads will be concurrent as long as they execute on different cores - as each core has it's own instance of the scheduler in every Unix-like OS. Being on separate CPU dies matters little: if fact, there's a benefit to sharing resources between threads running on separate cores of the same die.
I have a 6-core setup (intel xeon) with hyperthreading disabled running on Ubuntu.
I work on a software that increases the number of pthreads launched. When I increase my number of threads from say 3 to 6, I find that my CPU utilization by running the top command always remains around 290%. I am assuming that this means 3 out of 6 cores are being utilized. This is pretty confusing as I would expect all the cores to be utilized. I have checked the program's core affinity and it shows that its affine to all cores. Can anyone provide me with some hints or suggestions on what I might be doing wrong?
Thanks!
I am thinking about an idea , where a lagacy application needing To run on full performance on Core i7 cpu. Is there any linux software / utility to combine all cores for that application, so it can process at some higher performance than using only 1 core?
the application is readpst and it only uses 1 Core for Processing outlook PST files.
Its ok if i can't use all cores , it will be fine if can use like 3 cores.
Possible? or am i drunk?
I will rewrite it to use multiple cores if my C knowledge on multi forking is good.
Intel Nehalem-based CPUs (i7, i5, i3) already do this to an extent.
By using their Turbo Boost mode, when a single core is being used it is automatically over-clocked until the power and temperature limits are reached.
The newer versions of the i7 (the 2K chips) do this even better.
Read this, and this.
"Possible? or am i drunk?"
You're drunk! If this was easy in the general case, Intel would have built it into the processors by now!
What you're looking for is called 'Single System Image' or SSI. There is scant information on the internet about people doing such a thing, as it tends to be reserved for super computing (and perhaps servers).
http://en.wikipedia.org/wiki/Single_system_image
No, the application needs to be multi-threaded to use more than one core. You're of course free to write a multi-threaded version of that application if you wish, but it may not be easy to make sure the different threads don't mess each other up.
If you want it to alleviate multiple cores then you could write a multi-threaded version of your program. But only in the case that it is actually parallelizable. You said you were reading from pst-files, take care not to run into IO bottlenecks.
A great library for working with threads, mutex, semaphores and so on is POSIX Threads.
There is'nt available such an application, but it is possible.
When a OS will run in a VM, then the hypervisor could make use of a few CPUs to identify which CPU code could run parallel, and are not required to run sequentially, and then they could be actually done with a few other CPUs at once,
In the next second when the Operating CPUs are idle (because they finished their work faster then the menager can provide them with new they can start calculating the next second of instructions.
The reason why we need to do this on the Hypervisor level, and not within the OS, is because of memory locking this wouldnt be possible.