I have a 6-core setup (intel xeon) with hyperthreading disabled running on Ubuntu.
I work on a software that increases the number of pthreads launched. When I increase my number of threads from say 3 to 6, I find that my CPU utilization by running the top command always remains around 290%. I am assuming that this means 3 out of 6 cores are being utilized. This is pretty confusing as I would expect all the cores to be utilized. I have checked the program's core affinity and it shows that its affine to all cores. Can anyone provide me with some hints or suggestions on what I might be doing wrong?
Thanks!
Related
I've recently started running MPI on my computer for some practice, after having some experience using MPI on a cluster. I have a dual core processor but was curious about what would happen if I specified a large number of processes and to my surprise it worked.
mpirun -np 40 ./wha
How exactly is this happening. Even considering the number of threads a single one of the processors could spawn this doesn't seem possible.
In case of OpenMPI, if the number of processes running is larger than the number of processors (i.e when Oversubscription happens), OpenMPI starts running the MPI processes in degraded mode. Running in degraded mode means yielding it's processor to other MPI processes for making progress (i.e time sharing happens). mpi_yield_when_idle can be set to 0 for making the mode aggressive explicitly, in such case the MPI process won't give the processor to other processes voluntarily.
See here
I have a C program (graphics benchamrk) that runs on a MIPS processor simulator(I'm looking to graph some performance characteristics). The processor has 8 cores but it seems like core 0 is executing more than its fair share of instructions. The benchmark is multithreaded with the work exactly distributed between the threads. Why could it be that core 0 happens to run about between 1/4 and half the instructions even though it is multithreaded on a 8 core processor?
What are some possible reasons this could be happening?
Most application workloads involve some number of system calls, which could block (e.g. for I/O). It's likely that your threads spend some amount of time blocked, and the scheduler simply runs them on the first available core. In an extreme case, if you have N threads but each is able to do work only 1/N of the time, a single core is sufficient to service the entire workload.
You could use pthread_setaffinity_np to assign each thread to a specific core, then see what happens.
You did not mention which OS you are using.
However, most of the code in most OSs is still written for a single core CPU.
Therefore, the OS will not try to evenly distribute the processes over the array of cores.
When there are multiple cores available, most OSs start a process on the first core that is available (and a blocked process leaves the related core available.)
As an example, on my system (a 4 core amd-64) running ubuntu linux 14.04, the CPUs are usually less than 1 percent busy, So everything could run on a single core.
There must be lots of applications running like videos and background long running applications, with several windows open to show much real activity on other than the first core.
I am having issues with OpenMP and MPI execution timings. When I select either 4 Threads (OMP) or 4 Proccesses (MPI) my execution time is slower than the serial code.
Both scripts have correct timings on other machines and both use the gettimeofday() function for timing. Below is a screen shot of both scripts being executed from 1-8 Threads/Procs:
RAM is not exceeding its limit and the disk is not busy during execution.The machine hosts an Intel i5 2500k (Stock not overclocked) and is running on Linux Mint 17 x64.
AS mentioned before, both programs produce the correct timings on other machines, so I think the issue has something to do with cpu affinity and the OS.
Has anyone encountered this issue before?
EDIT 1:
When using the argument 'bind-to-core' on the MPI execution, runtime is significantly increased, but still much slower than serial:
Problem was faulty hardware.
I Replaced motherboard with one of the same series/chipset (so did not require an install) now the timings are returning correct on both scripts.
I'm trying to figure out why our software runs so much slower when run under virtualization. Most of the stats I've seen, say it should be only a 10% performance penalty in the worst case, but on a Windows virtual server, the performance penalty can is 100-400%. I've been trying to profile the differences, but the profile results don't make a lot of sense to me. Here's what I see when I profile on my Vista 32-bit box with no virtualization:
And here's one run on a Windows 2008 64-bit server with virtualization:
The slow one is spending a very large amount of it's time in RtlInitializeExceptionChain which shows as 0.0s on the fast one. Any idea what that does? Also, when I attach to the process my machine, there is only a single thread, PulseEvent however when I connect on the server, there are two threads, GetDurationFormatEx and RtlInitializeExceptionChain. As far as I know, the code as we've written in uses only a single thread. Also, for what it's worth this is a console only application written in pure C with no UI at all.
Can anybody shed any light on any of this for me? Even just information on what some of these ntdll and kernel32 calls are doing? I'm also unsure how much of the differences are 64/32-bit related and how many are virtual/not-virtual related. Unfortunately, I don't have easy access to other configurations to determine the difference.
I suppose we could divide reasons for slower performance on a virtual machine into two classes:
1. Configuration Skew
This category is for all the things that have nothing to do with virtualization per se but where the configured virtual machine is not as good as the real one. A really easy thing to do is to give the virtual machine just one CPU core and then compare it to an application running on a 2-CPU 8-core 16-hyperthread Intel Core i7 monster. In your case, at a minimum you did not run the same OS. Most likely there is other skew as well.
2. Bad Virtualization Fit
Things like databases that do a lot of locking do not virtualize well and so the typical overhead may not apply to the test case. It's not your exact case, but I've been told the penalty is 30-40% for MySQL. I notice an entry point called ...semaphore in your list. That's a sign of something that will virtualize slowly.
The basic problem is that constructs that can't be executed natively in user mode will require traps (slow, all by themselves) and then further overhead in hypervisor emulation code.
I'm assuming that you're providing enough resources for your virtual machines, the benefit of virtualization is consolidating 5 machines that only run at 10-15% CPU/memory onto a single machine that will run at 50-75% CPU/memory and which still leaves you 25-50% overhead for those "bursty" times.
Personal anecdote: 20 machines were virtualized but each was using as much CPU as it could. This caused problems when a single machine was trying to use more power than a single core could provide. Therefore the hypervisor was virtualizing a single core over multiple cores, killing performance. Once we throttled the CPU usage of each VM to the maximum available from any single core, performance skyrocketed.
I am thinking about an idea , where a lagacy application needing To run on full performance on Core i7 cpu. Is there any linux software / utility to combine all cores for that application, so it can process at some higher performance than using only 1 core?
the application is readpst and it only uses 1 Core for Processing outlook PST files.
Its ok if i can't use all cores , it will be fine if can use like 3 cores.
Possible? or am i drunk?
I will rewrite it to use multiple cores if my C knowledge on multi forking is good.
Intel Nehalem-based CPUs (i7, i5, i3) already do this to an extent.
By using their Turbo Boost mode, when a single core is being used it is automatically over-clocked until the power and temperature limits are reached.
The newer versions of the i7 (the 2K chips) do this even better.
Read this, and this.
"Possible? or am i drunk?"
You're drunk! If this was easy in the general case, Intel would have built it into the processors by now!
What you're looking for is called 'Single System Image' or SSI. There is scant information on the internet about people doing such a thing, as it tends to be reserved for super computing (and perhaps servers).
http://en.wikipedia.org/wiki/Single_system_image
No, the application needs to be multi-threaded to use more than one core. You're of course free to write a multi-threaded version of that application if you wish, but it may not be easy to make sure the different threads don't mess each other up.
If you want it to alleviate multiple cores then you could write a multi-threaded version of your program. But only in the case that it is actually parallelizable. You said you were reading from pst-files, take care not to run into IO bottlenecks.
A great library for working with threads, mutex, semaphores and so on is POSIX Threads.
There is'nt available such an application, but it is possible.
When a OS will run in a VM, then the hypervisor could make use of a few CPUs to identify which CPU code could run parallel, and are not required to run sequentially, and then they could be actually done with a few other CPUs at once,
In the next second when the Operating CPUs are idle (because they finished their work faster then the menager can provide them with new they can start calculating the next second of instructions.
The reason why we need to do this on the Hypervisor level, and not within the OS, is because of memory locking this wouldnt be possible.